7-8 Tokens/Second: Qwen3 30B on Raspberry Pi 5

Reddit user jslominski demonstrated running Alibaba’s Qwen3 30B language model at 7–8 tokens per second on a Raspberry Pi 5 with 8GB of RAM, using only local CPU compute and no GPU acceleration or external API calls. The video posted to Reddit shows the full inference workflow on the single-board computer. The result is a self-reported improvement over the developer’s own earlier benchmark, published approximately one week prior.

Qwen3 30B in Q3_K_S quantization (2.66 bits per weight) runs at 7–8 tokens/second on a Raspberry Pi 5 with 8GB RAM
Four optimizations drove the gain: NVMe SSD, official active cooler, a custom ik_llama.cpp build, and prompt caching
The developer packaged the setup as “Potato OS,” a headless Debian image that auto-configures and downloads a default model on first boot
The project is early-stage — no over-the-air updates exist, and users must reflash the image to upgrade

What Happened

Jslominski used the byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF model file, selecting its Q3_K_S quantized variant at 2.66 bits per weight, and ran inference on a Raspberry Pi 5 with 8GB of LPDDR4X RAM. The system operated at a context length of 16,384 tokens. No dedicated graphics hardware was involved; all compute ran on the Pi 5’s ARM Cortex-A76 CPU cores.

The demonstration was shared on Reddit and shows the system responding to prompts in real time. Author details beyond the Reddit username jslominski were not available at time of publication.

Why It Matters

Running a 30-billion-parameter model at usable inference speeds on sub-$100 hardware is a practical stress test of how far weight quantization and edge inference tooling have progressed. The Qwen3-30B-A3B variant is a Mixture-of-Experts architecture — the “A3B” in the model name designates approximately 3 billion active parameters per forward pass rather than the full 30 billion, which substantially reduces per-token compute and memory bandwidth requirements. That architectural property is what makes the model tractable on hardware with only 8GB of RAM.

The developer’s own week-over-week comparison suggests the bottleneck was not the hardware ceiling but the software configuration. A stock llama.cpp installation on the same device produced slower results; reaching 7–8 tokens per second required stacking multiple targeted changes simultaneously.

Technical Details

Four changes produced the reported performance: attaching an NVMe SSD to reduce storage I/O latency, installing the official Raspberry Pi active cooler to prevent thermal throttling under sustained inference load, switching from the standard llama.cpp binary to a custom ik_llama.cpp build, and enabling prompt caching to reduce redundant prefill computation on repeated context. The context window was fixed at 16,384 tokens for the demonstration.

For comparison, the 4-bit quantized versions of the same Qwen3 model family achieved only 4–5 tokens per second on the same Raspberry Pi 5 hardware. The lower throughput of the higher-bit-depth variant is consistent with the Pi 5’s LPDDR4X memory bus becoming the binding constraint — larger per-weight representations require more memory bandwidth per token, even though 4-bit quantization nominally preserves more model fidelity. The Q3_K_S format’s smaller footprint fits more of the model’s working set into the available bandwidth, yielding higher throughput at the cost of some precision.

Who’s Affected

The target audience is developers and researchers who want a self-contained local language model server on inexpensive, offline-capable hardware. Jslominski packaged the configuration as “Potato OS,” a flashable headless Debian image that automates the entire setup process after first boot. As the developer described: “After boot there’s a 5 minute timeout that automatically downloads Qwen3.5 2B with vision encoder (~1.8GB), so if you come back in 10 minutes and go to http://potato.local it’s ready to go.”

The system exposes an OpenAI-compatible API endpoint, meaning applications already integrated with GPT-4 or similar services can redirect requests to the local server with minimal code changes. A web interface supports model selection and direct file uploads for custom models. The developer has confirmed compatibility testing across the Qwen3, Qwen3-VL, and Qwen3.5 model families.

What’s Next

The source code is publicly available on GitHub at github.com/slomin/potato-os, and jslominski is soliciting community feedback on hardware compatibility and system stability. One significant limitation is explicit in the project documentation: there is no over-the-air update mechanism. Applying any fix or improvement requires users to reflash the entire SD card image from scratch, and the developer warns that bugs should be expected at this stage.

The vision-language angle introduces an open question: the default model auto-downloaded on first boot is Qwen3.5 2B with its vision encoder, enabling image-and-text inference out of the box. Whether the multimodal pipeline runs at speeds comparable to the text-only Qwen3 30B configuration was not addressed in the original post, and that performance profile on the Pi 5’s CPU-only stack remains untested publicly.

Raspberry Pi 5 Runs Qwen3 30B at 7–8 Tokens/Sec with Potato OS

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Raspberry Pi 5 Runs Qwen3 30B at 7–8 Tokens/Sec with Potato OS

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Kimi K2.6 and Xiaomi MiMo Beat Claude, GPT-5.5, Gemini in Word Gem Puzzle Coding Tournament

UK AISI Tests Show GPT-5.5 Matching Claude Mythos in Multi-Stage Cyber Attacks

Poolside Releases Laguna M.1 and XS.2: Open-Weight Coding Models Hitting 72.5% on SWE-bench Verified