TOOL UPDATES

Developer Creates 397B Parameter Model Runner for 48GB MacBook Pro

R Ryan Matsuda Mar 22, 2026 Updated Apr 7, 2026 4 min read
Engine Score 8/10 — Important

This story details a significant technical breakthrough, enabling a massive 397B parameter model to run on consumer-grade hardware, which greatly democratizes access to advanced AI. Its high actionability and potential for local AI development make it highly impactful for developers and researchers.

Editorial illustration for: Developer Creates 397B Parameter Model Runner for 48GB MacBook Pro
  • Developer Dan Woods built Flash-MoE, a pure C/Metal inference engine that runs the 397-billion-parameter Qwen3.5-397B model on a MacBook Pro with 48GB of RAM at 4.4 tokens per second.
  • The engine streams the entire 209GB model from SSD using custom Metal compute shaders, requiring no Python or external frameworks.
  • Flash-MoE uses SSD expert streaming to load only four active experts per layer on demand, keeping total memory usage around 6GB.
  • The project was built in 24 hours using Claude Code‘s autoresearch pattern, which ran 90 optimization experiments automatically.

What Happened

Dan Woods, a software developer known on GitHub as danveloper, released Flash-MoE, an open-source inference engine written in C, Objective-C, and hand-tuned Metal shaders. The engine runs Qwen3.5-397B-A17B, a 397-billion-parameter Mixture-of-Experts model, on a standard MacBook Pro with an M3 Max chip and 48GB of unified memory.

The project achieves 4.36 tokens per second at 4-bit quantization with production-quality output, including functional tool calling. At 2-bit quantization, speeds reach 5.74 tokens per second, though JSON output becomes unreliable at that precision.

Woods described the development process on X, noting he “handed Claude Code Karpathy’s autoresearch repo and Apple’s ‘LLM in a Flash’ paper, told it to get Qwen3.5-397B running on my M3 Max 48GB… it did.” The entire project was completed within 24 hours.

Why It Matters

Running a 397-billion-parameter model locally on consumer hardware was previously considered impractical. Models of this scale typically require multiple high-end GPUs with hundreds of gigabytes of VRAM, often costing tens of thousands of dollars. Flash-MoE demonstrates that Apple Silicon’s unified memory architecture and fast NVMe storage can substitute for dedicated GPU memory when paired with careful engineering.

The achievement has implications for AI accessibility. Developers and researchers who cannot afford cloud GPU time or enterprise hardware could use commodity Apple laptops for inference with frontier-scale models. The 4.4 tokens-per-second throughput is slow compared to data center deployments but fast enough for interactive use, code generation, and tool-calling workflows.

The project also highlights a shift in how complex systems software gets built. Woods used an AI-assisted autoresearch pattern rather than manual optimization, letting Claude Code generate and test 90 different approaches automatically. The entire engine, roughly 7,000 lines of C code and 1,200 lines of Metal shaders, was produced in a single 24-hour session.

Technical Details

The Qwen3.5-397B model uses 60 transformer layers with 512 experts per layer, activating only four experts per token. Flash-MoE exploits this sparsity through SSD expert streaming: expert weights are read on demand via parallel pread() calls using Grand Central Dispatch, with each active expert consuming approximately 6.75MB.

Non-expert weights occupy 5.5GB of memory-mapped read-only space, while Metal scratch buffers add roughly 200MB. Total memory usage stays around 6GB, leaving 42GB for the operating system and page cache. The OS page cache naturally achieves a 71% hit rate for expert data.

A key optimization is the FMA-optimized dequantization kernel, which rearranges the standard computation from (nibble * scale + bias) * x to fma(nibble, scale*x, bias*x). This GPU instruction-level change yielded a 12% performance improvement. The project also uses Accelerate BLAS for GatedDeltaNet recurrence, achieving 64% faster performance than scalar implementations.

Per-layer timing breaks down to approximately 4.28ms: 1.22ms for GPU attention projections, 0.55ms for output projection and routing, and 2.41ms for parallel SSD expert loading.

Who’s Affected

Machine learning researchers and developers working with large language models on Apple Silicon hardware stand to benefit most. The project proves that Mixture-of-Experts architectures can run on laptops when combined with SSD streaming, potentially reducing dependence on cloud GPU providers for inference tasks.

The approach is hardware-specific. It requires Apple’s M-series chips with their unified memory controller and high-bandwidth NVMe storage, specifically the M3 Max with its 400 GB/s memory bandwidth and 17.5 GB/s sequential SSD read speed. The engine targets macOS 26.2 and will not run on other platforms without significant rework.

For the broader MoE model community, Flash-MoE validates a design pattern. The 512-experts-per-layer architecture of Qwen3.5-397B was designed for distributed data center inference, but the extreme sparsity that activates only four of 512 experts per token makes SSD streaming viable on a single machine.

What’s Next

The Flash-MoE codebase is available on GitHub under open-source terms. Woods documented 58 failed optimization approaches in the repository’s experiment log, including LZ4 compression (13% slower), temporal expert prediction (18% slower with 25% hit rate), and mmap-based expert loading (5x slower due to per-page fault overhead).

The 2-bit quantization mode remains unreliable for structured output, producing malformed JSON that breaks tool calling. Woods noted the engine produces output like \name\ instead of "name" at 2-bit precision. The project also found that Apple’s F_RDADVISE prefetch hint caused a 73% GPU slowdown through memory controller contention, and a custom Metal LRU cache was 38% slower than simply relying on the OS page cache.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime