Developer Creates 397B Parameter Model Runner for 48GB MacBook Pro

A developer has released Flash-MoE, a pure C and Metal inference engine that runs Qwen3.5-397B-A17B, a 397 billion parameter Mixture-of-Experts model, on a MacBook Pro with 48GB RAM at 4.4+ tokens per second. The project, created by GitHub user danveloper, streams the entire 209GB model from SSD through a custom Metal compute pipeline without using Python or machine learning frameworks.

The implementation represents a significant technical achievement in local large language model inference. According to the project documentation, “Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.” The system uses only C, Objective-C, and hand-tuned Metal shaders.

The technical approach centers on several key optimizations. The model uses SSD expert streaming, where expert weights are read from NVMe SSD on demand via parallel pread() with GCD dispatch groups, loading only the 4 active experts per layer. An FMA-optimized dequantization kernel rearranges the math from “(nibble * scale + bias) * x to fma(nibble, scale*x, bias*x),” achieving 12% faster performance than naive implementations. The system also employs hand-written Metal compute shaders for 4-bit and 2-bit dequantized matrix-vector multiply operations.

Testing on an Apple M3 Max MacBook Pro with 16-core CPU, 40-core GPU, and 48GB unified memory shows the 4-bit expert configuration achieving 4.36 tokens per second with “excellent” quality and full tool calling support. A 2-bit configuration reaches 5.74 tokens per second but breaks JSON and tool calling functionality, with the documentation noting it “produces \name\ instead of ‘name’ in JSON output, making tool calling unreliable.”

The project includes a technical paper with “90+ experiments” detailing the implementation. The model architecture uses 60 transformer layers, with 45 GatedDeltaNet linear attention layers and 15 standard full attention layers, each containing 512 experts with 4 activated per token plus one shared expert. The developer has made the complete source code available on GitHub with 694 stars and 85 forks as of the repository snapshot.

Developer Creates 397B Parameter Model Runner for 48GB MacBook Pro

Enjoyed this story?

IBM Releases Granite-4.0-3B-Vision Model for Document Data Extraction

Anthropic Releases Claude Opus 4.6 with 1M Token Context Window

OpenAI Announces Two-Stage Sora Shutdown Starting April 2026

Before you go…