LAUNCHES

Hypura Scheduler Enables Oversized LLM Inference on Apple Silicon

M megaone_admin Mar 24, 2026 2 min read
Engine Score 8/10 — Important

This new open-source scheduler offers a novel approach to optimizing LLM inference on Apple Silicon, providing direct actionability for developers. Its primary source and timeliness make it an important development for on-device AI performance, despite targeting a niche.

Editorial illustration for: Hypura Scheduler Enables Oversized LLM Inference on Apple Silicon

A new open-source project called Hypura promises to run large language models that exceed available memory on Apple Silicon Macs by intelligently distributing model components across GPU, RAM, and NVMe storage tiers. The storage-tier-aware inference scheduler, developed by GitHub user t8, can run a 31 GB Mixtral 8x7B model on a 32 GB Mac Mini at 2.2 tokens per second, and a 40 GB Llama 70B at 0.3 tokens per second—both configurations that would crash standard llama.cpp implementations.

The system addresses a fundamental limitation of consumer Apple hardware: fast unified memory and NVMe storage paired with limited capacity. According to the project documentation, “A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.” Hypura solves this by understanding model architecture and optimizing tensor placement based on access patterns.

The scheduler implements three distinct inference modes based on model size and available memory. For Mixture of Experts (MoE) models like Mixtral, it exploits sparsity by keeping only non-expert tensors (~1 GB) on GPU while streaming expert tensors from NVMe through a pool buffer on demand. The system achieves a 99.5% hit rate through a neuron cache that tracks loaded expert slices across tokens, with co-activation tracking predicting which experts will fire next for speculative prefetch. For dense models like Llama 70B, attention layers and norms stay GPU-resident (~8 GB) while FFN tensors (~32 GB) stream from NVMe through a dynamically-sized pool buffer.

Hypura automatically profiles hardware capabilities including GPU working set limits, RAM capacity, and NVMe bandwidth to solve a placement optimization problem. The system assigns tensors to three tiers: Metal GPU for attention layers, norms, and embeddings; RAM for overflow layers accessed via memory mapping; and NVMe for remaining layers loaded on-demand via direct I/O with F_NOCACHE and pread operations. Pool buffer sizes, prefetch depth, and memory budgets are computed automatically without manual tuning.

The project is available on GitHub under the t8/hypura repository, written in Rust with 262 stars as of publication. The system promises zero overhead for models that fit entirely in memory while enabling previously impossible configurations on memory-constrained Apple Silicon devices.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime

M
MegaOne AI Editorial Team

MegaOne AI monitors 200+ sources daily to identify and score the most important AI developments. Our editorial team reviews 200+ sources with rigorous oversight to deliver accurate, scored coverage of the AI industry. Every story is fact-checked, linked to primary sources, and rated using our six-factor Engine Score methodology.

About Us Editorial Policy