A new open-source project called Hypura promises to run large language models that exceed available memory on Apple Silicon Macs by intelligently distributing model components across GPU, RAM, and NVMe storage tiers. The storage-tier-aware inference scheduler, developed by GitHub user t8, can run a 31 GB Mixtral 8x7B model on a 32 GB Mac Mini at 2.2 tokens per second, and a 40 GB Llama 70B at 0.3 tokens per second—both configurations that would crash standard llama.cpp implementations.
The system addresses a fundamental limitation of consumer Apple hardware: fast unified memory and NVMe storage paired with limited capacity. According to the project documentation, “A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.” Hypura solves this by understanding model architecture and optimizing tensor placement based on access patterns.
The scheduler implements three distinct inference modes based on model size and available memory. For Mixture of Experts (MoE) models like Mixtral, it exploits sparsity by keeping only non-expert tensors (~1 GB) on GPU while streaming expert tensors from NVMe through a pool buffer on demand. The system achieves a 99.5% hit rate through a neuron cache that tracks loaded expert slices across tokens, with co-activation tracking predicting which experts will fire next for speculative prefetch. For dense models like Llama 70B, attention layers and norms stay GPU-resident (~8 GB) while FFN tensors (~32 GB) stream from NVMe through a dynamically-sized pool buffer.
Hypura automatically profiles hardware capabilities including GPU working set limits, RAM capacity, and NVMe bandwidth to solve a placement optimization problem. The system assigns tensors to three tiers: Metal GPU for attention layers, norms, and embeddings; RAM for overflow layers accessed via memory mapping; and NVMe for remaining layers loaded on-demand via direct I/O with F_NOCACHE and pread operations. Pool buffer sizes, prefetch depth, and memory budgets are computed automatically without manual tuning.
The project is available on GitHub under the t8/hypura repository, written in Rust with 262 stars as of publication. The system promises zero overhead for models that fit entirely in memory while enabling previously impossible configurations on memory-constrained Apple Silicon devices.
