Key Takeaways
- Google’s Gemma 4 8B model runs comfortably on Mac minis with 16GB+ unified memory using Ollama, consuming roughly 9.6GB in Q4_K_M quantization.
- The 26B variant requires more than 24GB of unified memory and causes severe swapping on base-model Mac minis, making the 8B version the practical choice for most users.
- A LaunchAgent plist can keep the model preloaded in memory and warm, reducing first-response latency to near zero after boot.
- Ollama‘s MLX backend enables an automatic CPU/GPU split on Apple Silicon, with typical configurations showing 86% GPU and 14% CPU utilization.
What Happened
A detailed setup guide posted to GitHub Gist on April 3, 2026, walks through the complete process of installing Ollama and running Google’s Gemma 4 model locally on an Apple Silicon Mac mini. The guide, which reached 302 points on Hacker News, covers installation via Homebrew, model selection between the 8B and 26B variants, auto-start configuration, and keep-alive preloading to maintain inference readiness.
The guide addresses a common pitfall head-on: the 26B parameter variant of Gemma 4 consumes nearly all of a 24GB Mac mini’s unified memory, leaving the system “barely responsive and causing frequent swapping under concurrent requests.” The author recommends the default 8B variant with Q4_K_M quantization, which occupies roughly 9.6GB and runs with headroom to spare.
Why It Matters
Local LLM inference on consumer hardware continues to gain traction as models shrink in size and Apple Silicon’s unified memory architecture proves well-suited for the task. Gemma 4, released by Google, represents one of the strongest open-weight models available in early April 2026, and Mac minis remain among the most cost-effective Apple Silicon machines for always-on local AI workloads.
Running a capable language model locally eliminates API costs entirely, removes network latency from every request, and keeps all data on-device. For developers and small teams processing sensitive information — legal documents, medical records, proprietary code — this is a meaningful advantage over cloud-based alternatives where data leaves the local network.
The guide’s popularity on Hacker News (302 points) reflects sustained demand for practical, step-by-step instructions that get local AI running without debugging. Many users have the hardware but lack the specific configuration knowledge to set up auto-start, preloading, and GPU acceleration correctly.
Technical Details
The setup begins with installing the Ollama macOS app via Homebrew: brew install --cask ollama-app. This installs both the Ollama.app in /Applications/ and the ollama CLI at /opt/homebrew/bin/ollama. The app includes auto-updates and the MLX backend optimized for Apple Silicon’s GPU cores.
Pulling the model is a single command: ollama pull gemma4, which downloads approximately 9.6GB for the default 8B variant. After installation, running ollama ps shows the CPU/GPU split, typically around 14% CPU and 86% GPU on Apple Silicon machines. This split confirms that the MLX backend is properly utilizing the GPU for the bulk of inference computation.
The most useful portion of the guide covers creating a macOS LaunchAgent that preloads the model into memory at startup and refreshes it every 300 seconds to prevent memory eviction. The LaunchAgent is configured through a plist file at ~/Library/LaunchAgents/com.ollama.preload-gemma4.plist and activated with launchctl load. The result is a Mac mini that boots up ready to serve Gemma 4 responses immediately, with the model already loaded in GPU memory.
Hardware requirements are straightforward: any Apple Silicon Mac mini (M1 through M5) with at least 16GB of unified memory for the 8B model. The author tested the 26B variant on a 24GB machine and found it consumed nearly all available memory, making the system unusable for other tasks. Users with 36GB or 64GB configurations could run the larger model comfortably.
Who’s Affected
Developers, hobbyists, and small teams running local AI workloads on Apple hardware benefit most from this guide. The Mac mini’s compact form factor, low power consumption, and relatively low price point make it a popular choice for always-on home servers and small office deployments. Pairing it with Gemma 4 via Ollama turns it into a private AI endpoint that runs 24/7 without ongoing API charges.
Users who previously relied on cloud API calls for tasks like code completion, document summarization, translation, or conversational AI can now run those workloads locally at zero marginal cost after the initial hardware investment.
What’s Next
As Apple continues to increase unified memory capacities across its Mac lineup and open-weight models improve at smaller parameter counts, local inference will become viable for an expanding set of use cases. The 26B Gemma 4 variant, currently impractical on 24GB machines, would run comfortably on Mac minis with 36GB or more. Ollama’s MLX backend is under active development, with further performance optimizations expected throughout 2026 as Apple releases new silicon generations.
