Nvidia’s Nemotron Cascade 2 30B-A3B, released on March 19, 2026, attracted renewed community attention when Reddit user ilintar posted benchmark results on the r/LocalLLaMA subreddit reporting 97.6% on HumanEval and 88% on ClassEval. The tester described the model as one that “has largely flown under the radar” despite being part of Nvidia’s growing Nemotron model family.
- Community testing on an IQ4_XS quantized version produced 97.6% on HumanEval and 88% on ClassEval, outscoring mid-range Qwen 3.5 models on HumanEval.
- The model uses a hybrid Mamba SSM plus attention architecture—not the Qwen architecture—with 30B total parameters but only 3B active per inference pass under a mixture-of-experts design.
- Official Nvidia benchmarks report gold-medal-equivalent scores on IOI 2025 (439.3 points) and IMO 2025 (35 points), with 87.2% on LiveCodeBench v6.
- The model supports context windows up to 262,144 tokens and is available in 28 quantized formats via llama.cpp, LM Studio, Ollama, and Jan.
What Happened
Shortly after Nvidia’s March 19, 2026 release of Nemotron Cascade 2 30B-A3B, Reddit user ilintar posted structured benchmark results on the r/LocalLLaMA subreddit, drawing attention to a model that had received comparatively little coverage. The model was developed by a team at Nvidia including lead authors Zhuolin Yang, Zihan Liu, and Yang Chen, with methodology detailed in arXiv:2603.19220.
Ilintar’s tests used mradermacher’s IQ4_XS quantized version of the model rather than the full-precision weights. The tester explained the reasoning: “I’ve been running some evals on local models lately since I’m kind of tired of the ‘vibe feels’ method of judging them.” HumanEval and ClassEval were chosen because they are “quick to run and complicated enough for most small models to still have noticeable differences.”
The 97.6% HumanEval result reportedly left “both medium Qwen 3.5 models in the rear window,” though ilintar did not publish Qwen comparison scores in the original post.
Why It Matters
Nemotron Cascade 2 competes in a well-populated segment of sub-30B open models targeting local inference. What distinguishes it architecturally is the use of Mamba state-space model (SSM) layers combined with standard attention—a hybrid design that differs from the pure-transformer stacks used by Qwen, Llama, and Mistral-family models at similar parameter counts.
Ilintar noted directly in the post that the model “is not based on the Qwen architecture despite a similar size, it’s a properly hybrid model based on Nemotron’s own arch.” This distinction has practical inference consequences: SSM layers require a dedicated cache type (--mamba-ssm-cache-dtype float32 in vLLM) and are not interchangeable with standard key-value caches.
The release also arrived with less public attention than the earlier Nemotron Super model family, which received broader coverage at launch despite the Cascade 2 series posting competitive numbers across both mathematics and code categories.
Technical Details
The model’s naming convention—30B-A3B—describes its mixture-of-experts architecture: 30 billion total parameters with approximately 3 billion activated per inference pass. This reduces compute cost relative to dense 30B models while retaining access to the full parameter space.
Nvidia trained the model using two post-training techniques: Cascade RL (reinforcement learning) and Multi-Domain On-Policy Distillation, applied on top of the base model Nemotron-3-Nano-30B-A3B-Base. Training data is publicly available as Nemotron-Cascade-2-SFT-Data and Nemotron-Cascade-2-RL-data on Hugging Face.
Official benchmarks from Nvidia’s model card show 92.4% on AIME 2025 (rising to 98.6% with Tool-Integrated Reasoning), 87.2% on LiveCodeBench v6, and 99.0% on the NIAH@1M long-context retrieval benchmark. Nvidia reports gold-medal-equivalent competition scores on IOI 2025 (439.3 points) and IMO 2025 (35 points). The community HumanEval score of 97.6% was obtained on a quantized IQ4_XS version; quantization can affect benchmark outcomes relative to full-precision weights.
Who’s Affected
Local inference users running models on single-GPU consumer hardware are the primary beneficiaries of the 3B-active-parameter design. The model is available in 28 quantized formats through llama.cpp, LM Studio, Jan, and Ollama. Full-precision deployment requires vLLM version 0.17.1 or later with the --mamba-ssm-cache-dtype float32 and --trust_remote_code flags.
Developers working on competition-level mathematics or programming automation—agentic coding via OpenHands, olympiad problem solving, or code generation pipelines—are identified as the primary use case in Nvidia’s model card. The model is released under the NVIDIA Open Model License, not permissive open-source licenses like Apache 2.0 or MIT, which places restrictions on certain commercial deployments.
What’s Next
Ilintar indicated further testing is planned: “I’m going to run some more tests on this model, but I feel it deserves a bit more attention.” No timeline was given, and independent replication of the HumanEval and ClassEval figures had not appeared in the thread at time of writing.
Known limitations documented in the official model card include moderate performance on long-context benchmarks AA-LCR (39.1) and LongBench (40.3), and a general-knowledge score of 86.3% on MMLU-Redux. Users deploying the model for tool-use tasks should note that it does not follow standard function-calling conventions: tool responses must be wrapped in <tool_response> tags inside the user role rather than a dedicated tool role. Full training and evaluation methodology is available in the paper at arXiv:2603.19220.