TOOL UPDATES

NVIDIA’s Puzzle NAS Cuts OpenAI’s 120B Model to 88B With 2.82× Speedup

R Ryan Matsuda Mar 26, 2026 Updated Apr 7, 2026 4 min read
Engine Score 8/10 — Important

The release of an 88B parameter model by NVIDIA on Hugging Face is a significant development for the AI research and developer community. This highly actionable model provides a new tool, despite the news originating from a secondary source like Reddit.

Editorial illustration for: NVIDIA Releases 88B Parameter Model Using Neural Architecture Search

NVIDIA published gpt-oss-puzzle-88B on Hugging Face on March 26, 2026 — an 88-billion parameter language model derived from OpenAI’s gpt-oss-120b using a post-training neural architecture search framework called Puzzle. The release represents a concrete application of NAS techniques to a frontier-scale mixture-of-experts model, with documented throughput gains over the parent.

  • Puzzle reduced the 120B parent to roughly 88B active parameters — 73% of the original — using heterogeneous expert pruning and selective window attention.
  • On a single H100 GPU, gpt-oss-puzzle-88B delivers a 2.82× throughput improvement over gpt-oss-120b while matching or exceeding its accuracy across reasoning tasks.
  • The model supports 128K-token context and three runtime reasoning modes (low, medium, high), allowing operators to trade latency against response depth at inference time.
  • A formal technical paper extending Puzzle to mixture-of-experts models — arXiv 2602.11937 — accompanies the release.

What Happened

NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, derived from OpenAI’s gpt-oss-120b through NVIDIA’s Puzzle post-training neural architecture search framework, documented in arXiv preprint 2602.11937: “Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration.” The model is licensed under NVIDIA’s Open Model License, which permits commercial use. Author details for the team behind this specific release were not listed individually in the published model card at time of publication.

Why It Matters

Deploying models at the 100B+ parameter scale is expensive. Quantization and weight pruning are common approaches to reducing inference costs, but post-training NAS takes a different path: it searches for an optimal sub-architecture rather than simply compressing existing weights, and then applies knowledge distillation to recover accuracy in the new architecture. The original Puzzle framework was first described for dense models in arXiv 2411.19146; this release extends it to mixture-of-experts architectures, which present a different pruning challenge because experts are activated selectively per token rather than uniformly.

The release also builds directly on OpenAI’s decision to publish gpt-oss-120b and gpt-oss-20b as open weights, documented in arXiv 2508.10925, which made the base model available for downstream architectural work of this kind.

Technical Details

The Puzzle framework applied three structural changes to produce the 88B model from the 120B base. First, heterogeneous MoE expert pruning removed experts non-uniformly across layers using activation-based importance scoring, resulting in different expert counts per layer rather than a uniform reduction. Second, selective window attention was introduced, using an 8K attention window instead of full attention, which the model card states reduces KV-cache requirements by approximately 40%. Third, RoPE scaling was adjusted via an increased YaRN factor to maintain stability at the full 128K context length despite the architectural changes.

Recovery training ran in two stages: a knowledge distillation phase over 84 billion tokens at 128K sequence length with the MoE router frozen, followed by a reinforcement learning stage covering competitive programming, mathematical reasoning, multi-choice QA, instruction following, and structured outputs. MoE expert weights were quantized to MXFP4 and the KV cache to FP8, the model card states yielding approximately 2× token capacity. On an 8×H100 configuration, the published figures show 1.63× throughput improvement for long-context workloads (64K input / 64K output) and 1.22× for short-context (4K / 4K) compared to gpt-oss-120b. On a single H100, the reported gain is 2.82×.

The model’s chat template exposes a reasoning_effort parameter — settable to “low,” “medium” (default), or “high” — allowing operators to control inference depth at runtime. Benchmarks listed in the model card include GPQA-Diamond (448 graduate-level science questions), AIME25, MMLU-Pro, and RULER 128K for long-context evaluation, though numerical scores for gpt-oss-puzzle-88B versus its parent on those benchmarks were not published in the model card at time of writing.

Who’s Affected

Developers building agentic applications are a direct audience: the model’s chat template includes native support for browser and Python tool integrations via a builtin_tools parameter, and the TypeScript-style type system in the template handles arrays, nullable types, and oneOf union schemas for structured tool-call payloads. NVIDIA recommends vLLM as the primary inference backend, with Hugging Face Transformers 4.57.3 or later also supported.

Infrastructure teams running multi-tenant inference on NVIDIA H100 or B200 hardware stand to benefit most directly from the throughput figures. At 73% of the parent’s parameter count, gpt-oss-puzzle-88B reduces the GPU memory footprint meaningfully — moving from a model that requires multi-node serving to one that can run on a single H100-80GB node affects both hardware procurement and per-query serving costs.

What’s Next

The model card does not publish side-by-side numerical benchmark comparisons between gpt-oss-puzzle-88B and gpt-oss-120b on specific tasks such as GPQA-Diamond or AIME25, which would allow independent verification of the accuracy-retention claims. The accompanying arXiv preprint (2602.11937) contains the full methodology for the MoE extension of Puzzle, and third-party evaluation on those benchmarks would clarify how much task-specific accuracy the architectural changes trade away.

The training data disclosure notes a range of 1 billion to 10 trillion tokens — an unusually wide range that leaves the exact distillation dataset scale ambiguous. Reproducing the Puzzle search process for other base models would require access to the activation-scoring pipeline, which has not been separately open-sourced alongside the weights.

Related Reading

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime