- NVIDIA released Star Elastic on May 9, 2026 — a post-training method that embeds 30B, 23B, and 12B nested reasoning submodels inside a single Nemotron Nano v3 checkpoint, trained in one run on ≈160 billion tokens.
- All three variants live in one checkpoint and can be extracted without additional fine-tuning. Active parameters are 3.6B / 2.8B / 2.0B for the three sizes respectively.
- A learnable Gumbel-Softmax router determines nested submodel architectures end-to-end against a knowledge-distillation loss with the non-elastified parent as teacher.
- Star Elastic uses Router-Weighted Expert Activation Pruning (REAP) for MoE layers and supports nesting along SSM dimension, embedding channels, attention heads, Mamba heads, MoE expert count, and FFN intermediate dimension.
What Happened
NVIDIA released Star Elastic on May 9, 2026 — a post-training method that embeds multiple nested submodels at different parameter budgets inside a single parent reasoning model, using a single training run. Applied to Nemotron Nano v3 (a hybrid Mamba-Transformer-MoE model with 30B total / 3.6B active parameters), Star Elastic produces 23B (2.8B active) and 12B (2.0B active) nested variants trained with approximately 160 billion tokens. All three variants live in one checkpoint and can be extracted without additional fine-tuning.
Why It Matters
Training a family of LLMs has historically required separate full training runs for each size, multiplying compute costs by the number of variants supported. Star Elastic collapses that cost into one training run plus extraction. For dev teams running inference at scale, this dramatically reduces the storage, deployment, and ongoing fine-tuning overhead of supporting multiple model sizes. The technique also unlocks a different inference strategy — using a smaller submodel for the thinking phase and a larger one for the answering phase, evaluated by NVIDIA as the optimal “ℳS → ℳL” configuration.
Technical Details
Star Elastic supports nesting along multiple architectural axes: the SSM (State Space Model) dimension, embedding channels, attention heads, Mamba heads and head channels, MoE expert count, and FFN intermediate dimension. For MoE layers, Star Elastic uses Router-Weighted Expert Activation Pruning (REAP), which ranks experts by both routing gate values and expert output magnitudes — a more principled signal than naive frequency-based pruning that ignores how much each expert actually contributes to layer output.
A key distinction from prior compression methods like Minitron: Star Elastic uses an end-to-end trainable router rather than a fixed compression recipe. The router takes a target budget (e.g., “give me a 2.8B active parameter model”) as a one-hot input and outputs differentiable masks selecting which components are active at that budget level. Masks are trained jointly with the model through Gumbel-Softmax, allowing gradient flow through discrete architectural decisions.
The loss function combines knowledge distillation (KD) — where the non-elastified parent model acts as the teacher — with a router loss penalizing deviation from the target resource budget (parameter count, memory, or latency). The router learns architecture choices that actually improve accuracy under KD, rather than just minimizing a proxy metric.
Training uses a two-stage curriculum: a short-context phase (sequence length 8,192 tokens) with uniform budget sampling, followed by an extended-context phase (sequence length 49,152 tokens) with non-uniform sampling that prioritizes the full 30B model (p(30B)=0.5, p(23B)=0.3, p(12B)=0.2). The extended-context phase is critical for reasoning performance. NVIDIA’s ablations on the prior-generation Nano v2 (explicitly reproduced as the empirical basis for Nano v3) show gains of up to 19.8% on AIME-2025 for the 6B variant and 4.0 percentage points for the 12B variant from Stage 2 alone, motivating its use in Star Elastic.
Beyond static extraction, Star Elastic enables elastic budget control during inference: existing Nemotron Nano v3 budget control caps the number of tokens generated during a thinking phase. Star Elastic allows different nested submodels for the thinking phase versus the answering phase. NVIDIA evaluated four configurations; the optimal “ℳS → ℳL” allocates a cheaper model to generate extended reasoning, then switches to the larger model for the final answer.
Who’s Affected
NVIDIA’s Nemotron Nano v3 customers gain three model sizes from one training run, dramatically reducing the multiplier on compute and storage. The broader open-weight ecosystem (DeepSeek, Xiaomi MiMo, Moonshot Kimi, Zhipu GLM, Meta Llama) faces a methodology that — if adopted — collapses their multi-size release strategies into single-run efforts. Inference platform providers (Together, Fireworks, Anyscale, Hugging Face) gain a new pattern for deploying the ℳS → ℳL inference strategy. Compression-focused research labs face a shift: Star Elastic’s learnable router represents a meaningful departure from fixed-recipe compression like Minitron.
What’s Next
The Star Elastic checkpoints for Nemotron Nano v3 should appear on Hugging Face shortly. Watch for independent benchmark validation on AIME-2025, GPQA, MMLU, and SWE-bench across the three nested variants. Other labs — particularly Meta and DeepSeek — are likely to release comparable elastic-architecture methods within the next 1-2 quarters. The ℳS → ℳL inference strategy will require platform-side support before it can be deployed widely; expect Together, Fireworks, and similar providers to add this in the next several months.