- NVIDIA published Nemotron 3 Nano Omni on April 28, 2026, an open-weights 30B-parameter omni-modal model that jointly processes text, images, video, and audio.
- The model outperforms Qwen3-Omni 30B-A3B on document understanding (MMLongBench-Doc: 57.5 vs. 49.5), GUI automation (OSWorld: 47.4 vs. 29.0), and automatic speech recognition (HF Open ASR WER: 5.95 vs. 6.55).
- NVIDIA claims 9.2x higher system throughput for video use cases and 7.4x for multi-document use cases compared to other open omni models at equivalent per-user interactivity thresholds.
- Weights are available on Hugging Face in BF16, FP8, and NVFP4 formats; training code components are open-sourced via NeMo-RL.
What Happened
NVIDIA released Nemotron 3 Nano Omni on April 28, 2026, an open-weights model that processes text, images, video, and audio in a unified system. The model was developed by a team that includes researchers Tuomas Rintamaki, Amala Sanjay Deshmukh, Nabin Mulepati, and Collin McCarthy, among others.
The release extends NVIDIA’s Nemotron multimodal line, which previously covered vision-language tasks through Nemotron Nano V2 VL. Nemotron 3 Nano Omni adds native audio understanding and long-form video-plus-audio processing to that foundation, targeting five workload classes: document analysis, automatic speech recognition, audio-video understanding, GUI-based computer use, and general multimodal reasoning.
Why It Matters
Open-weights omni-modal models — systems that jointly process all four major modalities without routing to separate specialized models — represent a narrow field in mid-2026. Alibaba’s Qwen3-Omni (30B-A3B) is the most directly comparable publicly available model. Nemotron 3 Nano Omni’s release establishes a second major open-weights competitor in this category and claims to lead on multiple evaluations.
For organizations running inference on private infrastructure, the efficiency figures are operationally significant. NVIDIA’s blog post states that the model delivers “superior compression” through a combination of Conv3D frame fusion and inference-time token pruning, enabling higher concurrency at fixed hardware budgets compared to models that process video frames independently.
Technical Details
Nemotron 3 Nano Omni uses a 30B-parameter Mixture-of-Experts architecture with approximately 3 billion active parameters per forward pass (designated 30B-A3B). The language backbone interleaves 23 Mamba selective state-space layers, 23 MoE layers with 128 experts and top-6 routing plus one shared expert, and 6 grouped-query attention layers. The vision encoder is C-RADIOv4-H; the audio encoder is Parakeet-TDT-0.6B-v2, connected via a 2-layer MLP projector.
On the vision side, the model replaces the tiling approach used in V2 VL with dynamic resolution processing at native aspect ratio, representing each image with between 1,024 and 13,312 visual patches of 16×16 pixels — equivalent to resolutions from 512×512 up to 1,840×1,840. For video, a Conv3D tubelet embedding path fuses pairs of consecutive frames before the vision transformer, halving the number of vision tokens passed to the language backbone.
An inference-time feature called Efficient Video Sampling (EVS) further prunes static tokens — those where content is unchanged between frames — after the vision encoder, reducing latency while preserving accuracy on dynamic content. On audio, the model was trained on sequences up to 1,200 seconds (20 minutes) and can process over five hours of audio within the LLM context window at 16 kHz.
Benchmark results place the model at 57.5 on MMLongBench-Doc (vs. 49.5 for Qwen3-Omni and 38.0 for Nemotron Nano V2 VL), 47.4 on OSWorld for computer-use tasks (vs. 29.0 for Qwen3-Omni), 89.4 on VoiceBench (vs. 88.8 for Qwen3-Omni), and a 5.95 word error rate on the HF Open ASR leaderboard (vs. 6.55 for Qwen3-Omni, where lower is better).
Who’s Affected
Enterprises handling high-volume document workflows — legal, compliance, finance, and research organizations processing multi-page contracts, regulatory filings, or technical reports — are the primary stated target. NVIDIA documented the model as capable of handling documents exceeding 100 pages, covering layout, tables, figures, formulas, and cross-page references.
Developers building GUI automation agents are directly addressed by the OSWorld and ScreenSpot-Pro scores (47.4 and 57.8, respectively), which measure task completion in real graphical interface environments. Video-heavy enterprise workflows — meeting transcription, customer support analysis, and product demo parsing — fall within the model’s documented audio-video reasoning scope.
What’s Next
Checkpoints are available now on Hugging Face in BF16, FP8, and NVFP4 formats. NVIDIA also released the Nemotron-Image-Training-v3 dataset and open-sourced training code components through the NeMo-RL repository. A full technical report covering architecture design, training methodology, and data pipelines is available at research.nvidia.com. No API availability or commercial licensing terms were announced alongside the release.