NVIDIA Built a Simulated Universe to Train Robots — Cosmos Predict 2.5 Uses 200 Million Videos [Deep Dive]

Key Takeaways

NVIDIA’s Cosmos Predict 2.5 is a world foundation model trained on 200 million curated video clips that unifies text, image, and video-to-world generation in a single architecture.
The model generates 30-second physics-compliant video simulations used to train robots and autonomous vehicles before real-world deployment.
Available at 2B and 14B parameter scales under the NVIDIA Open Model License, the model supports specialized post-training for autonomous driving (7-camera multiview) and robotic control tasks.
Partners including 1X, Figure AI, Agility Robotics, Uber, and Waabi are already using Cosmos models for synthetic data generation and policy evaluation.

What Happened

NVIDIA released Cosmos Predict 2.5, the latest generation of its World Foundation Models for Physical AI. The flow-based model was trained on 200 million high-quality video clips curated from a pipeline that processed 35 million hours of raw video, producing over 6 billion clips before filtering. The release, first made available in October 2025 with ongoing updates through February 2026, includes both 2B and 14B parameter versions alongside specialized variants for autonomous driving and robotic control.

“Just as large language models revolutionized generative and agentic AI, Cosmos world foundation models are a breakthrough for physical AI,” said Jensen Huang, founder and CEO of NVIDIA.

Why It Matters

Training robots and autonomous vehicles in the physical world is slow, expensive, and dangerous. A self-driving car cannot safely learn from a near-collision, and a warehouse robot cannot practice grasping fragile items without breaking them. World foundation models address this by generating synthetic environments where AI agents can practice at scale before touching real hardware.

Cosmos Predict 2.5 represents a consolidation step. Its predecessor required three separate models for text-to-world, image-to-world, and video-to-world generation. The 2.5 release merges all three into a single unified architecture, reducing complexity for developers building physical AI pipelines. It also integrates Cosmos Reason 1, a 7-billion-parameter vision-language model that topped the Physical Reasoning leaderboard, as its text-scene encoder for improved grounding.

Technical Details

The model’s data pipeline is one of its defining features. NVIDIA’s Cosmos Curator processed 35 million hours of raw video, yielding over 6 billion clips. After quality filtering, 200 million clips survived. The resulting model generates spatially and temporally coherent video up to 30 seconds in length, maintaining consistent physics, lighting, and object permanence throughout each sequence.

Post-training uses a reinforcement learning algorithm that refines the model’s outputs for prompt alignment and reduced hallucination. Grounded prompt alignment, a technique introduced in this release, anchors generated scenes to semantic descriptions, reducing the gap between what was requested and what appears on screen.

Specialized variants include a 7-camera multiview model for autonomous driving simulation, an action-conditioned robotics model, and a policy model post-trained on Libero and RoboCasa benchmarks. The companion model, Cosmos Transfer 2.5, is 3.5x smaller than its predecessor while delivering up to 60% improvement in autonomous vehicle lane and cuboid detection tasks evaluated via LATR and BEVFormer frameworks.

Who’s Affected

Robotics and autonomous vehicle developers stand to gain the most. Partners already integrating Cosmos models include 1X, Agility Robotics, Figure AI, Skild AI, Uber, Waabi, Foretellix, Parallel Domain, Nexar, Oxa, and Virtual Incision. These companies use the models for synthetic data generation, policy evaluation, and closed-loop simulation across warehouse robotics, surgical robotics, and self-driving systems.

Independent developers also have access. The models, code, and post-training scripts are available on Hugging Face and GitHub under the NVIDIA Open Model License, with source code released under Apache 2.0. NVIDIA’s Cosmos Cookbook provides step-by-step recipes for building and deploying custom world models. The Cosmos Dataset Search tool uses vector-based retrieval to search billions of clips in seconds, shortening post-training data cycles from years to days.

What’s Next

NVIDIA announced research on Cosmos Policy for visuomotor control in January 2026, signaling that the next phase targets end-to-end robotic policy learning directly within simulated worlds. Distilled 2B checkpoints were released in December 2025 for teams that need faster inference on constrained hardware. The company also introduced a Physical AI Data Factory Blueprint for organizations building full synthetic data augmentation pipelines. For enterprises requiring custom licensing, NVIDIA directs inquiries to [email protected].

NVIDIA Built a Simulated Universe to Train Robots — Cosmos Predict 2.5 Uses 200 Million Videos [Deep Dive]

Key Takeaways

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Alibaba’s AI Turns Phone Photos Into 3D Restaurant Tours in Minutes — Professional Photographers Are Panicking

AI Is Writing Code Faster Than Anyone Can Audit It — This $6M Startup Says That’s a $100B Security Problem

Aurora Makes AI 1.25x Faster by Learning While It’s Running — No Retraining, No Downtime

Before you go…