- Researchers from NVIDIA, UC Berkeley, Stanford, and Carnegie Mellon released CaP-X, a framework that systematically tests how well frontier AI models can control robots by writing code — and found none of the 12 models tested could match human-written programs in a single attempt.
- Without human-designed abstractions (pre-built commands for grasping, lifting, and placing objects), even top models including Gemini 3 Pro, GPT-5.2, and Claude Opus 4.5 fail at basic manipulation tasks, with success rates that collapse at low abstraction levels.
- The best frontier models achieved only around 30% average success on robot control tasks; human-written code sets the ceiling these models cannot reach without structured scaffolding.
- Agentic techniques — parallel code generation, automated debugging, and reinforcement learning — partially close the gap, lifting a 7B-parameter model from 20% to 72% success in simulation.
What Happened
A research team from NVIDIA, UC Berkeley, Stanford University, and Carnegie Mellon University published a paper in March 2026 introducing CaP-X (Code-as-Policy Extended), a framework designed to benchmark how well AI coding agents can control robots through self-written programs. The paper is available on arXiv (2603.22435) and the full framework is open-sourced at capgym.github.io under an MIT license.
The authors include Jim Fan, NVIDIA Director of AI and Distinguished Scientist and co-lead of Project GR00T, alongside Ken Goldberg of UC Berkeley, Fei-Fei Li and Jiajun Wu of Stanford, and researchers from CMU. Fan announced the release on X, writing: “CaP-X is brought to you by NVIDIA, Berkeley, Stanford, and CMU. I’d like to thank the legend Ken Goldberg who co-advised the work.”
The central finding: across 12 frontier models tested on 39 robot manipulation tasks, none could match the reliability of human-written control programs in a single attempt. The models included Gemini 3 Pro, GPT-5.2, Claude Opus 4.5, and open-source models such as Qwen3-235B and DeepSeek-V3.1.
Why It Matters
The claim that large language models can replace specialized robotics engineering has circulated widely since models demonstrated strong coding ability. CaP-X provides the first systematic, reproducible test of that claim across a structured set of physical manipulation tasks — and the results are more constrained than the broader narrative suggests.
Today’s best frontier models achieve just over 30% average success on CaP-Bench’s robot control tasks, according to the paper. Human-written programs set the performance ceiling. The gap is not marginal: the researchers describe a consistent 56-point shortfall between top model performance and human baselines on the benchmark’s most demanding tasks.
Physical AI differs from text or code generation in a critical way: errors cascade. A misplaced grasp command does not produce a wrong answer — it produces a failed physical interaction that no downstream correction can recover. The cost of a single reasoning error is absolute task failure.
Technical Details
CaP-X evaluates models along three axes: Abstraction Level, Temporal Interaction, and Perceptual Grounding. The abstraction level axis is where the sharpest failures appear.
Human-designed abstractions are pre-built function libraries that encapsulate complex physical operations behind a single callable command. A high-level abstraction like pick_and_place(object_A, location_B) hides the underlying steps: image segmentation to locate the object, depth processing to estimate distance, grasp planning to select a grip angle, and inverse kinematics to compute the arm joint positions. When models operate at the highest abstraction level (S1 in CaP-Bench’s tier system), they only need to sequence these calls correctly.
When those convenience functions are removed and models must work from low-level primitives (tier S4), performance collapses across all 12 models. Compilation rates — the share of generated code that even runs without errors — drop sharply for weaker and open-source models at low abstraction levels. The models must correctly combine dozens of interdependent code steps where one function call previously sufficed.
CaP-Bench tests tasks ranging from simple cube lifting to bimanual coordination across three simulation environments: RoboSuite, LIBERO-PRO, and BEHAVIOR. The benchmark includes 8 tiers covering both single-turn (S1–S4) and multi-turn (M1–M4) interaction, allowing evaluation of whether models can recover from failure with feedback.
Who’s Affected
The findings most directly affect robotics companies and hardware manufacturers building on the assumption that general-purpose language models can serve as drop-in robot controllers. That assumption requires human engineers to maintain abstraction libraries — a dependency the CaP-X results make explicit rather than implicit.
Enterprise automation deployments that have marketed “AI-controlled” robotic systems face a more specific question: what layer of human-designed scaffolding sits beneath the model, and who maintains it when task requirements change? The CaP-X results show that scaffold quality, not model capability alone, determines whether a robot completes a task.
AI model developers face a benchmarking accountability problem. CaP-Bench is reproducible, open-source, and tied to physical outcomes rather than text scores. Claims about model capability in physical domains can now be tested against a standard the field did not previously have.
What’s Next
The researchers propose two paths toward closing the performance gap. The first is agentic scaffolding at inference time: their CaP-Agent0 system generates nine solution candidates in parallel from varying models and temperatures, uses a Visual Differencing Module to generate text status reports after each execution step, and builds an automatically accumulated function library from successful runs. CaP-Agent0 matches or exceeds human-written code on four of seven tested tasks without any task-specific training.
The second path is reinforcement learning. CaP-RL trains coding agents using physics simulation as a reward signal — the robot either completes the task or it doesn’t. A Qwen2.5-Coder-7B model trained this way improved from 20% to 72% average success in simulation after 50 training iterations. On a real Franka Emika robot, the same model reached 84% success on cube lifting and 76% on cube stacking without additional fine-tuning, because the model optimizes through programmatic interfaces rather than visual observations — minimizing the gap between simulated and real environments.
The full CaP-X framework — including CaP-Gym, CaP-Bench, CaP-Agent0, and CaP-RL — is available at github.com/capgym/cap-x under an MIT license. The paper is published at arxiv.org/abs/2603.22435. The benchmark is designed for ongoing community use, and the authors intend CaP-Bench to serve as a standard evaluation for physical AI coding agents as the field develops. Teams evaluating robotic AI deployments can run the benchmark against their own model stack to identify where abstraction dependencies actually sit.
