- A review paper from Meta, Stanford, and the University of Illinois Urbana-Champaign argues code is the foundation AI agents use to reason, act, and coordinate.
- The ‘harness’ — the software layer that wraps the model with tools, sandboxed execution, memory, testing, and feedback channels — is the real bottleneck for autonomous systems.
- Commercial systems like Claude Code and OpenAI’s Codex already operate on this principle.
- The authors caution that current software tests can be incomplete and obscure risks, calling for more transparent evaluation mechanisms.
What Happened
A review paper from researchers at the University of Illinois Urbana-Champaign, Meta, and Stanford argues that code is how AI agents think and act, not just what they produce. The argument extends beyond the conventional view of LLMs writing code as an output: code is the foundation agents use to reason, act, and coordinate with each other.
Why It Matters
The framing has direct implications for how the agentic-AI category is built and deployed. If code is the reasoning substrate rather than just the output, the bottleneck for capability gains shifts from model improvements (which are now largely commoditised at the top tier per the Cursor Composer 2.5 result) to harness improvements: the software layer that gives the model tools, isolated execution environments, memory, and feedback.
The paper’s argument also aligns with Hugging Face‘s recent agent-vocabulary glossary (May 25) that distinguished ‘harness’ (the execution wrapper) from ‘scaffolding’ (the structural prompting layer) and ‘agent’ (the system as a whole). Together these pieces signal that the agentic-AI research community is converging on shared technical vocabulary.
Technical Details
The authors call the wrapping software layer the ‘harness.’ It covers everything from tools and interfaces to sandboxed execution environments, memory, testing, permission boundaries, execution loops, and feedback channels. Without the harness, a language model is just stateless — it produces text in response to a prompt and stops. With the harness, the model becomes a working agent that can grind through tasks over long stretches of time.
The paper identifies several reasons code is the right execution format. Code is executable — model outputs become operations that can be checked rather than just inspected as text. Code is traceable — intermediate calculations show up as structured traces the system can read and store. And code persists across steps — the running program logs task progress in a form the agent can pick back up later. Commercial systems already operating on this principle include Anthropic’s Claude Code and OpenAI’s Codex.
Who’s Affected
AI research groups working on agentic-AI design gain a structured framework for analyzing where capability improvements come from. Frontier-lab platform owners (Anthropic, OpenAI, Google DeepMind) face the question of whether their proprietary harness is the structural moat going forward — rather than just the underlying model. Open-source agent frameworks (SWE-Agent, AutoGen, LangGraph) gain a research-paper validation. The Cisco-OpenAI partnership announced May 28 (Codex as ‘AI engineering teammate’) and Endava’s ‘agentic organization’ framing are commercial instantiations of the same architecture pattern.
What’s Next
The paper authors specifically call for more transparent harness-evaluation mechanisms — current software tests are often incomplete and can obscure risks. Expect parallel research papers on harness-specific evaluation methodologies. Industry-level standards work — through bodies like the AI Verification Foundation or NIST’s AISI — may produce harness-evaluation frameworks through 2026-2027.