- GrandCode, a multi-agent reinforcement learning system, placed first in three consecutive live Codeforces competitions in March 2026, defeating all human participants including top-ranked grandmasters.
- The system orchestrates five specialized agentic modules — including hypothesis proposal, solver, and test generator — trained jointly through post-training RL and online test-time RL.
- The researchers developed Agentic GRPO, a new training algorithm built to handle delayed rewards and severe off-policy drift in multi-stage agent rollouts.
- The prior best AI result on Codeforces, Google’s Gemini 3 Deep Think at 8th place, was not achieved under live competition conditions.
What Happened
Researchers published a paper on April 3, 2026, describing GrandCode, a multi-agent reinforcement learning system that placed first in three consecutive live Codeforces competitive programming contests: Round 1087 on March 21, Round 1088 on March 28, and Round 1089 on March 29, 2026. The paper, titled “GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning,” was posted to arXiv and asserts that GrandCode outperformed every human competitor in each event, including top-rated grandmasters. The authors’ names were not included in the source abstract available at publication.
Why It Matters
Competitive programming at the grandmaster level has persisted as a domain where human performance exceeded AI systems, partly because problems are novel, time-constrained, and require multi-step reasoning rather than pattern retrieval. The previous best AI result on Codeforces came from Google’s Gemini 3 Deep Think, which placed 8th — but that evaluation was not conducted under live competition conditions, meaning the system faced neither unknown problem sets nor real-time constraints. GrandCode’s three first-place finishes in live rounds, against active human participants, closes that methodological gap.
Technical Details
GrandCode coordinates five agentic modules during problem-solving: hypothesis proposal, a solver, a test generator, a summarization component, and additional specialized agents. These modules are jointly trained through post-training reinforcement learning and then refined via online test-time RL, allowing the system to adapt behavior during inference on unseen problems. To train this pipeline, the team developed Agentic GRPO, which the abstract describes as “specifically designed for multi-stage agent rollouts with delayed rewards and the severe off-policy drift that is prevalent in agentic RL.” The paper states that GrandCode is “the first AI system that consistently beats all human participants in live contests of competitive programming,” citing the three March 2026 Codeforces rounds as evidence.
Who’s Affected
The Codeforces community is the most directly affected: human grandmasters — the platform’s highest-rated tier — were outranked by an AI system in live events for the first time across multiple consecutive contests. AI research teams benchmarking code generation and reasoning systems will need to update their baselines, as competitive programming can no longer function as a clear separator between human and AI capability. Developers of multi-agent coding assistants and automated software engineering tools will also find GrandCode’s Agentic GRPO methodology — particularly its handling of delayed reward signals — relevant to their own training pipelines.
What’s Next
The paper does not announce a public release of GrandCode or deployment in any commercial product. Independent verification of the three Codeforces placements — Rounds 1087, 1088, and 1089 — is straightforward, as Codeforces publishes full leaderboards for each contest. Researchers will likely examine whether GrandCode’s performance generalizes across a broader range of contests and difficulty distributions, and whether the Agentic GRPO framework transfers to other domains with similar multi-stage, delayed-reward structures.