RMA Solves 8/10 Research-Level Math Problems

Research Math Agents (RMA) — a new agentic framework for research-level mathematics — solved 8 of 10 problems on the First Proof benchmark.
RMA outperforms GPT-5.2R and Aletheia, two strong baseline systems.
The framework decomposes proof-solving into specialised modules: problem analysis, literature search, fair comparison, knowledge-bank construction, and proof verification.
Initializer, proposer, and verifier agents coordinate through a shared structured memory in a multi-role, multi-round workflow.

What Happened

Researchers introduced Research Math Agents (RMA), an agentic framework for automated reasoning on research-level mathematical problems, in a paper submitted to arXiv on May 20, 2026. RMA solved 8 of 10 problems on the First Proof benchmark, outperforming GPT-5.2R and Aletheia.

Why It Matters

Where prior agentic-math work has centred on competition mathematics or formal theorem proving, RMA targets research-level mathematical problems that require long-horizon reasoning, literature grounding, and iterative proof refinement. This is a substantially harder regime — research-level math problems do not have known textbook solutions and often require connecting disparate ideas across the literature.

The result also pairs interestingly with Google DeepMind’s same-week AlphaProof Nexus result that solved 9 open Erdős problems. Two different research teams making independent advances on research-level mathematical reasoning in the same week suggests the field has crossed a meaningful capability threshold for autonomous original-mathematics generation.

Technical Details

RMA decomposes research-level proof solving into specialised modules: problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification. The decomposition is coordinated by initializer, proposer, and verifier agents that work through a shared structured memory. Within this unified framework, these agents operate in a multi-role, multi-round workflow — collaboratively generating, refining, and verifying candidate proofs through iterative feedback.

The First Proof benchmark consists of 10 research-level problems contributed by expert mathematicians across diverse domains. Through comprehensive expert evaluation, RMA outperforms strong baselines including GPT-5.2R and Aletheia, solving 8 of the 10 research problems and producing more logically sound and readable proofs. Comprehensive ablation studies show that performance gains arise from the interaction of structured reasoning modules, iterative refinement, and verifier-based feedback, rather than any single component. Solutions and implementations will be made publicly available upon acceptance.

Who’s Affected

Research mathematicians gain another agentic system worth comparing against AlphaProof Nexus and existing baselines. AI research groups working on agentic frameworks gain a concrete decomposition pattern — initializer/proposer/verifier with shared structured memory. Strong baseline systems (GPT-5.2R, Aletheia) face a methodological challenge for how to extend their architectures. The broader AI capability-evaluation community gains another data point on agentic-AI capability versus monolithic-model performance.

What’s Next

The paper is currently in submission to peer review; solutions and implementations will be made publicly available upon acceptance. Expect follow-up benchmarks that test the framework on harder problem sets, and adaptations of the multi-agent decomposition pattern to other research-level domains beyond mathematics. The same-week Google DeepMind AlphaProof Nexus result will likely drive a side-by-side comparison in subsequent literature.

RMA Agentic System Solves 8 of 10 Research-Level Math Problems

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

RMA Agentic System Solves 8 of 10 Research-Level Math Problems

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Researcher Used Claude to Uncover a Ticketing Flaw at Most US Festivals

Anthropic Launches Claude Science, a Flagship AI Product for Research

GPT-5.6 Cheats on Software Tests More Than Any Model, METR Finds