- Research Math Agents (RMA) — a new agentic framework for research-level mathematics — solved 8 of 10 problems on the First Proof benchmark.
- RMA outperforms GPT-5.2R and Aletheia, two strong baseline systems.
- The framework decomposes proof-solving into specialised modules: problem analysis, literature search, fair comparison, knowledge-bank construction, and proof verification.
- Initializer, proposer, and verifier agents coordinate through a shared structured memory in a multi-role, multi-round workflow.
What Happened
Researchers introduced Research Math Agents (RMA), an agentic framework for automated reasoning on research-level mathematical problems, in a paper submitted to arXiv on May 20, 2026. RMA solved 8 of 10 problems on the First Proof benchmark, outperforming GPT-5.2R and Aletheia.
Why It Matters
Where prior agentic-math work has centred on competition mathematics or formal theorem proving, RMA targets research-level mathematical problems that require long-horizon reasoning, literature grounding, and iterative proof refinement. This is a substantially harder regime — research-level math problems do not have known textbook solutions and often require connecting disparate ideas across the literature.
The result also pairs interestingly with Google DeepMind’s same-week AlphaProof Nexus result that solved 9 open Erdős problems. Two different research teams making independent advances on research-level mathematical reasoning in the same week suggests the field has crossed a meaningful capability threshold for autonomous original-mathematics generation.
Technical Details
RMA decomposes research-level proof solving into specialised modules: problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification. The decomposition is coordinated by initializer, proposer, and verifier agents that work through a shared structured memory. Within this unified framework, these agents operate in a multi-role, multi-round workflow — collaboratively generating, refining, and verifying candidate proofs through iterative feedback.
The First Proof benchmark consists of 10 research-level problems contributed by expert mathematicians across diverse domains. Through comprehensive expert evaluation, RMA outperforms strong baselines including GPT-5.2R and Aletheia, solving 8 of the 10 research problems and producing more logically sound and readable proofs. Comprehensive ablation studies show that performance gains arise from the interaction of structured reasoning modules, iterative refinement, and verifier-based feedback, rather than any single component. Solutions and implementations will be made publicly available upon acceptance.
Who’s Affected
Research mathematicians gain another agentic system worth comparing against AlphaProof Nexus and existing baselines. AI research groups working on agentic frameworks gain a concrete decomposition pattern — initializer/proposer/verifier with shared structured memory. Strong baseline systems (GPT-5.2R, Aletheia) face a methodological challenge for how to extend their architectures. The broader AI capability-evaluation community gains another data point on agentic-AI capability versus monolithic-model performance.
What’s Next
The paper is currently in submission to peer review; solutions and implementations will be made publicly available upon acceptance. Expect follow-up benchmarks that test the framework on harder problem sets, and adaptations of the multi-agent decomposition pattern to other research-level domains beyond mathematics. The same-week Google DeepMind AlphaProof Nexus result will likely drive a side-by-side comparison in subsequent literature.