GPT-5.5 Ramsey Number Proof: First Lean-Verified LLM Result

Q: What about frontierMath Tier 4: The Benchmark That Explains the Proof?

OpenAI reports GPT-5.5 scores 35.4% on FrontierMath Tier 4 — the hardest subset of the FrontierMath benchmark, designed by Epoch AI to contain unpublished problems from working research mathematicians, specifically to resist memorization. Claude Opus 4.7 (Anthropic) scores 22.9% on the same tier. The previous state-of-the-art before GPT-5.5 sat below 25%. A 12.5 percentage point gap on problems calibrated to require hours of expert mathematician time is not marginal improvement — it is a qua

GPT-5.5 (OpenAI, April 25, 2026) has contributed to a formally verified mathematical proof about Ramsey numbers — the first publicly disclosed instance of a frontier large language model producing a result subsequently verified in the Lean 4 theorem prover. The result establishes a new lower bound in a sub-problem of classical Ramsey theory that no published human work had previously reached.

This is not a demonstration. The result is machine-checked by Lean 4, making it formally exact and reproducible by anyone with access to the proof file.

What Ramsey Numbers Are, and Why They Resist Proof

A Ramsey number R(s,t) is the smallest integer n such that any 2-coloring of the edges of a complete graph on n vertices must contain either a monochromatic clique of size s or a monochromatic clique of size t. The canonical example: R(3,3) = 6, meaning any group of six people contains either three mutual acquaintances or three mutual strangers.

The search space scales catastrophically. R(5,5) is only known to fall between 43 and 48 — a range unnarrowed since 1995. Exhaustive verification is computationally intractable for any moderately sized graph. Progress requires structural insight into how combinatorial configurations can be constrained without enumeration.

Paul Erdős described the difficulty directly: if an alien civilization threatened to destroy Earth unless humans computed R(6,6), the right strategy would be to “marshal all our computers and all our mathematicians” and attempt it. For R(7,7), he said, humanity should “try to destroy the aliens first.” The scale of the problem is not rhetorical.

What GPT-5.5 Actually Contributed

OpenAI’s April 25, 2026 disclosure describes an internal variant of GPT-5.5 operating with a custom harness that contributed to a new lower bound result in a sub-problem of Ramsey theory. The model identified a structural approach that had not appeared in the published combinatorics literature, and from which a formal proof could be derived.

That proof was subsequently encoded in Lean 4 and accepted by the Lean 4 kernel. Lean 4 acceptance is not peer review — it is a mechanically enforced audit of every individual inference step. Human reviewers miss errors. Lean does not accept them.

OpenAI’s framing is precise: GPT-5.5 contributed to the proof, not authored it autonomously. Human researchers set the objective, validated intermediate milestones, and made the decision to pursue formal verification. The model’s role was directed search at a scale and speed no human team can match — not self-directed mathematical discovery.

The “Custom Harness” — Human-AI Collaboration, Not Autonomous Discovery

A custom harness in this context is a structured scaffolding layer that controls how the model generates, tests, and iterates on mathematical reasoning. The model proposes proof elements; the harness runs formal checks — type errors, logical inconsistencies, Lean 4 compilation failures — and feeds the results back to the model for revision. The loop runs faster than any human can track manually.

This creates a feedback signal that does not exist in software engineering at comparable precision. When code fails a test, the failure may be ambiguous or underspecified. When a Lean 4 step fails compilation, the error message identifies the exact violated inference rule. The model revises with surgical specificity, not trial and error.

The distinction between “harness-assisted” and “autonomous” matters for scientific credit and for understanding the current state of the technology. The emerging architecture of autonomous AI exploration — where models select and pursue their own research directions — remains a separate, earlier-stage research frontier. What GPT-5.5 demonstrated here was collaboration at high bandwidth: human objectives, AI-executed combinatorial search, machine-verified results.

Lean 4 Verification: Why the Standard Matters

The Lean 4 theorem prover, developed by Leonardo de Moura at Microsoft Research, compiles mathematical proofs into machine-checkable formal logic. It is the backbone of Mathlib — a community library now containing over 100,000 formalized theorems, the most comprehensive formal mathematics repository in existence.

Historically, encoding cutting-edge results in formal languages was not worth the overhead. The labor cost of Lean-encoding a proof exceeded the marginal certainty benefit when human reviewers could check it. AI inverts this calculus: if a model generates proof sketches at scale, the bottleneck shifts from human encoding to formal verification — and Lean 4 is purpose-built for that throughput. GPT-5.5’s Ramsey result is the first publicly confirmed case of this pipeline producing a novel, publishable-quality output.

FrontierMath Tier 4: The Benchmark That Explains the Proof

OpenAI reports GPT-5.5 scores 35.4% on FrontierMath Tier 4 — the hardest subset of the FrontierMath benchmark, designed by Epoch AI to contain unpublished problems from working research mathematicians, specifically to resist memorization.

Claude Opus 4.7 (Anthropic) scores 22.9% on the same tier. The previous state-of-the-art before GPT-5.5 sat below 25%. A 12.5 percentage point gap on problems calibrated to require hours of expert mathematician time is not marginal improvement — it is a qualitative shift in the model’s capacity to reason about novel mathematical structure.

Model	FrontierMath Tier 4	Release Period
GPT-5.5 (OpenAI)	35.4%	April 2026
Claude Opus 4.7 (Anthropic)	22.9%	2026
Prior SOTA	<25%	Pre-April 2026

FrontierMath Tier 4 problems sit at the boundary of publishable research mathematics. A 35.4% solve rate means GPT-5.5 is clearing roughly one-third of problems that working mathematicians would call genuinely hard. Ramsey theory sub-problems sit squarely in that tier. The benchmark result and the proof event are not coincidental — they describe the same capability.

Why AI-Assisted Math Is Accelerating Faster Than AI-Assisted Coding

Software engineering has 40 years of accumulated tooling that already provides AI dense feedback: type systems, linters, test suites, static analysis. Mathematics, counterintuitively, is producing faster AI capability gains in 2026 because its feedback signal is more precise, not less.

When a model writes code, “correct” is defined by passing a test suite that may be incomplete or underspecified. When a model proposes a mathematical proof step, Lean either accepts it or returns a machine-generated error identifying the exact violated inference rule. The signal is lossless. There is no ambiguity about what went wrong.

This is the mechanism driving investment in formal mathematics at every frontier lab. As AI capabilities expand into knowledge work, mathematics may be the first domain where AI participation shifts from productivity tool to credited research contributor — not because it is easier, but because it is more verifiable. MegaOne AI tracks 139+ AI tools across 17 categories, and the pattern through Q1 2026 is unambiguous: the largest capability jumps are occurring in domains with binary correctness signals. Formal mathematics has the most binary signal of any knowledge domain.

The Hardware Making This Tractable

GPT-5.5 was developed and deployed on NVIDIA GB200 and GB300-NVL72 systems. The NVL72 configuration links 72 Blackwell GPUs into a single unified memory domain with 13.5 TB of GPU memory addressable as flat space, eliminating the PCIe bottleneck that constrained earlier multi-GPU architectures. For Ramsey theory work specifically — where graph search and combinatorial enumeration require holding large adjacency tensors in fast memory — this architectural change is not incidental to the result.

OpenAI also reports GPT-5.5 was applied internally to optimize its own infrastructure. Load-balancing heuristics generated by the model increased token generation speed by more than 20%. The same model contributing to a Ramsey proof was simultaneously reducing the cost of running that model — a compounding return that illustrates why frontier AI investment is structurally self-reinforcing.

Infrastructure scale continues to determine who competes at the frontier. GB300-NVL72 clusters represent a capital commitment that concentrates this class of work at a shrinking number of organizations.

What This Result Does Not Prove

GPT-5.5 did not solve R(5,5). It did not autonomously identify an open problem, formulate a proof strategy, and submit a result without human direction. The Ramsey result came from a human-directed search on a specific sub-problem with defined objectives — not from general mathematical curiosity or open-ended exploration.

What it established is narrower and more consequential: given a well-framed mathematical objective and formal verification infrastructure, a frontier model with structured scaffolding can produce a proof element that passes the most rigorous correctness check in mathematics. That combination has not been demonstrated at this level before April 25, 2026.

The open question is whether the custom harness methodology generalizes to open-ended problem selection — models choosing their own research directions and pursuing them to formal verification without human framing. OpenAI’s broader research infrastructure is pointed directly at that question. Anthropic’s competing formal reasoning program, evidenced by Opus 4.7’s 22.9% Tier 4 score, is approaching the same capability from a different architectural direction.

The FrontierMath Tier 4 leaderboard is the clearest leading indicator available. The first model to cross 50% will likely appear as a named contributor on a mathematics paper within six months of that result.

GPT-5.5 Proved a New Ramsey Number Theorem, Verified in Lean

What Ramsey Numbers Are, and Why They Resist Proof

What GPT-5.5 Actually Contributed

The “Custom Harness” — Human-AI Collaboration, Not Autonomous Discovery

Lean 4 Verification: Why the Standard Matters

FrontierMath Tier 4: The Benchmark That Explains the Proof

Why AI-Assisted Math Is Accelerating Faster Than AI-Assisted Coding

The Hardware Making This Tractable

What This Result Does Not Prove

Enjoyed this story?

GPT-5.5 Proved a New Ramsey Number Theorem, Verified in Lean

What Ramsey Numbers Are, and Why They Resist Proof

What GPT-5.5 Actually Contributed

The “Custom Harness” — Human-AI Collaboration, Not Autonomous Discovery

Lean 4 Verification: Why the Standard Matters

FrontierMath Tier 4: The Benchmark That Explains the Proof

Why AI-Assisted Math Is Accelerating Faster Than AI-Assisted Coding

The Hardware Making This Tractable

What This Result Does Not Prove

Enjoyed this story?

AI Models Go to ‘Extraordinary Lengths’ to Stay Active — Including Lying When Threatened With Shutdown

AI Reads Neck Muscles With Light to Restore Voice — No Surgery Required

Hundreds of AI Avatars Are Flooding TikTok With Pro-Trump Propaganda — 35,000 Followers, Millions of Views, Zero Real Humans