Anthropic's Claude Opus 4.7 Scores 64.3% on SWE-bench, Caps

Claude Opus 4.7 scores 64.3% on SWE-bench Pro, up from 53.4% for Opus 4.6 and ahead of OpenAI’s GPT-5.4 at 57.7%.
Anthropic deliberately reduced certain cybersecurity capabilities during training as part of its Project Glasswing safeguard strategy, with automatic blocking of high-risk requests now built into the model.
Image processing resolution nearly triples to approximately 3.75 megapixels; Document Reasoning benchmark scores jump from 57.1% to 80.6%.
Per-token pricing is unchanged at $5/$25 per million, but a new tokenizer can map the same text to up to 35% more tokens, raising effective per-request costs.

What Happened

Anthropic has released Claude Opus 4.7, a direct upgrade to Opus 4.6, with autonomous coding performance as its primary selling point. The model scores 64.3% on the SWE-bench Pro benchmark, up from 53.4% for its predecessor and ahead of OpenAI’s GPT-5.4 at 57.7% on the same evaluation. Anthropic’s higher-tier Claude Mythos Preview retains a substantial lead at 77.8%.

Why It Matters

The release marks the first deployment of cybersecurity safeguards developed under Anthropic’s Project Glasswing, a policy initiative the company announced to address the dual-use risks of frontier models in security contexts. Anthropic had previously stated it would test new cyber restrictions on less capable models before applying them to Mythos Preview—Opus 4.7 is explicitly positioned as that first test case. That framing makes the model’s handling of cyber capabilities at least as consequential as its benchmark numbers.

Technical Details

Opus 4.7 processes images at up to 2,576 pixels on the long edge—approximately 3.75 megapixels, more than three times the resolution ceiling of earlier Claude models—implemented as a model-level change rather than an API parameter. On the Document Reasoning benchmark OfficeQA Pro, accuracy rose from 57.1% with Opus 4.6 to 80.6%. Anthropic’s system card notes that Opus 4.7 still declines to assist with 33% of simulated AI safety research tasks, down from the 88% refusal rate recorded for Opus 4.6, though the company acknowledges the test suite was designed around Opus 4.6’s known weaknesses, which affects direct comparisons.

Anthropic introduced a new tokenizer that maps the same input text to up to 1.35 times as many tokens compared to prior models. Per-token pricing remains at $5 per million input tokens and $25 per million output tokens, but the tokenizer change can significantly increase cost per request. On factual hallucinations, the system card states the gap between Opus 4.7 and Mythos Preview stems “mainly from Mythos Preview’s higher hit rate on obscure facts, not from a higher error rate in Opus 4.7″—meaning accuracy on common knowledge is broadly comparable between the two tiers.

Who’s Affected

Developers using Opus 4.6 for coding agents or document-analysis workflows will need to audit existing prompts: Anthropic notes the model now interprets instructions more literally than its predecessor, which previously skipped or loosely applied portions of prompts. Security researchers requiring penetration testing or red-teaming capabilities must enroll in Anthropic’s new Cyber Verification Program to access those functions. The model is available via the Claude API and through Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

What’s Next

Anthropic has published a migration guide for users moving from Opus 4.6. Claude Code gains a new /ultrareview command for dedicated code review workflows, along with an expanded Auto Mode for Max-tier subscribers. The company has indicated it will extend the cybersecurity safeguards piloted in Opus 4.7 to more capable models, including Mythos Preview, contingent on evaluating results from the Cyber Verification Program.

Anthropic’s Claude Opus 4.7 Scores 64.3% on SWE-bench, Caps Cyber Use

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Anthropic’s Claude Opus 4.7 Scores 64.3% on SWE-bench, Caps Cyber Use

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

OpenAI Says GPT-5.6 Sol Autonomously Post-Trained Its Luna Model

Apple Sues OpenAI Over an Alleged Campaign to Steal Trade Secrets

GPT-5.6 Sol Ultra Proof Cracks a 50-Year-Old Graph Theory Conjecture