US Government Benchmark: DeepSeek V4 Roughly Eight Months Be

The Center for AI Standards and Innovation (CAISI) within NIST published a report finding DeepSeek V4 Pro is roughly eight months behind leading US AI models across cybersecurity, software development, math, natural sciences, and abstract reasoning.
CAISI says DeepSeek V4 performs closer to GPT-5 than to Claude Opus 4.6 or GPT-5.4 — contradicting DeepSeek’s own technical-report framing of parity with current US models.
Math is the one area where CAISI says DeepSeek V4 nearly matches the top US models.
On price, DeepSeek V4 came in cheaper than the comparable GPT-5.4 mini in five of seven CAISI tests.

What Happened

The Center for AI Standards and Innovation (CAISI) — a unit within the National Institute of Standards and Technology (NIST) — published a report finding that DeepSeek V4 Pro, the newest Chinese open-weight model, performs roughly eight months behind leading US frontier models. CAISI tested across cybersecurity, software development, math, natural sciences, and abstract reasoning. The report calls DeepSeek V4 the most capable Chinese AI model to date but says private testing shows weaker performance than DeepSeek’s own technical report claims.

Why It Matters

This is the first formal U.S. government benchmark report directly comparing a Chinese frontier model against U.S. counterparts on operational tasks. The framing is politically consequential: CAISI’s verdict diverges from independent measurements (which The Decoder notes show the gap has stayed roughly constant rather than widening) and from DeepSeek’s own published parity claims. The report sits within a broader administration push on AI export-controls policy and federal-procurement standards. CAISI’s institutional placement within NIST gives the report formal weight in U.S. regulatory and policy contexts even when its methodology may be debated.

Technical Details

CAISI’s headline finding: DeepSeek V4 Pro is closer to the older GPT-5 than to current Claude Opus 4.6 or GPT-5.4 across most evaluation categories. Specifically weaker areas: abstract reasoning, cybersecurity, and software development. Stronger area: math, where DeepSeek V4 nearly matches the top US models. DeepSeek’s own technical report had positioned V4 as roughly on par with Opus 4.6 and GPT-5.4. The discrepancy is significant in evaluation methodology terms — vendor-published numbers and independent testing rarely match exactly, but a “closer to GPT-5” placement (suggesting an 8-month gap) is a sharper downgrade than typical methodology differences.

On pricing, DeepSeek V4 came in cheaper than the comparable GPT-5.4 mini in five of seven CAISI tests. The price-per-task gap matters because it shifts the relevant comparison from “which model wins absolutely” to “which model wins per dollar at production scale.” CAISI’s report notes that as AI models run longer and handle more complex tasks, price increasingly factors into deployment decisions. Top-tier US models continue getting more expensive at the same time DeepSeek and other Chinese open-weight options keep prices low.

OpenAI CEO Sam Altman captured the tension publicly. In a recent X post, he wrote: “I keep thinking I want the models to be cheaper/faster more than I want them to be smarter, but it seems that just being smarter is still the most important thing.” Altman’s view rests on the bet that smarter models accelerate AI R&D itself; OpenAI, Anthropic, and Chinese developers have all said their models now accelerate their own research work.

Who’s Affected

U.S. policymakers and procurement teams gain formal cover to argue that Chinese AI models should be excluded from federal use on capability grounds rather than purely on supply-chain or national-security grounds. DeepSeek and other Chinese open-weight providers face a U.S. government-issued report contradicting their public benchmarks. Cursor — which according to CAISI’s report built its custom fine-tuned coding model on top of a Chinese open-weight model — illustrates the commercial dynamic: even if the underlying model is “8 months behind,” the price and customization advantage can justify the trade-off for production use cases. Independent AI evaluation organizations face implicit pressure to either corroborate or contradict CAISI’s methodology.

What’s Next

Independent reproductions of CAISI’s specific test results — particularly the abstract-reasoning, cybersecurity, and software-development gaps — will be the cleanest external test. CAISI’s report has acknowledged political context (the agency “likely has its own political agenda” per The Decoder’s framing), so methodology transparency will determine whether the report is treated as authoritative or politicized. Expect DeepSeek to publish a methodology rebuttal, and watch for whether CAISI extends the framework to evaluate Xiaomi MiMo-V2.5-Pro, Kimi K2.6, GLM 5.1, and other Chinese open-weight models in a follow-up.

US Government Benchmark: DeepSeek V4 Roughly Eight Months Behind Leading US AI Models

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

US Government Benchmark: DeepSeek V4 Roughly Eight Months Behind Leading US AI Models

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

‘This Is Fine’ Creator Says AI Startup Artisan Used His Comic Without Permission

Japan’s First AI Legislation Becomes Law: R&D-Focused, No Monetary Penalties

Chinese Court Bars Firms From Firing Workers Solely on AI-Replacement Grounds