CAISI Welcomes Google DeepMind & Microsoft to AI Testing

The Center for AI Standards and Innovation (CAISI), part of the US Department of Commerce, signed new agreements on May 5, 2026 with Google DeepMind, Microsoft, and xAI for pre-release AI model testing.
The new deals expand earlier agreements with Anthropic and OpenAI, bringing the total covered labs to five.
CAISI Director Chris Fall states the agency has already run more than 40 evaluations, “some on unreleased models,” with labs providing versions with reduced safety guardrails for testing.
The expansion comes as AI models rapidly improve at finding and exploiting security vulnerabilities — UK AISI confirmed last week that GPT-5.5 matches Claude Mythos cyber-attack capability — and as the US-China AI race intensifies.

What Happened

The Center for AI Standards and Innovation (CAISI) within NIST signed new agreements on May 5, 2026 with Google DeepMind, Microsoft, and xAI to test advanced AI models for national-security risks before they become publicly available. The new deals expand earlier CAISI agreements with Anthropic and OpenAI, bringing the program to five major labs. The agreements include classified-environment testing.

Why It Matters

Pre-release government access to frontier AI models is the single most consequential AI-policy infrastructure in the US in 2026. CAISI’s framework operates in parallel to the UK AI Security Institute (AISI) and to private red-teaming arrangements between labs. The expansion to five labs — covering essentially all major US frontier-AI developers — signals the framework has stabilized as the de facto pre-release safety review for US AI. Combined with the White House’s reported consideration of formal pre-release review (covered earlier this week) and the Pentagon’s seven-vendor classified-network deals, the federal AI infrastructure is consolidating around government access to model capabilities before public release.

Technical Details

CAISI Director Chris Fall stated: “Independent, rigorous measurement science is essential to understanding frontier AI and its national security implications.” The agency has already run more than 40 evaluations, some on unreleased models. Critical detail: AI labs provide versions with reduced safety guardrails for testing — meaning CAISI’s evaluations measure raw capability rather than the deployed safety-mitigated capability. This is the same methodology UK AISI uses for its evaluations of Claude Mythos and GPT-5.5.

The original agreements with Anthropic and OpenAI covered joint safety assessments and research into risk mitigation. The expansion adds classified-environment testing to the new three-lab cohort (Google DeepMind, Microsoft, xAI). Specific evaluation categories that CAISI has run include cybersecurity capability, biosecurity risk, autonomous-systems behavior, and reasoning-chain analysis. The agency’s earlier published DeepSeek V4 evaluation — which positioned the Chinese open-weight model “roughly eight months behind” leading US models — illustrates the depth of CAISI’s methodology.

The expansion comes as AI models rapidly improve at finding and exploiting security vulnerabilities. UK AISI’s report last week documented OpenAI’s GPT-5.5 matching Anthropic’s Claude Mythos Preview on Expert-tier cyber-attack benchmarks, with both models fully solving multi-stage enterprise-attack simulations against undefended networks. The cyber-capability question, plus the continuing US-China AI rivalry, frames CAISI’s testing scope.

Who’s Affected

Anthropic, OpenAI, Google DeepMind, Microsoft, and xAI are the five labs now formally included in pre-release testing. Their model release schedules will increasingly coordinate around CAISI evaluation timelines. Chinese AI labs — DeepSeek, Moonshot, Xiaomi, Zhipu — are excluded by structure, but face implicit competitive disadvantage on US enterprise and federal procurement: a CAISI-evaluated model carries a different policy posture than an unevaluated one. The seven Pentagon-aligned AI vendors confirmed last week largely overlap with CAISI’s covered labs (with some exceptions, notably Reflection AI). UK AISI, the ECB-Anthropic banks-testing program reported earlier this week, and other national-level AI safety frameworks gain a peer benchmark for cross-jurisdictional evaluation methodology.

What’s Next

CAISI is expected to publish summary evaluation reports for the new three-lab cohort in coming months, similar to the DeepSeek V4 evaluation. Watch for whether the program adds a sixth lab — Reflection AI is the most obvious candidate given its Pentagon classified-network contract — and for whether the evaluation methodology converges with UK AISI’s framework or remains distinct. The deeper policy question: whether CAISI evaluations become a soft prerequisite for US federal AI procurement, which would make participation effectively mandatory for any lab targeting government revenue.

CAISI Adds Google DeepMind, Microsoft, xAI to US Pre-Release AI Testing Program

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

CAISI Adds Google DeepMind, Microsoft, xAI to US Pre-Release AI Testing Program

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

Google DeepMind UK Workers Vote to Unionize Over Military AI Deals

PayPal and Coinbase Announce AI-Driven Layoffs as Industry Restructuring Accelerates

Beijing Blocks Meta’s $2 Billion Acquisition of Chinese AI Startup Manus, Bloomberg Reports