MLPerf Inference v6.0: Largest Benchmark Overhaul Adds DeepSeek-R1 and Text-to-Video Tests

MLCommons released MLPerf Inference v6.0 with five new or updated benchmarks including GPT-OSS 120B, DeepSeek-R1, and the suite’s first text-to-video generation test.
24 organizations submitted results, with three first-time participants — the largest round of submissions to date.
Multi-node submissions increased 30% over v5.1, with the largest system spanning 72 nodes and 288 accelerators.
AMD’s Instinct MI355X achieved 93% of NVIDIA B200 single-node performance in Single Stream inference.

What Happened

MLCommons released MLPerf Inference v6.0 on April 1, 2026, representing what co-chair Frank Han of Dell called “the most significant revision of the Inference benchmark suite that we’ve ever done.” The update replaces or adds five of the eleven datacenter benchmarks: a 120-billion parameter open-weight LLM (GPT-OSS 120B) for math, science, and coding; DeepSeek-R1 with an interactive scenario supporting speculative decoding; a third-generation recommender system (DLRMv3); the suite’s first text-to-video generation benchmark; and a vision-language model test using Shopify product catalogs.

Why It Matters

MLPerf is the primary standardized benchmark for comparing AI inference hardware across vendors. The v6.0 update reflects a shift in what matters for production AI deployment: reasoning models like DeepSeek-R1 and multimodal workloads including text-to-video now sit alongside traditional classification and object detection tasks. The addition of speculative decoding support in DeepSeek-R1 testing acknowledges a technique that has become standard in production LLM serving.

Technical Details

Twenty-four organizations submitted results, including first-time participants Inventec Corporation, Netweb Technologies India, and Stevens Institute of Technology. Multi-node submissions increased 30% over the previous round (v5.1, released six months earlier), with 10% of submissions featuring more than 10 nodes — up from 2% previously. The largest system tested comprised 72 nodes with 288 accelerators, four times the size of the previous round’s largest configuration.

AMD’s Instinct MI355X achieved 93% of NVIDIA B200 single-node performance and 87% of B300 single-node performance in Single Stream testing. MLCommons also introduced LoadGen++, a new harness enabling LLM serving-style software stacks for more realistic testing conditions. Co-chair Miro Hodak of AMD noted these partnerships were “essential in ensuring that the tests include scenarios and workloads that represent the current state of the industry.”

Who’s Affected

Cloud providers and enterprises evaluating AI inference hardware now have updated benchmarks reflecting current workloads. AMD’s competitive results against NVIDIA’s B200 and B300 give buyers more data for procurement decisions. The text-to-video and vision-language benchmarks provide the first standardized comparison points for companies deploying generative AI in production.

What’s Next

MLPerf Training v5.0 results are expected later in 2026. The new benchmarks will likely accelerate hardware optimization around reasoning and multimodal workloads, with vendors already racing to submit optimized results for the next inference round. Edge inference benchmarks were also updated with YOLOv11 Large replacing the previous object detection test.

MLPerf Inference v6.0: Largest Benchmark Overhaul Adds DeepSeek-R1 and Text-to-Video Tests

What Happened

Why It Matters

Technical Details

Who’s Affected

What’s Next

Enjoyed this story?

DeepMind Study: LLM Rewrites Game Theory Algorithms and Outperforms Human Experts

Self-Distillation Boosts Code Generation by 30%: No Teacher Model or RL Required

Australia and Anthropic Sign AI Safety MOU With AUD$3M Research Investment

Before you go…