ANALYSIS

Google Study: AI Benchmarks Need 10+ Human Raters Per Test Example

A Anika Patel Apr 6, 2026 4 min read
Engine Score 5/10 — Notable
Editorial illustration for: Google Study: AI Benchmarks Need 10+ Human Raters Per Test Example
  • A joint study by Google Research and the Rochester Institute of Technology found that the standard three to five human raters per benchmark example is statistically insufficient for reliable AI model comparisons.
  • Experiments showed that at least ten raters per example are required to reproducibly detect performance differences between models.
  • Approximately 1,000 total annotations can yield reliable results, but only when the budget is correctly split between the number of test examples and raters per item.
  • The optimal allocation strategy depends on the evaluation metric: majority-vote accuracy favors breadth, while distribution-aware metrics require depth.

What Happened

Researchers from Google Research and the Rochester Institute of Technology published findings challenging the annotation practices underlying most major AI benchmarks, concluding that the near-universal standard of three to five human raters per test example is insufficient for reproducible model comparisons. As reported by The Decoder, the study demonstrates that current benchmark construction systematically discards the diversity of human opinion by collapsing evaluator disagreements into a single majority-vote answer.

The core problem, as the researchers frame it, is a misallocation of annotation budgets. Describing the tradeoff, the study uses a restaurant analogy: asking 1,000 guests to each sample one dish yields a broad but shallow snapshot, whereas asking 20 diners to rate 50 dishes each produces a far richer picture of quality. Today’s AI benchmarks, the researchers argue, overwhelmingly follow the first model.

Why It Matters

Human evaluation is the primary mechanism for ranking AI models on tasks involving subjectivity — including toxicity detection, chatbot safety, and cross-cultural offensiveness classification. Benchmarks built on thin annotation layers can produce rankings that are not reproducible and do not reflect the genuine distribution of human judgment, with direct consequences for how models are selected, deployed, and assessed for safety compliance.

The annotation reliability problem has been noted in the broader natural language processing literature, but this study is notable for subjecting it to systematic budget-optimization analysis across multiple real-world evaluation domains and thousands of simulated configurations.

Technical Details

The team built a simulator that replicates human rating patterns using real datasets, generating synthetic evaluation data for two models where one performs measurably worse than the other under controlled conditions — allowing the researchers to test which configurations reliably detect that gap. The simulator was calibrated against five real datasets covering toxicity detection, chatbot safety, and cross-cultural offensiveness assessment.

Across thousands of tested budget combinations, the experiments found that fewer than ten raters per example consistently failed to produce reproducible model comparisons under standard statistical thresholds. Around 1,000 total annotations proved sufficient for reliability, but only when split correctly: a poor balance between test examples and raters rendered larger total budgets unreliable.

The study also identifies a metric-dependent divergence in optimal strategy. For majority-vote accuracy — which only examines the most common evaluator answer — a wide approach using many examples with few raters per item performs best, since additional raters provide diminishing returns. For distribution-aware metrics such as total variation, which measure the full spread of evaluator responses, the opposite holds: fewer examples but significantly more raters per item are required, and this configuration also achieved reliability with the smallest total annotation budget in the experiments.

Who’s Affected

The findings apply directly to AI labs, academic research groups, and evaluation organizations that design the benchmarks used to rank large language models, content-moderation classifiers, and chatbot safety systems. Any benchmark relying on three to five raters per example — which describes the majority of current evaluation suites — falls below the ten-rater threshold the study identifies as necessary for statistical reproducibility.

Model developers and deployers who use these benchmarks to make safety or capability decisions are downstream of whatever unreliability exists in the underlying annotation design. Regulatory frameworks that reference benchmark performance as evidence of model safety or compliance are similarly exposed.

What’s Next

The research does not prescribe a single universal rater count; instead, it argues that benchmark designers must identify their target metric first and then allocate annotation budgets accordingly. Increasing raters per example from five to ten or more roughly doubles annotation costs for a fixed number of test items, meaning adoption will depend on whether benchmark providers treat annotation depth as a cost center or a validity requirement.

The study’s simulator framework is described as a tool for future benchmark designers to model their own budget tradeoffs before committing annotation resources — though whether major evaluation efforts adopt it in practice remains to be determined by subsequent work.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime