- A study from Harbin Institute of Technology and Xiaohongshu argues leading AI search agents mostly use the web to confirm answers they already have, not to research.
- The authors call this ‘intrinsic knowledge dependence’ (IKD) — reliance on knowledge absorbed during training rather than active web research.
- Without any internet access, MiniMax M2.5 solved 44.5% of BrowseComp tasks from memory alone; Kimi K2.6 hit 62% on the Chinese BrowseComp-ZH variant.
- When search returns no supporting documents, models actually perform worse than without search at all — MiniMax dropped 44.5% → 8.0%; Kimi 25.5% → 2.3%.
What Happened
Leading AI search agents don’t actually research on established benchmarks — they mostly use the web to confirm answers they already have, according to a study from researchers at Harbin Institute of Technology and Xiaohongshu reported by The Decoder. Frontier models like GPT-5.4, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek-V4-Pro, and Kimi-K2.6 have been posting higher scores on BrowseComp — a benchmark that asks agents complex questions requiring multi-step web browsing — but the gains say less about research skill than assumed.
Why It Matters
The finding reframes how to interpret recent benchmark progress on agentic-AI search. BrowseComp scores have been treated as evidence that frontier models are getting better at multi-step web research. The Harbin–Xiaohongshu study suggests a substantial fraction of those scores come from training-data memorization rather than active research capability.
The implication has practical consequences. Enterprise deployments increasingly rely on agentic-AI search for tasks where the right answer must come from the live web (current pricing, recent regulatory changes, freshly-published research). If frontier models actually use the web to confirm answers they already have rather than discovering new information, those deployments may produce confidently-wrong outputs when the underlying data has shifted since the model’s training cutoff.
Technical Details
The researchers tested 11 models. The first test stripped away all search and browsing tools — models had to answer BrowseComp questions from memory alone. Results were striking: MiniMax M2.5 solved 44.5 percent of BrowseComp tasks from memory; Kimi K2.6 hit 62 percent on the Chinese BrowseComp-ZH variant. A substantial fraction of benchmark performance comes before any search action happens.
The second test was more telling. The researchers left the search interface in place but removed all answer-supporting documents from the search index. Every tested model then performed worse than without any tool access. MiniMax M2.5 dropped from 44.5% to 8.0%. Kimi-K2.6 fell from 25.5% to 2.3%. The interpretation: search actively pulls agents away from correct gut-feeling answers as soon as no confirming hits show up. The authors call the pattern ‘intrinsic knowledge dependence’ (IKD).
Who’s Affected
AI safety researchers and benchmark designers face a methodological question: how to test for genuine research capability rather than memorized-answer confirmation. Enterprise AI buyers deploying agentic-AI search face the question of whether their use cases involve information that wasn’t in the model’s training data — and if so, whether the deployed agent will produce confident-wrong outputs. Frontier AI providers (OpenAI, Anthropic, Google DeepMind, Mistral, xAI, DeepSeek, Moonshot, MiniMax) face renewed scrutiny on how their agents handle queries that require new information vs queries within their training distribution.
What’s Next
The Harbin–Xiaohongshu paper will likely face independent replication. Expect follow-up benchmark designs that specifically test for IKD-resistant research capability. The broader question — how to design agentic-search benchmarks that resist training-data contamination — is one of the most active open problems in agentic-AI evaluation.