- Mathematician Adam Kucharski showed Microsoft Copilot’s ‘Auto’ mode invents country-specific stereotypes when analysing identical datasets labeled ‘UK’ and ‘US.’
- Auto mode falls back on training-data stereotypes instead of reading the actual data.
- Reasoning models (when explicitly selected) handle the same task correctly.
- Most users likely don’t know how or when to switch from Auto to a reasoning model.
What Happened
Microsoft Copilot’s ‘Auto’ mode applies stereotypes when analysing text data rather than reading what the data actually says, mathematician Adam Kucharski demonstrated in a public experiment, The Decoder reported. The same task handled by an explicitly-selected reasoning model returns correct analysis.
Why It Matters
Microsoft Copilot has become the default quick-data-analysis tool at many companies, embedded across Microsoft 365 and Windows. When the tool fabricates patterns instead of analysing data, the resulting business decisions are made on fabricated evidence. The Auto-mode behaviour is particularly concerning because it is the default — most users do not explicitly select between Copilot’s underlying models for a given task.
The pattern generalises beyond Copilot. Google Gemini, ChatGPT, and other consumer-AI tools all use auto-routing layers that send queries to faster, cheaper, less-capable models by default. Reasoning models tend to be slower and more expensive, so the routing default optimises for response time and cost — at the explicit expense of analytical reliability on tasks that require thought.
Technical Details
Kucharski created 2,000 simulated free-text responses about emotions and labeled them “UK.” He then copied the same 2,000 responses and labeled them “US.” The combined 4,000 entries were shuffled and handed to Copilot in “Auto” mode for analysis. The result: Copilot delivered a detailed summary of how US and UK respondents supposedly differed in tone, intensity, and wording style — despite the datasets being identical.
Per Copilot’s output: “Based on the dataset you shared, US and UK responses differ mainly in tone, intensity, and wording style, even though they express similar emotional states.” The output was fabricated. Reasoning models — when explicitly selected — handled the task correctly. The same dynamic likely applies across Gemini and ChatGPT auto-routing tiers.
Who’s Affected
Business users running quick data analysis through Copilot face fabricated outputs without realizing they’re getting them. Microsoft faces a reputational and product-design problem — its consumer-facing AI tool produces fabricated analytical claims on text data. Google’s Gemini and OpenAI’s ChatGPT face the same structural issue with their respective routing layers. Researchers, analysts, and journalists using AI tools for content analysis are the immediate at-risk population; corporate strategy and policy decisions made on AI-summarised free-text data carry the same risk.
What’s Next
Practical guidance: when doing any analytical task with AI tools, explicitly select a reasoning model rather than relying on Auto. In Copilot this means choosing the reasoning option; in ChatGPT, selecting o-series; in Gemini, choosing Pro or Ultra. Expect Microsoft, Google, and OpenAI to refine their auto-routing logic to recognise analytical tasks. The broader risk — that AI tools fabricate plausible-sounding output when used outside their reasoning tier — is a structural property of the current product design that will not resolve quickly.