- Microsoft launched Copilot Cowork and a Critique function on March 30, 2026, enabling one AI model to draft research and a second model to review it for accuracy before delivery.
- Critique pairs OpenAI’s GPT for content generation with Anthropic’s Claude for independent review, separating writing from evaluation in a two-stage pipeline.
- Microsoft reports a 13.8-point improvement on the DRACO benchmark, which measures accuracy, completeness, and objectivity in research outputs.
- A separate Model Council feature runs multiple AI models simultaneously on the same query, with a third model summarizing areas of agreement and disagreement.
What Happened
Microsoft announced the broader rollout of Copilot Cowork and a new Critique function for its Researcher tool on March 30, 2026, as part of Wave 3 of Microsoft 365 Copilot. Copilot Cowork allows the AI system to independently manage multi-step workflows across Microsoft 365, including accessing files, managing calendars, compiling daily briefings, and executing tasks across Outlook, Teams, and Excel.
The Critique function introduces a two-stage review process where one AI model generates a draft and a second, independent model reviews it. Microsoft stated that the feature “separates generation from evaluation and utilises a combination of models from Frontier labs, including Anthropic and OpenAI.”
Why It Matters
Single-model AI systems have a well-documented tendency to produce confident but inaccurate output, particularly in research and analysis tasks. By routing generated content through an independent reviewer from a different model provider, Microsoft is addressing one of the most persistent complaints about AI assistants in enterprise settings: unreliable facts presented as authoritative. The separation of generation and evaluation reduces the likelihood that a single model’s blind spots pass through unchecked.
The multi-model approach also represents a shift in how major tech companies deploy AI. Rather than betting on a single model provider, Microsoft is combining OpenAI’s GPT for generation with Anthropic’s Claude for evaluation, treating each model’s strengths as complementary rather than competing. This is the first major productivity suite to ship cross-provider AI verification as a default feature.
Technical Details
In the Critique pipeline, GPT drafts a response to a research query. Claude then reviews the draft for accuracy, completeness, and citation quality before the response reaches the user. Critique activates by default in Researcher when users select “Auto” in the model picker. Microsoft reports that Researcher with Critique scores 57.4 on the DRACO benchmark, a 13.8-point improvement over the previous version. DRACO measures accuracy, completeness, and objectivity in research outputs.
The company claims the new Researcher outperforms Perplexity with Claude Opus 4.6 by 7 points in internal benchmarks, though no independent verification of these results has been published. A comparison against OpenAI’s GPT-5-based Deep Research or Google’s Gemini models was not included.
Model Council, a separate feature, runs models from both Anthropic and OpenAI simultaneously on identical queries and generates full reports from each. A third “judge” model then summarizes key findings, highlighting areas of agreement, disagreement, and unique insights from each model.
Who’s Affected
Copilot Cowork is currently available through Microsoft’s Frontier early-access program, which began enrolling customers on March 9, 2026. The feature is not yet available to all Microsoft 365 subscribers. Enterprise customers using Copilot for research-heavy tasks in legal, financial, and consulting workflows stand to benefit most from the accuracy improvements that come from cross-model verification.
Individual knowledge workers who rely on Copilot for daily briefings, email summarization, and calendar management gain access to autonomous task execution through Cowork, reducing the number of manual steps required to complete routine workflows.
The current version of Copilot Cowork lacks local computer use and third-party integrations found in standalone Claude Cowork, limiting its scope to tasks within the Microsoft 365 ecosystem.
What’s Next
Microsoft has not announced a timeline for general availability beyond the Frontier program. The DRACO benchmark results remain self-reported, and the absence of comparisons against Google’s Gemini and OpenAI’s latest GPT-5-based Deep Research tool makes independent performance assessment difficult.
The multi-model approach also raises questions about latency and cost. Running two frontier models sequentially on every research query doubles the compute requirements compared to single-model execution. Whether the accuracy gains justify the additional processing time and expense in real-world enterprise use, rather than controlled benchmark conditions, will determine long-term adoption of the Critique pattern.
