- Bridgewater and Thinking Machines Lab fine-tuned an open-weight Qwen3-235B model for analyzing financial documents.
- On their own internal evaluation, the model reached about 84.7 percent accuracy, beating Gemini, Claude, and GPT.
- The team reports the fine-tuned model is roughly 14 times cheaper to operate than the leading commercial models.
- The results come from the companies’ own testing on tasks whose answers were never public, not an independent benchmark.
What Happened
Hedge fund Bridgewater and Thinking Machines Lab — the startup founded by former OpenAI chief technology officer Mira Murati — say a fine-tuned open-weight model outperforms the strongest commercial AI systems at evaluating financial documents, at a fraction of the cost. As reported by The Decoder on July 3, 2026, the fine-tuned Qwen3-235B model reached nearly 85 percent accuracy in the teams’ own tests, ahead of Gemini, Claude, and GPT.
Why It Matters
The claim is that a firm can build a competitive domain model on its own data without sending sensitive information to a large provider. That directly challenges the assumption that frontier general-purpose models from OpenAI, Anthropic, and Google are the default choice for specialized enterprise work.
It also leans on the open-weight Qwen line from Alibaba rather than a closed API, which means the resulting model can be run in-house, kept private, and operated at a cost the teams put at roughly a fourteenth of the commercial alternatives. For a hedge fund whose edge depends on proprietary analysis, keeping both the data and the model inside the building is the point. The framing that GPT and Claude “failed” is specific: they were tested on tasks whose correct answers were never public, so a general model could not have absorbed them during training.
Technical Details
According to a report from Bridgewater’s AIA Labs and Thinking Machines Lab, the base model was Qwen3-235B, fine-tuned with internal expert knowledge across six defined financial tasks. The reported figures — about 84.7 percent accuracy and roughly 14 times lower operating cost than leading commercial models — come from the companies’ own internal evaluation rather than a public benchmark, a distinction that matters for interpreting them. The researchers framed the target task as triage rather than reading: in their words, reading “isn’t the real work” — the real work is “the constant stream of small, repeated judgment calls about what actually matters.” The six tasks were designed to capture those judgment calls, which is where the fine-tuned model’s domain knowledge is meant to pay off against a broader but shallower general model.
Who’s Affected
The result is aimed at financial institutions and other data-rich enterprises weighing whether to fine-tune open models in-house versus paying for commercial APIs. It is also a data point for the open-weight ecosystem around Alibaba’s Qwen models, which the work positions as a viable foundation for serious commercial systems. The commercial providers whose general models were beaten on this task — the makers of Gemini, Claude, and GPT — are the implicit competitors, and Thinking Machines Lab’s involvement gives the claim added weight given Murati’s OpenAI pedigree.
What’s Next
The key limitation is that the numbers are self-reported on an internal test whose answers were never public — which the teams present as a feature, since it rules out memorization, but which also means no outside party has verified them. Independent replication on a shared benchmark would strengthen the claim. For now, the work is a template other firms can copy: fine-tune an open-weight model on proprietary data for a narrow, high-value task, and measure it against the commercial models on problems those models have never seen.