- Researchers at Handshake AI and McGill University released BankerToolBench, an open-source benchmark testing nine AI models on real investment banking deliverables.
- Approximately 500 current and former investment bankers evaluated outputs across 100 tasks; not one deliverable was rated fit for immediate client submission.
- GPT-5.4, the top-scoring model, cleared all critically weighted criteria on just 2 percent of tasks; Gemini 2.5 Pro scored zero percent on that same measure.
- Claude Opus 4.6 produced visually polished outputs but hardcoded most key figures as static values rather than Excel formulas, making scenario analysis impossible.
What Happened
A research team at Handshake AI and McGill University published BankerToolBench, an open-source benchmark evaluating nine leading AI models on the daily workflows of junior investment bankers. The evaluation enlisted approximately 500 current and former bankers from firms including Goldman Sachs, JPMorgan, Evercore, Morgan Stanley, and Lazard. Not a single output from any tested model was deemed ready for client submission, though more than half of the participating bankers said they would use AI outputs as a starting point.
Why It Matters
Investment banking is among the highest-stakes domains for AI deployment in knowledge work, with analyst-level deliverables carrying direct financial and legal weight. The findings align with other recent assessments: a Vals.ai study conducted with a globally systemic bank found that OpenAI’s o3 reached only 48.3 percent accuracy on financial analysis tasks, and UC Berkeley researchers concluded that production-grade agent deployments currently rely on simple, tightly controlled pipelines with few steps. A separate analysis from Carnegie Mellon and Stanford argued that agent benchmarks have concentrated too narrowly on coding tasks, leaving economically important fields such as finance and law largely absent.
Technical Details
The benchmark comprised 100 tasks designed by 172 practicing bankers who collectively logged more than 5,700 hours building them. Each task required a human banker an average of five hours to complete, with some running as long as 21 hours. A single task can trigger up to 539 language model calls, with 97 percent tied to tool use or code execution.
Deliverables graded by the benchmark include Excel financial models with working formulas, PowerPoint client decks, PDF reports, and Word memos — each scored against rubrics averaging 150 individual criteria across six dimensions including technical correctness, client readiness, compliance, and cross-file consistency. Grading was handled by an AI verifier the authors built, called Gandalf, based on Gemini 3 Flash Preview; it agreed with human reviewers 88.2 percent of the time, slightly above the 84.6 percent agreement rate between two human reviewers.
GPT-5.4 led all nine models tested — which also included GPT-5.2, Claude Opus 4.5 and 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview, Grok 4, Qwen-3.5-397B, and GLM-5 — but still failed nearly half the evaluation criteria overall. Only 16 percent of GPT-5.4 outputs were accepted by bankers as a useful starting point; requiring three consistent runs dropped that figure to 13 percent. GPT-5.4 cleared all critically weighted criteria on just 2 percent of tasks, while Gemini 2.5 Pro cleared zero. An analysis of GPT-5.4 agent trajectories identified four recurring failure modes: code and formula bugs at 41 percent of errors, business logic breakdowns such as adding cost synergies to the revenue line instead of the cost line at 27 percent, aborted data queries at 18 percent, and fabricated numbers presented as sourced at 13 percent.
Claude Opus 4.6‘s outputs appeared polished at first glance, but the underlying Excel models embedded most key figures as hardcoded static values rather than calculated formulas. The paper describes this as “a dealbreaker in investment banking” because changing a purchase price leaves all dependent cells unchanged, rendering scenario analysis impossible. Claude Opus 4.5 exhibited the same behavior.
Who’s Affected
AI labs with models in the evaluation — Anthropic, OpenAI, Google DeepMind, xAI, and the developers behind Qwen-3.5-397B and GLM-5 — now have a structured, banker-reviewed account of performance gaps in finance-specific knowledge work. Financial institutions that have begun piloting AI agents for analyst-level tasks face a detailed breakdown of where and how frequently failures occur. The authors noted that performance improved meaningfully when tasks were enriched with domain context that experienced bankers take for granted, suggesting that fine-tuning or retrieval augmentation could reduce the gap in practical deployments.
What’s Next
The benchmark is publicly available along with its full dataset, rubrics, and AI verifier. The team demonstrated that reinforcement learning methods — specifically Dr. GRPO and DPO applied to Qwen-3-4B and 32B — raised benchmark scores by a factor of five to thirteen, though from a very low starting baseline. Anthropic has separately introduced tooling that allows Claude to switch autonomously between Excel and PowerPoint, and Cowork plugins now pipe FactSet, MSCI, and LSEG market data directly into agent workflows — capabilities the paper identifies as directly relevant to the gaps BankerToolBench exposes, though their effect on benchmark scores has not yet been tested.