BENCHMARKS DESK

James Whitfield

Benchmarks & Research Editor · MegaOne AI

James Whitfield runs MegaOne AI's benchmarks and research desk, covering model evaluations, technical papers, and the methodology debates shaping how AI systems are measured. His work examines what published benchmark scores actually mean, where evaluation harnesses can be gamed, and how new releases stack up against the prior state of the art on tasks that matter — coding, reasoning, agentic workflows, multimodality, and long-context retrieval. James reads the papers other reporters skim. He pays close attention to ablations, evaluation protocols, dataset contamination concerns, and the gap between cherry-picked demos and reproducible results. He has a background in computer science research and is comfortable digging into model cards, training-compute disclosures, and the supplementary appendices where the most consequential details often live. James prefers to wait for independent reproduction before declaring a new model the leader on any given task. His reporting is built for technical readers who want to understand what a benchmark result implies for real-world deployment, and for non-technical readers who want a clear explanation of which claims hold up under scrutiny.

44 stories published

James Whitfield

Recent stories by James Whitfield

ArXiv to Become Independent Nonprofit After 35 Years Under Cornell University

Study Finds AI Users Adopt Incorrect Answers 73% of the Time When Models Err

OpenUI Rewrites Rust WASM Parser in TypeScript, Achieves 3x Speed Gain

MoonshotAI’s AttnRes Replaces Fixed Residuals with Depth-Aware Attention

Raspberry Pi 5 Runs Qwen3 30B at 7–8 Tokens/Sec with Potato OS

EsoLang-Bench Reveals 85-Point Gap in LLM Code Generation

SkyPilot Study: Parallel GPU Access Speeds AI Agent Research 9x

UC Berkeley Expands BFCL Benchmark to v4, Adding Agentic Evaluation for LLM Function Calling