- A research team from several Chinese universities published RealChart2Code, testing 14 AI models on chart-to-code generation with 2,800+ real-world test cases drawn from Kaggle datasets.
- Anthropic’s Claude 4.5 Opus topped the proprietary category with an average score of 8.2 out of 10; Google’s Gemini 3 Pro Preview led on basic chart replication at 9.0.
- A “complexity gap” cuts scores roughly in half: Gemini 3 Pro Preview drops from about 96% on ChartMimic to about 50% on RealChart2Code; Qwen3-VL-235B falls from roughly 85% to under 25%.
- Open-weight models hallucinate non-existent libraries while proprietary models make data assignment errors; all tested models struggle with iterative code repair.
What Happened
Researchers at several Chinese universities published RealChart2Code, a benchmark evaluating how well AI models translate real-world visualizations into runnable code. The April 2026 study tested 14 models—five proprietary and nine open-weight—across more than 2,800 test cases derived from 1,036 Kaggle datasets totaling roughly 860 million rows of data. The benchmark, evaluation code, and model outputs are available on GitHub and Hugging Face.
Why It Matters
Earlier chart-understanding benchmarks, including Plot2Code and ChartMimic, relied primarily on synthetic data and simple single-panel charts. RealChart2Code introduces composite multi-panel layouts, 50 distinct chart types, and large raw data files sourced from real research datasets—conditions closer to production visualization work. The cross-benchmark score comparisons in the paper indicate that performance figures from older evaluations have been substantially overstating real-world capability.
The finding echoes results from Google’s PaperBanana project, which used five specialized agents to generate scientific charts from text descriptions and achieved a visualization fidelity of 45.8 percent—below human reference levels—though human reviewers still preferred the outputs over plain image generation in 73 percent of cases.
Technical Details
RealChart2Code structures evaluation around three tasks. “Chart Replication” gives a model only an image and asks for working visualization code. “Chart Reproduction” adds the raw data source. “Chart Refinement” opens a multi-turn dialogue: the model receives broken code and must fix it incrementally. The paper describes this as the first benchmark to systematically evaluate iterative refinement in a conversational format alongside large raw dataset inputs.
Claude 4.5 Opus posted the highest proprietary average of 8.2 on a ten-point scale covering eight visual accuracy criteria. Gemini 3 Pro Preview followed at 8.1 overall and led on basic chart replication at 9.0. GPT-5.1 scored 5.4. The open-weight leaders—Qwen3-VL-235B at 3.6 and Intern-VL-3.5-241B at 3.4—scored less than half of what the leading proprietary models achieved. DeepSeek-VL-7B recorded a chart replication pass rate of just 9.7 percent, meaning its generated code failed to execute in more than 90 percent of cases.
The paper identifies two distinct failure modes. Open-weight models frequently call non-existent API functions: Qwen3-VL-235B generated invalid Matplotlib calls—including a nonexistent style parameter—in roughly 20 percent of cases, often producing overlapping subplots or broken grid structures when the code did run. Proprietary models such as Claude 4.5 Opus and GPT-5.1 rarely produce syntax errors but tend to assign data series to incorrect axes or misapply visual attributes. Across all models, the researchers documented what they term “regressive editing”: when a model is asked to fix one error during iterative refinement, it “frequently breaks previously correct parts of the code,” the paper states. Automated scoring correlated with human expert judgments at a Cohen’s Kappa of 0.83, with inter-agent agreement at a Fleiss’ Kappa of 0.82.
Who’s Affected
Developers building AI-assisted data visualization tools—products that generate or repair dashboard and report code—face the most direct implications. The benchmark documents concrete failure modes (hallucinated APIs, wrong-axis data assignments, regressive edits) that degrade user-facing output in ways simpler benchmarks would not have surfaced. Anthropic, Google, and OpenAI all have models in the results, establishing a more demanding public baseline; open-source teams behind Qwen3-VL and InternVL face the widest gap between prior benchmark scores and performance on real-world data.
What’s Next
The benchmark is currently scoped to Matplotlib as the only supported visualization library, and the authors acknowledge that automated scoring may miss subtle visual artifacts such as minor element overlap or imprecise color rendering. The full dataset, evaluation pipeline, and results are publicly available on GitHub and Hugging Face for independent replication. No extension to additional visualization libraries has been announced as of April 19, 2026.