On March 30, 2026, researcher Rongtian Ye submitted ChartDiff to arXiv — a benchmark of 8,541 annotated chart pairs built to evaluate whether vision-language models can reason comparatively across two charts at once. The dataset is the first large-scale resource dedicated to cross-chart comparative summarization, filling a gap that standard chart understanding benchmarks have not addressed.
- ChartDiff contains 8,541 chart pairs, each annotated with human-verified summaries describing differences in trends, fluctuations, and anomalies.
- Frontier general-purpose models scored highest on GPT-based quality evaluation; chart-specialized models posted higher ROUGE scores but lower human-aligned quality.
- Multi-series charts proved the most difficult subtype across all tested model families.
- The results reveal a measurable gap between lexical overlap metrics and human-judged output quality — a methodological concern for benchmark design.
What Happened
Rongtian Ye submitted “ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts” to arXiv on March 30, 2026 (identifier: 2603.28902). The benchmark contains 8,541 chart pairs spanning diverse data sources, chart types, and visual styles. Each pair is annotated with summaries describing differences in trends, fluctuations, and anomalies between the two charts; annotations were generated by a large language model and subsequently verified by human reviewers before being used as evaluation references.
Why It Matters
Existing chart understanding benchmarks have focused almost exclusively on single-chart tasks: reading a value, answering a factual question, or producing a caption for one visualization. Comparative reasoning — identifying what differs between two charts — is a more demanding cognitive task and one that arises frequently in analytical work, from comparing quarterly performance across periods to examining differences between experimental conditions.
No prior benchmark at this scale had formalized the cross-chart comparison task. ChartDiff provides a structured way to measure this capability directly, which existing single-chart benchmarks cannot do by design.
Technical Details
The 8,541 chart pairs cover multiple chart types and visual encodings across diverse data domains. Summaries target three specific categories of inter-chart difference: trends (directional patterns over a data range), fluctuations (variance or volatility), and anomalies (outliers or unexpected deviations). Annotation proceeded in two stages — LLM generation followed by human verification — to balance scale with quality control.
Ye evaluated three categories of model: general-purpose frontier vision-language models, chart-specialized models, and pipeline-based systems that decompose the task into subtasks such as chart-to-data extraction followed by text generation. General-purpose frontier models achieved the highest scores under GPT-based quality evaluation, in which a language model judges the coherence and accuracy of generated summaries. Chart-specialized and pipeline-based models scored higher on ROUGE — a metric measuring word overlap with reference summaries — but lower on the human-aligned evaluation.
The paper identifies “a clear mismatch between lexical overlap and actual summary quality” as a central finding. ROUGE scores, taken alone, can favor models that reproduce surface-level phrasing without producing summaries that human reviewers judge as accurate or useful. Multi-series charts were the hardest subtype across all tested model families. Notably, strong end-to-end models showed relative robustness to variation in plotting libraries, suggesting the difficulty lies in tracking multiple concurrent data series rather than in parsing any particular visual encoding syntax.
Who’s Affected
Vision-language model researchers working on chart or document understanding have a new evaluation resource that directly measures comparative reasoning rather than single-chart perception. Developers building AI tools for data analysis — automated report summarization, business intelligence dashboard comparison, financial chart monitoring — face the exact reasoning task ChartDiff formalizes.
The benchmark’s findings on evaluation metrics also affect teams selecting models for production summarization tasks. Organizations relying on ROUGE to compare candidate models may be selecting against human preference: the benchmark demonstrates that higher ROUGE does not reliably correspond to higher human-judged summary quality for this task.
What’s Next
The ChartDiff dataset is available through arXiv, and the authors position it as a standing benchmark for multi-chart research. As the paper states, “comparative chart reasoning remains a significant challenge for current vision-language models,” with no tested system achieving strong, consistent performance across all chart types and evaluation criteria.
The annotation methodology — LLM-generated summaries reviewed by human annotators — is a pragmatic approach to building a dataset at this scale, but raises open questions about annotation consistency and whether human verification systematically catches LLM errors in chart-difference description. These are areas subsequent work building on ChartDiff will need to examine. Author details beyond Rongtian Ye were not available in the arXiv submission at time of publication.