ANALYSIS

ScoringBench Ranks Tabular AI Models on Full Distribution Accuracy

E Elena Volkov Apr 1, 2026 Updated Apr 7, 2026 4 min read
Engine Score 3/10 — Logged

Tabular foundation model scoring benchmark is specialized academic work.

Editorial illustration for: ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

A benchmark submitted to arXiv on March 31, 2026 by Jonas Landsgesell and Pascal Knoll measures tabular foundation models on how accurately they predict full probability distributions — not just single-point estimates — exposing a systematic blind spot in how models like TabPFN and TabICL have been evaluated.

  • TabPFN and TabICL already output full predictive distributions, but are routinely assessed only on RMSE and R², which collapse those distributions into a single number.
  • ScoringBench introduces six proper scoring rules — CRPS, CRLS, Interval Score, Energy Score, weighted CRPS, and Brier Score — as evaluation criteria alongside standard point metrics.
  • Model rankings shift depending on which scoring rule is applied; the authors found no single pretraining objective is universally optimal across all rules.
  • The benchmark includes a live leaderboard maintained via git pull requests, designed for transparency and community contribution.

What Happened

Jonas Landsgesell and Pascal Knoll submitted ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules to arXiv (arXiv:2603.29928) on March 31, 2026. The paper introduces an open benchmark that evaluates tabular foundation models using proper scoring rules — statistical measures that assess the quality of a full predicted distribution, not just its central tendency.

The models under evaluation, TabPFN and TabICL, already generate complete predictive distributions as standard outputs. The benchmark’s core argument is that evaluating these models on RMSE and R² alone discards the distributional information those models actually produce, measuring only a point estimate derived from a richer output.

The code and a live leaderboard are linked in the paper, with submissions accepted via git pull requests.

Why It Matters

In finance and clinical research — the two application domains the paper singles out — asymmetric risk profiles mean that tail outcomes carry disproportionate weight relative to their frequency. A patient survival model that performs well on average while underestimating risk in the worst-case percentiles would look adequate under RMSE while failing exactly where accuracy matters most.

Proper scoring rules are designed to penalize miscalibrated or overconfident distributions across the full range of outcomes. The CRPS (Continuous Ranked Probability Score), for instance, integrates the squared difference between the predicted cumulative distribution function and the observed outcome across all thresholds — a fundamentally different signal than mean squared error. Standard benchmarks have not routinely included these measures for tabular regression tasks.

Technical Details

ScoringBench computes six proper scoring rules: CRPS, CRLS (Continuous Ranked Log Score), Interval Score, Energy Score, weighted CRPS, and Brier Score, alongside RMSE and R² for direct comparison. The evaluation covers three model configurations: realTabPFNv2.5 fine-tuned with different scoring rule objectives, TabICL, and untuned realTabPFNv2.5 as a baseline, run across a suite of regression benchmarks.

The central empirical finding, stated directly in the paper, is that “model rankings depend on the chosen scoring rule and that no single pretraining objective is universally optimal.” A model ranking first under CRPS may rank lower under Interval Score, meaning the apparent best-performing model changes based on which aspect of distributional accuracy is prioritized.

The authors further conclude that “for applications sensitive to extreme events the choice of evaluation metric is as much a domain specific requirement as the data itself” — framing metric selection as a design decision tied to the deployment context, not a universal default.

The paper notes that fine-tuning realTabPFNv2.5 with different scoring rule objectives produces models with different ranking profiles, demonstrating that pretraining choices have measurable, scoring-rule-dependent effects on distributional accuracy.

Who’s Affected

Researchers and practitioners using TabPFN or TabICL for tabular regression in finance, healthcare, and insurance are most directly affected. Model rankings established under RMSE-only evaluation may not hold when proper scoring rules are applied, potentially changing which model is selected for deployment in risk-sensitive contexts.

Developers fine-tuning tabular foundation models on custom objectives now have a public, versioned leaderboard to benchmark against. Because the leaderboard is maintained through git pull requests, every submission is traceable and reproducible — a design choice the authors describe as enabling “transparency, traceability, agility, and reproducibility.”

What’s Next

ScoringBench is publicly available with both the benchmark code and the live leaderboard linked in the arXiv paper. The leaderboard is structured to accept new model submissions from the community as tabular foundation models continue to develop.

The paper does not propose a method for selecting among the six scoring rules for a given application, framing that choice as inherently domain-specific. How practitioners should combine or prioritize scoring objectives during fine-tuning — particularly for multi-objective scenarios — remains an open question the benchmark surfaces but does not resolve.

Author details for Jonas Landsgesell and Pascal Knoll beyond their names were not available at time of publication.

Share

Enjoyed this story?

Get articles like this delivered daily. The Engine Room — free AI intelligence newsletter.

Join 500+ AI professionals · No spam · Unsubscribe anytime