ANALYSIS ScoringBench Ranks Tabular AI Models on Full Distribution Accuracy 3/10 4 min read 1 month ago
ANALYSIS Epistemic Uncertainty Proposed as Routing Signal for Cheaper, More Reliable AI Explanations 3/10 4 min read 1 month ago
ANALYSIS ATP-Bench: Researchers Benchmark 10 MLLMs on Agentic Tool Planning 3/10 4 min read 1 month ago
ANALYSIS ShapE-GRPO Uses Shapley Values to Fix GRPO Free-Rider Problem in LLM Training 3/10 4 min read 1 month ago
ANALYSIS Dual-Capability Bottleneck in Chess AI Formalized, Model Hits Lichess 2570 3/10 4 min read 1 month ago
ANALYSIS CausalPulse Multi-Agent Copilot Achieves 98.7% Success at Bosch Plant 4/10 4 min read 1 month ago
ANALYSIS LLM Use Boosts Output but Degrades Metacognitive Accuracy, Paper Argues 4/10 4 min read 1 month ago
ANALYSIS ELT-Bench-Verified: Benchmark Flaws Were Masking AI Agent Performance 4/10 4 min read 1 month ago
ANALYSIS BenchScope: AI Benchmarks Show 20x Variance in Independent Signal 3/10 4 min read 1 month ago
ANALYSIS Nomad System Uses Exploration Maps to Surface Insights Without User Queries 4/10 4 min read 1 month ago
ANALYSIS PSPA-Bench: New Benchmark Exposes Personalization Gap in Smartphone GUI Agents 3/10 4 min read 1 month ago