AI benchmarks

Articles tagged with AI benchmarks

12 articles

All Critical (9-10) Important (7-8) Notable (5-6) Logged (1-4) 12 matches

Editorial illustration for: BullshitBench Results Show Anthropic Claude Models Dominate Top Seven Spots in Nonsense Detection

BullshitBench Results Show Anthropic Claude Models Dominate Top Seven Spots in Nonsense Detection Rankings

7/10 4 min read 2 months ago

Editorial illustration for: NIST Issues New Statistical Framework for AI Benchmark Evaluations

NIST AI 800-3 Introduces Statistical Models to Fix AI Benchmark Gaps

8/10 4 min read 2 months ago

Editorial illustration for: Function Calling Harness Pushes Qwen From 6.75 Percent to 100 Percent Success on Complex Schemas

Function Calling Harness Pushes Qwen From 6.75 Percent to 100 Percent Success on Complex Schemas

8/10 4 min read 2 months ago

Editorial illustration for: Liquid AI Runs 24-Billion-Parameter Model at 50 Tokens Per Second in a Web Browser

Liquid AI Runs 24-Billion-Parameter Model at 50 Tokens Per Second in a Web Browser

7/10 2 min read 2 months ago

Editorial illustration for: Open-Source ATLAS System on a $500 GPU Outperforms Claude Sonnet on Coding Benchmarks

Open-Source ATLAS System on a $500 GPU Outperforms Claude Sonnet on Coding Benchmarks

7/10 2 min read 2 months ago

Editorial illustration for: Claude Appears as Third Top Contributor on OpenAI Repository, Sparking Viral Debate

Claude Appears as Third Top Contributor on OpenAI Repository, Sparking Viral Debate

7/10 2 min read 2 months ago

Editorial illustration for: Study Finds Hundreds of AI Benchmark Tests Are Fundamentally Flawed

Study Finds Hundreds of AI Benchmark Tests Are Fundamentally Flawed

8/10 4 min read 2 months ago

Editorial illustration for: iPhone 17 Pro Demonstrated Running a 400-Billion-Parameter LLM with 12GB of RAM

iPhone 17 Pro Demonstrated Running a 400-Billion-Parameter LLM with 12GB of RAM

7/10 2 min read 2 months ago

Editorial illustration for: SWE-Rebench February Update: GPT-5.4 and Qwen3.5 Lead on Decontaminated Coding Tasks

SWE-Rebench February Update: GPT-5.4 and Qwen3.5 Lead on Decontaminated Coding Tasks

8/10 4 min read 2 months ago

Editorial illustration for: Mystery AI Model Suspected as DeepSeek V4 Revealed as Xiaomi's 1-Trillion-Parameter MiMo-V2-Pro

Mystery AI Model Suspected as DeepSeek V4 Revealed as Xiaomi’s 1-Trillion-Parameter MiMo-V2-Pro

7/10 2 min read 2 months ago

Editorial illustration for: Benchmarks Show Vulkan Outperforms ROCm 7 for Short-Context LLM Inference on AMD MI50

Benchmarks Show Vulkan Outperforms ROCm 7 for Short-Context LLM Inference on AMD MI50

7/10 4 min read 2 months ago

Editorial illustration for: Humanity's Last Exam Exposes the Gap Between AI Hype and Expert-Level Knowledge

Humanity’s Last Exam Exposes the Gap Between AI Hype and Expert-Level Knowledge

8/10 2 min read 2 months ago

📬 Get AI news daily → Subscribe Free