Independent reference. Not affiliated with OpenAI, Anthropic, Google DeepMind, Meta, Mistral, xAI, Papers With Code, HuggingFace, Langfuse, LangSmith, Braintrust, Arize, Humanloop, or HoneyHive. Scores cited with source and capture date. Affiliate disclosure.
§ BENCHMARKING AGENTSLast verified April 2026 - 42 sources
100%80%60%40%20%Q1'23Q3'23Q1'24Q3'24Q1'25Q3'25Q1'26MMLUSWE-bench VerifiedGPQA-DiamondARC-AGI (dashed)

SOTA progression 2023-Q1 2026. Sources: Papers With Code, HuggingFace Open LLM Leaderboard v2, official leaderboards. Captured April 2026.

AI Benchmarking 2026 - Measure Models, Agents, and Your Own Evals

A reference for how AI is measured in 2026, and how to measure your own agents. Every benchmark score on this site carries a capture date, N-shot setting, CoT flag, and a link to the primary source. No vendor affiliation. No affiliate-gated pages.

4 leaderboards indexed17 benchmarks explained7 eval tools compared0 affiliate-gated pages
§ 01

Three Layers of AI Measurement

§ 01

Public Model Benchmarks

The leaderboards everyone quotes: MMLU, GPQA-Diamond, HumanEval, ARC-AGI. We cover 17 benchmarks with current 2026 SOTA scores, saturation dates, and contamination notes.

MMLU-ProGPQA-DiamondARC-AGI-2HumanEval
Read ->
§ 02

Agent Benchmarks

The newer, less-settled category. SWE-bench Verified, WebArena, AgentBench, Terminal-Bench, OSWorld, Tau-Bench. Scores are still moving fast and the methodology itself is disputed.

SWE-bench VerifiedWebArenaOSWorldTau-Bench
Read ->
§ 03

Your Own Evals

How teams actually measure their workflows. Braintrust, Langfuse, LangSmith, Arize, Humanloop, HoneyHive, DeepEval. A neutral, honest comparison that no vendor can publish.

BraintrustLangfuseLangSmithArize Phoenix
Read ->
§ 02

Frontier Snapshot - April 2026

Top 8 frontier models across 6 canonical benchmarks. Best-in-column highlighted in phosphor. Click any model name to see the full benchmark page.

ModelMMLU-ProHumanEvalGPQA-DiaSWE-benchARC-AGI-2Arena Elo
Claude 4.5 Opus85.2%97.4%76.3%74.5%61.2%1389
GPT-586.1%98.1%78.4%71.3%65.8%1401
Gemini 2.5 Pro83.7%96.8%74.1%68.9%59.4%1371
Claude 4 Sonnet81.4%95.3%71.2%64.6%54.7%1342
Llama 4 Maverick79.8%93.7%66.3%58.2%48.1%1318
Grok 482.3%96.1%72.8%60.4%52.6%1355
Mistral Large 376.4%91.2%61.7%49.3%41.2%1287
DeepSeek V378.9%92.8%64.5%53.1%44.8%1301
Captured April 2026. Sources: vendor model cards, HuggingFace Open LLM Leaderboard v2, Papers With Code, LMSYS Chatbot Arena. HumanEval pass@1 0-shot. MMLU-Pro 5-shot CoT. GPQA-Diamond 0-shot CoT. SWE-bench Verified end-to-end. ARC-AGI-2 official leaderboard. See /what-these-benchmarks-miss for contamination notes.
§ 04 - EDITORIAL

What Most Listicles Miss

The benchmark landscape in 2026 has three systemic problems that most coverage never mentions. First, contamination: MMLU test questions have been found verbatim in Common Crawl; HumanEval problems are near-duplicates of LeetCode solutions that appear in pre-training data. A 94% MMLU score might reflect memorisation as much as reasoning.

Second, saturation: MMLU, HumanEval, and MBPP no longer discriminate frontier models. The field has moved to MMLU-Pro, GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam - but most comparison sites still quote the saturated versions because the numbers are larger and more familiar.

Third, methodology opacity: "best-of-16 with CoT and tool use" is not comparable to "greedy 0-shot." Scores without methodology footnotes are unfalsifiable claims. Every table on this site documents the evaluation setup.

§ 05

Eval Tools - Quick View

ToolTypeBest forFree tier
BraintrustCloudCI integration, developer experienceYes
LangfuseOSS + CloudSelf-hosting, cost-conscious teamsGenerous
LangSmithCloudLangChain usersLimited
Arize PhoenixOSS + CloudProduction monitoring, tracingYes (OSS)
§ 06

Frequently Asked Questions

What is AI benchmarking?+
AI benchmarking is the practice of measuring model or agent performance on standardised test sets to enable objective comparison. Benchmarks range from knowledge tests (MMLU, GPQA) to coding tasks (HumanEval, SWE-bench) to agentic challenges (WebArena, OSWorld). Every benchmark score should be read as a claim with a specific methodology, not a universal fact.
Which benchmarks matter in 2026?+
For frontier model comparisons: MMLU-Pro (not plain MMLU, which is saturated), GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam. For coding: LiveCodeBench and SWE-bench Verified. For agentic capability: SWE-bench Verified, WebArena, and Terminal-Bench. For human preference: LMSYS Chatbot Arena.
Are benchmark scores reliable?+
Benchmark scores are useful but require critical reading. Key questions: When was this captured? What N-shot and CoT settings? Is the score from the official leaderboard or a vendor model card? Is the test set public (contamination risk)? MMLU, HumanEval, and MBPP have documented training-data overlap issues.
What is the difference between an eval and a benchmark?+
A benchmark is a standardised public test set used to compare models across the field. An eval is any measurement of model quality - it may use a public benchmark, a custom golden dataset, LLM-as-judge scoring, or human annotation. Public benchmarks are one kind of eval; custom evals are the other kind, built for specific workflows.
Should I trust vendor-published benchmark scores?+
Treat vendor-published scores as a starting point, not a final answer. Model cards are produced by the same company that built the model. Methodology details are often omitted or buried. Independent replications on Papers With Code or HuggingFace's Open LLM Leaderboard v2 are more reliable, though not immune to issues.