Benchmark Leaderboards

Compare AI model performance across industry-standard benchmarks.

Arena Elo

Crowdsourced ranking where users compare model outputs head-to-head. Higher Elo indicates stronger overall performance across diverse tasks.

1Gemini 3.1 Pro1,500

2Claude Opus 4.61,496

3Gemini 3 Pro1,486

MMLU

Tests knowledge across 57 academic subjects including STEM, humanities, and social sciences. Measures breadth of world knowledge.

1Gemini 3 Pro93.1%

2GPT-5.3 Codex93%

3GPT-5.292.8%

GPQA

Expert-crafted questions in biology, physics, and chemistry that are difficult even for domain experts with internet access.

1Gemini 3.1 Pro94.3%

2Gemini 3 Pro91.9%

3Claude Opus 4.691.3%

MATH

Competition-level mathematics problems spanning algebra, geometry, number theory, and calculus. Tests multi-step mathematical reasoning.

1GPT-5.298%

2Claude Sonnet 4.697.8%

3Claude Opus 4.697.6%

HumanEval

Evaluates ability to generate correct Python functions from docstrings. Measures programming skill and code synthesis.

1Claude Opus 4.697%

2GPT-5.297%

3DeepSeek R196.1%

SWE-bench

Tests ability to resolve real GitHub issues from popular open-source projects. Measures practical software engineering capability.

1Gemini 3.1 Pro80.6%

2GPT-5.3 Codex80%

3Claude Opus 4.672.5%

AIME

Competition-level math problems from the AMC pipeline. Tests advanced problem-solving with proof-style integer answers. Strong differentiator among frontier models.

1o396.7%

2GPT-5.3 Codex94%

3o4-mini93.4%

SimpleQA

Short-form factual questions with verifiable answers. Measures factual accuracy and resistance to hallucination. Lower scores are common even for frontier models.

1Gemini 3.1 Pro79.6%

2GPT-5.3 Codex58%

3GPT-5.252.5%