Model Horizon
DashboardModelsCompareBenchmarks
© 2026 Model Horizon
About|Terms
SYS.v0.1.0
Skip to content
  1. Home
  2. /Benchmarks

Benchmark Leaderboards

Compare AI model performance across industry-standard benchmarks.

Arena Elo

Crowdsourced ranking where users compare model outputs head-to-head. Higher Elo indicates stronger overall performance across diverse tasks.

1Gemini 3.1 Pro1,500
2Claude Opus 4.61,496
3Gemini 3 Pro1,486

MMLU

Tests knowledge across 57 academic subjects including STEM, humanities, and social sciences. Measures breadth of world knowledge.

1Gemini 3 Pro93.1%
2GPT-5.3 Codex93%
3GPT-5.292.8%

GPQA

Expert-crafted questions in biology, physics, and chemistry that are difficult even for domain experts with internet access.

1Gemini 3.1 Pro94.3%
2Gemini 3 Pro91.9%
3Claude Opus 4.691.3%

MATH

Competition-level mathematics problems spanning algebra, geometry, number theory, and calculus. Tests multi-step mathematical reasoning.

1GPT-5.298%
2Claude Sonnet 4.697.8%
3Claude Opus 4.697.6%

HumanEval

Evaluates ability to generate correct Python functions from docstrings. Measures programming skill and code synthesis.

1Claude Opus 4.697%
2GPT-5.297%
3DeepSeek R196.1%

SWE-bench

Tests ability to resolve real GitHub issues from popular open-source projects. Measures practical software engineering capability.

1Gemini 3.1 Pro80.6%
2GPT-5.3 Codex80%
3Claude Opus 4.672.5%

AIME

Competition-level math problems from the AMC pipeline. Tests advanced problem-solving with proof-style integer answers. Strong differentiator among frontier models.

1o396.7%
2GPT-5.3 Codex94%
3o4-mini93.4%

SimpleQA

Short-form factual questions with verifiable answers. Measures factual accuracy and resistance to hallucination. Lower scores are common even for frontier models.

1Gemini 3.1 Pro79.6%
2GPT-5.3 Codex58%
3GPT-5.252.5%