Compare AI model performance across industry-standard benchmarks.
Crowdsourced ranking where users compare model outputs head-to-head. Higher Elo indicates stronger overall performance across diverse tasks.
Tests knowledge across 57 academic subjects including STEM, humanities, and social sciences. Measures breadth of world knowledge.
Expert-crafted questions in biology, physics, and chemistry that are difficult even for domain experts with internet access.
Competition-level mathematics problems spanning algebra, geometry, number theory, and calculus. Tests multi-step mathematical reasoning.
Evaluates ability to generate correct Python functions from docstrings. Measures programming skill and code synthesis.
Tests ability to resolve real GitHub issues from popular open-source projects. Measures practical software engineering capability.
Competition-level math problems from the AMC pipeline. Tests advanced problem-solving with proof-style integer answers. Strong differentiator among frontier models.
Short-form factual questions with verifiable answers. Measures factual accuracy and resistance to hallucination. Lower scores are common even for frontier models.