Model Horizon
DashboardModelsCompareBenchmarks
© 2026 Model Horizon
About|Terms
SYS.v0.1.0
Skip to content
  1. Home
  2. /Benchmarks
  3. /MMLU

MMLU Leaderboard

Massive Multitask Language Understanding

Tests knowledge across 57 academic subjects including STEM, humanities, and social sciences. Measures breadth of world knowledge.

26Models Tested
93.1%Highest Score
86.7%Average
21.1%Spread
#ModelProviderScore
1Gemini 3 ProGoogle
GGoogle
93.1%
Try
2GPT-5.3 CodexOpenAI
OOpenAI
93%
Try
3GPT-5.2OpenAI
OOpenAI
92.8%
Try
4Gemini 3.1 ProGoogle
GGoogle
92.6%
Try
5Claude Opus 4.6Anthropic
AAnthropic
92.5%
Try
6GPT-5.1OpenAI
OOpenAI
91.5%
Try
7Claude Sonnet 4.6Anthropic
AAnthropic
91%
Try
8DeepSeek R1DeepSeek
DDeepSeek
90.8%
Try
9Claude Opus 4.5Anthropic
AAnthropic
90.5%
Try
10Grok 4.1xAI
XxAI
90.2%
Try
11o3OpenAI
OOpenAI
89.4%
Try
12Gemini 2.5 ProGoogle
GGoogle
89%
Try
13Claude Sonnet 4.5Anthropic
AAnthropic
88.7%
Try
14Gemini 3 FlashGoogle
GGoogle
88%
Try
15Grok 4xAI
XxAI
87.5%
Try
16o4-miniOpenAI
OOpenAI
86.8%
Try
17GPT-4.1OpenAI
OOpenAI
86%
Try
18DeepSeek V3.2DeepSeek
DDeepSeek
85.7%
Try
19Mistral Large 3Mistral
MMistral
84%
Try
20Gemini 2.5 FlashGoogle
GGoogle
83.5%
Try
21Llama 4 MaverickMeta
MMeta
82.5%
Try
22Claude Haiku 4.5Anthropic
AAnthropic
82%
Try
23GPT-4.1 miniOpenAI
OOpenAI
80.5%
Try
24Llama 4 ScoutMeta
MMeta
78%
Try
25Mistral Small 3.2Mistral
MMistral
72.7%
Try
26GPT-4.1 nanoOpenAI
OOpenAI
72%
Try