Model Horizon
DashboardModelsCompareBenchmarks
© 2026 Model Horizon
About|Terms
SYS.v0.1.0
Skip to content
  1. Home
  2. /Benchmarks
  3. /HumanEval

HumanEval Leaderboard

HumanEval Code Generation

Evaluates ability to generate correct Python functions from docstrings. Measures programming skill and code synthesis.

26Models Tested
97%Highest Score
92.8%Average
15.8%Spread
#ModelProviderScore
1Claude Opus 4.6Anthropic
AAnthropic
97%
Try
2GPT-5.2OpenAI
OOpenAI
97%
Try
3DeepSeek R1DeepSeek
DDeepSeek
96.1%
Try
4Claude Sonnet 4.6Anthropic
AAnthropic
96%
Try
5Claude Opus 4.5Anthropic
AAnthropic
96%
Try
6GPT-5.1OpenAI
OOpenAI
96%
Try
7Claude Sonnet 4.5Anthropic
AAnthropic
95%
Try
8Gemini 3 ProGoogle
GGoogle
95%
Try
9Gemini 3.1 ProGoogle
GGoogle
94.6%
Try
10Gemini 3 FlashGoogle
GGoogle
94%
Try
11o4-miniOpenAI
OOpenAI
93.6%
Try
12Gemini 2.5 ProGoogle
GGoogle
93.2%
Try
13GPT-5.3 CodexOpenAI
OOpenAI
93%
Try
14Mistral Small 3.2Mistral
MMistral
92.9%
Try
15o3OpenAI
OOpenAI
92.8%
Try
16GPT-4.1OpenAI
OOpenAI
92.4%
Try
17Grok 4.1xAI
XxAI
92%
Try
18DeepSeek V3.2DeepSeek
DDeepSeek
92%
Try
19Mistral Large 3Mistral
MMistral
92%
Try
20Llama 4 MaverickMeta
MMeta
91.5%
Try
21Gemini 2.5 FlashGoogle
GGoogle
90.3%
Try
22Claude Haiku 4.5Anthropic
AAnthropic
90%
Try
23Grok 4xAI
XxAI
90%
Try
24GPT-4.1 miniOpenAI
OOpenAI
89.5%
Try
25Llama 4 ScoutMeta
MMeta
89%
Try
26GPT-4.1 nanoOpenAI
OOpenAI
81.2%
Try