HumanEval Code Generation
Evaluates ability to generate correct Python functions from docstrings. Measures programming skill and code synthesis.
| # | Model | Score | |
|---|---|---|---|
| 1 | Claude Opus 4.6Anthropic | 97% | Try |
| 2 | GPT-5.2OpenAI | 97% | Try |
| 3 | DeepSeek R1DeepSeek | 96.1% | Try |
| 4 | Claude Sonnet 4.6Anthropic | 96% | Try |
| 5 | Claude Opus 4.5Anthropic | 96% | Try |
| 6 | GPT-5.1OpenAI | 96% | Try |
| 7 | Claude Sonnet 4.5Anthropic | 95% | Try |
| 8 | Gemini 3 ProGoogle | 95% | Try |
| 9 | Gemini 3.1 ProGoogle | 94.6% | Try |
| 10 | Gemini 3 FlashGoogle | 94% | Try |
| 11 | o4-miniOpenAI | 93.6% | Try |
| 12 | Gemini 2.5 ProGoogle | 93.2% | Try |
| 13 | GPT-5.3 CodexOpenAI | 93% | Try |
| 14 | Mistral Small 3.2Mistral | 92.9% | Try |
| 15 | o3OpenAI | 92.8% | Try |
| 16 | GPT-4.1OpenAI | 92.4% | Try |
| 17 | Grok 4.1xAI | 92% | Try |
| 18 | DeepSeek V3.2DeepSeek | 92% | Try |
| 19 | Mistral Large 3Mistral | 92% | Try |
| 20 | Llama 4 MaverickMeta | 91.5% | Try |
| 21 | Gemini 2.5 FlashGoogle | 90.3% | Try |
| 22 | Claude Haiku 4.5Anthropic | 90% | Try |
| 23 | Grok 4xAI | 90% | Try |
| 24 | GPT-4.1 miniOpenAI | 89.5% | Try |
| 25 | Llama 4 ScoutMeta | 89% | Try |
| 26 | GPT-4.1 nanoOpenAI | 81.2% | Try |