Software Engineering Bench
Tests ability to resolve real GitHub issues from popular open-source projects. Measures practical software engineering capability.
| # | Model | Score | |
|---|---|---|---|
| 1 | Gemini 3.1 ProGoogle | 80.6% | Try |
| 2 | GPT-5.3 CodexOpenAI | 80% | Try |
| 3 | Claude Opus 4.6Anthropic | 72.5% | Try |
| 4 | o3OpenAI | 71.7% | Try |
| 5 | Gemini 3 ProGoogle | 70.8% | Try |
| 6 | Claude Sonnet 4.6Anthropic | 70.3% | Try |
| 7 | o4-miniOpenAI | 68.5% | Try |
| 8 | GPT-5.2OpenAI | 68% | Try |
| 9 | Grok 4.1xAI | 65% | Try |
| 10 | Claude Opus 4.5Anthropic | 64% | Try |
| 11 | Gemini 2.5 ProGoogle | 63.8% | Try |
| 12 | GPT-5.1OpenAI | 62% | Try |
| 13 | Gemini 3 FlashGoogle | 57% | Try |
| 14 | Claude Sonnet 4.5Anthropic | 55.8% | Try |
| 15 | GPT-4.1OpenAI | 54.6% | Try |
| 16 | DeepSeek V3.2DeepSeek | 52% | Try |
| 17 | DeepSeek R1DeepSeek | 49.2% | Try |
| 18 | Grok 4xAI | 48% | Try |