Model Horizon
DashboardModelsCompareBenchmarks
© 2026 Model Horizon
About|Terms
SYS.v0.1.0
Skip to content
  1. Home
  2. /Benchmarks
  3. /SWE-bench

SWE-bench Leaderboard

Software Engineering Bench

Tests ability to resolve real GitHub issues from popular open-source projects. Measures practical software engineering capability.

18Models Tested
80.6%Highest Score
64.1%Average
32.6%Spread
#ModelProviderScore
1Gemini 3.1 ProGoogle
GGoogle
80.6%
Try
2GPT-5.3 CodexOpenAI
OOpenAI
80%
Try
3Claude Opus 4.6Anthropic
AAnthropic
72.5%
Try
4o3OpenAI
OOpenAI
71.7%
Try
5Gemini 3 ProGoogle
GGoogle
70.8%
Try
6Claude Sonnet 4.6Anthropic
AAnthropic
70.3%
Try
7o4-miniOpenAI
OOpenAI
68.5%
Try
8GPT-5.2OpenAI
OOpenAI
68%
Try
9Grok 4.1xAI
XxAI
65%
Try
10Claude Opus 4.5Anthropic
AAnthropic
64%
Try
11Gemini 2.5 ProGoogle
GGoogle
63.8%
Try
12GPT-5.1OpenAI
OOpenAI
62%
Try
13Gemini 3 FlashGoogle
GGoogle
57%
Try
14Claude Sonnet 4.5Anthropic
AAnthropic
55.8%
Try
15GPT-4.1OpenAI
OOpenAI
54.6%
Try
16DeepSeek V3.2DeepSeek
DDeepSeek
52%
Try
17DeepSeek R1DeepSeek
DDeepSeek
49.2%
Try
18Grok 4xAI
XxAI
48%
Try