SimpleQA Factuality Benchmark
Short-form factual questions with verifiable answers. Measures factual accuracy and resistance to hallucination. Lower scores are common even for frontier models.
| # | Model | Score | |
|---|---|---|---|
| 1 | Gemini 3.1 ProGoogle | 79.6% | Try |
| 2 | GPT-5.3 CodexOpenAI | 58% | Try |
| 3 | GPT-5.2OpenAI | 52.5% | Try |
| 4 | Gemini 3 ProGoogle | 49% | Try |
| 5 | GPT-5.1OpenAI | 48% | Try |
| 6 | o3OpenAI | 47.9% | Try |
| 7 | Claude Opus 4.6Anthropic | 43.2% | Try |
| 8 | GPT-4.1OpenAI | 42.8% | Try |
| 9 | Gemini 2.5 ProGoogle | 41.5% | Try |
| 10 | o4-miniOpenAI | 40.3% | Try |
| 11 | Claude Sonnet 4.6Anthropic | 39.5% | Try |
| 12 | Grok 4.1xAI | 38% | Try |
| 13 | Claude Opus 4.5Anthropic | 36% | Try |
| 14 | Gemini 3 FlashGoogle | 36% | Try |
| 15 | Grok 4xAI | 34.2% | Try |
| 16 | DeepSeek V3.2DeepSeek | 33% | Try |
| 17 | DeepSeek R1DeepSeek | 31.4% | Try |
| 18 | Claude Sonnet 4.5Anthropic | 30.8% | Try |
| 19 | Mistral Large 3Mistral | 29% | Try |
| 20 | Gemini 2.5 FlashGoogle | 28.3% | Try |
| 21 | Llama 4 MaverickMeta | 27.5% | Try |
| 22 | GPT-4.1 miniOpenAI | 26.5% | Try |
| 23 | Llama 4 ScoutMeta | 21% | Try |
| 24 | Claude Haiku 4.5Anthropic | 19% | Try |
| 25 | Mistral Small 3.2Mistral | 15.5% | Try |
| 26 | GPT-4.1 nanoOpenAI | 14.2% | Try |