How do top AI models compare on MMLU, MATH-500, HumanEval, SWE-bench, and Chatbot Arena? A comprehensive benchmark analysis of 4,587 models across 95 providers.
MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic subjects. MMLU-Pro is a harder variant requiring deeper reasoning.
| Model | MMLU | MMLU-Pro | Provider | Input $/M |
|---|---|---|---|---|
| GPT-4.1 | ~90% | ~78% | OpenAI | $2.00 |
| Claude Opus 4 | ~90% | ~78% | Anthropic | $15.00 |
| Gemini 2.5 Pro | ~90% | ~78% | $1.25 | |
| Claude Sonnet 4 | ~88% | ~76% | Anthropic | $3.00 |
| Grok 3 | ~87% | ~75% | xAI | $3.00 |
| DeepSeek R1 | ~85% | ~72% | DeepSeek | Free |
| Qwen3-235B | ~85% | ~72% | Alibaba | Free |
| Llama 4 Maverick | ~82% | ~68% | Meta | Free |
MATH-500 tests competition-level mathematics. AIME 2024 is an even harder math competition benchmark.
| Model | MATH-500 | AIME 2024 | Provider | Input $/M |
|---|---|---|---|---|
| o3 | ~96% | ~83% | OpenAI | $2.00 |
| o4-mini | ~93% | ~75% | OpenAI | $1.10 |
| DeepSeek R1 | ~92% | ~72% | DeepSeek | Free |
| Gemini 2.5 Pro | ~91% | ~70% | $1.25 | |
| Qwen3-235B | ~90% | ~68% | Alibaba | Free |
| Claude Sonnet 4 | ~88% | ~65% | Anthropic | $3.00 |
HumanEval tests Python code generation. SWE-bench tests real GitHub issue resolution — more realistic for production use.
| Model | HumanEval | SWE-bench Verified | Provider | Input $/M |
|---|---|---|---|---|
| Claude Sonnet 4 | ~93% | ~72% | Anthropic | $3.00 |
| o3 | ~92% | ~70% | OpenAI | $2.00 |
| GPT-4.1 | ~91% | ~65% | OpenAI | $2.00 |
| Gemini 2.5 Pro | ~90% | ~63% | $1.25 | |
| DeepSeek V3 | ~88% | ~55% | DeepSeek | $0.07 |
| Codestral | ~86% | N/A | Mistral | $0.30 |
GPQA (Graduate-Level Google-Proof Q&A) tests expert-level scientific reasoning. Even PhDs with internet access struggle.
| Model | GPQA Diamond | Provider | Input $/M |
|---|---|---|---|
| o3 | ~80% | OpenAI | $2.00 |
| Gemini 2.5 Pro | ~78% | $1.25 | |
| Claude Opus 4 | ~75% | Anthropic | $15.00 |
| o4-mini | ~73% | OpenAI | $1.10 |
| DeepSeek R1 | ~71% | DeepSeek | Free |
BFCL (Berkeley Function Calling Leaderboard) tests function calling accuracy — critical for AI agents.
| Model | BFCL v3 | Provider | Input $/M |
|---|---|---|---|
| GPT-4.1 | ~88% | OpenAI | $2.00 |
| Claude Sonnet 4 | ~86% | Anthropic | $3.00 |
| Gemini 2.5 Pro | ~85% | $1.25 | |
| Grok 3 | ~83% | xAI | $3.00 |
| Gemini 2.5 Flash | ~82% | Free |
LMSYS Chatbot Arena uses blind human comparisons. This is the most practical benchmark for chat quality.
| Model | Arena Score | Provider | Input $/M |
|---|---|---|---|
| GPT-4.1 | ~1380 | OpenAI | $2.00 |
| Claude Sonnet 4 | ~1370 | Anthropic | $3.00 |
| Gemini 2.5 Pro | ~1360 | $1.25 | |
| Grok 3 | ~1350 | xAI | $3.00 |
| DeepSeek R1 | ~1330 | DeepSeek | Free |
| Benchmark | Best Free | Best Paid | Best Overall |
|---|---|---|---|
| MMLU | DeepSeek R1 / Qwen3 | Gemini 2.5 Pro ($1.25) | GPT-4.1 |
| MATH | DeepSeek R1 | o4-mini ($1.10) | o3 |
| Coding | DeepSeek V3 ($0.07) | Gemini 2.5 Pro ($1.25) | Claude Sonnet 4 |
| GPQA | DeepSeek R1 | Gemini 2.5 Pro ($1.25) | o3 |
| Tool Calling | Gemini 2.5 Flash | Gemini 2.5 Pro ($1.25) | GPT-4.1 |
| Chat | DeepSeek R1 | Gemini 2.5 Pro ($1.25) | GPT-4.1 |