📊 AI Model Benchmarks Comparison 2025

How do top AI models compare on MMLU, MATH-500, HumanEval, SWE-bench, and Chatbot Arena? A comprehensive benchmark analysis of 4,587 models across 95 providers.

1. General Knowledge — MMLU & MMLU-Pro

MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic subjects. MMLU-Pro is a harder variant requiring deeper reasoning.

Model MMLU MMLU-Pro Provider Input $/M
GPT-4.1 ~90% ~78% OpenAI $2.00
Claude Opus 4 ~90% ~78% Anthropic $15.00
Gemini 2.5 Pro ~90% ~78% Google $1.25
Claude Sonnet 4 ~88% ~76% Anthropic $3.00
Grok 3 ~87% ~75% xAI $3.00
DeepSeek R1 ~85% ~72% DeepSeek Free
Qwen3-235B ~85% ~72% Alibaba Free
Llama 4 Maverick ~82% ~68% Meta Free
Key Insight: MMLU is near-saturated for frontier models. Use MMLU-Pro or GPQA for more discriminating comparisons.

2. Mathematics — MATH-500 & AIME

MATH-500 tests competition-level mathematics. AIME 2024 is an even harder math competition benchmark.

Model MATH-500 AIME 2024 Provider Input $/M
o3 ~96% ~83% OpenAI $2.00
o4-mini ~93% ~75% OpenAI $1.10
DeepSeek R1 ~92% ~72% DeepSeek Free
Gemini 2.5 Pro ~91% ~70% Google $1.25
Qwen3-235B ~90% ~68% Alibaba Free
Claude Sonnet 4 ~88% ~65% Anthropic $3.00
Key Insight: Reasoning models (o3, DeepSeek R1) dominate math benchmarks. For cost-sensitive math tasks, DeepSeek R1 is free and performs near o3.

3. Coding — HumanEval & SWE-bench

HumanEval tests Python code generation. SWE-bench tests real GitHub issue resolution — more realistic for production use.

Model HumanEval SWE-bench Verified Provider Input $/M
Claude Sonnet 4 ~93% ~72% Anthropic $3.00
o3 ~92% ~70% OpenAI $2.00
GPT-4.1 ~91% ~65% OpenAI $2.00
Gemini 2.5 Pro ~90% ~63% Google $1.25
DeepSeek V3 ~88% ~55% DeepSeek $0.07
Codestral ~86% N/A Mistral $0.30
Key Insight: SWE-bench is more realistic than HumanEval. Claude Sonnet 4 leads on SWE-bench. For budget coding, DeepSeek V3 at $0.07/M offers remarkable value.

4. Science & Reasoning — GPQA

GPQA (Graduate-Level Google-Proof Q&A) tests expert-level scientific reasoning. Even PhDs with internet access struggle.

Model GPQA Diamond Provider Input $/M
o3 ~80% OpenAI $2.00
Gemini 2.5 Pro ~78% Google $1.25
Claude Opus 4 ~75% Anthropic $15.00
o4-mini ~73% OpenAI $1.10
DeepSeek R1 ~71% DeepSeek Free

5. Tool Calling — BFCL v3

BFCL (Berkeley Function Calling Leaderboard) tests function calling accuracy — critical for AI agents.

Model BFCL v3 Provider Input $/M
GPT-4.1 ~88% OpenAI $2.00
Claude Sonnet 4 ~86% Anthropic $3.00
Gemini 2.5 Pro ~85% Google $1.25
Grok 3 ~83% xAI $3.00
Gemini 2.5 Flash ~82% Google Free
Key Insight: 2,350 models in our catalog support tool calling. GPT-4.1 leads on BFCL, but Gemini 2.5 Flash offers strong performance for free.

6. Human Preference — Chatbot Arena

LMSYS Chatbot Arena uses blind human comparisons. This is the most practical benchmark for chat quality.

Model Arena Score Provider Input $/M
GPT-4.1 ~1380 OpenAI $2.00
Claude Sonnet 4 ~1370 Anthropic $3.00
Gemini 2.5 Pro ~1360 Google $1.25
Grok 3 ~1350 xAI $3.00
DeepSeek R1 ~1330 DeepSeek Free
Key Insight: Chatbot Arena correlates best with real-world chat quality. The top 5 models are very close — pricing and features should drive your decision.

7. Best Value by Benchmark

Benchmark Best Free Best Paid Best Overall
MMLU DeepSeek R1 / Qwen3 Gemini 2.5 Pro ($1.25) GPT-4.1
MATH DeepSeek R1 o4-mini ($1.10) o3
Coding DeepSeek V3 ($0.07) Gemini 2.5 Pro ($1.25) Claude Sonnet 4
GPQA DeepSeek R1 Gemini 2.5 Pro ($1.25) o3
Tool Calling Gemini 2.5 Flash Gemini 2.5 Pro ($1.25) GPT-4.1
Chat DeepSeek R1 Gemini 2.5 Pro ($1.25) GPT-4.1

8. Benchmark Limitations

Data contamination: Models may have seen benchmark data during training. Prefer LiveCodeBench over HumanEval for coding.
Task narrowness: Benchmarks test specific skills. Real-world performance may differ significantly.
Cost blindness: Benchmarks ignore pricing, latency, and availability. Always combine with our pricing data.
Staleness: Saturated benchmarks (GSM8K, HellaSwag) are uninformative. Focus on harder benchmarks like GPQA and SWE-bench.
Small Language Models

🎯 AI Model Picker

⚡ GitHub Action