AI Model Benchmarks Comparison 2025

1. General Knowledge — MMLU & MMLU-Pro

MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic subjects. MMLU-Pro is a harder variant requiring deeper reasoning.

Model	MMLU	MMLU-Pro	Provider	Input $/M
GPT-4.1	~90%	~78%	OpenAI	$2.00
Claude Opus 4	~90%	~78%	Anthropic	$15.00
Gemini 2.5 Pro	~90%	~78%	Google	$1.25
Claude Sonnet 4	~88%	~76%	Anthropic	$3.00
Grok 3	~87%	~75%	xAI	$3.00
DeepSeek R1	~85%	~72%	DeepSeek	Free
Qwen3-235B	~85%	~72%	Alibaba	Free
Llama 4 Maverick	~82%	~68%	Meta	Free

Key Insight: MMLU is near-saturated for frontier models. Use MMLU-Pro or GPQA for more discriminating comparisons.

2. Mathematics — MATH-500 & AIME

MATH-500 tests competition-level mathematics. AIME 2024 is an even harder math competition benchmark.

Model	MATH-500	AIME 2024	Provider	Input $/M
o3	~96%	~83%	OpenAI	$2.00
o4-mini	~93%	~75%	OpenAI	$1.10
DeepSeek R1	~92%	~72%	DeepSeek	Free
Gemini 2.5 Pro	~91%	~70%	Google	$1.25
Qwen3-235B	~90%	~68%	Alibaba	Free
Claude Sonnet 4	~88%	~65%	Anthropic	$3.00

Key Insight: Reasoning models (o3, DeepSeek R1) dominate math benchmarks. For cost-sensitive math tasks, DeepSeek R1 is free and performs near o3.

3. Coding — HumanEval & SWE-bench

HumanEval tests Python code generation. SWE-bench tests real GitHub issue resolution — more realistic for production use.

Model	HumanEval	SWE-bench Verified	Provider	Input $/M
Claude Sonnet 4	~93%	~72%	Anthropic	$3.00
o3	~92%	~70%	OpenAI	$2.00
GPT-4.1	~91%	~65%	OpenAI	$2.00
Gemini 2.5 Pro	~90%	~63%	Google	$1.25
DeepSeek V3	~88%	~55%	DeepSeek	$0.07
Codestral	~86%	N/A	Mistral	$0.30

Key Insight: SWE-bench is more realistic than HumanEval. Claude Sonnet 4 leads on SWE-bench. For budget coding, DeepSeek V3 at $0.07/M offers remarkable value.

4. Science & Reasoning — GPQA

GPQA (Graduate-Level Google-Proof Q&A) tests expert-level scientific reasoning. Even PhDs with internet access struggle.

Model	GPQA Diamond	Provider	Input $/M
o3	~80%	OpenAI	$2.00
Gemini 2.5 Pro	~78%	Google	$1.25
Claude Opus 4	~75%	Anthropic	$15.00
o4-mini	~73%	OpenAI	$1.10
DeepSeek R1	~71%	DeepSeek	Free

5. Tool Calling — BFCL v3

BFCL (Berkeley Function Calling Leaderboard) tests function calling accuracy — critical for AI agents.

Model	BFCL v3	Provider	Input $/M
GPT-4.1	~88%	OpenAI	$2.00
Claude Sonnet 4	~86%	Anthropic	$3.00
Gemini 2.5 Pro	~85%	Google	$1.25
Grok 3	~83%	xAI	$3.00
Gemini 2.5 Flash	~82%	Google	Free

Key Insight: 2,350 models in our catalog support tool calling. GPT-4.1 leads on BFCL, but Gemini 2.5 Flash offers strong performance for free.

6. Human Preference — Chatbot Arena

LMSYS Chatbot Arena uses blind human comparisons. This is the most practical benchmark for chat quality.

Model	Arena Score	Provider	Input $/M
GPT-4.1	~1380	OpenAI	$2.00
Claude Sonnet 4	~1370	Anthropic	$3.00
Gemini 2.5 Pro	~1360	Google	$1.25
Grok 3	~1350	xAI	$3.00
DeepSeek R1	~1330	DeepSeek	Free

Key Insight: Chatbot Arena correlates best with real-world chat quality. The top 5 models are very close — pricing and features should drive your decision.

7. Best Value by Benchmark

Benchmark	Best Free	Best Paid	Best Overall
MMLU	DeepSeek R1 / Qwen3	Gemini 2.5 Pro ($1.25)	GPT-4.1
MATH	DeepSeek R1	o4-mini ($1.10)	o3
Coding	DeepSeek V3 ($0.07)	Gemini 2.5 Pro ($1.25)	Claude Sonnet 4
GPQA	DeepSeek R1	Gemini 2.5 Pro ($1.25)	o3
Tool Calling	Gemini 2.5 Flash	Gemini 2.5 Pro ($1.25)	GPT-4.1
Chat	DeepSeek R1	Gemini 2.5 Pro ($1.25)	GPT-4.1

8. Benchmark Limitations

Data contamination: Models may have seen benchmark data during training. Prefer LiveCodeBench over HumanEval for coding.

Task narrowness: Benchmarks test specific skills. Real-world performance may differ significantly.

Cost blindness: Benchmarks ignore pricing, latency, and availability. Always combine with our pricing data.

Staleness: Saturated benchmarks (GSM8K, HellaSwag) are uninformative. Focus on harder benchmarks like GPQA and SWE-bench.

📊 AI Model Benchmarks Comparison 2025