1,548 models that see, hear, speak, and create — compared with pricing, context windows, and capabilities
🔍 Interactive Catalog ⭐ Star on GitHubThe most capable multimodal models across all providers:
| Model | Provider | Context | Input | Output | Tool Call | Price (in/out per 1M) |
|---|---|---|---|---|---|---|
gpt-4o |
OpenAI | 128K | text, image | text | ✓ | $2.50/$10 |
gpt-4.1 |
OpenAI | 1M | text, image | text | ✓ | $2/$8 |
claude-sonnet-4 |
Anthropic | 200K | text, image | text | ✓ | $3/$15 |
gemini-2.5-pro |
1M | text, image, audio, video | text | ✓ | $1.25/$10 | |
gemini-2.5-flash |
1M | text, image, audio, video | text | ✓ | $0.15/$0.60 | |
llama-4-maverick |
Meta | 1M | text, image | text | ✓ | Varies |
qwen3-235b-a22b |
Alibaba | 128K | text, image | text | ✓ | Varies |
1,487 models can accept images as input alongside text. These are the most common type of multimodal model:
→ See all 1,487 vision models compared
118 models can process audio input — for transcription, voice analysis, and audio understanding:
| Model | Provider | Audio Capabilities | Context |
|---|---|---|---|
gemini-2.5-pro |
Audio understanding + transcription | 1M | |
gemini-2.5-flash |
Audio understanding + transcription | 1M | |
gpt-4o-audio-preview |
OpenAI | Audio input + output | 128K |
claude-sonnet-4 |
Anthropic | Audio transcription | 200K |
28 models can generate images from text descriptions. This is a rapidly growing category:
| Model | Provider | Capabilities |
|---|---|---|
gpt-image-1 |
OpenAI | Text-to-image, image editing |
dall-e-3 |
OpenAI | Text-to-image generation |
flux-1.1-pro |
Black Forest Labs | High-quality text-to-image |
stable-diffusion-3.5 |
Stability AI | Open-weight text-to-image |
→ See all 28 image generation models
34 models can generate audio output — for text-to-speech, voice cloning, and audio generation:
167 models can process video input — for video analysis, summarization, and content understanding:
gemma-3-27b-it (free) or
gpt-4o
gemini-2.5-flash (cheapest
multimodal) or gemini-2.5-pro
gpt-image-1 or
flux-1.1-pro
gemini-2.5-pro (best video
understanding)
llama-4-maverick or
claude-sonnet-4
gemini-2.5-flash ($0.15/$0.60 per 1M
tokens)
gemma-3-27b-it (Google, free) or
qwen3-32b (Alibaba, free)