🎨 Multimodal AI Models

1,548 models that see, hear, speak, and create — compared with pricing, context windows, and capabilities

1,548
Multimodal Models
1,487
Vision
118
Audio Input
34
Audio Output
28
Image Output
167
Video Input
🔍 Interactive Catalog ⭐ Star on GitHub
Contents
  1. Modality Breakdown
  2. Flagship Multimodal Models
  3. Vision Models (Image Input)
  4. Audio Input Models
  5. Image Generation Models
  6. Audio Output Models
  7. Video Understanding Models
  8. Choosing the Right Multimodal Model

📊 Modality Breakdown

👁️
1,487
Vision (Image Input)
🎬
167
Video Input
🎤
118
Audio Input
📄
141
PDF Input
🖼️
28
Image Output
🔊
34
Audio Output
🎥
4
Video Output

🏆 Flagship Multimodal Models

The most capable multimodal models across all providers:

Model Provider Context Input Output Tool Call Price (in/out per 1M)
gpt-4o OpenAI 128K text, image text $2.50/$10
gpt-4.1 OpenAI 1M text, image text $2/$8
claude-sonnet-4 Anthropic 200K text, image text $3/$15
gemini-2.5-pro Google 1M text, image, audio, video text $1.25/$10
gemini-2.5-flash Google 1M text, image, audio, video text $0.15/$0.60
llama-4-maverick Meta 1M text, image text Varies
qwen3-235b-a22b Alibaba 128K text, image text Varies

👁️ Vision Models (Image Input)

1,487 models can accept images as input alongside text. These are the most common type of multimodal model:

Best Vision Models by Use Case

See all 1,487 vision models compared

🎤 Audio Input Models

118 models can process audio input — for transcription, voice analysis, and audio understanding:

Model Provider Audio Capabilities Context
gemini-2.5-pro Google Audio understanding + transcription 1M
gemini-2.5-flash Google Audio understanding + transcription 1M
gpt-4o-audio-preview OpenAI Audio input + output 128K
claude-sonnet-4 Anthropic Audio transcription 200K

🖼️ Image Generation Models

28 models can generate images from text descriptions. This is a rapidly growing category:

Model Provider Capabilities
gpt-image-1 OpenAI Text-to-image, image editing
dall-e-3 OpenAI Text-to-image generation
flux-1.1-pro Black Forest Labs High-quality text-to-image
stable-diffusion-3.5 Stability AI Open-weight text-to-image

See all 28 image generation models

🔊 Audio Output Models

34 models can generate audio output — for text-to-speech, voice cloning, and audio generation:

Key Audio Output Models

🎬 Video Understanding Models

167 models can process video input — for video analysis, summarization, and content understanding:

Top Video Understanding Models

🤔 Choosing the Right Multimodal Model

Decision Framework

🔗 Related Comparisons

Small Language Models

🎯 AI Model Picker

⚡ GitHub Action