Complete guide to small language models for edge deployment, mobile apps, and cost-efficient production. All data from AI Models Catalog โ first-party data only.
Small Language Models (SLMs) are AI models with fewer than ~10 billion parameters, designed for efficiency, low latency, and deployment on resource-constrained hardware โ from smartphones to edge servers. They offer a practical alternative to large frontier models when cost, speed, or privacy matters.
Key advantages of SLMs:
Best value SLMs for AI agents and tool-use workflows (first-party providers only):
| Model | Provider | Input $/M | Output $/M | Context | Reasoning |
|---|---|---|---|---|---|
| ling-2.6-flash | ling | $0.01 | $0.03 | 262K | |
| klusterai--Meta-Llama-3.1-8B-Instruct-Turbo | klusterai | $0.015 | $0.02 | 131K | |
| granite-4.0-h-micro | ibm | $0.017 | $0.112 | 131K | |
| llama-3.1-8b-instruct--fp-16 | fireworks | $0.02 | $0.03 | 131K | |
| schematron-3b | fireworks | $0.02 | $0.05 | 131K |
48 small models available at zero cost โ perfect for prototyping and development:
| Model | Provider | Context | Tool Calling | Reasoning |
|---|---|---|---|---|
| deepseek-r1-distill-llama-8b | cerebras | 131K | โ | |
| llama-4-scout-17b-16e-instruct | cerebras | 131K | โ | |
| qwen-2.5-32b | cerebras | 131K | โ | |
| gemma-4-26b-a4b-it | auriko | 262K | โ | |
| glm-4.5-flash | auriko | 200K | โ | |
| glm-4.6v-flash | auriko | 128K | โ | |
| baidu--ernie-4.5-0.3b | aimlapi | 120K | โ |
557 small models with reasoning capabilities โ ideal for math, logic, and step-by-step problem solving:
| Model | Provider | Input $/M | Output $/M | Context | Tool Calling |
|---|---|---|---|---|---|
| qwen3.5-0.8b | qwen | $0.01 | $0.05 | 262K | |
| qwen3.5-2b | qwen | $0.02 | $0.10 | 262K | |
| qwen--qwen3-4b-fp8 | fireworks | $0.03 | $0.03 | 128K | |
| qwen3.5-4b | qwen | $0.03 | $0.15 | 262K | |
| deepseek-r1-distill-llama-8b | cerebras | Free | Free | 131K |
ling-2.6-flash ($0.01/$0.03/M) โ cheapest tool-calling model with 262K context. Perfect for high-volume agent workflows.
Qwen3.5 0.8B โ ultra-compact reasoning model. Gemma 4 27B IT โ free with vision + tool calling.
bdc-coder ($0.01/$0.01/M) โ cheapest coding model. Qwen3 4B ($0.03/$0.15/M) โ open-source with reasoning.
DeepSeek R1 Distill Llama 8B โ free reasoning model. Qwen3.5 0.8B ($0.01/$0.05/M) โ cheapest reasoning.
GPT-4.1-nano ($0.10/$0.40/M) โ fast, cheap, reliable. Qwen3 4B ($0.03/$0.15/M) โ open-source alternative.
| Factor | Small Model (SLM) | Large Model (LLM) |
|---|---|---|
| Cost per 1M tokens | $0.01 โ $0.20 | $1 โ $40 |
| Latency (first token) | 50 โ 200ms | 200 โ 2000ms |
| Deployment | On-device, edge, cloud | Cloud only |
| Privacy | Data stays on device | Data sent to cloud |
| Customization | Easy fine-tuning | Expensive fine-tuning |
| Complex reasoning | Good for simple tasks | Superior for complex tasks |
| Best for | High-volume, real-time, edge | Complex, nuanced, creative |