Llama 4 Pricing: What You Actually Pay
Llama 4 is Meta’s open-weight model family. The weights are free to download — no per-token licensing fee, no usage caps. Our calculator shows $0/$0 per million tokens because that’s the model cost. But “free” hides the real expense: compute.
Whether you self-host or use an API provider, you’re paying for the hardware that runs inference. This guide breaks down what Llama 4 actually costs in practice, how to compare it against proprietary models, and where the cost savings are real versus theoretical.
Llama 4 Model Family
Meta released two models under the Llama 4 umbrella, both using a mixture-of-experts (MoE) architecture with 17 billion active parameters per forward pass:
- Llama 4 Scout — 109B total parameters, 16 experts, 10M token context window. The lighter model, designed for long-document understanding and retrieval tasks. Cheaper to self-host because it fits on fewer GPUs.
- Llama 4 Maverick — 400B total parameters, 128 experts, 1M token context window. The flagship model, competitive with GPT-5 and Claude on coding, math, and multilingual benchmarks. Requires more hardware but delivers higher quality.
The calculator above uses Maverick pricing by default since it’s the model most developers deploy for production workloads.
API Provider Pricing
You don’t need to self-host. Several providers offer Llama 4 Maverick as a pay-per-token API:
- Together AI — ~$0.27 input / $0.85 output per million tokens
- Fireworks AI — ~$0.22 input / $0.88 output per million tokens
- AWS Bedrock — ~$0.35 input / $1.05 output per million tokens
- Groq — competitive pricing with the fastest inference speeds
Prices change frequently — check each provider for current rates. The key takeaway: hosted Llama 4 costs a fraction of proprietary models because the model license is free. You’re only paying for the compute margin.
Cost Comparison: Llama 4 vs Proprietary Models
At hosted API rates, Llama 4 Maverick costs roughly $0.20-1.00 per million tokens. Here’s how that compares:
- GPT-5 — $2.00-10.00 per 1M tokens depending on variant (5-10x more expensive)
- Claude Opus 4 — $15.00 input / $75.00 output per 1M tokens (the premium tier)
- Claude Sonnet 4 — $3.00 input / $15.00 output per 1M tokens
- Gemini 2.5 Pro — $1.25-10.00 per 1M tokens depending on context usage
For tasks where Llama 4 matches proprietary model quality — straightforward code generation, translation, summarization, structured extraction — the cost advantage is 5-50x. The gap narrows on complex reasoning where larger proprietary models still lead.
Self-Hosting: The Real Numbers
Running Llama 4 Maverick yourself means paying for GPU hours instead of per-token fees:
- Hardware requirement — 4-8 A100 80GB GPUs (or equivalent H100/L40S). Maverick’s 400B parameters don’t all load at once (MoE activates 17B per pass), but you still need enough VRAM for the full model.
- Cloud GPU cost — $8-15/hour for an 8xA100 instance depending on provider and commitment length. Spot instances can drop this to $3-6/hour with interruption risk.
- Break-even point — at $10/hour and 80% GPU utilization, self-hosting beats API pricing when you’re processing more than roughly 50-100M tokens per day. Below that, the idle-time penalty makes APIs cheaper.
Reducing Llama 4 Costs
Three techniques that meaningfully lower your per-token spend:
- Quantization — running Maverick in FP8 or INT4 precision cuts memory by 30-60% and increases throughput, with benchmarks showing less than 2% quality degradation. This lets you use fewer GPUs or serve more concurrent requests.
- Request batching — grouping multiple inference requests shares the fixed cost of loading model weights. Most serving frameworks (vLLM, TGI, TensorRT-LLM) batch automatically.
- Prompt caching — if your application reuses a long system prompt or context prefix, caching the KV-cache for that prefix avoids recomputing it for every request. Several API providers now offer this as a built-in feature.
When to Choose Llama 4
Self-hosting or API-hosted Llama 4 makes sense when:
- Cost is the primary driver — Llama 4 is one of the cheapest frontier-quality models to run at scale
- Data must stay on your infrastructure — no third-party data processing agreements needed
- You need to fine-tune — Meta’s license allows commercial fine-tuning, unlike proprietary APIs
- Predictable billing matters — fixed GPU costs beat variable per-token charges for high-volume use
It’s a weaker fit for low-volume prototyping (API minimum costs add up) or tasks where GPT-5/Claude significantly outperform open models (complex multi-step reasoning, highly specialized domains).
Use the calculator above to plug in your actual token volumes and see the cost difference across every model — including Llama 4 at both self-hosted and API rates.