How LLM API Pricing Works
If you’re building anything with AI, you’ve probably stared at a pricing page and thought “okay, but what will this actually cost me?” That’s what this calculator is for. Plug in your numbers, see what you’ll pay across every major provider, and stop guessing.
Most LLM providers use a token-based pricing model. Tokens aren’t characters or words — they’re chunks of text that the model processes internally. A rough rule of thumb: 1 token is about 3-4 characters in English, or roughly 0.75 words. But that varies by model and tokenizer.
The key thing to understand: input tokens and output tokens are priced differently. Input tokens (your prompt) are cheaper because the model just reads them. Output tokens (the model’s response) cost more because generation is computationally heavier. With most providers, output tokens run 2-5x the input price.
The Real Cost Breakdown
Here’s what actually hits your bill:
- Input tokens — everything you send: system prompts, user messages, conversation history, injected context (like RAG results)
- Output tokens — everything the model generates back
- Hidden multiplier — if you’re passing conversation history, those tokens get re-sent (and re-charged) every turn
That last point catches a lot of people. A chatbot that keeps 20 turns of history isn’t just paying for the latest message — it’s paying for all 20 turns as input every single time. This is where costs spiral.
Beyond Per-Token Pricing
Token costs aren’t the whole picture. Depending on your use case, watch for:
- Fine-tuning costs — training runs are billed per token at higher rates, plus you pay storage fees for custom models
- Embedding costs — if you’re doing RAG, you’re paying separately for embedding generation (usually cheap, but it adds up at scale)
- Rate limits — higher tiers cost more monthly but give you better throughput
- Minimum spend — some enterprise tiers require committed spend
- Cached input discounts — Anthropic and OpenAI both offer discounts when your prompt prefix is cached, cutting input costs by 50-90%
Tips for Keeping Costs Down
After working with these APIs for a while, here’s what actually moves the needle:
Pick the right model for the job. Don’t use GPT-5.4 or Claude Opus for tasks that Haiku or GPT-4o Mini can handle. Classification, extraction, and simple Q&A don’t need frontier models.
Trim your prompts. Every token in your system prompt gets charged on every request. Be concise. Use abbreviations in few-shot examples. Strip unnecessary formatting.
Use caching. If your system prompt or context doesn’t change between requests, prompt caching can cut input costs dramatically. Both Anthropic and OpenAI support this.
Set max_tokens wisely. Don’t set it to 4096 if you expect 200-token responses. While you only pay for tokens actually generated, a lower cap prevents runaway completions.
Batch when possible. Some providers offer batch APIs at 50% discounts for non-real-time workloads.
Comparing Pricing Models
The landscape breaks down roughly like this:
- Premium frontier (Claude Opus 4.6, GPT-5.4, Gemini 3) — highest capability, highest cost. Use for complex reasoning, coding, analysis.
- Mid-tier (Claude Sonnet, GPT-4o, Gemini 2.5 Pro, Grok 3) — great balance of quality and cost for production apps.
- Budget (Claude Haiku, GPT-4o Mini, Gemini 2.5 Flash, DeepSeek V3) — surprisingly capable for routine tasks at a fraction of the price.
- Open-weight (Llama 4 Maverick) — zero API cost if self-hosted, but you’re paying for GPU compute instead.
The right choice depends on your latency requirements, quality bar, and volume. Most production systems end up using 2-3 models: a cheap one for simple tasks, a mid-tier for most work, and a frontier model for the hard stuff.
Use the calculator above to model your specific usage pattern. Enter your token counts per request, how many requests you expect, and see exactly what each provider will charge.