Claude Token Counting
Anthropic’s Claude models use a custom BPE tokenizer that’s been optimized for both natural language and code. If you’ve been building with the Claude API, you’ve probably noticed that Claude tends to tokenize text a bit differently from GPT – roughly 3.5 characters per token versus GPT’s 4 characters per token for English text.
That tighter tokenization isn’t necessarily a disadvantage. It means Claude captures more granular information per token, which can help with nuanced understanding. And Anthropic’s pricing reflects the different ratio, so you’re not paying more just because the token count is higher.
Claude Model Lineup and Pricing
| Model | Context | Max Output | Input $/1M | Output $/1M |
|---|---|---|---|---|
| Claude Opus 4.6 | 200K | 32K | $15.00 | $75.00 |
| Claude Sonnet 4.6 | 200K | 16K | $3.00 | $15.00 |
| Claude Haiku 4.5 | 200K | 8K | $0.80 | $4.00 |
All three models share the same 200K context window, which makes them great for long-document tasks. The difference is in reasoning capability and output length. Opus 4.6 is Anthropic’s most capable model, while Haiku 4.5 is the speed-and-cost champion.
When to Use Each Claude Model
Opus 4.6 is the right call for complex reasoning, nuanced writing, code architecture decisions, and tasks where accuracy matters more than speed. It’s the most expensive option, but it’s also the most capable.
Sonnet 4.6 hits a sweet spot for most production workloads. It handles coding tasks, content generation, and data analysis at a fraction of Opus’s cost. If you’re not sure which to pick, start here.
Haiku 4.5 is your go-to for high-volume, latency-sensitive tasks: classification, extraction, summarization, and chatbot responses where you need fast turnaround at minimal cost.
Tips for Claude API Cost Optimization
- Use prompt caching aggressively. If your system prompt is longer than a few hundred tokens, caching pays for itself almost immediately.
- Batch requests when possible. Anthropic’s Message Batches API offers 50% cost savings for non-time-sensitive workloads.
- Set max_tokens thoughtfully. Don’t request 4096 output tokens if your task only needs 200. Shorter max_tokens means the model stops sooner and you pay less.
- Use Haiku for filtering. Run a cheap Haiku call first to determine if a task needs Opus-level reasoning. Route only the hard cases to the expensive model.