The first month an AI feature goes to production, the LLM bill is usually a pleasant surprise — small, because traffic is small. The third month, after a launch, it’s an unpleasant surprise. We’ve been called in more than once to fix a SaaS whose LLM costs were eating the entire margin on the AI feature.
The good news: there are four levers, and pulling all four typically cuts spend by 10x without a noticeable quality drop.

Lever 1: Route by difficulty (60-80% savings)
The single biggest lever. Most requests don’t need your most expensive model. Classify the request — cheaply, with a tiny model or a heuristic — and route easy requests to a small/fast model, hard ones to the big model.
Concretely: a customer-support bot might use a small model for “what are your hours?” and only escalate to the flagship model for “help me debug why my integration is failing.” If 70% of your traffic is easy, you just cut 70% of your model spend to a fraction.
Lever 2: Cache aggressively (30-50% savings)
LLM calls are deterministic enough to cache. Two layers:
- Exact-match cache. Identical prompt → cached response. Trivial to implement, surprisingly high hit rate for common queries.
- Semantic cache. Embed the prompt; if a previous prompt is within a similarity threshold, serve its cached response. Catches paraphrases.
Also: use the provider’s prompt caching (Anthropic, OpenAI both offer it) for the static parts of your prompt — the system prompt and few-shot examples are identical across calls and shouldn’t be re-billed at full rate every time.
Lever 3: Trim the context (20-40% savings)
Teams routinely send the entire conversation history, the full document, all the retrieved chunks — when the model only needs a fraction. You pay per token. Sending 8K tokens when 2K would do is a 4x overcharge on input.
- Summarize old conversation turns instead of sending them verbatim
- Send the top-3 retrieved chunks, not the top-20
- Strip boilerplate from documents before sending
- Use a smaller, focused system prompt — not a 2,000-word manifesto
Lever 4: Batch and stream (throughput + UX)
Not a direct cost cut, but it changes the economics:
- Batch async work.If a task isn’t interactive (nightly summarization, bulk classification), use the provider’s batch API — typically 50% cheaper for non-realtime work.
- Stream interactive responses.Streaming doesn’t reduce tokens, but it makes the feature feel fast, which means you can use a slightly smaller/cheaper model without users perceiving it as worse.
Measure before you optimize
Before pulling any lever, instrument: log tokens-in, tokens-out, model, latency and cost per request, tagged by feature. Most teams are shocked to find 80% of their spend comes from one feature, or from a handful of power users, or from an accidental loop re-calling the model. You can’t cut what you can’t see.
Set a budget guardrail
Put a hard per-user and per-org rate limit on LLM calls. Without one, a single runaway script (or abusive user) can run up thousands of dollars overnight. Treat LLM access like any metered resource: rate-limited, budgeted, alerted.
How we approach this
For AI features we ship via AI Software Development, model routing and caching go in from day one — not as a later optimization. We instrument cost-per-request per feature so the spend never surprises anyone, and we set budget guardrails before the feature ever sees production traffic.
Takeaways
- Route by difficulty — the biggest lever, 60-80% savings.
- Cache exact and semantic matches; use provider prompt caching.
- Trim context aggressively — you pay per token.
- Batch async work; stream interactive responses.
- Measure cost-per-feature first. Set budget guardrails always.







