Controlling LLM costs in production

The first month an AI feature goes to production, the LLM bill is usually a pleasant surprise — small, because traffic is small. The third month, after a launch, it’s an unpleasant surprise. We’ve been called in more than once to fix a SaaS whose LLM costs were eating the entire margin on the AI feature.

The good news: there are four levers, and pulling all four typically cuts spend by 10x without a noticeable quality drop.

Lever 1: Route by difficulty (60-80% savings)

The single biggest lever. Most requests don’t need your most expensive model. Classify the request — cheaply, with a tiny model or a heuristic — and route easy requests to a small/fast model, hard ones to the big model.

Concretely: a customer-support bot might use a small model for “what are your hours?” and only escalate to the flagship model for “help me debug why my integration is failing.” If 70% of your traffic is easy, you just cut 70% of your model spend to a fraction.

Lever 2: Cache aggressively (30-50% savings)

LLM calls are deterministic enough to cache. Two layers:

Exact-match cache. Identical prompt → cached response. Trivial to implement, surprisingly high hit rate for common queries.
Semantic cache. Embed the prompt; if a previous prompt is within a similarity threshold, serve its cached response. Catches paraphrases.

Also: use the provider’s prompt caching (Anthropic, OpenAI both offer it) for the static parts of your prompt — the system prompt and few-shot examples are identical across calls and shouldn’t be re-billed at full rate every time.

Lever 3: Trim the context (20-40% savings)

Teams routinely send the entire conversation history, the full document, all the retrieved chunks — when the model only needs a fraction. You pay per token. Sending 8K tokens when 2K would do is a 4x overcharge on input.

Summarize old conversation turns instead of sending them verbatim
Send the top-3 retrieved chunks, not the top-20
Strip boilerplate from documents before sending
Use a smaller, focused system prompt — not a 2,000-word manifesto

Lever 4: Batch and stream (throughput + UX)

Not a direct cost cut, but it changes the economics:

Batch async work.If a task isn’t interactive (nightly summarization, bulk classification), use the provider’s batch API — typically 50% cheaper for non-realtime work.
Stream interactive responses.Streaming doesn’t reduce tokens, but it makes the feature feel fast, which means you can use a slightly smaller/cheaper model without users perceiving it as worse.

Measure before you optimize

Before pulling any lever, instrument: log tokens-in, tokens-out, model, latency and cost per request, tagged by feature. Most teams are shocked to find 80% of their spend comes from one feature, or from a handful of power users, or from an accidental loop re-calling the model. You can’t cut what you can’t see.

Set a budget guardrail

Put a hard per-user and per-org rate limit on LLM calls. Without one, a single runaway script (or abusive user) can run up thousands of dollars overnight. Treat LLM access like any metered resource: rate-limited, budgeted, alerted.

How we approach this

For AI features we ship via AI Software Development, model routing and caching go in from day one — not as a later optimization. We instrument cost-per-request per feature so the spend never surprises anyone, and we set budget guardrails before the feature ever sees production traffic.

Takeaways

Route by difficulty — the biggest lever, 60-80% savings.
Cache exact and semantic matches; use provider prompt caching.
Trim context aggressively — you pay per token.
Batch async work; stream interactive responses.
Measure cost-per-feature first. Set budget guardrails always.

Controlling LLM costs in production

Lever 1: Route by difficulty (60-80% savings)

Lever 2: Cache aggressively (30-50% savings)

Lever 3: Trim the context (20-40% savings)

Lever 4: Batch and stream (throughput + UX)

Measure before you optimize

Set a budget guardrail

How we approach this

Takeaways

More from the engine room

AI in QA: where it helps, where it doesn’t

RAG vs fine-tuning: which do you actually need?

Agentic features in SaaS: the maturity ladder

Offline-first mobile: the app that works on the subway

Lift-and-shift vs refactor: how to actually decide

Monolith migration: the strangler-fig playbook

SOC 2 readiness in plain English

OWASP top risks for 2026 — with what to actually do

Let’s Build the Future Together!