Dezen Technology
All articles
AIMay 25, 20267 min read

Controlling LLM costs in production

Four levers cut spend 10x without cutting quality: route by difficulty, cache, trim context, batch and stream. Measure cost-per-feature first; set budget guardrails always.

Controlling LLM costs in production

The first month an AI feature goes to production, the LLM bill is usually a pleasant surprise — small, because traffic is small. The third month, after a launch, it’s an unpleasant surprise. We’ve been called in more than once to fix a SaaS whose LLM costs were eating the entire margin on the AI feature.

The good news: there are four levers, and pulling all four typically cuts spend by 10x without a noticeable quality drop.

Four LLM cost levers — route by difficulty, cache, trim context, batch and stream

Lever 1: Route by difficulty (60-80% savings)

The single biggest lever. Most requests don’t need your most expensive model. Classify the request — cheaply, with a tiny model or a heuristic — and route easy requests to a small/fast model, hard ones to the big model.

Concretely: a customer-support bot might use a small model for “what are your hours?” and only escalate to the flagship model for “help me debug why my integration is failing.” If 70% of your traffic is easy, you just cut 70% of your model spend to a fraction.

Lever 2: Cache aggressively (30-50% savings)

LLM calls are deterministic enough to cache. Two layers:

  • Exact-match cache. Identical prompt → cached response. Trivial to implement, surprisingly high hit rate for common queries.
  • Semantic cache. Embed the prompt; if a previous prompt is within a similarity threshold, serve its cached response. Catches paraphrases.

Also: use the provider’s prompt caching (Anthropic, OpenAI both offer it) for the static parts of your prompt — the system prompt and few-shot examples are identical across calls and shouldn’t be re-billed at full rate every time.

Lever 3: Trim the context (20-40% savings)

Teams routinely send the entire conversation history, the full document, all the retrieved chunks — when the model only needs a fraction. You pay per token. Sending 8K tokens when 2K would do is a 4x overcharge on input.

  • Summarize old conversation turns instead of sending them verbatim
  • Send the top-3 retrieved chunks, not the top-20
  • Strip boilerplate from documents before sending
  • Use a smaller, focused system prompt — not a 2,000-word manifesto

Lever 4: Batch and stream (throughput + UX)

Not a direct cost cut, but it changes the economics:

  • Batch async work.If a task isn’t interactive (nightly summarization, bulk classification), use the provider’s batch API — typically 50% cheaper for non-realtime work.
  • Stream interactive responses.Streaming doesn’t reduce tokens, but it makes the feature feel fast, which means you can use a slightly smaller/cheaper model without users perceiving it as worse.

Measure before you optimize

Before pulling any lever, instrument: log tokens-in, tokens-out, model, latency and cost per request, tagged by feature. Most teams are shocked to find 80% of their spend comes from one feature, or from a handful of power users, or from an accidental loop re-calling the model. You can’t cut what you can’t see.

Set a budget guardrail

Put a hard per-user and per-org rate limit on LLM calls. Without one, a single runaway script (or abusive user) can run up thousands of dollars overnight. Treat LLM access like any metered resource: rate-limited, budgeted, alerted.

How we approach this

For AI features we ship via AI Software Development, model routing and caching go in from day one — not as a later optimization. We instrument cost-per-request per feature so the spend never surprises anyone, and we set budget guardrails before the feature ever sees production traffic.

Takeaways

  • Route by difficulty — the biggest lever, 60-80% savings.
  • Cache exact and semantic matches; use provider prompt caching.
  • Trim context aggressively — you pay per token.
  • Batch async work; stream interactive responses.
  • Measure cost-per-feature first. Set budget guardrails always.
Keep reading

More from the engine room

AI in QA: where it helps, where it doesn’t

May 27, 2026

AI in QA: where it helps, where it doesn’t

AI augments QA throughput — test generation, triage, visual regression. It doesn’t replace QA judgment: strategy, exploratory testing, and defining correctness stay human.

Read More
RAG vs fine-tuning: which do you actually need?

May 23, 2026

RAG vs fine-tuning: which do you actually need?

Facts → RAG. Behavior → maybe fine-tune. Most business AI features want RAG even when teams ask for fine-tuning. The decision rule and the order to try things in.

Read More
Agentic features in SaaS: the maturity ladder

May 21, 2026

Agentic features in SaaS: the maturity ladder

From manual to autonomous — four levels of autonomy and the guardrails each needs. Match autonomy to the cost of being wrong, not to how impressive it sounds.

Read More
Offline-first mobile: the app that works on the subway

May 19, 2026

Offline-first mobile: the app that works on the subway

The UI never waits on the network. Local DB, sync engine, server — with conflict resolution per data type. The architecture that makes mobile apps feel instant.

Read More
Lift-and-shift vs refactor: how to actually decide

May 17, 2026

Lift-and-shift vs refactor: how to actually decide

Lift-and-shift is fast, cheap to do, expensive to keep. Refactor is months of work with structural upside. The matrix — and why half-finished refactors are the worst path.

Read More
Monolith migration: the strangler-fig playbook

May 15, 2026

Monolith migration: the strangler-fig playbook

The big-bang rewrite is the most consistently bad idea in software. Proxy in front, extract one route at a time, shrink the monolith to nothing. No migration day.

Read More
SOC 2 readiness in plain English

May 13, 2026

SOC 2 readiness in plain English

Five Trust Service Criteria, Security mandatory and the rest optional. Type 1 vs Type 2. The pragmatic 6-month timeline — not the year-long ordeal it’s made out to be.

Read More
OWASP top risks for 2026 — with what to actually do

May 11, 2026

OWASP top risks for 2026 — with what to actually do

The ten vulnerability classes that show up in real breaches, each with the single most important defensive action. Plus the 80/20 of web security.

Read More

Let’s Build the Future Together!

Contact our team today and turn your ideas into reality.

Let’s Discuss
Contact Details : sales@dezentech.com Sy. No:40, Flat No:402, SIRISAMPADHA ARCADE I, Plot no:18-21, behind Union Bank of India, Khajaguda, Hyderabad, Telangana 500104