The Token Audit · Maximand

The score

The Token Efficiency Score: 0 to 100.

We build a spend cube from your billing and observability, reconcile it to the invoice, and score each of the twelve levers from the evidence in your data rather than from what anyone reports. The result is a single number and a band, from Tokenmaxxer to Yield-optimised, and an overlap-adjusted estimate of recoverable spend. The paid audit replaces every benchmark default with a measured value from your own data.

<30

Tokenmaxxer (unmanaged)

30–49

Reactive

50–69

Managed

70–100

Disciplined to Yield-optimised

The taxonomy

Twelve levers, in four tiers, worked in order. The early tiers come first because you cannot trust a saving you cannot yet measure or contain.

Tier 0 · Stop the bleedingvisibility & control

visibilityAttribute every dollar to a team, app, and model in near real time. The precondition for everything else.

capsEnforced spend caps and token budgets at the key, agent, and seat level, with spike alerts.

zombiesDetect and kill idle agents, abandoned apps, runaway retry loops, and non-prod traffic at prod prices.

Tier 1 · Right-sizethe largest dollar levers

routingRoute each task to the cheapest model that clears a quality gate, instead of defaulting to a frontier model.

cachingCache repeated context: system prompts, boilerplate, retrieved documents.

batchMove work with no real-time deadline to a Batch API at roughly half the price.

arbitrageMatch the pricing model, seat versus metered, to each usage pattern.

Tier 2 · Engineer the waste outcompounds across calls

hygieneTrim prompts and context, cap output length, eliminate reflexive retries.

semanticDe-duplicate near-identical queries with a semantic cache.

distillFine-tune or distill a small model for narrow, high-volume tasks.

Tier 3 · Strategicwhether the spend earns its keep

triageMeasure ROI per use case and cut or redeploy spend that returns nothing.

finopsA standing FinOps-for-AI operating model so savings do not creep back.

How the levers actually work

The mechanics behind the biggest levers.

Routing, behind an eval gate

We do not swap models on a hunch. Each workload gets a golden set and a pass-rate floor; a cheaper model is promoted only after it clears that floor offline and then holds it in a live A/B holdout. For workloads with a hard tail, a confidence-gated cascade answers with a cheap model first and escalates only the difficult cases. Routing stays inside your approved, security-vetted model set.

Caching, at the prefix

A cache read costs roughly a tenth of a fresh input token and a write about a quarter more, so any context resent more than twice pays for itself. The detail that matters: cacheable content has to sit at the prompt prefix, so we restructure prompts to put the stable system prompt and retrieved context first and the variable query last, which maximizes the cacheable prefix and the hit rate.

Distillation, for narrow high-volume tasks

For a stable, high-volume task we collect labeled traces from the frontier model, fine-tune a small approved model, evaluate it against a held-out golden set to a quality floor you set, take it through your model-risk process, and deploy it behind the same eval gate as routing.

Verified by holdout, not by argument

Quality-sensitive changes are proven with an A/B holdout: the optimized path runs against a concurrent control on comparable traffic, so the saving is the measured difference between the two arms. Everything is computed on your own invoices and observability.

What access we need

We run on usage metadata, not prompt content. The spend cube is built from your billing exports and observability data: per request, the model, input and output token counts, cached-token counts, latency, retries, and team and application tags. Redacted is fine. We prefer read-only access inside your environment. Prompt and response content is needed only for a specific use-case deep-dive, and only with your sign-off.

The engagement

Five stages, two to four weeks.

01 / 02

Scope & instrument

Define the perimeter and access, then build the spend cube and reconcile it to the invoice. Attribution is the visibility lever in action.

Diagnose

Score all twelve levers from your own data, with at least one quantified finding per lever.

Quantify

Set the baseline, compute the score and the overlap-adjusted savings, and flip at least one lever live so the estimate becomes an observed delta.

Present

A readout, a written report, a tier-sequenced roadmap, and an implementation proposal priced on verified savings.

then

Implement & verify

Remediate in tier order, verify on your own ledger against a quality floor, sign off monthly.

always

Govern

Stand up FinOps for AI so the savings stay won.

What your AI estate is actually wasting.