The Token-Efficiency Standard v0.1

Seen in practice. The published case study is a conformant TES-Verified result. The fund scored 26 (Tokenmaxxer) on the score in section 4 and moved to 71 (Disciplined); the $8.7M was booked under the verification protocol in section 5, computed on the fund's own invoices, with a quality floor its researchers signed off.

1. What this specifies

The Token-Efficiency Standard (TES) is the normative specification for diagnosing and reducing an enterprise's generative-AI cost while holding output quality. It governs the delivery method: the taxonomy of waste, the scoring, the verification, and the conformance levels a result must meet before it can be called verified. Conformance keywords (MUST, MUST NOT, SHOULD, MAY) are used in the usual sense.

In scope: metered LLM inference and the per-seat AI subscriptions adjacent to it, for an enterprise buyer. Non-goals: the TES does not rank model vendors, does not promise a fixed savings percentage, and does not claim to improve model quality beyond holding a pre-agreed floor. It is a cost-and-yield protocol, not a model-evaluation one.

2. The claims that would refute it

A standard that cannot be wrong is not a standard. The TES rests on four claims, each stated so it can be disproven. The fourth is a self-correcting loop: the method is wired to be revised by its own accumulating evidence.

C1 · Recoverability. An estate scoring in the lower bands carries materially recoverable unit cost.

Refuted if a conformant audit of a low-scoring estate repeatedly fails to identify and verify recoverable unit cost above a stated threshold.

C2 · Recovery without quality loss. That cost can be recovered while holding the quality floor.

Refuted if conformant remediation cannot realize the savings while in-scope quality stays at or above its agreed floor.

C3 · Counterfactual reality. The saving is caused by the intervention, not by provider price cuts or demand swings.

Refuted if an A/B holdout shows no unit-cost delta between the optimized and control arms.

C4 · Calibration. The Standard's benchmark rates predict measured rates within tolerance.

Tested every engagement by comparing benchmark against measured in the engagement record. Persistent divergence on a lever requires a new version that updates that benchmark.

3. The taxonomy: twelve levers, four tiers

A conformant diagnosis MUST assess all twelve levers and MUST NOT rename or omit them. They are worked in tier order, because a saving you cannot attribute or contain cannot be trusted.

Tier 0, stop the bleeding: visibility, caps, zombies.
Tier 1, right-size: routing, caching, batch, arbitrage.
Tier 2, engineer the waste out: hygiene, semantic, distill.
Tier 3, strategic: triage, finops.

Each lever's definition, applicable-share default, savings range, and weight are fixed parameters. The full scoring model, with the benchmark behind every range and a worked example, is published at the scoring model.

4. The Token Efficiency Score

Adoption per lever is scored from evidence (no = 0, partial = 0.5, yes = 1.0). A claimed "yes" the data contradicts MUST be scored lower.

TES-Score = round( 100 × Σ weight·adoption / Σ weight )

Bands: under 30 Tokenmaxxer (unmanaged); 30 to 49 Reactive; 50 to 69 Managed; 70 to 84 Disciplined; 85 to 100 Yield-optimised. Recoverable spend MUST use the overlap-adjusted compounded fraction, capped at 0.65, so overlapping levers are never double-counted and the model cannot claim an implausible total.

recoverable = 1 − Π_i ( 1 − applicable_i · rate_i · (1 − adoption_i) ), then min(·, 0.65)

5. The verification protocol

This is the part that lets a client book the number, and it is where most cost-savings claims fail.

MUST co-define the baseline in writing on the client's own invoices and observability, before any optimization, then freeze it. Anomalies are normalized out.

MUST measure savings as a reduction in cost per unit of work by default, so volume growth neither creates nor destroys a fee. Permitted alternatives: frozen run-rate for one-time eliminations, and A/B holdout (the strongest) for quality-sensitive levers.

MUST hold a per-workload quality floor agreed at baseline. A period's savings are creditable only while in-scope quality stays at or above it.

MUST carve out provider price cuts, organic volume changes, post-baseline use cases, and client-led changes.

MUST be computed reproducibly on the client's own ledger and signed off by the client each period before any fee. Each verified saving is credited for twelve months, then rolls into the client's baseline at no further fee.

6. Conformance levels

Marketing or case studies MUST distinguish a diagnostic estimate from a verified result.

TES-Diagnostic. A scored estate using measured adoption and the normative score. Establishes the prize. The free scorecard is at this level.
TES-Verified. Savings realized and verified under section 5, quality floor intact, client signed off. The only level that may be cited as a savings result. The case study is at this level.
TES-Governed. A standing operating model keeps the verified savings durable across periods.

7. Practitioner certification

Competence is defined in levels so a trained practitioner, not only the original author, can deliver an engagement to the same standard. This is what removes key-person risk and lets the method scale beyond any one person.

Associate. May run the diagnostic under supervision.
Practitioner. May own a full audit, set the baseline, and choose verification methods. Certified by a method examination plus a supervised engagement whose record falls within calibration tolerance.
Lead. May structure and sign the gainshare engagement and the verified number.

8. Limits and non-claims

A diagnostic figure is a benchmark-grounded estimate, not a guarantee. Only a verified figure is a result.
The Standard does not promise a fixed savings percentage; recoverability depends on the starting estate.
It does not improve model quality; it holds a floor.
Benchmark rates are time-bound and decay as the model market moves. C4 and the engagement record keep them current; a stale version SHOULD NOT be used for sizing.
It is not legal, tax, or model-risk advice.

9. Governance and versioning

The TES is a living standard, versioned with semver. Engagements record the version they were delivered under. The calibration loop in C4 is the formal amendment mechanism: when the accumulated evidence shows a lever's measured rates persistently diverging from its benchmark, a new version MUST update that benchmark, with the change and its evidence recorded. We intend the Standard to be open and citable, on the model of the FinOps Foundation, where being in the open is part of why it can be trusted. The method is published; the moat is the execution under a client's constraints, the verification their finance and risk teams will sign, and the benchmark that grows with every engagement.

The Token-Efficiency Standard.