Least-Cost Routing for AI: How We Cut Token Spend Without Losing Quality

The default approach to AI model selection is to use the most capable model available for every task. It feels like the safe choice. If you are paying for quality, use quality everywhere.

In practice, this is neither safe nor economical. A highly capable, expensive model applied to a simple, well-defined task produces the same output as a less expensive model — and the cost difference between them can be substantial at production volumes.

I noticed this relatively early in building the WAT system. We were routing all agent work to Opus — the highest-capability model in the family — by default. The outputs were excellent. The spend was not. And a review of which tasks were actually requiring Opus-level capability versus which were producing identical outputs from a less expensive model produced a clear pattern: tasks with high judgment requirements genuinely benefited from the more capable model. Tasks that were well-defined, structured, and deterministic produced equivalent results from models that cost a fraction of the price.

Article illustration — least-cost-routing-ai-token-spend

This is least-cost routing applied to AI operations. It is standard practice in telecommunications infrastructure. It is surprisingly rare in AI deployments.


The task type determines the model requirement

The core insight is simple enough to state: different task types have different capability requirements, and capability requirements map to model tiers with different cost profiles.

In the WAT system, Tony Jones is the finance controller agent who evaluates model selection across task types. The framework he uses organises tasks by the kind of reasoning they require:

High judgment tasks — synthesis across complex or ambiguous information, novel reasoning, handling edge cases that require real contextual understanding, nuanced quality assessment. These tasks benefit from the most capable model. The output quality differential is real, not theoretical.

Structured execution tasks — well-defined procedures with specific inputs and outputs, deterministic logic, format checking, data extraction against a defined schema. These tasks produce equivalent outputs from less expensive models because the task is not testing the model’s judgment — it is testing its ability to follow explicit instructions reliably.

Simple classification and routing tasks — categorisation, routing logic, binary decisions against defined criteria. These tasks are the lowest capability requirement. The most capable model applied to them is cost-inefficient.

The discipline is in the honest classification. The instinct is to categorise everything as “high judgment” because that feels like it is taking the work seriously. The honest classification is based on what the task actually requires, not on how important it feels.


Budget caps and spend reporting

Routing decisions without spend visibility are just guesses. The WAT system includes budget cap enforcement and regular spend reporting precisely because cost decisions made without data accumulate into surprises.

Every significant batch operation — generating 30 pages of content, running a complete site audit, executing a multi-agent research pipeline — is evaluated against a per-run cost estimate before execution. If the estimate exceeds the defined budget cap, the operation requires explicit approval rather than autonomous execution. This is not a significant friction — it is a governance gate that prevents runaway spend on misrouted tasks.

The spend reports, produced by Tony Jones on a regular basis, break down expenditure by agent, by task type, and by model tier. This produces the data required to identify patterns: which task types are being over-routed to expensive models, which agents are consuming disproportionate budget relative to their output volume, which operations have higher-than-expected token usage suggesting specification inefficiency.

Specification inefficiency is a cost signal as well as a quality signal. An agent that uses significantly more tokens than expected on a well-defined task is often doing extra work — re-evaluating decisions, producing unnecessary hedging, working around an underspecified prompt. The token count is the footprint of the reasoning. A large footprint on a simple task is a specification problem, not a model capability problem.


Chain of Draft as a cost reduction technique

One practical technique I have found valuable and that is documented in my LinkedIn archive: Chain of Draft prompting.

The standard prompting approach for complex reasoning tasks asks the model to reason through a problem step by step, producing a full thought chain before the answer. This produces high-quality reasoning but at the cost of a large context window. Every step in the thought chain is tokens.

Chain of Draft modifies this: instead of asking the model to produce full sentences for each reasoning step, ask it to produce minimal drafts — short, compressed summaries of each reasoning step. The quality of the final output is comparable. The token usage is substantially lower. The reduction I have seen in practice is meaningful at production volumes.

This is not a universal technique. It works well on tasks where the reasoning path is important but the intermediate steps do not need to be human-readable. It is less appropriate where the reasoning chain is itself a deliverable — an audit trail, a governance record, a client-facing explanation — where intermediate step clarity matters.

The broader principle: token usage is a design choice, not a fixed consequence of task complexity. Prompting strategy, output format specification, context management (what you include in the context and what you exclude), and model tier selection all affect the token cost of a task. Managing these deliberately is materially different from accepting whatever the default produces.


The governance dimension

I write this post primarily for practitioners building AI systems. There is a board-level dimension worth naming, because boards are increasingly approving AI investment decisions where the ongoing operational cost is not well-understood at the time of approval.

AI model usage costs are not fixed. They scale with usage volume, and they are affected by architecture decisions — model selection, context management, prompting strategy — that are typically made by engineering teams rather than finance or governance functions. A system that was cost-acceptable at pilot stage may have a materially different cost profile in production, for reasons that are not visible to anyone outside the engineering team.

The governance question is: what is the ongoing cost model for this AI deployment, who is accountable for managing it, and what visibility does the board have over it?

This is not a question most AI governance frameworks address, because it sits at the boundary of finance and technology. Boards that have approved AI investments without a clear answer to this question often receive the first production cost reports as a surprise. The surprise is avoidable.


What I would tell someone starting a production AI system today

Model spend is meaningful even at moderate scale. The difference between routing all tasks to the most capable model and routing tasks to appropriate models can be 40-60% of total spend at production volumes. That is not a marginal optimisation.

Build the spend visibility first, before you have accumulated significant spend you cannot explain. Running a month of production work and then trying to understand the cost breakdown is considerably harder than building the reporting into the system from the start.

The least-cost routing principle applies at every level: model tier, context size, prompting strategy, agent architecture. Every unnecessary token is a cost. Every task routed to a more expensive model than the task requires is a cost. These accumulate.

And finally: the cheapest overall architecture is the one where agents are well-specified enough to complete tasks without requiring multiple correction cycles. Rework costs tokens. A specification investment upfront is cheaper than the token cost of repeated correction runs. This is the one cost efficiency argument I make most frequently, because it is both correct and counterintuitive — the investment in specification quality is not just a quality decision, it is a cost decision.


The Board AI Governance Framework includes a section on AI investment governance — the questions boards should ask about ongoing operational costs, cost accountability, and spend visibility for AI deployments.

Steven advises boards navigating AI adoption and deep tech commercial strategy. For AI governance consulting, contact Steven directly. For quantum security, visit Quantum Security Defence.

Steven Vaile

Steven Vaile

Board technology advisor and QSECDEF co-founder. Writes on AI governance, quantum security, and commercial strategy for boards and deep tech founders.