You can tell me how many tokens your AI agent consumed last month. Can you tell me how much intelligence it delivered?
An enterprise deployed an AI agent to review batch records. The agent consumed 4.2 million tokens in its first week. The CIO asked a reasonable question: “Is that a lot?” No one could answer — because the industry measures AI in tokens, and tokens measure cost. They do not measure intelligence.
This is the fundamental gap in how organisations think about agentic AI. We have precise metrics for what AI costs (tokens consumed, API calls made, compute hours billed) but no shared language for what AI delivers. A deviation investigation that required 14 reasoning turns, 6 tool calls across 3 systems, and cross-referencing 200 pages of batch data is reported as “47,000 tokens” — collapsing the entire cognitive process into a billing line item.
This article proposes a framework called Intelligence Units — a way to decompose, measure, and reason about the intelligence that agentic systems actually deliver. Not to replace token-based pricing, but to give organisations a second axis: alongside cost, capability.
Tokens are to intelligence what kilowatt-hours are to comfort. They measure resource consumption, not the outcome produced. An Intelligence Unit measures what the AI actually did — how hard it thought, what it retrieved, how many actions it took, and what model capability it required.
When AI was limited to single-turn completions — ask a question, get an answer — tokens were a reasonable proxy for work done. More tokens meant longer responses. Cost tracked roughly with output. But agentic systems break this assumption. An agent that investigates a deviation might spend 80% of its tokens on internal reasoning that never appears in the final output. Another agent might make 12 tool calls to query databases, each consuming tokens for the request and response, with the actual “intelligence” sitting in how it decided which tools to call and in what sequence.
The result is that two agents can consume identical token counts while delivering fundamentally different levels of intelligence. One might have brute-forced its way through a task with verbose, unfocused reasoning. The other might have executed a precise, multi-step investigation with targeted data retrieval and surgical tool use. The token bill is the same. The intelligence delivered is not.
$52B
Up from $7.8B today — yet no standard framework exists for measuring agent capability
60-70%
Without multi-turn reasoning and tool use, LLM accuracy on complex tasks plateaus well below enterprise requirements
90%
Plan-and-execute patterns using tiered models can reduce costs by 90% vs. frontier-only architectures
An Intelligence Unit is not a single number. It is a composite measure across five dimensions — each capturing a different aspect of the cognitive work an agent performs. Together, they describe not just how much an agent consumed, but how intelligently it operated.
Think of it like a medical diagnosis. A doctor’s intelligence is not measured by how many words they speak. It is measured by the reasoning they apply, the tests they order, the data they review, and the expertise they bring. An Intelligence Unit applies the same logic to AI agents.
The internal thinking an agent performs before producing output. Modern reasoning models (extended thinking, chain-of-thought) generate thousands of tokens of internal deliberation that never appear in the response. These reasoning tokens represent the depth of analysis — the difference between a snap judgment and a considered investigation. A deviation review that generates 8,000 reasoning tokens is doing fundamentally different cognitive work than one that generates 200.
The number of iterative reasoning cycles an agent executes to complete a task. Each turn represents a plan-act-observe loop: the agent reasons about its current state, takes an action, observes the result, and decides what to do next. A single-turn agent is a chatbot. A 14-turn agent that progressively narrows a root cause investigation is demonstrating genuine problem-solving behaviour. Turns measure persistence and adaptability.
The actions an agent takes against external systems — database queries, API calls, document retrievals, calculations. Tool calls are where reasoning meets reality. An agent that makes 6 targeted tool calls to cross-reference batch data across an MES, LIMS, and document management system is demonstrating integration intelligence. The number, sequence, and precision of tool calls are a direct measure of operational capability.
The volume and relevance of information an agent pulls into its reasoning context. This includes RAG (retrieval-augmented generation) lookups, document searches, database queries, and cross-system data correlation. An agent reviewing a cleaning validation SOP that retrieves the relevant FDA guidance, the HBEL calculation report, and the last three inspection observations is building a richer reasoning context. More relevant retrieval means higher-quality conclusions.
The fifth dimension — and the one that transforms the framework from descriptive to prescriptive — is the Model Multiplier.
Not all reasoning is equal. A reasoning token generated by a frontier model (like Claude Opus) carries more cognitive weight than the same token generated by a lightweight model (like Claude Haiku). The model multiplier captures this difference. It is the recognition that the same task, executed with the same number of tokens and turns, will produce materially different intelligence depending on which model does the thinking.
Consider three tiers: a lightweight model (Haiku-class) operates at a 1x multiplier — fast, efficient, suitable for classification, routing, and structured extraction. A mid-tier model (Sonnet-class) operates at a 3-5x multiplier — capable of nuanced reasoning, multi-step analysis, and contextual judgment. A frontier model (Opus-class) operates at a 10-15x multiplier — deep reasoning, complex investigation, novel problem-solving, and expert-level domain synthesis.
The model multiplier means an Intelligence Unit is not just about volume of work — it is about quality of thought.
An Intelligence Unit for a given agent task can be expressed as a weighted composite of its five dimensions. The exact weights will vary by domain and use case, but the structure remains consistent.
The formula is intentionally simple: IU = (Reasoning Tokens × Model Multiplier) + (Agent Turns × Turn Weight) + (Tool Calls × Call Weight) + (Data Retrieval Score). The point is not mathematical precision — it is giving organisations a shared vocabulary for comparing agent work that goes beyond “how many tokens did it use.”
Interactive Example
Walk through a temperature excursion deviation in tablet coating. See how each phase consumes different levels of intelligence — and why tokens alone don't tell the story.
Has this happened before?
AI scans 2 years of historical deviations to check if this issue — a temperature excursion during tablet coating — has occurred before. Pattern matching against 1,847 closed deviations.
Systems Queried
Intelligence Unit Breakdown
Cumulative — Steps 1 to 1
12,000 tokens consumed. Cost: $0.18. No further insight into what happened.
Opaque
400 reasoning tokens, 1 turn, 0 tool calls, no retrieval. Haiku-class model (1x). IU score: Low. Appropriate for the task complexity.
Transparent
47,000 tokens consumed. Cost: $2.35. Looks expensive. No context on why.
Opaque
8,200 reasoning tokens, 14 turns, 6 tool calls across 3 systems, 12 documents retrieved. Opus-class (10x). IU score: High. Deep investigation justified the model choice.
Transparent
31,000 tokens consumed. Cost: $0.93. Mid-range. Unclear if optimised.
Opaque
3,100 reasoning tokens, 4 turns, 8 tool calls (regulatory DB lookups), 5 guidance docs retrieved. Sonnet-class (4x). IU score: Medium-High. Efficient use of mid-tier model with heavy retrieval.
Transparent
Intelligence Units shift the conversation from “How much does AI cost?” to “How much intelligence does AI deliver per unit of cost?” This reframing has practical consequences for how organisations architect, deploy, and optimise agentic systems.
Intelligence Units make model selection empirical. Route tasks to the model tier that delivers the required IU score at the lowest cost. Classification tasks do not need Opus. Root cause investigations should not run on Haiku. The framework makes the mismatch visible.
Compare agents not by token consumption but by intelligence delivered per task. An agent that resolves deviations in 8 turns with targeted tool calls is measurably better than one that takes 22 turns with redundant queries — even if the second one costs less in tokens.
Allocate AI spend by intelligence required, not tokens estimated. A quality operations team might budget 10,000 IUs per month for deviation investigations (high complexity, frontier model) and 50,000 IUs for document classifications (low complexity, lightweight model). Same budget, different intelligence profiles.
Track IU efficiency over time. Are your agents getting smarter — delivering higher IU scores with fewer tokens? Or are they getting bloated — consuming more resources without improving outcomes? Intelligence Units make agent drift measurable.
The organisations that will extract the most value from agentic AI are not those that spend the most on tokens. They are those that understand what intelligence their agents are delivering — and can systematically optimise for capability, not just cost.
Intelligence Units are a framework, not a specification. The exact weights, multipliers, and scoring methodology will evolve as the industry matures and as agentic architectures become more sophisticated. What matters now is the shift in mental model: from measuring AI by what it consumes to measuring AI by what it delivers.
The parallel to manufacturing is direct. Pharmaceutical companies do not measure production efficiency by kilowatt-hours consumed — they measure it by batch yield, right-first-time rates, and cycle times. The energy bill is a cost input, not a performance metric. Tokens are the same. They are the energy bill for intelligence. Intelligence Units are the yield metric.
As agentic AI moves from pilot to production — Gartner projects 40% of enterprise applications will embed AI agents by the end of 2026 — the organisations that build measurement frameworks around intelligence delivered, not just tokens consumed, will make systematically better decisions about where to deploy agents, which models to use, and how to optimise their AI operations over time. Those that don’t will keep asking the same unanswerable question: “Is 4.2 million tokens a lot?”