Skip to content




The AI industry loves token inflation. Your company shouldn’t

Featured Replies

rssImage-88f2fcb7949d16920a85610ee9e5b018.webp

The AI industry has a quiet addiction problem: It is addicted to tokens. 

Every new generation of agentic AI seems to assume that the answer to complexity is to throw more context at the model, keep longer histories, spawn more calls, loop over more tools, and let the token meter run wild. 

The rise of agentic systems, and now projects like OpenClaw, makes that temptation even stronger. Once you give models more autonomy, they do not just consume tokens to answer questions. They consume them to plan, reflect, retry, summarize, call tools, inspect outputs, and keep themselves on track. OpenClaw itself describes the product as an “agent-native” gateway with sessions, memory, tool use, and multi-agent routing across messaging platforms—which tells you exactly where this is going: more autonomy, more orchestration and, unless someone intervenes, a lot more token burn. 

That trajectory delights almost everyone selling the infrastructure. If billing is based on tokens, more token consumption looks like growth. If you sell the compute behind those tokens, it looks even better. Google said in its October 2025 earnings call that it was processing more than 1.3 quadrillion monthly tokens across its surfaces, or more than 20 times the volume of a year earlier. Nvidia, for its part, has been leaning hard into the economics of inference and agentic AI, highlighting both the demand surge and the opportunity to sell ever more infrastructure into it. 

But companies buying AI should look at this very differently. From the customer’s perspective, explosive token growth is not necessarily a sign of intelligence. In many cases, it is a sign of inefficiency. 

More tokens are not the same thing as more intelligence 

The current industry narrative often treats token consumption as if it were a proxy for progress. Bigger context windows, more reasoning traces, more agent loops, more memory, more retrieval, more interactions. It all sounds impressive. 

But a system that needs to ingest and regenerate enormous amounts of context at every turn is not necessarily smarter. It may simply be badly designed. 

Anthropic’s own engineering guidance makes this point with unusual clarity. Its team argues that context should be treated as a finite resource, and that good context engineering means finding “the smallest possible set of high-signal tokens” for the task at hand. That is not a marginal optimization. It is a fundamentally different philosophy. It says the future does not belong to systems that can swallow the most context, but to systems that know what context actually matters. 

That distinction is becoming more important as agentic workflows spread. Once an AI system is allowed to act repeatedly, use tools, revisit plans, and maintain session state, token consumption can compound quickly. What looks like one task from the outside may involve many hidden prompts, subqueries, summaries, and retries underneath. Deloitte now describes tokens as the new currency of AI economics, precisely because the structure of agentic systems changes the cost dynamics so dramatically. 

And yet many companies are still behaving as if scale alone will solve the problem. 

It won’t. 

Long context is not a free lunch

One of the most persistent myths in enterprise AI is that if some context is good, more context must be better. That assumption was always too simplistic, and the evidence against it is getting stronger.

The paper “Lost in the Middle” showed that language models often struggle to use relevant information when it is buried inside long contexts, performing best when key information appears near the beginning or the end. More recently, Chroma’s long-context evaluation across 18 models found that model performance becomes increasingly unreliable as input length grows. In other words, there is a point at which more tokens stop being additional intelligence and start becoming additional noise. 

This is where the brute-force model begins to look less like technological inevitability and more like lazy architecture. If your answer to every new requirement is to stuff more material into the prompt, preserve every turn forever, and keep every intermediate artifact in the active context window, you are not building a better AI system. You are building a more expensive one, and quite possibly a worse one. 

The real frontier is context engineering

The more interesting future is not bigger and hungrier. It is more selective, more structured, and more deliberate. That is why the most important emerging concept in applied AI may not be prompt engineering, but context engineering

Anthropic explicitly frames context engineering as the next step beyond prompt engineering. OpenAI offers retrieval and prompt caching to avoid repeatedly sending the same large bodies of information. Google offers context caching for repeated use of substantial initial context. Microsoft’s guidance on retrieval-augmented generation (RAG) and chunking is similarly direct: Sending entire documents or oversized chunks is expensive, can overwhelm token limits, and often produces worse results than well-prepared retrieval pipelines. 

These are not fringe techniques. They are signals from the industry itself that the brute-force era has limits. 

The pattern is clear. The future enterprise stack will not rely on blindly resending everything a company knows into a model at every interaction. It will rely on better architecture: retrieval layers, access controls, selective memory, hierarchical summaries, context compaction, caching, routing, and strong query planning. 

In other words, it will rely on engineering. 

Why the current economics are misleading 

This is where the incentives become distorted. 

Model vendors can live quite happily in a world where customers believe token growth is natural, unavoidable, and even desirable. The more calls, the more context, the more loops, the more revenue. Graphics processing unit (GPU) makers are similarly well positioned when inference demand keeps climbing. 

And of course some of that demand is legitimate. There are real use cases that need more context, more modalities, and more sophisticated inference. But it would be a mistake to confuse “demand exists” with “waste does not.” 

OpenAI says prompt caching can reduce latency by up to 80% and input token costs by up to 90% for repeated content. Google says context caching is especially useful when a substantial initial context is referenced repeatedly. Microsoft says good chunking removes irrelevant information and improves both cost and quality. None of those capabilities would matter if the brute-force approach were already efficient. Their very existence is proof that smarter architecture beats indiscriminate token flooding. 

That is why companies should be very careful about adopting the vocabulary of the vendors selling them the computing. “More capable” and “more expensive to run” are not synonyms.

The AI industry is monetizing token inflation. Smart companies will engineer their way out of it. 

The enterprise advantage will come from knowing your own context

This is where this article becomes more than a complaint about cost. Because the real opportunity is not merely to reduce token bills. It is to build better systems. 

A company that understands its own knowledge structure, internal permissions, workflows, terminology, and decision logic should not need to approach every interaction with an AI system as if it were talking to a stranger from scratch. It should be able to architect context intelligently: retrieve the right information at the right moment, preserve what matters, discard what does not, and ground outputs in its own internal logic. 

That is not a small improvement. It radically changes the economics of enterprise AI. 

If the company platform is built properly, the model should not need to carry the whole world in active memory all the time. It should be working with a curated, dynamic, high-signal subset of relevant information. Microsoft’s agentic retrieval architecture points in exactly this direction: focused subqueries, structured responses, citations, security trimming, and knowledge-source-aware grounding instead of indiscriminate context stuffing. 

This is also why I argued in an article earlier that “AI won’t replace strategy: It will expose it.” The same is true here. AI will not merely expose whether you have adopted the latest model. It will expose whether your company actually understands its own information architecture, or whether it has been living in a fog of disconnected documents, permissions, and processes. 

What the next phase of AI will actually reward

The companies that win in the next phase of artificial intelligence will not be the ones that can afford the biggest token bills. They will be the ones that build systems that do not need them. 

They will treat tokens the way good engineers treat bandwidth, battery life, or latency: not as infinite resources to be consumed theatrically, but as constraints that reward intelligent design. They will store most of the context in world models. They will use large models when large models are justified. They will use retrieval when retrieval is enough. They will cache repeated context. They will route simpler work to cheaper models. They will manage memory instead of romanticizing it. They will distinguish between context that is available and context that is actually useful. 

And, crucially, they will stop confusing brute force with sophistication. That is the part of the current AI narrative that deserves a serious correction. The industry keeps encouraging us to imagine a future in which ever-growing token consumption is simply the price of progress. 

It probably isn’t. It is, at least in part, the price of immature architecture. And mature architecture has a way of destroying bad business models. 

The future of AI will not belong to the companies that consume the most tokens. It will belong to the ones that know how to need fewer of them. 

View the full article





Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.