Media
June 4, 2026

Token Costs Are an Architecture Problem, Not a Model Problem

By Samir Dutta, Co-founder and CEO of Farsight  ·  Published June 4, 2026

The most effective way to reduce enterprise AI token costs is to stop paying for intelligence twice. In high-repeat domains, most output follows patterns that recur constantly, and yet most systems treat every request as if it were the first one they had ever seen. The result is a cost structure that scales linearly with usage, when it shouldn’t.

The math behind this is already visible at the industry level. The cost of running a model at GPT-3.5 capability fell more than 280-fold between November 2022 and October 2024, according to Stanford’s AI Index. And yet total AI bills have risen sharply over the same period, because usage has grown faster than unit costs have fallen. Gartner now forecasts global AI spending will reach $2.5 trillion in 2026. Cheaper tokens have not produced cheaper AI. They have produced more of it.

Most AI products still treat the foundation model as the product. That creates a predictable trap. Every workflow gets routed through the biggest, most expensive model, every request becomes a fresh inference event, and cost grows in lockstep with adoption.

That works for a demo. It breaks at enterprise scale.

The instinct is to treat this as an optimization problem: pick a cheaper model, tune a prompt, trim some tokens. But the root cause isn’t the model. It’s the architecture around it. If a system has to rethink the same work every time, you are paying for reasoning that should already be settled.

Why do enterprise AI costs scale with usage?

In most enterprise environments, a large share of knowledge work is not truly net new. Teams produce the same categories of output over and over: reports, profiles, summaries, recurring analyses. The specifics change, but the structure, the logic, the source hierarchy, and the quality bar stay remarkably consistent.

That repeatability is exactly where cost discipline should come from, and exactly where most systems leave it on the table. The expensive part of intelligence is genuinely new reasoning: synthesis, judgment, interpretation, creation. The cheap part, or what should be the cheap part, is everything the system has effectively done before.

The problem is that a model-first architecture can’t tell the difference. To it, the fortieth version of a recurring task looks identical to the first. So it spends the same effort, and you pay the same price, regardless of how much is actually new. Deloitte has noted that agentic AI, with its continuous inference, is already sending some enterprises’ monthly token costs into the tens of millions of dollars. The systems most exposed to this are precisely the ones doing the same work over and over without recognizing it as repeat work.

A worked example

Consider a recurring report, the kind someone might refresh every week. The first version is real work: deciding what matters, how to frame it, what to include and what to cut, and how the narrative should read. That is the kind of net-new reasoning a frontier model is built for, and exactly where its budget should go.

By the fortieth refresh, almost none of that is new. The structure, the standards, and the framing are all settled. What has actually changed is the underlying data and a handful of recent developments. Paying a frontier model to re-derive the whole thing from scratch every week is paying for intelligence that has already been spent.

The fix isn’t a cheaper model doing the same redundant work. It’s an architecture that recognizes what is genuinely new and reserves expensive reasoning for that, and only that. Get this right and the per-refresh cost falls dramatically.

What does an architecture-first approach actually save?

Shifting from model-first to architecture-first has three measurable consequences.

Lower unit economics. Repeated work should get dramatically cheaper over time, not stay permanently expensive. Cost should track how much work is genuinely new, not how often the system runs.

Lower latency. Fewer unnecessary model calls usually means faster output, and in live workflows, speed matters as much as cost.

More predictable spend. When every request hits a frontier model, cost volatility becomes a real governance problem. When more of the workload is handled deterministically, usage becomes forecastable.

Why isn’t swapping models enough?

Efficiency is about more than reaching for a cheaper model. Model swapping helps at the margin, but it doesn’t touch the root cause, which is redundant inference on work the system has effectively already done.

The better posture is model-agnostic and architecture-first. Use the strongest model for the parts that demand it. Use smaller, faster models where they are sufficient. And avoid unnecessary inference altogether where the work is already well understood.

This is the shift I think the next phase of enterprise AI hinges on. The first wave was about access to raw capability, getting the most powerful model into the workflow at all. The next wave is about spending that capability wisely: building systems that know when they are solving something new versus repeating themselves, and that don’t pay frontier prices for work that has already been figured out.

In that world, token costs stop being the headline. They become a byproduct of good product design.

This is the principle we are building on at Farsight, AI for finance that treats efficiency as an architecture decision, not a model one.

Sources: Stanford HAI, 2025 AI Index Report (inference unit-cost decline); Gartner (2026 global AI spending forecast); Deloitte (agentic AI inference costs).