Intelligence Got Cheap This Year — and the Value Moved to What You Build With It

The price of machine intelligence fell by roughly three-quarters in twelve months. That is not a footnote. It quietly rewrites who wins.

The Number That Changed the Calculus

Here is the figure worth sitting with: the average cost of a million tokens — the unit AI models are billed in — fell from around $10 to around $2.50 over the past year. Roughly a fourfold drop in twelve months, on the core input of an entire industry.

It is hard to think of a comparable input cost that has collapsed this fast. Imagine if the price of electricity, or steel, or bandwidth had dropped seventy-five percent in a year. Whole business models that did not make sense at the old price suddenly pencil out at the new one. Things you would never have automated because the per-call cost was absurd become trivially cheap to run at scale. When a fundamental cost falls this hard, the interesting action moves to whoever can take advantage of the new price — and that is rarely the people who set it.

I have felt this directly. Eighteen months ago, building anything that called a frontier model in a loop meant watching a meter and rationing every request. The same loop today costs little enough that I stopped counting. That shift — from rationing intelligence to spending it freely — changes what you are willing to attempt.

Who Pushed the Price Down

The downward pressure came largely from a direction the dominant US labs did not fully control: cheap, genuinely capable open-weight models, many of them from China. Names like DeepSeek, Qwen, and Kimi released models that are competitive on hard tasks — coding, reasoning — at a fraction of the cost, and crucially, with open weights you can download and run yourself. Once a capable model is free to obtain and cheap to run, it sets a floor that every paid API has to reckon with.

The response from the frontier has been telling. In June, business press reported that OpenAI was weighing steep cuts to its token prices, specifically to counter both Anthropic's enterprise gains and the Chinese price undercut. There is a real tension here that is worth naming honestly: the same labs reportedly preparing to go public are the ones under pressure to cut prices, and cutting prices compresses exactly the margins that public-market investors will scrutinize. The economics of running a frontier lab remain brutal — these companies still burn cash heavily, and the path to profitability runs through efficiency gains and enterprise contracts, not consumer enthusiasm.

I will flag, as the research I am drawing on flags, that the eye-popping valuation figures floating around for these labs vary wildly between sources and should be treated with suspicion. The verifiable story is the price of tokens, not the price of the companies.

Where the Moat Actually Is Now

The strategic consequence is the part worth internalizing, because it inverts the instinct most people have about this industry.

When raw intelligence was scarce and expensive, access to the best model was a genuine advantage. If you had the frontier model and your competitor did not, you had something they could not easily replicate. But when capable intelligence is cheap and available from a dozen providers — and increasingly downloadable for free — model access stops being a moat. Everyone has it. It becomes a commodity input, like electricity or cloud storage: necessary, but not differentiating.

So where does the advantage go? Downstream. It moves to the things that were always hard and are still hard: a genuinely useful product, proprietary or hard-won data, a distribution channel that reaches real users, and the unglamorous engineering of making a system reliable and cost-efficient. The moat is the application, not the model. The company that wins is increasingly the one that understands a real problem deeply and wraps cheap intelligence around it — not the one that owns the smartest model.

This is genuinely good news if you are a builder rather than a trillion-dollar lab. The expensive, capital-intensive part of the stack — training frontier models — is being commoditized by competition. The part that rewards insight, taste, and proximity to a real problem is the part left standing. That favors small teams who understand their users over large ones who own infrastructure.

The Model Moved Onto Your Laptop

There is a second shift compounding the first: capable models are increasingly small enough to run locally, on your own machine, with no per-token fee at all.

A model like Qwen's recent edge-focused release is built specifically to run on a single device. Paired with a now-mature ecosystem of local tools — Ollama, llama.cpp, LM Studio, and MLX on Apple Silicon — it is genuinely practical to run a useful assistant entirely offline. The drivers are obvious once you list them: privacy (your data never leaves the machine), latency (no round trip to a server), cost (no metered fees), and regulatory comfort (nothing to leak).

Local models still trail the frontier on the hardest reasoning, so the pattern emerging is hybrid rather than either-or: a small local model handles the routine, private, high-volume work, and routes only the genuinely hard problems out to a frontier API. That architecture — cheap-and-local by default, expensive-and-remote by exception — is quietly becoming the sensible default for cost-conscious systems.

What This Means for What You Learn

If cheap inference means application builders win, then the skills that matter shift accordingly, and they are not the ones the hype cycle emphasizes.

The valuable engineering skill is no longer "get access to the best model." It is building cost-efficient systems: caching results so you do not pay twice for the same answer, routing easy requests to a cheap small model and hard ones to a frontier model, and tracking price-per-quality as the real purchasing metric rather than chasing the top of a benchmark. Learning to fine-tune and serve an open-weight model — using techniques like LoRA and a serving layer like vLLM — is now one of the most transferable, vendor-independent skills in the field, and the whole ecosystem is free to start experimenting with.

The single cheapest, highest-return move I can suggest is concrete: install Ollama or LM Studio this week and run a small model locally. It costs nothing, and an hour of doing it teaches you more about inference, quantization, and the real limits of these systems than a month of reading. It also happens to be a safe, no-cost sandbox for a curious kid — a model running on your own laptop, answering questions, with no account, no bill, and no data leaving the house.

The labs will keep fighting at the frontier, and that fight is genuinely consequential. But the quiet truth of this year is that the frontier got cheap enough to stop being where most of the value lives. The value moved to the people close enough to a real problem to know what is worth building. That has always been the more interesting place to stand.

FAQ

If models are getting cheaper, why do the labs need so much money?

Training frontier models is enormously capital-intensive — compute, data, and talent all cost a great deal up front. The price you pay per token has fallen, but the cost to build the model behind it has not fallen nearly as fast. That gap is why frontier labs still burn cash and lean on enterprise contracts and efficiency gains rather than consumer revenue for a path to profitability.

Does cheap inference mean I should stop paying for frontier models?

Not necessarily — it means matching the tool to the task. For the hardest reasoning, a frontier model still earns its price. For routine, high-volume work, a cheap or local model is often more than sufficient. The cost-efficient pattern is hybrid: default to cheap, escalate to expensive only when the task genuinely demands it.

Are open-weight models really competitive with the big paid ones?

On many tasks, yes — the quality gap has narrowed to months rather than years, and on specific tasks like coding, some open models are at or near the top tier. They still trail on the very hardest reasoning, and you take on the work of running your own infrastructure. But for a large share of real applications, an open model you control is now a serious option.

What is the catch with running models locally?

Local models trade some capability for control. They run on consumer hardware via heavy compression (quantization), which costs some quality, and they will not match a frontier model on the hardest problems. The upside is privacy, zero per-use cost, and no dependence on any vendor. For routine and private tasks, that trade is increasingly worth it.