Stop Buying Tokens: The AI Cost Strategy Most Businesses Get Wrong

Uber Burned Its Entire 2026 AI Budget in Four Months

In December 2025, Uber gave 5,000 engineers access to Claude Code. By February, usage had nearly doubled. By April, the CTO announced to the company that they had burned through the entire annual AI budget. All of it. Before summer.

The culprit was not rogue spending or a bad contract. It was the adoption curve. Monthly API costs per engineer ranged from $500 to $2,000 as usage scaled. By the time the bill caught up to the rollout, 84% of Uber's developers were classified as agentic-coding users in internal telemetry. The consumption-based pricing that looked manageable in pilot had become ruinous at company-wide scale.

Uber is not a cautionary tale about AI being wasteful. It is a cautionary tale about what happens when you do not have a strategy before you scale. The tools were working. The engineers were productive. The budget architecture was just completely wrong for what was actually being used.

We see a version of this all the time -- not at Uber scale, but the same mistake. Teams burning through tokens for work that subscriptions would cover. Agents running on Opus when Haiku would do. No visibility into where the spend is going until the invoice arrives. This post is about fixing that before it happens to you.

Abstract visualization of AI model cost strategy and token economics

The Two Legitimate Cases for Buying Tokens

Token-based API access is not inherently wrong. It is wrong in the wrong context. There are two situations where it makes sense.

The first is the obvious one: you are building a product, AI inference is a core cost of goods, and you are charging customers at a margin on top of it. If you are reselling AI output as part of a SaaS or service, token-based billing is how you track costs against revenue. That math works.

The second is less talked about but equally valid: agents and automated workflows running on cheaper models. If you are orchestrating a pipeline that processes thousands of documents, classifies inbound requests, extracts structured data at scale, or runs background tasks without a human in the loop -- tokens make sense there. The key word is cheaper. Haiku, Sonnet, GPT-4o-mini, Gemini Flash. These models are fast, cheap, and more than capable for well-defined automated tasks. Tokens for agents on mid and budget tier models can be extremely cost-effective.

What does not make sense is buying frontier tokens -- Opus 4.7, o3, the top-of-the-line models -- for every employee, for every task, all the time. That is where companies blow their budget. They treat their most expensive model like a default setting instead of a precision tool.

The mistake is not buying tokens. The mistake is buying Opus-level tokens for everything, including work that Haiku or a subscription could handle for a fraction of the cost.

Subscriptions Are for Humans. Tokens Are for Machines.

This is the cleanest way to think about it. When a person is sitting at a keyboard doing work -- writing, coding, analyzing, building -- they should be on a subscription. When a system is running automated tasks at scale without a human in the loop, it should use tokens, and it should be doing it on a model that is priced appropriately for the task.

For internal human use, our recommendation is to get every serious user to the maximum subscription tier on both Anthropic and OpenAI. Claude Max covers the full Claude family including Opus, plus Claude Code for developers. ChatGPT Pro covers GPT-4o, o3, and Codex. Together that is roughly $300 per month per seat -- and it comes with no token meter, no rationing, no moment where someone decides not to use the better model because they are watching a budget. That friction is expensive in ways that do not show up on an invoice.

If you are coding with AI, the subscription argument is even stronger. The moment a developer starts throttling their Claude Code or Codex usage because of API costs, you have already lost more in productivity than you saved in billing. Subscriptions remove that friction entirely.

Humans doing work

Use subscriptions

Employees writing, coding, researching, analyzing

Claude Max — $100/mo

ChatGPT Pro — $200/mo

~$300/seat total. No token meter.

Agents + workflows

Tokens on cheap models

Automated pipelines, classification, extraction, background tasks

Haiku 4.5 — $0.25/$1.25/1M

GPT-4o-mini — $0.15/$0.60/1M

Gemini Flash — $0.075/$0.30/1M

Frontier tokens for everything

Stop

Opus 4.7 API for every employee, every task, all the time

Opus 4.7 — $15/$75/1M

This is where budgets blow up.

Model Tiers: Stop Using Opus When Haiku Will Do

Not every task deserves your most expensive model. This is where most teams that do use the API bleed money -- they default to Opus or GPT-4o for everything because those are the names they know, and they never build the habit of matching model capability to task complexity.

Tier 1 — High Complexity

Use sparingly. Reserve for work where quality is non-negotiable.

Complex reasoning · long-form analysis · nuanced writing · high-stakes review

Claude Opus 4.7 — $15 / $75 per 1M

OpenAI o3 — premium

Tier 2 — General Purpose

Your production daily driver for most automated tasks.

Production tasks · customer-facing content · structured extraction · summarization

Claude Sonnet 4 — $3 / $15 per 1M

GPT-4o — $2.50 / $10 per 1M

Tier 3 — Fast and Cheap

Use more than you think. Most high-volume workloads live here.

Classification · routing logic · simple extraction · first-pass filtering · high-volume pipelines

Claude Haiku 4.5 — $0.25 / $1.25 per 1M

GPT-4o-mini — $0.15 / $0.60 per 1M

Gemini 2.0 Flash — $0.075 / $0.30 per 1M

The rule of thumb we give clients: if the task is well-defined and the output is verifiable, use a cheaper model. If the task requires genuine reasoning, synthesis across conflicting information, or produces output that goes directly to a customer without review, use a top-tier model. Most workloads split roughly 70/30 between tiers two and three once teams actually audit what they are using Opus for.

Task-to-Model Decision Flow

Is a human in the loop?

Yes

Use a subscription. Don't buy tokens for work people do interactively.

Claude Max / ChatGPT Pro

No — automated pipeline

Is the output high-stakes or going to customers without review?

Yes → Tier 1 tokens (Opus 4.7, o3) — use sparingly, measure ROI

No → Tier 2/3 tokens (Sonnet, Haiku, Flash) — this is the cost-effective zone

Build Redundancy Into Your LLM Layer

Single-provider AI architectures are a liability. Models go down. Rate limits hit at the worst moments. A provider has an outage during your demo. These are not edge cases -- they happen regularly enough that any production system needs a fallback plan baked in from the start.

The practical version of this is building your stack so that if Anthropic is unreachable, an equivalent OpenAI or Gemini call fires automatically. For most tasks, Claude Sonnet and GPT-4o produce similar enough results that a fallback does not degrade your product. For Opus-level tasks, you need to decide in advance what acceptable degradation looks like -- is it retrying on o3, or is it queuing the request and waiting?

This is also why the multi-subscription approach matters. When you have active accounts with multiple providers, failover is a configuration change, not a procurement process.

Router Tools: The Infrastructure Layer You Are Probably Missing

If you are building anything beyond simple single-model calls, a routing layer belongs in your architecture. These tools sit between your application and the model providers, handling fallbacks, cost optimization, load balancing, and observability automatically.

LiteLLM is the most widely used. It gives you a unified interface across 100+ models -- swap Claude for GPT-4o or Gemini with a one-line config change. It handles rate limit retries, tracks spend per project, and lets you set budget caps so a runaway process cannot surprise you with a $4,000 bill. For teams that want open-source and self-hosted, this is the default choice.

OpenRouter works similarly but as a managed service -- they sit in the middle, you call one API, and they route to whichever provider you configure. Useful if you do not want to manage the infrastructure yourself and want access to models outside the main provider APIs.

RouteLLM, developed by the team at LMSYS, takes a different angle: it uses a trained classifier to automatically decide whether a given prompt needs a strong model or a weak one. Instead of you manually routing by task type, the router learns which requests actually need Opus-level reasoning and which ones a Haiku can handle. In their benchmarks this approach cut costs by over 50% while maintaining output quality. You can read more about how agents and routing decisions interact in our post on chatbots vs AI agents and how to build your first AI agent.

Portkey and Helicone are worth knowing for observability -- they sit in front of your model calls and give you per-request logging, latency tracking, and cost visibility without having to build that instrumentation yourself. If you are running a production system and do not know where your token spend is actually going, one of these should be your first stop.

LiteLLMOpen-source gateway

Unified API for 100+ models
Budget caps per project
Rate limit retries + fallbacks
Self-hosted — your infra, your data

Best for: Teams that want full control

OpenRouterManaged gateway

No infra to manage
Access to models outside main APIs
Single unified billing
Fast to get started

Best for: Rapid prototyping, smaller teams

RouteLLMIntelligent routing

Auto-decides: strong vs weak model
Trained classifier on prompt complexity
50%+ cost reduction in benchmarks
No manual routing rules

Best for: Mixed workloads, variable complexity

Portkey / HeliconeObservability

Per-request cost + latency logging
Prompt caching analytics
Spend dashboards by project
Drop-in — no code changes

Best for: Understanding where money goes

A routing layer is not a nice-to-have once you are running multiple models in production. It is the difference between a managed LLM cost structure and one that surprises you every month.

Your LLM Architecture Needs a Map

The question every company should be able to answer: which model does what in our system, and what falls over to what when it fails?

Most companies cannot answer this. They have one or two models they use for everything, no documented fallback logic, and no visibility into whether the model they are paying for is actually the right one for the job. That is not a technology problem -- it is a planning problem.

A proper LLM layer design looks something like this: you have primary models assigned to specific task categories, secondary models for fallback, budget caps per service, and routing logic that escalates or degrades based on task complexity and availability. You know your cost per output type. You have alerts when spend crosses thresholds. You review model performance quarterly and adjust assignments as the model landscape shifts.

Multi-Provider Redundancy Architecture

Your Application

Router Layer (LiteLLM / OpenRouter)

Anthropic

Primary

Opus · Sonnet · Haiku

OpenAI

Secondary

o3 · 4o · 4o-mini

Google Gemini

Budget / Fallback

Flash · Pro

→ Primary fails → secondary fires automatically

→ Secondary fails → budget tier or queue + retry

→ Budget caps enforced at router layer

→ Cost + latency logged per request

The Full-Stack Advantage: Why Google, xAI, and Groq Play a Different Game

Most companies buying AI are renting from someone else's stack. A small number of companies own the whole thing -- model, compute, infrastructure, distribution -- and that changes what they can do with pricing in ways that everyone else simply cannot match.

Google is the clearest example. They own the TPUs their models run on, the data centers that house them, and the distribution channels -- Android, Chrome, Search, Workspace -- that put AI in front of billions of people. That integration means Google can price Gemini aggressively, subsidize free tiers, and still come out ahead because the value is captured elsewhere in the ecosystem. Gemini Flash being some of the cheapest capable inference available is not charity. It is a strategic choice made possible by owning the whole stack.

xAI is moving fast toward the same position. With Colossus -- one of the largest GPU clusters ever built -- xAI owns the raw compute. Grok is their model. And the recent acquisition of Cursor was a move that solved three problems at once: it gave Cursor access to xAI's compute infrastructure (solving their cost problem at scale), gave xAI a world-class engineering team and product DNA (solving a talent and product problem), and handed xAI a massive, deeply engaged developer user base overnight (solving a distribution and virality problem). That is what full-stack ownership enables -- acquisitions that are immediately accretive in multiple dimensions because you can offer what the acquisition needs.

Groq takes the compute advantage even further. Their custom LPU hardware is built specifically for inference, not training -- which means their throughput and cost per token on supported models is dramatically lower than GPU-based alternatives. They can offer inference speeds and prices that are structurally impossible for providers running on commodity hardware. It is not a better product. It is a different category of infrastructure.

The companies with full-stack ownership -- model, compute, and distribution -- can offer pricing, free tiers, and acquisitions that API-only players structurally cannot. Everyone else is playing on their terms.

For businesses making AI decisions today, this matters because it tells you something about where the floor on pricing goes. Google and xAI will continue to push inference costs down because their business models do not depend on inference margin the way a pure API provider's does. If you are building products on AI, the cheapest capable model two years from now is probably going to come from one of the companies that owns its own silicon -- and your architecture should be flexible enough to route to it when it does.

The OpenAI vs Anthropic Quality Gap Is Closing

This matters for cost strategy because the pricing gap remains real even as the quality gap narrows. A year ago, Opus was in a category by itself. Today, OpenAI's o3 and GPT-4o are competitive on most benchmarks, and o4-mini is producing results that would have required Opus-level compute twelve months ago at a fraction of the price.

Anthropic still leads on certain task types -- particularly long-form reasoning, nuanced writing, and tasks where faithfulness to source material matters. As we covered in our piece on why we chose Anthropic over OpenAI, Claude's tendency to qualify uncertainty rather than confabulate is a genuine competitive advantage for high-stakes business use cases. We also ran a detailed head-to-head comparison of Claude, ChatGPT, Gemini, and others if you want the full breakdown. But for a growing number of tasks, OpenAI models are the cost-optimal choice, and a well-designed system will route to them when it makes sense.

The same logic applies to Gemini. Google's pricing on Flash-tier models is aggressive. For classification tasks, structured extraction, and high-volume lightweight inference, Gemini Flash is hard to beat on cost. It belongs in your toolkit even if it is never your primary model. For a deeper look at where each provider stands right now, see our breakdown of the OpenAI vs Anthropic ecosystem race.

Everyone Is in the Same Game

Here is something that does not get said enough: the AI cost journey every company goes through is the same journey. The scale is different. The dollar amounts are different. But the progression -- from first using AI to eventually thinking hard about compute infrastructure -- is the same arc, whether you are a ten-person SMB or a hyperscaler signing nuclear power deals.

It starts the same way for almost everyone. Someone on the team starts using ChatGPT or Claude through a browser. It is free or cheap. They do not think about cost at all -- they are just exploring. This is the discovery phase. Most companies are still here.

Then the team gets serious. People start hitting usage limits. The company buys subscriptions -- Claude Max, ChatGPT Pro -- and starts expecting employees to use AI as part of their actual workflow. Costs are predictable and manageable. This is the productivity phase.

Then someone wants to build something. An internal tool. A customer-facing feature. An automated workflow. Now you need the API. You start buying tokens. The billing is variable and harder to predict, but the output has clear business value. This is the building phase.

If the thing you built works, you start selling it. AI becomes a cost of goods. Token spend scales with revenue. You start caring about margins on inference, not just whether the thing works. This is the product phase. It is where smart token strategy starts to matter a lot.

Sooner than most people expect, the infrastructure question forces its way in. Not just because of volume -- but because of data. Healthcare companies, law firms, financial services, any business handling sensitive customer information: you cannot send that data to a public API endpoint. AWS Bedrock, Google Vertex AI, and Azure OpenAI exist precisely for this. They give you frontier model access inside a private, compliant environment where your data never leaves your cloud tenant. This is not an enterprise-only edge case. It is the default path for any company in a regulated industry or with serious data governance requirements. Many mid-sized businesses need to be here before they think they do.

Keep going and the infrastructure question gets deeper. Dedicated capacity. Reserved instances. On-premise deployments for sensitive workloads. Running open-weight models locally because the unit economics finally make sense at your scale. The hardware your employees have starts to matter. Companies start thinking about power -- literally, the electricity required to run inference at scale.

And at the extreme end, you have Anthropic and OpenAI themselves -- companies that realized the only way to control their own future was to own the compute. Data center deals. Nuclear power agreements. Custom silicon. The same instinct that makes a small business ask "should we run a local model?" is the same instinct that makes a frontier lab sign a 20-year power contract. The scale is incomprehensible but the question is identical.

The AI Cost Maturity Curve

DiscoveryMost companies start here

ChatGPT free · Claude.ai basic

$0–$20/mo"This is interesting"

ProductivityWhere every serious team should be

Claude Max · ChatGPT Pro subscriptions

$100–$300/seat/mo"This is how we work now"

Building

Anthropic API · OpenAI API · tokens

Variable — $500–$5K/mo"We're building on top of AI"

ProductSmart token strategy matters most here

Tokens at scale · COGS optimization

Scales with revenue"AI is in our margin structure"

Managed InfrastructureMore common than people think — any regulated or sensitive data workload lands here

AWS Bedrock · Google Vertex AI · Azure OpenAI

Contracted rates, volume pricing"Our data can't leave our environment"

Compute

On-prem GPUs · local models · custom silicon

CapEx, not OpEx"We need to own this layer"

Deep InfrastructureSame instinct as Stage 6 — different scale

Data centers · power contracts · custom chips

Billions"Anthropic / OpenAI territory"

The reason this progression matters is that most companies try to skip stages. They go from Stage 1 directly to buying tokens without ever committing to subscriptions. Or they hit Stage 3 costs and start dreaming about Stage 6 on-prem deployments before they have actually optimized their model tier usage. Every stage has a right time. Jumping ahead creates costs without the scale to justify them.

Know where you are. Optimize for that stage. The next one will come.

The Practical AI Budget Plan

Here is the framework we walk clients through when their AI costs are out of control or before they start spending at scale:

Step one: subscriptions first. Get your team on Claude Max and ChatGPT Pro before you spend a dollar on tokens. Every developer, every power user. The effective hourly rate on subscription coverage is dramatically better than API billing for human-in-the-loop work.

Step two: audit what actually needs API access. Production systems and automated pipelines need tokens. Human workflows usually do not. Draw that line clearly.

Step three: tier your model usage. Document which tasks go to which model tier and why. Audit it quarterly. The model that was the right choice six months ago may not be now.

Step four: instrument everything. You cannot optimize what you cannot see. LiteLLM, Helicone, or Portkey will give you the visibility to make decisions based on actual data rather than vibes about what is expensive.

Step five: design for redundancy from day one. Multi-provider fallbacks are a one-time architecture decision that pays off every time there is an outage, a rate limit spike, or a new model that makes your current primary obsolete.

The companies we see overspending on AI are almost always doing one of three things: buying tokens when subscriptions would cover them, using top-tier models for tasks that do not need them, or operating a single-provider system with no visibility into where the money goes. Any one of those is fixable. All three at once is a budget problem that compounds fast.

If you want help mapping out your LLM architecture or auditing where your AI spend is actually going, reach out. This is exactly the kind of problem we work through with clients before it becomes expensive to ignore.