Tokenmaxxing: Waste Tokens, Save Time

On a recent podcast, Naval Ravikant talked with Guillermo Rauch (Vercel), Blake Scholl (Boom Supersonic), and Max Hodak (Science), and the four of them landed on a pretty simple idea: waste tokens to save time. Inference is cheap. Engineers aren't. So over-spend the model, under-spend the human.

They're right. If you're a small, technical team, that's really the whole post — go burn the tokens. But take the same advice and hand it to a 500/2000-engineer org, and you get the news cycle of the last week. Both things are true at the same time, and it's worth pulling them apart.

There's a name for this whole debate now: tokenmaxxing. The shorthand for treating token consumption as a proxy for productivity with AI — more tokens spent, more innovative the employee. A few weeks ago, this was the corporate fashion. Meta, Amazon, OpenAI, and others stood up formal or informal leaderboards and encouraged their engineers to compete on token usage. The Naval/Rauch camp still uses the term approvingly: stop counting pennies on inference, the math obviously favors burning tokens. The skeptics use it as a warning. Fortune declared it dead last week. Both camps have evidence on their side. Both are right, depending on who's reading the bill.

The math

A senior engineer at a US tech company runs $200–$300 an hour fully loaded. Globally, $80–$150 is more common. Either way: real money, every hour they're on the clock.

Now look at tokens. Today's frontier models — Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro — sit between $2 and $5 per million input tokens, $12 and $25 per million output. A heavy interactive coding session might burn 500K to 2M tokens an hour. That works out to $3–$20 in API spend per engineer-hour. A typical session is closer to $5.

So the trade looks like this: $250 of engineer time on one side, $5 of token spend on the other. Those numbers aren't in the same range. If a $5 spend saves you twelve minutes of engineer time, you're ahead. Almost anything worth spending engineer time on clears that bar.

For a small technical team, this isn't a close call.

Why this works for small teams

Naval and Rauch run small teams of senior people. When a senior engineer on team burns $20 in tokens over an afternoon, the loop closes on itself: they know what they were trying to ship, they can tell whether the spend produced it, and if it didn't, they'll adjust how they prompt next time.

Heavy session, real feature shipped — fine, no one's worried. Heavy session, nothing landed — they notice it themselves, before anyone has to ask. The token bill ends up reading like a signal they can act on, because the person spending is also the person who can tell what came out the other side. The judgment to spend tokens well lives at the keyboard. That's the part of the math everyone misses.

What breaks at enterprise scale

Now imagine handing “waste tokens to save time” to a 500/2,000-engineer org. The news from the last week tells you exactly what happens.

The corporate retreat from tokenmaxxing has been fast. Amazon shut down its internal Kirorank leaderboard after employees gamed it by running up agent costs to climb the rankings. Meta took down the informal leaderboard its engineers had built. Microsoft cancelled Claude Code subscriptions for engineers in several key product divisions, per The Verge. Uber burned through its entire 2026 token budget in the first four months of the year, mostly on Claude Code. Salesforce's CEO Marc Benioff said the company's Anthropic bill will run roughly $300 million this year — and publicly wished out loud for a “smart router” that could send simpler queries to cheaper, good-enough models instead of the most capable ones.

Each of these stories has its own specifics, but the shape is the same. The leaderboards weren't broken because tokens are a bad metric in some philosophical sense. They were broken because the company couldn't look at someone's score and tell whether the spend behind it was producing anything real, or just running up a number on a dashboard.

Then there's METR, the AI research lab. They've been trying to repeat a 2025 study that showed AI tools made developers 19% slower — even as those same developers self-reported being 20% faster. They couldn't run the repeat. Developers refused to participate without AI tools. They'd rather skip the study than work on the terms the experiment required.

So METR did the next best thing and ran a self-report survey. Engineers reported they were twice as valuable to their organizations. GitHub Copilot's own marketing claims 55% faster task completion. The measured number, the last time anyone could measure it, was minus 19%. That's a 74-point gap between perception and reality — and the people most invested in the perception are the ones writing the narrative.

Jellyfish's Q1 2026 data on 7,548 engineers found the same shape from a different angle: engineers with the largest token budgets shipped the most pull requests, but the productivity gain didn't scale linearly with the spend. Bigger budget, more PRs — but not proportionally more actual progress.

The rule didn't fail. The judgment behind the rule didn't scale.

We don't have a clean fix for the enterprise version

On a small expert team, Naval's position works because the person at the keyboard can also read the bill against their own output. On a 500-person engineering org, that loop falls apart: the person spending the tokens isn't the person reading the bill isn't the person measuring the output. The signal scatters across roles, and whoever's closest to a leaderboard ends up shaping it.

We've written about pieces of the fix — per-invocation attribution in The Month-Three Moment, cost-per-shipped-feature instead of cost-per-token in Tokens Are Not the Metric. Those help. But we're honestly not sure anyone has yet built the org-wide judgment layer that scales the small- team math. Not yet.

So: for the team of four shipping a production feature this week, Naval is right. Spend whatever tokens save you time. For the CTO of a 500-person org watching the bill creep up with nothing obvious to point to, the same advice produces Kirorank. The math is identical. What changes is who's reading the bill — and whether they can tell what got shipped.