Google shipped Gemini 3.5 Flash at I/O on May 20. The headline is that a Flash-tier model now leads the Pro tier on most agentic benchmarks: 76.2% on Terminal-Bench 2.1 versus 70.3% for Gemini 3.1 Pro, and ahead of both Claude Opus 4.7 and GPT-5.5 on MCP Atlas and Finance Agent v2. Every AI newsletter will run that story this week.
Here's the one they won't: Gemini 3.5 Flash costs $1.50 per million input tokens and $9 per million output tokens. That's six times the price of Gemini 3.1 Flash-Lite, and roughly 75% of what Gemini 3.1 Pro charges. The tier name stayed the same. The price did not.
The routing assumption that just broke
If your app has a model routing layer, you almost certainly made some version of this call: route complex, high-stakes tasks to a Pro-tier model; route cheaper, simpler tasks to Flash. The logic was sound. Flash used to be 5–10× cheaper per token and good enough for retrieval, summarization, and low-stakes generation.
That assumption is now stale for two reasons.
First, Gemini 3.5 Flash is the better agentic model on the benchmarks that matter for coding agents and tool-calling workflows. If you're routing those workloads to 3.1 Pro on the theory that it's the smarter model, you're now paying more for worse outputs on that class of task.
Second, if you're routing cheap work to Flash on the theory that it's the cheap model — and you migrated from 3.1 Flash-Lite without checking the pricing — your inference bill just went up 6×. For most production apps, that's the difference between a workload being economically viable or not.
Why “it’ll just keep getting cheaper” stopped being safe
LLM API costs dropped roughly 80–90% between 2023 and early 2025. That trend conditioned a lot of architectural decisions: use a more capable model now, optimize cost later. Simon Willison observed this week that all three major labs appear to be “probing the price tolerance of their API customers.” Prices on frontier-tier performance are not falling right now. They're testing a ceiling.
The gap between “Flash” and “Pro” pricing has compressed from a 10× difference to less than 2×, while the benchmark gap has flipped. The tier labels are now close to arbitrary. You're paying for capability curves, not tier buckets.
What to actually do
Re-benchmark your production workloads this week. Pick three representative tasks from real traffic. Run them through the models you're currently routing to. Measure quality + latency + cost per call. Then route by those numbers, not by the name stamped on the model.
Build per-workload cost tracking if you haven't. At 6× price variation within a single model family, you can't eyeball inference costs from tier assumptions anymore. You need actual token counts per call type in production, not estimates from launch-day pricing.
Check your prompts against the new benchmark class. Terminal-Bench 2.1 (76.2% for 3.5 Flash) measures multi-step agentic task completion in a terminal environment. MCP Atlas (83.6%) tests tool-use and multi-hop reasoning across MCP server interactions. If your product involves coding agents or chained tool calls, these numbers are relevant. If you're doing document summarization, run your own evals — don't route by someone else's benchmark.
Gemini 3.5 Flash is available now via the Gemini API and on OpenRouter. The benchmark numbers are real. The price is also real. Routing decisions made with only one of those facts are wrong.