When to Build an Agent (And When You're Just Building a Workflow With Extra Steps)

At the AI Engineer conference last year, an Anthropic engineer presented a checklist for deciding whether a task warrants an agent or a deterministic workflow. It's a good checklist and still relevant in 2026. Four questions — is the task complex enough, is it valuable enough, are all parts doable, what's the cost of an error — each with a clear branching logic.

One framing note up front: a workflow can absolutely call one or more LLMs. What makes it a workflow is that the orchestration is fixed, which makes each interaction predictable in output, latency, and cost — not that there's no model in the loop. Most production AI systems are workflows in this sense, with LLM calls at specific steps. Agents are the case where the model itself decides what the next step should be — and you give up that predictability in exchange.

We've been running it against real engagements, and it holds up well as a starting point. It also, like most frameworks, leaves out the things that matter once you try to apply it to production systems. This isn't a criticism of the checklist — no slide can cover everything. It's an observation that the checklist tells you whether an agent is appropriate, but not whether it's buildable in your environment, with your team, for your users.

Here's what we've learned about the gap.

The checklist, briefly

For anyone who hasn't seen it, the four questions are roughly:

Is the task complex enough to warrant agent reasoning, or is a workflow enough? Is the value per invocation high enough to justify agent-level cost, or is the economics better as a workflow? Are all parts of the task actually doable by an LLM, or does something need to be scoped out? And what's the cost of an error — high enough to require read-only output and human review, or low enough to let the agent act autonomously?

The checklist produces defensible answers. The problem is that defensible answers on paper sometimes lead to systems that are painful to operate.

What the checklist doesn't ask

Four considerations come up repeatedly in real builds, and none of them map cleanly to the four questions above.

Latency tolerance. Agents are slower than workflows, often by a lot. A workflow with five deterministic steps completes in the time it takes to run five API calls. An agent solving the same problem might take fifteen calls, or fifty, depending on how it reasons. If the user is waiting — a customer on a chat interface, a form that needs to submit — the agent's flexibility stops being an advantage and starts being a liability. “Is the task complex enough” doesn't capture this. A task can be genuinely complex and still belong in a workflow, purely because the end user won't tolerate a thirty-second response time.

Observability requirements. Debugging a workflow is straightforward. Step three failed. Here's the input to step three, here's the output, here's the error. Debugging an agent is an entirely different exercise. The agent took seventeen actions, some of which were reasonable and some of which were strange, and the final output is wrong. Which of the seventeen was the mistake? Was it a mistake at all, or the consequence of an earlier mistake? Teams without strong tracing infrastructure underestimate how much operational overhead agents add. The checklist treats agents as a capability choice. They're also an operational commitment.

How failures compound. Workflows fail in one place. Agents fail across many places, and the failures interact. An agent that makes a reasonable decision based on slightly wrong context will produce outputs that look right but aren't, and those outputs become the context for its next decision. By step ten, the drift is invisible but substantial. Workflows don't have this property — each step either succeeded or failed, and the failure is local. “What's the cost of error” asks the right question, but the real cost of errors in agents isn't just the wrong output. It's the difficulty of even knowing the output is wrong.

What human-in-the-loop actually costs. The checklist suggests human review when error costs are high, which is correct. What it doesn't surface is that human review is not free, and the cost isn't just the reviewer's time. It's the queue, the SLA, the interface they use, the training they need, and the escalation path when they're uncertain. Many AI projects underestimate this and end up with a review layer that's slower and more expensive than the workflow it replaced. Sometimes “human-in-the-loop” is the right answer (you definitely don't want to send an AI-generated email to the customer without any human review). Sometimes it means the task wasn't ready for automation at all, and the honest move is to scope it out rather than wrap it in review.

A longer checklist

If we were extending the AIE checklist for our own use, we'd add four questions:

How long can the user wait? If the answer is seconds, default to workflow. Agents belong in places where latency is flexible — batch processing, background enrichment, offline analysis.
Can you trace what the system did? If the answer is “sort of,” you're not ready to run an agent in production. The observability has to exist before the agent does, not after.
How do errors compound? If a wrong output at step three poisons step seven, the cost of error is higher than it looks, and the case for deterministic structure gets stronger.
What's the full cost of the human review layer? Not just the reviewer, but the queue, the tooling, the training. If it's more expensive than the work being automated, the agent isn't the right shape for this problem.

When an agent is actually the right answer

None of this is an argument against agents. For the right problem, they're transformative — genuinely open-ended tasks where the path isn't knowable in advance, where the value per invocation is high, where latency is flexible, and where the operational maturity exists to run them.

But a lot of systems that get called agents are really workflows with an LLM call in the middle, which is fine and often better. And a lot of things that want to be agents would be better as workflows with a narrower scope. The checklist is useful for sorting. The judgment about what to actually build sits one layer down, in the questions the checklist doesn't ask.

The honest version of the framework is that agents are the right answer less often than the current discourse suggests, and when they are the right answer, the hard part isn't deciding to build one. It's building the infrastructure around it so you can run it in production without losing sleep.

Most of the interesting engineering in AI right now is in that infrastructure, not in the agent itself.

The checklist, briefly

What the checklist doesn't ask

A longer checklist

When an agent is actually the right answer

Liked this? Get the next one in your inbox.