Edition 3·May 26, 2026·Frameworks·5 min read

The operator's evaluation framework

Every AI tool looks like a win in the demo. Whether it is a win in production is a different question, and it is the only one that matters. Here is how I actually decide.

People ask me which AI tools they should be using, and I have learned to dodge the question, because it is the wrong one. The tool that transformed one team quietly drowns another. The question is never which tool. It is how you decide, and most teams are deciding off the demo.

A demo shows you the tool at its best. Carefully chosen input, a clean happy path, someone who knows exactly how to drive it. Production shows you the tool at its worst, on a Tuesday, in the hands of someone tired, on input nobody anticipated, again and again. I evaluate for the second one. Here is the checklist I run, built mostly from getting it wrong a few times first.

1. Does it remove the work, or just move it?

This is the first question and most tools fail it quietly. AI rarely deletes work. It relocates it. The code gets written in seconds, and now the work is reviewing code you did not write. The draft appears instantly, and now the work is catching the three things it got subtly wrong. The load did not vanish. It moved from your hands to your attention, and attention is the scarcer resource.

So I do not trust the demo's time savings. I run the tool across a real sprint and measure the full round trip. Idea to shipped and trusted, not idea to first draft. Often the tool that writes code fastest produces the slowest round trip, because verifying generated code you do not understand takes longer than writing code you do. If a tool saves an hour of typing and adds ninety minutes of reviewing, that is not a win wearing a costume. It is a loss.

2. What does being wrong cost, and how fast do I find out?

Accuracy is the wrong frame. Two other numbers matter more: the cost when it is wrong, and the time before you notice. A tool that is right ninety-five percent of the time but buries the other five percent until production is more dangerous than one that is right eighty percent and obvious about its failures. Visible failure is cheap. You catch it, you fix it, you move on. Invisible failure compounds.

So for anything near the core I ask one thing: when this is wrong, does it fail loud or fail silent? Tools that fail silently belong on the surface area, never the spine.

3. Does it make the team deeper, or just more dependent?

Some tools leave your people sharper. Some leave them unable to work without the tool. The difference is whether the tool shows its reasoning or only its output. A tool that explains how it got there teaches while it helps. A tool that hands you an answer and hides the path makes you faster today and hollower over a year. (I wrote about this last week. The feeling of understanding is not understanding.)

For junior people especially, I weigh this heavily. The tool that does their thinking for them looks like a productivity gain and is actually a training failure with a delay on it.

4. Can I see what it did?

If a tool touches production, I have to be able to inspect what it changed, understand why, and reverse it. Black boxes are fine in a sandbox and a liability in a system you are on call for. The question I ask is plain: when this breaks at 2am, can the person holding the pager understand and undo what this tool did? If the answer is no, it does not go near the core, no matter how good the output looks.

5. What happens when it is gone?

Pricing triples. The API gets deprecated. The company is acquired and the roadmap dies. I do not adopt anything critical without knowing the cost of leaving. The more a tool saves you, the more worth it is to ask how trapped you are if it disappears. Leverage you cannot walk away from is not leverage. It is a dependency you have not priced yet.

Put together, the framework is almost embarrassingly simple. Ignore the demo. Find out where the work really goes, what it costs when it is wrong, whether it deepens or hollows your team, whether you can see inside it, and what happens when it is gone. None of those questions are answerable from a landing page or a launch thread. All of them are answerable from two weeks of real use on real work.

That is the whole discipline. Most teams adopt on the upside and meet the downside in production. The operator's move is to go looking for the downside on purpose, before it becomes load-bearing.

I am not anti-tool. I put AI into production for a living and the leverage is real. But the teams that win with these tools are not the ones that adopt the fastest. They are the ones that adopt on evidence instead of demos, and that know exactly what each tool costs them, not only what it saves.

So before your next adoption. Do you actually know where that tool moves the work, or only what it promised to remove? Run it for two weeks on real work and find out.

Reply and tell me. I read every one.