In the chat window, everything looks good. You feed it a tidy example, the agent handles it well, returns exactly what you expected, you nod, satisfied. Then what it built meets real work — hits data missing half its fields, a user who typed something nobody anticipated, a case that happens once a month. And it breaks. Not loudly — just a wrong number, a dropped row, something that should have been caught slipping through.
This isn't a weak model's fault. It's the built-in gap between performing in practice and playing the real match — and most of the pain of putting an agent into production lives in that gap.
01Why "nice demo" fools you
The demo fools you because it's played on a path you cleared yourself. You supply the example — and a human's example is always the typical case: enough data, standard format, no weird edge. The agent handles the typical case very well, because that's exactly what it's seen the most of.
But production doesn't send you typical cases. It sends the field that's null which "is never null," the string ten times longer than expected, the operation two people do at the same instant. Those cases are rare on the demo desk but certain to appear in the wild — and the agent meets them with the same confidence it uses for the common case. Confident, and wrong.
02Three traps that repeat
Nearly every production faceplant I've seen fits one of three molds, and all three share one root — the agent is strong in the middle, weak at the edges, and doesn't know it's at an edge:
- It falls at the edges — it handles the common path well, then hits a rare case and gets it wrong just as confidently. No flag that says "I'm not sure about this one."
- Trust drift — it's right twenty times, you stop checking. The twenty-first is wrong, and it sails straight through because you've stopped looking.
- Silence taken for fine — no error firing doesn't mean it's right. The worst failure is the one that doesn't make a sound, sitting still until the worst possible moment.
The three pieces ahead go deeper into each. The thing worth remembering now: none of the three is fixed by finding a smarter agent. They're fixed by treating every "done" as a hypothesis to test in the world, not a conclusion proven in the chat.
03Narrow the gap, don't erase it
You'll never fully erase the demo–field gap; the real world has too many corners to foresee. But you can narrow it, with a few cheap habits: test the agent with exactly the bad cases, not just the nice ones; put your instrumentation at the boundary where its work meets the real world; and keep checking even after it's been right many times.
This cluster, more than the others, is the one of lessons paid for in real tuition — retold with the private parts removed, so that price at least buys you one dodge.