It runs in the chat. It falls apart in the field.

The gap between 'works in the demo' and 'survives the real world' — and how to narrow it

Published2026-05-31
Read3 min read
TypeField notes
TL;DR

Agents perform well on the "clean path": nice data, common cases, nobody interfering. The real world is messy, strange, full of rare cases — and that's where an agent's work tends to break, and break quietly. This cluster gathers field faceplants, with every name and specific detail stripped, keeping only the general trap so you can dodge it first.

In the chat window, everything looks good. You feed it a tidy example, the agent handles it well, returns exactly what you expected, you nod, satisfied. Then what it built meets real work — hits data missing half its fields, a user who typed something nobody anticipated, a case that happens once a month. And it breaks. Not loudly — just a wrong number, a dropped row, something that should have been caught slipping through.

This isn't a weak model's fault. It's the built-in gap between performing in practice and playing the real match — and most of the pain of putting an agent into production lives in that gap.

01Why "nice demo" fools you

The demo fools you because it's played on a path you cleared yourself. You supply the example — and a human's example is always the typical case: enough data, standard format, no weird edge. The agent handles the typical case very well, because that's exactly what it's seen the most of.

But production doesn't send you typical cases. It sends the field that's null which "is never null," the string ten times longer than expected, the operation two people do at the same instant. Those cases are rare on the demo desk but certain to appear in the wild — and the agent meets them with the same confidence it uses for the common case. Confident, and wrong.

02Three traps that repeat

Nearly every production faceplant I've seen fits one of three molds, and all three share one root — the agent is strong in the middle, weak at the edges, and doesn't know it's at an edge:

  • It falls at the edges — it handles the common path well, then hits a rare case and gets it wrong just as confidently. No flag that says "I'm not sure about this one."
  • Trust drift — it's right twenty times, you stop checking. The twenty-first is wrong, and it sails straight through because you've stopped looking.
  • Silence taken for fine — no error firing doesn't mean it's right. The worst failure is the one that doesn't make a sound, sitting still until the worst possible moment.

The three pieces ahead go deeper into each. The thing worth remembering now: none of the three is fixed by finding a smarter agent. They're fixed by treating every "done" as a hypothesis to test in the world, not a conclusion proven in the chat.

03Narrow the gap, don't erase it

You'll never fully erase the demo–field gap; the real world has too many corners to foresee. But you can narrow it, with a few cheap habits: test the agent with exactly the bad cases, not just the nice ones; put your instrumentation at the boundary where its work meets the real world; and keep checking even after it's been right many times.

This cluster, more than the others, is the one of lessons paid for in real tuition — retold with the private parts removed, so that price at least buys you one dodge.

End of pieceCluster 05 · 1/4
The author

craftagent is the notebook of someone still building — told over coffee, each story wrapped around a lesson paid for in full.