You ask the agent to write a bit that processes a list. You test it with a sample list, it runs fine. It confidently reports done. Two weeks later an empty list comes through — and that bit blows up, or worse, quietly returns something meaningless. The "empty list" edge never appeared on your test desk, so it never appeared in its head either.
The scary part isn't that it falls. It's that it falls in the same confident voice it used when it was right. No little frown of "hmm, this one's odd." To an agent, the edge case and the common case look identical — until the result turns out wrong.
01Rare doesn't mean cheap
That 1% is easy to wave off while testing — "it's rare, deal with it later." But low frequency doesn't bring light consequences. An edge case that reaches production is often the one that corrupts a data stream, puts a wrong number on a report, or causes the two-a.m. incident. You save five minutes in testing, then pay it back in an afternoon of debugging.
02Test with bad cases, not just nice ones
The fix isn't a smarter agent — it's changing what you throw at it to test. The human instinct is to test with the nice example, because the nice example is easy to think up and easy to see as "right." But right on the nice case says almost nothing about toughness in the wild.
So before trusting something an agent made, actively ask: what's the edge here? Which field could be empty, null, unusually long? What does an extreme input look like? What happens when two things arrive at once? Then throw exactly those at it — or make it list the edge cases of the very thing it just built.
One tight question to make a habit of: "where does this break if the input isn't as nice as what I just tested?" Ask it while still in the chat and it's cheap. Let production ask it for you and it's expensive — and it always asks at the worst time.