AI-built software: the bugs that pass review

Most of the AI Code Rescue jobs we take on start the same way. The owner tells us the software works. The tests pass. There's a code coverage figure that looks reassuring. The developer who built it has moved on, and on paper there's nothing wrong with it.

Then we actually read the code. What we find is rarely cosmetic. Live passwords sitting in plain text in the files. A checkout that charges a customer twice if two people press the button in the same second. A whole "send the order to the warehouse" step wrapped in code that quietly swallows every error, so when it fails, nobody is ever told.

None of that showed up in a test. None of it failed a review. That's the part worth understanding. AI didn't make the code look broken. It made broken code look finished.

To be clear, we're not against AI. We use it ourselves every day, and we've said so on this blog before. In the hands of someone who knows what good looks like, it's genuinely brilliant. The problem is narrower than "AI is bad". It's that AI is very good at producing code that reads as correct, and the signals we've always trusted to tell us code is safe were never designed to catch a mistake this confident and this plausible.

"All the tests pass" isn't the reassurance it sounds like

A passing test suite used to mean something specific. A person sat down, thought hard about how the software could go wrong, and wrote a test for each of those cases. The green tick was a record of that thinking.

When AI writes the tests, that green tick can mean a lot less. AI tends to test the code it has just written, using the inputs it already expected. It writes the test that the code passes. So the test and the code agree with each other. Whether either of them agrees with reality is a separate question, and often nobody asked it.

The coverage figure has the same weakness. Coverage tells you how much of the code ran while the tests were going. It doesn't tell you whether the code does the right thing. You can have 90% coverage and still have never once tested what happens when a customer uploads an empty file, or a file that's 500MB, or types their name into the price box.

Tests passing tells you the software does what the code says. It doesn't tell you the code says the right thing. With AI in the mix, that gap is where the real bugs live.

This is the "happy path" problem, and it's everywhere in AI-built software. Ask AI to build a booking form and it'll handle an ordinary booking beautifully. A booking for a date in the past? Two people booking the same slot a second apart? A phone number field with a chunk of code pasted into it? Often there's simply nothing there. The code reads like a tidy story with a beginning, a middle and an end, and no awkward questions in between.

When we audit, the tell is a test file where every test feeds in one obvious input, checks one obvious result, and is named after the function it covers. It looks thorough. It's really just the code described twice. So the first thing we do is rewrite the tests from what the software is meant to do: empty inputs, the wrong type of data, the largest size you'd ever realistically see, and at least one deliberately awkward input.

The security holes we find most often

Veracode's 2025 study put a number on this that's hard to ignore. Across more than 100 AI models, when a model had a choice between a secure way and an insecure way to do the same job, it picked the insecure one 45% of the time. Not occasionally. Closer to a coin toss, every time the choice came up.

Three patterns turn up again and again.

Passwords and keys written straight into the code. AI has read millions of public projects, and plenty of them had a password or an access key committed by mistake. So it learned that as a normal thing to do. We still open projects and find live credentials sitting in plain text. Anyone who gets a copy of the code gets the keys to the bank with it.
Pages that never check who's asking. This is the one we find most. An app has a page that shows a customer their own orders, or their own documents. The code happily fetches whatever record the web address asks for, with no check that the person logged in is actually allowed to see that particular record.
Database queries built by gluing text together. When a search box builds a database instruction by pasting the customer's typing straight into it, a customer can type something that isn't a search term at all, but a command for the database. This has been a solved problem for over a decade. AI knows the safe way. It just doesn't always use it.

That middle one is worth seeing, because it's so small. A page that shows an account often looks something like this:

app.get('/account/:id', (req, res) => {
  const account = db.getAccount(req.params.id);
  res.send(account);   // never checks this is YOUR account
});

It reads cleanly. It works perfectly in a demo, because in a demo you only ever look at your own account. It's also a data breach waiting for the first curious customer who changes the number in the address bar to someone else's.

How we handle this: we scan every project for secrets before we touch anything else, then move every password and key out of the code itself. And we test every page that shows private data by logging in as one customer and trying to open another customer's records. If that returns anything other than a flat refusal, the work isn't finished.

Works for one person, falls over for a thousand

Most AI-built software is tested by one person clicking through it on their own. It works. Then real customers arrive all at once, and two things tend to give way.

The same-moment problem. Picture a checkout. The code checks the last ticket is still available, then takes the payment. Written simply, that's two separate steps. If two requests land in the same fraction of a second, both can pass the check before either one finishes, and you sell the same ticket twice or charge one customer twice. One tester never sees this, because one person can't click in two places at once. A busy Friday sees it straight away.
The database-in-a-loop problem. To show a list of 50 customers, AI will often write code that asks the database about each customer separately. Fifty small questions instead of one. With 50 customers it's instant. With 10,000 customers and real traffic it's 10,000 questions, a page that crawls, and a cloud bill that climbs for no good reason.

What joins these two is that ordinary testing never catches either. You don't find them by clicking around carefully. You find them by deliberately throwing a crowd at the software at once, which AI won't do unless told, and which the person prompting it usually doesn't know to ask for. So for anything that touches money, stock or a shared resource, we ask one blunt question, "what happens if a hundred of these arrive at the same time", and we run that test for real before launch.

Silent failures: the bugs nobody gets told about

This is the pattern that worries us most, because it's the one that hides the longest.

When something goes wrong inside software, the code is meant to do something about it. Write it down in a log. Alert someone. Tell the customer. At the very least, stop and refuse to carry on as though nothing happened.

AI very often writes the opposite. It wraps a risky step in code that catches every possible error and then does nothing at all with it:

try {
  sendOrderToWarehouse(order);
} catch (e) {
  // nothing here
}

To the customer, the order looks placed. To you, everything looks fine, because nothing ever complained. The order simply never reached the warehouse. You find out when a customer rings to ask where their delivery is, possibly weeks later, and possibly after it's quietly happened a hundred times.

There's a reason AI does this. Catching one specific, expected failure is a deliberate decision a developer makes because they had that exact failure in mind. Catching everything and ignoring it is what gets written when nobody's sure what could go wrong and just wants the red error to stop appearing. AI reaches for the second one far too readily. When we take a project on, "catch everything and do nothing" becomes a banned pattern: every real failure has to be recorded somewhere a human will actually see it. A problem you can't see is a problem you can't fix.

Code that looks current but is quietly out of date

AI learned from an enormous amount of code, much of it written years ago. So it has a habit of reaching for yesterday's tools with complete confidence.

Sometimes that means an outdated way of doing a security-sensitive job, the common one being how customer passwords are scrambled before they're stored. There are methods that were perfectly acceptable in 2015 and are a genuine liability now. AI will still suggest them, because it saw them used thousands of times. The code runs. It's simply not safe.

More surprising is when AI invents an ingredient that doesn't exist. Modern software is assembled from small, free, ready-made components called packages, downloaded from public libraries. AI sometimes makes one up. It confidently writes code that relies on a sensible-sounding package that nobody ever published.

In one large study, close to one in five of the software components suggested by common AI models simply didn't exist. Attackers noticed. They now watch for the names AI tends to invent, then publish real packages under those exact names with their own code tucked inside. The AI suggests the made-up name, a developer accepts it, and the build quietly downloads a stranger's code. The trick is common enough now to have a name: slopsquatting.

How we handle it: we check every outside component a project pulls in. Does it exist, who maintains it, when was it last updated, is it genuinely widely used. Anything obscure or brand new gets a hard look. We also pin every component to an exact version, so the project keeps using the parts we checked rather than whatever happens to turn up later. It's the same discipline behind keeping a website properly patched, which we wrote about in why website updates are a security issue, not just a chore.

What a careful review actually looks for

None of this means AI-built software is doomed. Most of what reaches us can be fixed, and the fixes are usually targeted rather than a full rebuild. We said much the same about AI-built websites. But it does mean the old comfort blanket, "the tests pass and somebody reviewed it", isn't enough on its own any more.

When we review AI-built code, we're really asking three questions the test suite tends to skip:

What happens when a thousand people use this at once, instead of one?
What happens when someone feeds it something deliberately hostile, instead of something sensible?
What happens when a step fails, and does anyone actually find out?

If the honest answer to any of those is "we're not sure", that's the work, and that's where we start.

AI has made writing software fast and cheap. It hasn't done the same for reading it properly, and reading it properly is the part that keeps your customers' data private and your software standing up on a busy day. That's the bit still worth paying a person for.

If you've had an app, a website or a booking or membership system built with AI tools and you're not certain what's underneath it, that's exactly what our AI Code Rescue service is for. We'll take an honest look and tell you plainly what we find, including when the answer is "this is fine, leave it alone". Get in touch and we'll set up a chat.