Coasting on reputation
The gap between what code review signals and what it delivers
Life is complex; working everything out from first principles is hard. And time consuming. So we all use proxies. Marks & Spencer == good food. Tesco Value less so. Volkswagen == quality cars. Lada less so.
But quite how these proxies get established is rarely clear. Nor is whether they remain accurate. Regardless we continue using them.
Software proxies
Something similar happens with software. One oft used proxy is code-review. "Does your team review all your code" is slowly morphing into "Do humans review all your code".
But, really, asking about code-review is a proxy for quality. Teams that care about quality have rigorous dev processes. And one of those processes is code review.
The thing is code review isn’t great at finding bugs. Multiple studies have shown that human code review is most effective within narrow constraints: under 400 lines reviewed in under an hour. Even then the majority of what’s flagged is style and maintainability. Finding the subtle - yet critical - bugs is extremely hard. I’m not alone in admitting I never enjoyed reviewing code - it was hard and unrewarding. Spending multiple hours to find nothing more than a few typos - well, it didn’t seem like a great use of time.
But we did it because, once in a while, it turned up great bugs - like the time I spotted a whole new multi-threaded feature lacked any support for concurrency. Things like that make the pain worthwhile.
The past tense
You’ll have spotted I’ve been writing about human code review in the past tense. That’s deliberate - it’s increasingly clear that the era of human code review is over. If you are interested in quality - and you should be - nowadays manual code review offers a very poor return on the time invested.
Instead AI has opened the door to some incredible opportunities. Opportunities that could result in better quality software than we could ever have dreamed of.
UT testing is essentially now free. By default Claude and Codex will produce a paltry few tests. But give them a goal - say 100% line or region or function coverage and they will go and diligently work for hours building a test suite to achieve that goal. Even better the door is starting to open for MC/DC testing.
Modified Condition/Decision Coverage tests every condition within a conditional. So if you have something like if (A && B || C), it’ll test what happens if you flip A alone (while holding B and C fixed), and then the same for B and C independently.
MC/DC is the standard for life critical systems - think flight control systems, or aircraft engine control systems.
Until now it has been prohibitively expensive for normal software to use. You need hundreds of lines of test code for each line of production code. But when you’ve got AI it suddenly comes into reach.
Then there’s property testing. AI can build comprehensive property test suites. Or mutation test suites. Or fuzz test suites.
E2E tests become easy too - my emulator project has hundreds of end-to-end tests that use real games or Windows apps.
Then there’s exploratory testing. Once again Claude has you covered - build harnesses to let it drive your app and then let them explore all night. The emulator has emuscru (EMUlator SCript RUnner) and Gremlin (probably don’t need to explain that one) which allow Claude to create all sorts of tests cases and havoc. Even better another agent will fix up issues as soon as they are found.
What about code-review?
Of course, code-review still has a role to play in this new world. Except this time you can run it every night. You can run multiple agents working on adversarial reviews. You should run multiple agents every night. These things aren’t possible in the human world - imagine how many times you could ask an engineer to review the same code before they’d revolt. Twice? Three times?
And so?
Trouble is, proxies decay. Take Volkswagen. Despite all the “German engineering” marketing, Volkswagen came last in JD Power’s dependability survey last year. But if you haven’t bought one recently, you probably still think of them as quality cars.
And human code review?
In reality, it’s no longer the proxy for quality it once was. If anything insisting on it signals an organisation that hasn’t adjusted its thinking for the new world. A world where you can do so much more than human code review ever could. A world where your time is far better spent orchestrating agents.
Who knows? Maybe "do you review your code" will get replaced with "what’s your MC/DC coverage?"


When I review AI code, I'm not reviewing the _code_. I'm reviewing the design decisions that it made at coding time, to spot where the design we agreed up front had gaps that the AI has had to fill by itself. Because I don't know about your experience, but mine is that AI is still bad at design.
Working on something over the weekend that exposed a basic CRUD API, it worked in phases - first, to support local testing of business logic with no external dependencies, it built an initial mode of operation where it maintained an in-memory state store and then persisted the whole thing to disk. Okay, cool, that's useful. Then, in the next phase, it was time to introduce an actual database. The AI's approach? "Well, we've got this in-memory state store, so let's just maintain the pattern of mutating that and then persisting it, only now to the database. I'll just delete the database and recreate it on full on every write operation!"
Ummm. Now, in this case, I got another AI agent to do the first pass of review, and it did highlight that that was... not very efficient. The coding agent's handling of that feedback?
"You're absolutely right! Let's instead iterate through every item in the store, attempt to INSERT it, attempt to UPDATE it if the INSERT generated a conflict, then iterate through every item in the store again to build a set of IDs, and DELETE everything in the database whose ID isn't in that set. Much better."
Sorry - I still can't bring myself to put code into production that I haven't looked at at all. I don't need to check for dotted Is and crossed Ts, but AI still has yet to develop a good enough sense of smell to remove the need for me to at least give it a sniff.
I'd like to offer something of a counterpoint here. I think to look at Code Review by Humans as solely a mechanism to improve code quality is a mistake. In a properly functioning development organisation, it's doing a lot more jobs than just that:
- It's disseminates knowledge about code changes around the development team.
- It's an opportunity for alternative solutions to be proposed.
- It provides opportunities for teaching/mentoring between more senior and junior developers.
- It's a key checkpoint to ensuring the maintainability of the code (this is distinct from whether there are bugs in it!)
- It's a way of making people feel more collective ownership of the code and less personal ownership. After review, the code is accepted as belonging to the team that looks after it as opposed to being the baby of the person/tool that wrote it.
A key thread running through all of these points is that Code Review is a vector for communication between humans working on a project together. With one or two developers perhaps that is less important, but as your project scales to have more people working on it this kind of communication is critical to a well functioning team. Maybe exactly what you need for each of these points changes as the amount that one person can accomplish by themselves with AI help continues to evolve. But I'm unconvinced that the need for them will disappear. Unless your vision of the future is that all software projects can be done by a single human? (Which personally I find quite depressing - collaboration is fun!)