Discussion about this post

User's avatar
Martin Davidson's avatar

I agree - 4 million is quite a few! But the majority (3.99 million+) are opcode tests from a mix of 8086/80286/80386 test suites. I've only built a few thousand new tests. But I can also automate a whole slew of things - I got Claude to build a suite of real-world apps which it then runs, captures screenshots and hashes them. Then Claude repeats the tests with the app in the emulator and checks the hashes. Sure, there are limitations - timing is a pain - but there's a lot of mileage you can cover cheaply.

As to testing in the AI world, my thought process goes like this:

- Human code-review is too slow. IIRC Google reckons 200-400loc per hour, 4 hours per day. 120kloc would take ~ a year to review.

- Human code-review isn't great. It's hard; and really easy to miss both subtle and big picture things. While it was all we had it was worthwhile, but I'm not convinced of the value now.

AI makes testing cheap. There's a whole set of testing approaches and tooling around - many I've never used before because they were too expensive to use in commercial software - e.g. MC/DC aviation grade testing, mutant testing, fuzzing. Plus there's kani as well (a rust verifier). I'm not an expert on any of these yet - but I'm sure I'll be getting to know them better in 2026 :). My instinct is using these tools will result in better code than the traditional human code-review plus UTs/FVs/ST. Of course, I'm assuming we move to writing everything in Rust - but that seems like a sensible move when AI is generating the code.

For safety critical core pieces humans still have a role to play. But for the rest? It increasingly looks like a poor use of time. Maybe we'll find human code-review becomes obsolete while quality improves through the use of these tools/frameworks?

Martin's avatar

4 million tests for 140kloc seems excessive - surely that's over fitting the tests such that the whole system is brittle? If so, any change will break a number of them, giving you (the human) more to review when the AI fixes those tests. Which leads into a wider question: I've been wondering about whether AI making tests cheap leads to loads of low value tests and whether ending up with a large suite of low value tests is desirable.

If we assume only AI will work on a repo then what does an AI-first test suite look like? Do we assume humans just shouldn't look at the tests because they're so numerous, and only focus their review on the deliverable code? Or the opposite, and the production code is meaningless as long as the tests are good enough to prove it works?

What do you think?

That said, perhaps the numbers are inflated in this particular case because you've got loads of opcode combinations to iterate through, and the core test is high value?

2 more comments...

No posts

Ready for more?