There's something odd happening in software engineering
The rise of Oracle Driven Development
It’s exam season in our house. My kids are going through the well trodden "revise-past papers-mark" loop. I like to say they were enthusiastic, but resigned to their fate is probably more accurate.
One of the regular topics of discussion is how to mark history and English papers. Maths & physics are easy - they have clear yes/no answers. But history and English? They’re a bit vague and wishy-washy.
In a very real way my kids are learning the advantages of good oracles. Good testing oracles enable you to quickly work out if the something is right. Maths has good oracles; English less so.
Oracles
Oracles have become a central part of my life. These days I spend a lot of time thinking about them. They turn out to be critical to the new way of creating software.
I’ll admit it. In the past I paid lip service to testing oracles. If the code built, well, that was often good enough. Building and manual testing went hand-in-hand; by the time I was finished I had a pretty good idea of the quality - and investing more time in test oracles felt like a poor use of time.
But now? Now oracles are critical. They are the tool that turn Claude and Codex from a capable partner into an amazing one.
Take the Rust Opus audio codec side-project I’ve been working on. I added a video capture feature to my DOS emulator; but there were no good native Rust audio or video codecs. In the past I’d have been stuck wrapping C deps in a Rust FFI. But nowadays?
Nowadays I just build the codec I need. For Opus we started with the canonical xiph C implementation. Claude reverse engineered the codebase, writing detailed design docs. While that was underway we worked through the oracles we’d need: UT tests, IETF test suite, differential fuzzing against the canonical xiph implementation, sample audio files for perf testing.
The goal was to build a memory safe, bit identical encoder in Rust, with performance comparable to xiph.
With the design in hand I set Claude (Opus 4.6) and Codex (GPT 5.4) off in parallel building two independent versions. Why both? Well, I was interested in seeing how they compared when given the same brief. Which was faster? Could work longer? Produced better code?
In the end it was a damp squib. Claude quickly drew ahead - mostly because it was able to run for far longer than Codex. At the point where Claude had a basic encoder working, I abandoned the Codex version (or, more precisely, we reviewed to see if there was anything worth saving - and decided there wasn’t).
But the first stage was building the oracles. Once the oracles exist, Claude and Codex have something to build against. They have a way of checking their work. Suddenly they can run independently for hours. These days it’s not uncommon to have 10-20 agent teams busy working for 2-4 hours each across a range of projects.
Over the past three weeks Claude has been grinding away in the background - porting function and, then, fuzzing against the differential oracle. The codec hasn’t been a high priority - but Claude has made steady progress.
Towards the end we focused on performance; Claude ran a perf loop - work through some likely candidates, bench them, test them and then either commit or drop.
In the end decode perf is slightly ahead of the C version - the mean across all different audio types is 5% faster. Encode is 2% slower. I can live with that.
And then we were done; Opus ported (modulo some more niche features). 24 days. 365 commits. 63kloc. >100 hours of fuzzing. >24 hours of mutation testing.
You can find "ropus" here and here.
Reimplementing is easy
Now you might well think porting is easy. And you’d be partly right. Having a reference implementation that can act as an oracle made this much easier. Creating an effective audio codec from scratch would be significantly more work. Even starting from just the documentation would be a lot harder.
But then pause to think of all the legacy code written in memory unsafe languages. Riddled with tech debt. The products we’d love to rewrite but are too scared to. Too worried that our customers are relying on unintentional bugs or weird edges. This technique offers an automated way to tackle those. Your old product is suddenly the perfect oracle. Could the world finally retire Cobol? Fortran?
It’s not just porting
The oracle technique works for new code as well. Assuming you can get the right oracle. I’m building a Raspberry Pi Pico emulator; the prime oracle is the real hardware - Claude has a couple of them hooked up via USB debug probes and can run functional and differential fuzzing against them. They are invaluable.
But…
If this sounds too good to be true, well…
For tasks with maths-style oracles, Claude and Codex are imperious. Porting, refactoring, performance tuning against a reference. They all have natural oracles. But UI work. Or product decisions. Or architecture. Less so. They’re the English tests of the coding world.
It’s the sense of taste that the AI coding tools lack. Take the Pico emulator. The real hardware has two cores, a complex programmable IO controller (PIO), DMA and a bunch of peripherals. A naive implementation steps each core, then ticks PIO, DMA and all the peripherals in a loop. It works but it’s slow - maybe 10% of real time performance. And that’s on a fast host.
Much better is to parallelize the work across multiple cores. But getting the split right still seems beyond Claude - it offered multiple solutions but they all felt, well, a bit clunky. And some were slower than the naive single core approach. But once we’d agreed on an approach? Well, then we were away. We had an architecture. And a perf oracle. Life was good.
And so?
The quality of your oracles now directly correlates with how fast and long you can get Claude to run. What was once an afterthought is now critical. The worse my oracle - or the more the problem resists having one at all - the more I need to stay in the loop.
Suddenly two skills seem critical: designing the thing that checks the code, and having a view on the things no check can settle. Everything in between is getting eaten.
I’m not sure my kids are going to like what’s on the other side of exam season.

