The Oracle of Doom
Why building the test is the new hard problem
Over the years I’ve been lucky to get to know some very smart people. There were the two girls in my primary class who consistently beat me every year. The friend at uni who was studying physics but decided to spend their spare time doing a maths degree as well (lovely person, clearly mad). The numerous work colleagues who have solved problems I couldn’t. Or provided insights and guidance I never saw.
In that time, I’ve realised it’s reasonably easy to tell where I fit in the pecking order. I can work out who is smarter than me. And who is not. But beyond that it gets harder. Out of a set of five smarter people, who is the smartest?
Opus 4.6 / Codex 5.3
Last week Opus 4.6 and Codex 5.3 were released. And once again I found myself faced with this problem. The previous set of models - Opus 4.5 and Codex 5.2 - were clearly smarter than their predecessors. They could do things their predecessors couldn’t. When I first tried Opus 4.5 it fixed a raft of bugs Opus 4.1 and Codex 5.1 had failed with. It just nailed them. It was clear who was smarter.
Those were bugs I reckoned I could fix. So Opus 4.5 was at least as good as me. And since then it’s become obvious it’s probably better than me. It’s definitely faster.
But Opus 4.6? How to judge it? The benchmark numbers are better - but on coding benchmarks Opus 4.6 isn’t much ahead of 4.5. Some are better, some not so much.
Since Christmas I’ve been working intermittently with Opus 4.5 on a 386/DOS/Win16 emulator. We’ve been trying to get Doom to run. Which has turned out to be surprisingly hard. DOS is a 16-bit OS with a 640K-ish memory limit. Memory is split into 64K segments. Memory management is hard.
But under the covers the 386 can also work as a 32-bit processor with a "flat" addressing mode of 4GB. Writing software for this mode is much easier. Plus access to more than 640K means software can be a lot more interesting. Hello Doom.
Supporting a 4GB flat address space on a 16-bit segmented OS was an impressive technical feat. Even today it feels a little bit like magic. And it was all thanks to DOS Protected Mode Interface (DPMI). DPMI handled all the complexities of interfacing with 16-bit DOS and the complex processor modes of the 386. It let apps think they were running with a flat 32-bit address space. It was - is - remarkably clever.
Applications like Doom use DPMI. So, without a working DPMI, Doom won’t run. And until Friday I didn’t have a working DPMI.
Since Christmas Opus 4.5 and Codex 5.2 have spent many hours investigating problems. We tried fixing problems as they arose, stepping back and doing detailed reviews against specs, differential fuzzing against working implementations, building tools to run in the emulator, driving code coverage over 95%. But DPMI remained stubbornly broken.
Opus 4.6 was released last Thursday. On Friday I spun up an agent team (a new Opus 4.6 feature) to work together to investigate, discuss, explore theories - to get Doom working. I gave them all the background, all the tools, all the diags. The emulator has built in screen capture and keyboard injection features - so Opus 4.6 could drive the app and test fixtures. And then, as I am wont to do, I went to bed.
Partway through Saturday I went to check-in on the team. They were claiming success. "Seen that before," I thought. Except this time it was true.
It’s possible Opus 4.5 had been just about to fix DPMI. But I’m confident Opus 4.6 is the reason Doom works.
And so?
How did Opus 4.6 fix DPMI? Perhaps a deeper understanding of x86 memory models? Better reasoning about edge cases? Luck? I don’t know.
But, really, does it matter? What matters is that Doom works.
We’ve reached the point - arguably several months ago - where understanding the internals isn’t an option. Not because the code is badly written, but because of scale. Nobody understands the Linux kernel end-to-end. We’ve been relying on oracles for years - test suites, integration tests. We just didn’t call them that.
Doom was my oracle for DPMI.
And it’s a great oracle. Binary. It runs or it doesn’t. But most things aren’t that clean. How do you build an oracle for reliability? For performance under edge cases you haven’t thought of? For security? For maintainability?
Oracles are rapidly becoming the problem. AI can write the code - that’s clear. Knowing whether a model is better is much less important than knowing whether the thing it builds actually works. As models generate code at scales beyond comprehension, proving function, reliability, performance - it’s all we’ve got left.
Building the software used to be the hard part. But now? It’s building the test.
For now, Doom runs. Slowly. Very slowly. You can see it below. You might want to grab a chair. But me? I’ll take the win.

