WOW
I think I have a problem
At this time of year, apps often seem to send you a yearly summary. How many words you’ve learnt, or miles you’ve walked or commits you’ve pushed. Not wanting to be left out, ChatGPT now offers a "Your year with ChatGPT."
I initially ignored it, but when someone on LinkedIn posted their stats I thought I’d have a quick look to see how mine compared. They had ~6,000 messages across 377 chats. And had generated 37 images. That seemed pretty impressive. But then I looked at mine and was shocked to discover I had ~27,000 messages in 1,500 chats. And ~700 images.
And then I remembered this was just one of my four ChatGPT accounts. Plus I have a couple of Claude accounts and Gemini and Grok as well. I think I have a problem.
But it got me thinking - how much have I actually used these tools on things I care about? How much code? How many stories? How much music? So I got Claude to build a tool to find out.
Meet AnnualReport
The result was an annual report tool (written in Rust, of course). It clones all the repos from my github and runs mdkloc over the changes made in 2025. It also looks for short stories (Claude has written quite a few in English and Spanish), novels (I’ve experimented generating longer content) and songs (my AI music needs lyrics). And then it goes and scrapes YouTube and my private audio repo to work out how much video & audio I’ve created.
The results?
About 2Mloc of useful code (and about the same again in stuff I’ve thrown away). 3,000 commits. 140 new repos. 29 novels. ~200 short stories. The lyrics for ~2,000 songs, of which 162 got turned into music running to ~19 hours in length. And 4 hours of video.
Last year I was astonished when AI created 10kloc of code for me and wrote half a novel. This year things are… different.
Feeling the exponential
Is this what an exponential feels like?
In the first 30 years of my career I created roughly 200kloc of code across work and side projects. About 7kloc a year. Last year AI accelerated that to 10kloc in a couple of months. This year? 10x more. And most of that’s in the past 4 months.
As more evidence of the mad world we now inhabit, in the three days since I created the reporting tool, I’ve built another ~100kloc.
Right now creating a 1-2kloc tool feels like what creating a 20 line script used to. It’s trivial for Claude/Codex/Gemini to build, it’ll work, and it’s quick. I’m starting to build these things without thinking.
Some examples:
The annual report tool (2.5kloc). A replacement for df (2kloc). A disk-space visualizer tool (3kloc), an Android Spanish flash card app (6kloc), a Chrome plugin for my son to manage his tabs (0.7kloc). Three sheep themed versions of battleships (where the ships are replaced by sheep and the missiles by sheep dogs) to pass the time while waiting for the pantomime to start (~0.7kloc each).
These can often be built on the sofa or the bus or in the theatre using Claude Code on the Web.
Then there are the medium sized things. These require a bit more work around the design, UX, testing. I’ve got a repo analyser because, well, managing 150 repos is hard (8kloc). A memory monitor, mdoomchk (8kloc). A skills manager (12kloc). A LM council app where multiple LMs discuss answers to a question before giving you their collective wisdom (13kloc). A blue tooth file sharing app (6kloc).
The medium sized things take a bit more time to get right - mostly because I need to play with them a little bit to understand how to make the UX work well. And, being more complex, they often come with sharper corners that need elapsed time to find and smooth. The repo analyser, for example, decided to analyse all 150 repos in parallel and launched six hundred terminals; Windows didn’t cope well and promptly blue screened…
And then there are the large things. These take weeks to build. But can be amazing. The most recent of those is my Xmas project: mddosem.
mddosem
Microsoft have always done backwards compatibility well. Years ago I remember looking at the Windows 95 source code. I was amazed at just how much effort had gone into ensuring that all the existing third party apps would continue to work properly on Windows 95, despite those apps doing many strange and unconventional things.
Microsoft did the same thing with Windows NT. The new (in the mid 1990s) 32-bit OS shipped with an emulator - Windows On Windows (or WOW for short) that enabled old 16-bit Windows apps to run. Under the covers this was technically impressive - 16-bit DOS/Windows was hard. Be glad if you don’t know about segmented memory or the difference between small, medium and huge memory models. WOW earnt its name - it worked well. Most people didn’t even realise the magic WOW was performing.
WOW relied on the x86 processor supporting 16-bit VMs when running in 32-bit mode. But when 64-bit mode arrived, it no longer supported 16-bit VMs. So 64-bit Windows spelt the end of WOW.
But I’ve often wondered - could you write a 386 emulator and glue it together with Win API mapping code and recreate WOW for the modern era? Could I build something that would natively (ish) run my old 16-bit apps? I’m pretty sure the answer is yes. But I’ve never had the time. Or, more to the point, the talent. This stuff is hard.
Enter Claude, Codex and Gemini
Earlier this month it occurred to me that perhaps AI could build this emulator. So I started work on mddosem. It’s a 386 emulator, with DOS6 and Win3.1 support built in. It supports DPMI, EMS and XMS. It has SoundBlaster, GUS and MT-32 audio. The goal is it can run all the 16-bit apps I care about - whether DOS or Win16.
After a couple of weeks of intermittent progress, I gave it a bit more focus in the week before Xmas. And then, on Christmas day, this happened:
And then a little later this:
This is the emulator running Tetris (which dates from 1987) with EGA graphics. What a cool Xmas present :).
Then on Sunday I got this:
This is Geoff Crammond’s original Formula One Grand Prix game. The emulator crashes when it tries to run a race, but still… this is amazing progress.
And the emulator? It’s 110kloc of Rust plus 30kloc of assembler. But the impressive numbers are the tests. There are ~4 million of them. There are some comprehensive x86 test suites online - for testing 8086, 286 and 386 CPUs. mddosem runs all of them - and they all pass.
But there’s more than just these static tests. There’s a built in fuzzer which runs overnight. Plus a growing set of real-world apps (the emulator has support for keyboard input and periodic screenshotting - so the coding agents can drive the emulator themselves).
It’s strange how quickly those numbers seem normal. But they’re not. These are ridiculous numbers of tests. Some of the classic emulators (86Box, PCem) don’t have any automated testing.
There’s still a long way to go. Performance is, well, mixed. Checkit thinks the CPU speed is 0MHz. In the CGA version of tetris the blocks drop so fast it is unplayable. Windows apps don’t run at all. Hangs are frequent. But this is so much fun.
And we’re making progress. Plus, if Opus 4.5 / Codex 5.2 / Gemini 3 can’t get it working then Opus 5 / Codex 6 / Gemini 4 will be able to.
And so?
My hope is I can modernize my favourite legacy apps. Take them - and mddosem - and build them into a single Rust exe that runs on modern systems. Then I can spend my time replaying Tetris, Formula One GP, The Secret of Monkey Island, Doom and Freecell. Could I perhaps even add DirectX support and emulate a Voodoo card to run Colin Mcrae Rally?
Last year was amazing. The past few months utterly crazy. What will 2026 hold?
If I can get mddosem working then that would be a thing. There’s something cool about using the tools of the future to rescue the software of the past. WOW, you might say :).







I agree - 4 million is quite a few! But the majority (3.99 million+) are opcode tests from a mix of 8086/80286/80386 test suites. I've only built a few thousand new tests. But I can also automate a whole slew of things - I got Claude to build a suite of real-world apps which it then runs, captures screenshots and hashes them. Then Claude repeats the tests with the app in the emulator and checks the hashes. Sure, there are limitations - timing is a pain - but there's a lot of mileage you can cover cheaply.
As to testing in the AI world, my thought process goes like this:
- Human code-review is too slow. IIRC Google reckons 200-400loc per hour, 4 hours per day. 120kloc would take ~ a year to review.
- Human code-review isn't great. It's hard; and really easy to miss both subtle and big picture things. While it was all we had it was worthwhile, but I'm not convinced of the value now.
AI makes testing cheap. There's a whole set of testing approaches and tooling around - many I've never used before because they were too expensive to use in commercial software - e.g. MC/DC aviation grade testing, mutant testing, fuzzing. Plus there's kani as well (a rust verifier). I'm not an expert on any of these yet - but I'm sure I'll be getting to know them better in 2026 :). My instinct is using these tools will result in better code than the traditional human code-review plus UTs/FVs/ST. Of course, I'm assuming we move to writing everything in Rust - but that seems like a sensible move when AI is generating the code.
For safety critical core pieces humans still have a role to play. But for the rest? It increasingly looks like a poor use of time. Maybe we'll find human code-review becomes obsolete while quality improves through the use of these tools/frameworks?
4 million tests for 140kloc seems excessive - surely that's over fitting the tests such that the whole system is brittle? If so, any change will break a number of them, giving you (the human) more to review when the AI fixes those tests. Which leads into a wider question: I've been wondering about whether AI making tests cheap leads to loads of low value tests and whether ending up with a large suite of low value tests is desirable.
If we assume only AI will work on a repo then what does an AI-first test suite look like? Do we assume humans just shouldn't look at the tests because they're so numerous, and only focus their review on the deliverable code? Or the opposite, and the production code is meaningless as long as the tests are good enough to prove it works?
What do you think?
That said, perhaps the numbers are inflated in this particular case because you've got loads of opcode combinations to iterate through, and the core test is high value?