All you need is o3
With a little help from my AI friends
Last week was fun. Really fun. It’s up there in my list of all-time great weeks. Working with o3-mini-high is a lot of fun. We’ve built tons of code. There have been highs. And some lows. And now I’ve got a pretty good feel for what o3-mini is capable of - and what it can’t do.
So, what have I learnt?
o3-mini is great at one shotting. Get the requirements right and it can generate over a kloc of code in one hit. I’ve been using C# and Rust - and it works well with both.
It’s fantastic at little Rust tools. Need a wrapper to streamline git? Need a tool to combine source files into a single text file? It pretty much always gets these right first time.
It knows. It knows the best open source libraries to build on. It knows the right patterns to use. Whether to use a collection or a tuple or a list or a whatever. Gone are the hours of exploring and experimenting to find the best framework/ crate to use.
It’s lazy. By default the output and logs rarely look nice. I’m sure that’s a reflection of the training data - most of us don’t invest time in making the output look nice. But when it’s essentially free to get nice output - neatly formatted text, appropriate use of colour - it’d be great if o3 did that by default.
Claude is still useful - it’s good at explaining things, creating flow diagrams to explain the code, reviewing. And it even fixed a JSON parsing bug o3 struggled with.
o3 is also much less prone to random refactoring than previous models. >90% of the time it only changes what needs to be changed. But, the trouble is those few percent where it succumbs to the temptation of an unnecessary refactor. One time it removed a bunch of working code and replaced it with comments that read “Omitted for brevity”. Err, no. I only discovered the missing code multiple iterations later and retrofitting it was tedious. Since then I’ve kept a closer eye on what o3 is doing.
The limits
1kloc is about the current limit for code generation. Beyond that o3 struggles. And I know this because I've been experimenting with a more complex challenge - can I recreate Windiff, Microsoft's legendary file comparison tool from the 90s? At around 9kloc, it's a good test of where the boundaries lie (yeah, I know WinMerge is better, but it’s ~800kloc).
The test was simple - can anything generate Windiff in one shot? o1-pro came closest (below) - generating plausible UI code and file handling. Except it forgot the diffing algorithm…
Sure, I could iterate with o3 to add the missing functionality. But we're clearly not in one-shot territory for applications of this size. At least not yet.
So what?
Current AI tools encourage decomposed architectures. If your product is split into multiple small components around the 1-2kloc mark then the current tools are fantastic. Microservices anyone?
But larger more monolithic codebases remain challenging. For example, I can give Gemini-Thinking (1M token context window) the entire original Windiff codebase - that uses about 200k tokens. And it’s genuinely useful - Gemini found a buffer overflow bug along with a lot of questionable string handling.
But OpenSSL (850kloc) is well beyond Gemini’s capacity…
A human is still needed to figure out how to split the codebase into chunks that LLMs can work with effectively.
Back in the real world
A month or so ago I had a coding interview for one of the FAANGS. It was the classic codility interview - you get a poorly specified problem and then have to manually write code to solve it. And this from one of the companies actively promoting AI - to write code - to the outside world
It really brought home to me how far behind the curve the majority of people are. This type of interview test is pointless. An LLM can solve the problem in seconds. Sure - I can learn how to do this test well. But life is short. I don’t want to waste my limited learning hours developing or improving obsolete skills.
I discussed AI with the interviewer. They told me it was still important to be able to solve these problems. They were wrong. It’s not. There’s an infinite set of skills you can test for. You’ve got to test for the ones that matter. It’s like choosing a house builder based on their ability to convert clay into bricks.
We need to be testing people on their abilities to effectively use tools - the skills we need now are higher level. Clearly defining requirements. Understanding the limitations of the tools. An ability to keep up with the jagged frontier. Those are the skills we should be testing for.
Where does that take us?
The contrast between cutting-edge AI capabilities and current industry practices is stark - for pure coding work I’m at least 10x faster than I used to be. Overall I’m least 2x more productive.
This disconnect can't last. Those kinds of productivity gains can’t be ignored. And the tools are only going to get better from here. Six months ago, one shotting more than a few hundred lines was next to impossible. Now, here we are at 1kloc. When does that become 10kloc? 100kloc?
Many of the large software companies will struggle. They have too much inertia, too much process, too many legal constraints to adapt rapidly. It’s ironic - they are the ones helping accelerate the AI race, yet they are poorly placed to take advantage of the changes themselves.
In forty years of coding I've seen many changes. The shift from assembly to high-level languages. The rise of open source. The adoption of cloud computing. Each felt revolutionary at the time. But this is different. This isn't just a better way to code - it's the end of coding as we know it.
The irony isn't lost on me. After decades of writing code, I've spent the past week building more than ever before. And it’s been fun - so much fun.
And the interview? It wasn't for me. I want to be somewhere that innovates. That rides the leading edge. That moves fast. And right now, that isn't big tech.


