The 584 million year bug
Rough edges, hung tests, and coding while you sleep
Every relationship has a honeymoon period. A time where everything seems amazing. Everything is possible. But, over time, you start to take the strengths for granted. And then you notice the rough edges. The clothes on the floor. The dirty dishes in the sink. The sour milk in the fridge.
Reality bites
Something similar happens with coding agents. Take Codex. I’ve been getting it to sort through a UT test suite and fix problems. But some of the tests hang. And Codex doesn’t notice. It’ll happily sit for hours doing nothing while the test hangs. You can break in and tell it that the test seems hung. Oftentimes Codex will deny the test has hung. And it will almost invariably immediately try and re-run the test that hung. And get stuck. Again.
The solution is to be firmer with Codex. And teach it about the timeout command (strangely while Codex is a dab hand with many obscure Unix commands, it doesn’t naturally turn to timeout).
Then there are the bizarre issues Claude found reviewing a HLD:
Yeah, sure. I’m not going to worry about a bug that won’t manifest for 584 million years. But lest you think Claude is scraping the barrel, in the next few passes it uncovered three critical defects. This is not your typical reviewer.
There are the frustrations with context windows. The latest 0.46 update to Codex is bugged and the "/compact" command frequently fails to free context. Which leaves you with a fiddly challenge of manually getting context into a new session. But, a week from now, when I’m running 0.50, this will be a forgotten irritation.
And then there are the struggles. I’ve been working on building a SIP/RTP mixer with Codex. The inbound audio path works well - jitter buffers work properly and the audio streams reliably. The mixer works and a steady stream of 20ms packets pop out the end of it. But after that things go wonky with large jumps in RTP timestamps and distorted audio. Codex has really struggled to diagnose the problem - and some of the proposed solutions are, well, not great. Adding more threads and queues between the mixer and the RTP sender to pace the audio? I don’t think so - we’ve already got a nice steady stream of packets. Re-queueing them isn’t the solution.
But…
So am I falling out of love with Codex and chums? Not a bit. They may have rough edges. I may frequently stumble. But they are still amazing tools. I’ve been able to create ~75kloc in the past week. That seems bizarre to write.
And, in reality, a big part of the problem is me. Anyone using these tools is still learning how to use these tools. There are no paved paths. And things are moving fast.
I’m being forced to think differently. Codex may be less smart than a human. But it is more persistent. It can replicate itself and run 24x7. That opens new opportunities.
Swarm computing: A swarm of Codexs can review all my codebases every night. That’s not something that was ever possible with humans. The swarm might not find all the issues the first night. But running it every night increases the chances of finding the lurking issues. Swarm computing, anyone?
Exhaustive iteration: Codex can repeatedly review docs until it doesn’t find any new issues. OK, maybe some of the issues it finds are, err, niche. But I can filter them out. Give all the issues to another instance to check that they are genuine.
Sleep coding: I can do something else while Codex works. Eat. Sleep. Learn plumbing. Coding is no longer something that only happens when I’m awake and engaged. It can happen all day, every day.
What’s more, Codex and Claude Code are starting to be able to learn. I can teach them what matters to me in software engineering by configuring my config.toml and claude.md files. That’s the basics. Then Claude gained 'skills' a few weeks ago - the ability to 'teach' Claude new skills. I’ve started experimenting getting Claude to read software engineering books, work out what’s its learnt and turn that into a skill. Whether this makes Claude a better engineer remains to be seen - but you can imagine using this to make Claude a SIP/RTP expert (which might help with my mixer problems).
Claude has agents - get the main Claude session to act as the 'manager' and get the agents to do the actual work - design, implementation, review. Currently my experiments haven’t been overly successful (think frustrated, confused manager and recalcitrant agents). But the tools will get better. And so will our understanding of how to use them.
And so?
All relationships have ups and downs. But it’s important not to lose sight of the strengths. And important to ask what part you play in any of the frustrations.
These tools remain incredible. In the past month I’ve created ~250kloc of code. 1500 unit tests, 50+ E2E tests. I periodically wonder if I’m dreaming but, no, I don’t seem to be. It’s a different world. And what will this new world look like? Swarms of skilled-up agentic coding agents generating code day and night? Ever growing test suites? Zero-length defect backlogs? I’m not sure, but I am sure it’s going to look very different from the world of just a few years ago.



