Understanding codebases with AI
Can a LLM retrofit an HLD to a legacy codebase?
Let me tell you about a hospital I visited many years ago. Like many, it had evolved over decades. At the core was a Victorian building with high ceilings. Over the years other parts had been added. These had varying ceiling heights, meaning the floor levels didn’t match. Level three in the Victorian part was level four in the 1960s part. At some stage the hospital had consumed a nearby row of houses which were only accessible by going outside. The once elegant design had turned into a sprawling mess. Even if you paid close attention you were likely to get lost.
Legacy codebases are like that hospital. Over time the architecture starts to decay. Duplicate code appears. Code gets orphaned but not deleted.
Working on such a codebase is hard. Where do you start? How do you know whether the code you are reading is actually called? These codebases nearly always lack a decent architecture diagram - it is easy to get lost in the maze of code.
But maybe LLMs can provide a solution. If they can reason across an entire codebase maybe they can retrofit High Level Designs (HLD) and architecture diagrams? I’ve spent the past few days discovering what is possible…
Conclusions
There are details of my testing below. But if you’re short of time here’s the spoiler for the conclusions:
LLMs are capable of retrofitting HLD, architecture and flow diagrams. Or, more accurately, Claude is. o1 is a reasonable second choice if you don’t have access to Claude. GPT4o and Gemini are, erm, hopeless.
They are limited to relatively small codebases at present. About 3kloc. If you’ve got a nicely encapsulated, decomposed product you could analyse each chunk one at a time. But LLMs aren’t (yet) capable of making sense of a 100kloc sprawling mess of code.
The HLDs are relatively basic - but they serve as a useful starting point.
Claude is the clear leader at the moment. From a usability perspective the others aren’t even close. The combination of artefacts and a fantastic model (even better in V3.5.1 form) is a winner. o1 and Gemini lack artefacts - they produce raw Mermaid or SVG or PlantUML files and you’ve got to find a tool to view them.
Net - if you can divide your codebase up and your org allows you to use Claude then it’s a potentially powerful tool at understanding legacy code. Pretty cool!
BlueSCSI
I wanted to start with a project that was big, but not too big. Enter BlueSCSI. If you’ve not heard of it, SCSI was the USB of the 90s. You could connect hard drives, CD-ROM drive, scanners - all using a nice chunky cable.
BlueSCSI is an open source project designed to emulate SCSI devices (hard drives, CD-ROMs, tapes) using microcontrollers. It’s really awesome. I’ve used it to bring various old computer systems back to life (a VAXstation, a SPARCstation and a HP712 since you didn’t ask).
It’s about 17kloc in size - a fairly small application - and would likely take a team of five experienced devs somewhere between 6-12 months to build from scratch.
GPT4o
I gave the Github code repository (a zip file) to GPT4o and asked it to:
Analyze the attached code and provide an overall HLD for the code. Include a high-level architecture diagram and a two page HLD for the project.
GPT4o thought for quite a while and then produced this architecture diagram:
Err. No.
The two page HLD turned out to be a single page with little detail. Maybe useful as a very basic intro, but it wasn’t an HLD.
Claude
The paid version of Claude supports Projects - in Claude’s words these are:
A dedicated workspace that maintains context across multiple conversations about a specific task or topic. The aim is to allow more organized and efficient interactions over time.
But Claude doesn’t support zip files. I could have uploaded all the files individually, but there are a lot of files so I used Claude to write a script to parse the zip file, find source files and concatenate them into a single text file.
This worked well - I soon had a 729KB file that I uploaded to my new project. Only to discover the file was too big:
That implies 1% of knowledge capacity equals 729KB/ 145 =~= 5KB.
Back to Claude! I got it to improve the concatenation script to strip out all comments and less useful files. The output was now 417KB. That’s ~=84% of the capacity - so this should work? Well, maybe:
But trying to ask any questions failed:
So another fail - BlueSCSI is too big for Claude Professional.
o1
o1 only supports text. So I tried to paste the 417KB file. But it was too large for o1:
Gemini
I had high hopes for Gemini. It has a 1 million token context window - could it handle BlueSCSI and make sense of it?
But it got off to a bad start. It doesn’t support ZIP files, so I pasted in the 729KB concatenated version of the codebase. Unfortunately that locked up the UI. So I attached the file instead. But that didn’t work either…
I couldn’t get past this error. I tried the smaller, 417KB, concatenated version but got the same result.
So back to the drawing board. I needed something smaller.
Estimation Whist
Enter Estimation Whist. I wrote this back in 1993 using Turbo Pascal for Windows V1.5. Strange as it might seem now, those were the days when compilers cost money. Lots of money. Microsoft Visual C cost $300 (~$650 today). Being a student there was no way I could afford it. So when I was given a free copy of Turbo Pascal, I just learnt to write in Pascal.
Claude
Estimation Whist is about 2.9kloc of Pascal. And conveniently it’s all in a single file of ~115KB. Claude was happy to analyze it and came up with this architecture diagram.
I rather like it. It is clear and simple. It shows how the classes/ subsystems fit together. I briefly wondered if TChetBox was a mistake. But, no, that’s what the code says. I seem to have shortened Cheat to Chet. I’m sure there was a reason but I don’t remember. It’s probably an 8.3 thing. Or I couldn’t spell. Both are equally probable…
The HLD was useful in identifying the key classes:
But not so much at understanding how the pieces fit together. So I asked for a flow diagram.
This was surprisingly successful:
That looks good. And Claude was happy to annotate the diagram with the key classes/ functions.
I count that as a success. If I wanted to make changes, I would have a better chance at putting them in the right place.
What about bugs?
I was pleased. Trump should never be greater than 4 (there are only 4 suits after all!), so I can live with my 30 year younger self not handling edge cases. And I know the game plays randomly (although, surprisingly, it is quite an effective strategy).
Then I asked Claude how long it would have taken to write it.
Based on the code complexity and scope, I estimate this would have taken approximately 9-13 weeks (2.5-3 months) for an experienced Pascal/Windows developer.
The code is well-structured and documented, suggesting careful planning and professional development practices, which would have extended the timeline but improved maintainability.
I’m liking the compliments.
As an aside, a 2.9kloc file used 21% of Claude’s knowledge capacity. Assuming a 200k token context window, that says ~14.5 tokens per loc. Which feels plausible.
o1
o1 thought for 12 seconds and produced this architecture diagram. But it’s really a class hierarchy diagram. Useful, but not what I asked for.
The flow diagram was less impressive;
The ‘HLD’ was again a list of classes. It addressed what the code was trying to do, but was light on why and how. And it didn’t have anything to say about performance or maintainability.
Gemini
The HLD Gemini produced implied more structure to the code than there is - it talked of separate game logic, UI and main window logic. In reality those are all implemented in the same, rather large, class.
The architecture and flow diagrams looked like this.
Both are pretty simple. And Gemini produces output in raw Mermaid or PlantUML formats - so you need to copy and paste to an external website to view the output. Claude’s inline artefacts is a much nicer experience.
I asked Gemini for a more detailed flow diagram. And got this.
Yeah, but no. Explicitly showing that four players have a turn wasn’t what I had in mind. And, strictly speaking, the number of players is configurable (although four is the default).
I asked Gemini how long the code would have taken to write.
Gemini feels a bit simplistic. There are 4068 lines in the source file, not 8000. Compare the rest of the output to Claude. Claude is just… …better.
GPT4o
For completeness I included GPT4o. I’ll be honest - I didn’t have high hopes based on the BlueSCSI experience. Here’s the architecture diagram.
It’s pretty. But it’s not really an architecture diagram. A fail.
The HLD was more of an intro to the code. In fairness it would make a useful starting point, but it needs more detail to be an HLD.
And GPT4o optimistically ended with:
This HLD document, paired with the architecture diagram, provides a comprehensive high-level view of the card game’s structure and functionality.
I respectfully disagree.
Conclusions
Conclusions are at the top of this post. Claude was the clear winner. And things are only going to get better from here. Maybe that sinking feeling of confusion when you have to make changes to an unfamiliar codebase will soon be a thing of the past? We can but hope…
Postscript
As I nearly always do these days I asked Claude to review this post. And as ever it had some useful feedback which I incorporated, and some I chose to ignore. But I was intrigued to find it asking me questions:
Those are good questions. FWIW the answers are:
yes
experience
no, good question, something to try next time
But, more importantly, it’s interesting to have a model asking decent questions. Up to now models haven’t asked enough questions to clarify human prompts. Could this mark the start of an evolution where models want to understand before responding?





















Really interesting to get a sense of the scale of codebase that the best AIs can handle. This is what I most want AIs to do for me, to digest complex information and make it comprehensible. Not capable of handling big enough codebases yet to be very useful, bit it’s a promising start, maybe in a few years’ time they’ll be great at it.