A good creator, but bad editor
AI's struggle with connecting dots
When I was a kid I remember learning the maxim “Fire is a good servant, but a bad master”. I’m increasingly coming to think that a variation of this maxim applies to AI: “AI is a good creator, but a poor editor”. Why? Let me explain.
I’m learning Spanish. For some inexplicable reason Spanish has two forms of the verb “to be”. There’s ser. And then there’s estar. Ser is mostly used for permanent things (e.g. she is tall); estar for temporary things (e.g. he is tired). But there are exceptions. Quite a few it turns out.
Get it wrong and it can be embarrassing. “Él está aburrido” means “he is bored”. But use the wrong form - “él es aburrido” - and you change the meaning to “he is boring”. Plenty of potential for the embarrassing foot-in-mouth mistakes I’m all too familiar with…
So I decided to learn it properly.
Enter Claude
And what better way than to get Claude to create a web artefact to learn from? I could have my own little web app that could focus on the cases I struggle with. Isn’t this the modern dream? Software specifically built to solve my problems. Software that is custom tailored for me.
First I got Claude to write down a list of the various rules for ser and estar. Partly so I could check what they were. And partly because breaking a complex problem down invariably results in better results - it forces the model to spend tokens (aka time) on the aspects of the problem that matter to you. Plus it forces the model to spend more tokens in total - i.e. spend more time thinking, which should result in a better outcome.
With Claude 3.7 you’ve got to be careful - it’s a bit like an over eager puppy - the briefest mention of a web artefact has it salivating wildly as it churns out reams of code. So it’s also important to brief Claude that we’re designing - and NOT writing code just yet.
With the rules defined, and a rough design in hand, I got Claude to create the web artefact. And a few minutes later it was ready. As ever it looked nice…
…but it had some problems.
First, the feedback from the previous question was displayed under the current question.
Claude fixed that. But then it didn’t display any feedback at all. And so began the vibe coding iteration cycle many of us are familiar with - where you slowly inch towards a solution often stuck in a one-step forwards, one-step back loop. Worse, occasionally something unrelated will break without you noticing. Aarrgh.
And in the back of my mind I’m wondering: is it worth persevering? Maybe I should scrap this version and start again? I persisted for a bit and got something that was semi-workable - but once I finish this article I’m going to try again. The current artefact is here if you feel the need to practice using ser and estar using a less than ideal tool.
It turns out this editing struggle isn't limited to code - it extends to any content that requires maintaining consistency over time or across a complex structure.
Other problems
I’ve written before about the problems AI models have with editing large content. I’ve got a 60,000 word AI generated novel I use for testing. When I last tried, Gemini Pro 2.5, o3 and o4 hadn’t been released. So I was interested to see whether they could better than their predecessors.
The first problem is the context window. o3 and o4-mini only have a 128k token window. And 60,000 words turns out to be too many to fit in that window. Fortunately you can upload the doc as an attachment. This appears to be a form of RAG, where the model stores the doc in a separate database and then attempts to pull relevant parts into the context window to answer the query. Unfortunately that doesn’t work well for summarizing a long story - sure, you can attempt to review a story by reading a few pages, but you’ll inevitably miss key parts. And so it turned out.
o4-mini-high told me:
Interesting, not least because there are 12 chapters. Then o3 told me:
I had to dig to discover what o3 meant by the “Wednesday-evening prospectus paragraph”. And then it transpired that while the paragraph does appear in Chapter 1, it doesn’t ever appear again. o3 is merrily hallucinating content that isn’t there.
The conclusion? o3 and o4-mini-high are essentially useless for summarizing docs that exceed their context window.
Can Gemini 2.5 do any better?
Gemini 2.5 has a 1 million token context window. The raw text used about 90,000 tokens (so plenty of space left for discussions), and Gemini proved quite capable of summarizing the doc. Good. It provided a list of suggested improvements, so I asked it to implement some of them. There was a scene where two people are taking a break after a busy few hours (due to a colleague having called in sick). In the original text, the colleague arrives part way through their break saying “they were feeling better so had come to work”. Gemini decided to improve this by inserting a new scene between the start of the break and the colleague coming back to work. And what was this scene?
It was the ill colleague popping their head into the breakroom to offer them a cigarette. Yup. That’s right. The ill colleague makes a miraculous unexplained recovery, pops in, offers a cigarette and then leaves. Only to returns in the very next paragraph to explain they are feeling better and have decided to come into work. It doesn’t make any sense. This error wouldn’t make it past even the most basic human editorial review. And this is from arguably the best model we have available right now…
Putting aside the question of how models like this are going to take over the world and replace all humans (!), these two examples reveal something interesting about models.
Models are good at creating. Their first drafts are invariably consistent and generally good quality. But when edits are needed, the models often get tied in knots - and make errors that seem surprisingly silly to us humans. They seem to lack:
The ability to create an overall summary of the code, or text.
The ability to understand time’s arrow.
I now approach editing with caution, especially if the model is left to do it unsupervised. Each edit introduces some decay - and my job as a human is to spot that decay and minimize it. Sure tools, such as Claude Code, can be left to make changes on their own, but the outputs can often be, err, interesting. For now, human oversight of editing is crucial.
AI is a good creator but a poor editor.
Are our metrics wrong?
Many people talk about an intelligence explosion. Or the rapidly decreasing cost of intelligence. But it's beginning to dawn on me that perhaps we're using the wrong metrics to measure intelligence. What if our benchmarks are skewed toward skills that AI is naturally good at, while missing what makes humans special?
For years we've defined intelligence as the ability of some humans to be better than others at things we collectively find difficult. Maths. Engaging in sesquipedalian language. Memorizing vast amounts of information. AI turns out to be great at those things. And so we consider it highly intelligent.
But there's also a set of things all us humans can do. We understand the arrow of time. We understand the rules for stories being consistent. We understand the laws of physics. We grasp that sick people don't offer cigarettes before announcing they're feeling better. We instinctively understand narrative coherence. We know that events happen in sequence, that causes precede effects.
Yet the evidence suggests this is something current models cannot do. And may never be able to do. Remember no one really understands how current models actually work, so it's next to impossible to predict whether these are soluble problems. For now at least, there are no obvious solutions on the horizon.
Perhaps we need a wider definition of intelligence? One that recognizes the unique set of abilities all humans have? One that enables us to properly measure the capabilities of the current models? Maybe the true test of intelligence isn't creating brilliance from scratch, but successfully improving something that already exists without breaking it. Being a good creator is necessary but not sufficient. To be truly intelligent AI also needs to be a good editor.




