The Lazarus effect
How book editing exposes the flaws in AGI predictions
The hype machine from the AI labs remains at full speed. If you listen to senior figures from the labs you’d think AGI (Artificial General Intelligence) or ASI (Artificial Super Intelligence) was just around the corner. That’s certainly what they are saying - OpenAI are suggesting it will arrive imminently. Anthropic says perhaps as soon as 2026.
And that’s scary. Creating something that is more intelligent than us humans - which could replace us humans - is troubling. There aren’t many good examples of a less intelligent species controlling a more intelligent one.
But for all the amazing benchmark results, the remarkable jive coded games, the way AI helps me every day, it still has some massive shortcomings. Shortcomings that are shared by all the models. And which appear fundamental and unlikely to be easily resolved.
What I’ve found is that these problems only show up when you are using the models towards their limits. They are like the class of software bugs that only show up after you’ve been running your product for a month - the long running memory leaks, the weird stability issues. There is scant evidence the labs are testing for these problems - the focus is firmly on short duration benchmarks. But they are the kind of things that are essential for superintelligence - superintelligence must be able to run for years and years.
Should we be scared?
Recently I’ve been exploring editing a book with Claude. Why? Well partly because a book is something that models should be able to handle well. It’s text after all. And partly because it’s much easier to notice the model flaws in a book than it is in code or other formats.
The book is ~65,000 words long. So much longer than the typical documents people are using LLMs to handle. And, it seems, much, much longer than any of the documents used by the AI labs in their benchmarks. But it’s not beyond the context of a model. And definitely not beyond the scope of a human.
It’s been interesting. The quick summary? I’m not worried about imminent AGI or ASI.
The problems
It quickly became apparent how challenging this task was for Claude. And Gemini. And GPT-4.5. Lots of things went wrong.
Narrative consistency: Claude had a poor grasp on the timeline. When I asked it to create new paragraphs to insert in the book, or modify existing ones, it’d produce superficially nice looking text except…
The day after Tuesday would be Sunday. People would reflect on events that had yet to happen. Characters managed to be in multiple places at once. Dead characters would stage miraculous recoveries and reappear as if nothing had happened. If I hadn’t been there it would have rapidly become a hopeless jumble.
Throwing it all away: One time I asked Claude to rewrite a section in the middle of chapter seven. Rather than confine itself to just the affected section, Claude decided to give me an updated chapter seven that I could drop in. Nice, except it was broken. Claude had copied the first sections of chapter seven, then added the new section. All good so far. But then it threw away the rest of chapter seven, all of chapter eight and most of chapter nine. It tagged the original end of chapter nine onto the new chapter seven. The story made considerably less sense after that, err, editing.
Chapter counting: It quickly became apparent that Claude can’t count. It would confidently assert a chunk of text was in chapter 2, when it was in chapter 8. Most times it would fail to get chapter numbers correct.
Remembering the gist, not actuality: On many occasions I found that Claude had remembered the gist of a conversation rather than the exact words. Here’s an example - this is the original text.
And here’s how Claude presented it when suggesting an edit to some text after this snippet.
I found myself wondering if this is why AI often seems to re-write code. Presumably when it is regenerating code a similar thing is happening where the model remembers the gist of the logic than the precise original implementation?
Temporal feedback inconsistency: Claude would suggest detailed feedback such as this interesting (and valid) feedback on exposition issues.
Claude happily gave me updated text which I merged in. But the next time it reviewed the novel, it complained about too much exposition!
Of course it’s not just AI that does that - as humans we’re quite prone to do the same.
Hallucinations: There were many of these. Text that didn’t exist but which sounded plausible. Or feedback on parts of text that didn’t exist. For example I asked Gemini to check for any Americanisms which might have crept in. I got a long list.
Of these only “okay” and “gotten” appeared in the original text. And okay is, well, okay - as a Brit I wouldn’t consider it an Americanism. Gotten is valid. But the others? All made up.
I tried to work out why Gemini had got this wrong. For example I initially thought “rom-com” might have crept in because the text says “from competitor”. But then I realised it came from an earlier response from Gemini where it was summarising a previous chapter.
And then it became clear - all the apparently hallucinated Americanisms came from an earlier response by Gemini. Once again the models inability to accurately select the relevant context comes into play. I was clearly only asking about the original text I’d provided - not any of the subsequent answers Gemini gave.
So ASI then?
These limitations aren't just minor inconveniences - they reveal fundamental flaws in how today's AI models process and maintain information over time. While the labs focus on short-term benchmarks that make for impressive headlines, the reality of working with AI on complex, long-form content exposes far more mundane but critical shortcomings.
Now maybe spending more thinking tokens, or scaling the models, or wrapping Claude in some sort of manager model might help. And it’s conceivable I could have helped by chopping the manuscript up into smaller chunks, or by getting Claude to summarize the key timeline first. But I’m sceptical. These problems feel pretty fundamental that reveal the limitations of the current technology. And I don’t see any evidence that the labs are aware of these problems, never mind trying to address them.
The gap between performing well on carefully curated benchmarks and maintaining coherence across a 65,000-word manuscript isn't just a matter of degree—it's a difference in kind. It suggests that the path to AGI isn't a smooth continuum where we just need "more of the same."
So while AI continues to impress and assist us in countless ways, my experience editing this book has given me a healthy perspective on the AGI hype cycle. The next time you hear bold predictions about superintelligence arriving by 2026, remember that today's most advanced AI models still can't reliably track whether a character is dead or alive across a few chapters of text.
That doesn't diminish what these tools can do—they remain remarkable achievements—but it does suggest we have time to thoughtfully consider how to integrate AI into our world before it surpasses human capabilities in any meaningful sense.





