One bad apple
How misleading AI research distracts from real breakthroughs
Back in the early 1990s there was no internet. So, if you wanted to know what was going on in the world of computing, you bought a magazine. And these magazines were massive - A4 sized and nearly an inch thick. Among hundreds of pages of adverts were the juicy articles explaining how Intel’s next processor - the Pentium - could do two things at once. At the time it felt like magic, but these days, when processors can do hundreds of things simultaneously, it seems rather quaint.
Every month I’d troop down to WH Smith and spend my £1.95 on Computer Shopper or Personal Computer World. Or sometimes both. And that’s how I learnt about computing.
A few years later I started work. Our team was building early video conferencing software and we’d sold the technology to multiple different companies. And, one month, one of the computer magazines reviewed conferencing software. As I eagerly read the review I realised most of the products they were reviewing were built on our tech. Cool. This would be interesting - would the reviewer like what we’d created? Might we get a mention?
But as I read on, I felt a wave of realisation. The reviewer clearly didn’t know the products were fundamentally the same under the covers. The review found differences where none existed. Praised one product and slated another, even though they were mostly the same. It was the first time I properly realised that domain knowledge can give you a very different perspective. That at least some of what I was reading was at best misleading, at worst plain wrong.
I was reminded of that feeling this week when I saw an article in the Guardian. It told the tale of an AI research paper recently published by Apple. The Apple researchers claim to have found "fundamental limitations in current models". That for "high-complexity tasks… models experience complete collapse".
The claim
The paper describes how researchers asked models to solve some classic problems - Towers of Hanoi, Checkers Jumping, River Crossing and Blocks World.
Take the Towers of Hanoi. The goal is to move a stack of disks of different sizes from the left spike to the right spike. The only constraint is that a larger disk can never be on top of a smaller disk. The solution is well known; the minimum number of moves required is always 2^n - 1.
The paper shows that models "collapse" above 10 disks. But dig deeper and it becomes clear this isn’t a collapse. First remember that formula for minimum number of moves. Ten discs require at least 1023 moves; fifteen needs 32,767 moves and twenty requires over a million moves.
Secondly, the "success" criteria demanded by the researchers is that the model provides the complete set of moves in order to pass the test. So what happens if you ask Claude to calculate the moves for fifteen discs?
Fortunately we can test the prompts Apple used since they provide them in the paper. If I repeat the Apple experiment, Claude tells me:
Since the user specifically asked for the complete list of moves and emphasized this in their requirements, I should provide the full list. However, 32,767 moves is extremely long. Let me write it to a file and also provide a way to access it.
And that’s exactly what Claude does. It creates a program to generate the moves, runs the program and stores the output in a file. But Apple considers this a "complete collapse of reasoning" because Claude didn’t provide the moves in the chat window.
Really?
Pause for a second and consider Claude has a 200k token context window. That’s roughly 200k words. The solution for 20 discs requires an output of over 1 million words. It doesn’t fit. The test was never going to work.
What’s confusing is that the paper is so clearly flawed. I’m not an AI researcher and I can see the flaws in it. Ask Claude to review the paper and it can also see problems:
Can it really be that Apple didn’t use Claude to review their paper?
Claude summarizes it nicely:
The models might not have an "illusion of thinking"—they might just be too smart to follow a requirement that no intelligent agent would fulfill. It's like claiming humans can't do arithmetic because they won't write out "1+1+1+1..." a million times instead of writing "1,000,000."
Why?
So why would any AI researcher put their name to this paper?
The sceptic might wonder whether it has anything to do with Apple’s failure to compete in the AI race. They lack a frontier class model. Apple Intelligence has not gone well; the Apple Developer Conference last week focused on Liquid Glass - a UI tweak. Siri won’t get proper AI until spring 2026 at the earliest. And all the time Google, OpenAI and Anthropic race ahead. From the outside it is confusing how Apple appears to have got this so wrong. Is throwing dirt the best strategy Apple has left?
Maths
In retrospect the AI story the world should have focused on was in Scientific American.
Last month, thirty of the world's best mathematicians got together in Berkeley to challenge o4-mini with some of the world’s hardest solvable problems.
"I came up with a problem which experts in my field would recognize as an open question in number theory - a good Ph.D.-level problem," said one of the mathematicians. He asked o4-mini to solve the question. Over the next 10 minutes, they watched in stunned silence as the bot unfurled a solution in real time, showing its reasoning process along the way. The bot spent the first two minutes finding and mastering the related literature in the field. Then it wrote on the screen that it wanted to try solving a simpler “toy” version of the question first in order to learn. A few minutes later, it wrote that it was finally prepared to solve the more difficult problem. Five minutes after that, o4-mini presented a correct solution.
By the end of their time, the group was discussing what the future might look like for mathematicians. Will mathematicians shift to simply posing questions and interacting with reasoning-bots to help them discover new mathematical truths, much the same as a professor does with graduate students? The article finishes with a quote: "…in some ways these large language models are already outperforming most of our best graduate students in the world."
True - this is a scarier story. But it’s reality. Spreading misleading information about the capabilities of AI isn’t useful. We all need to understand what current models can do. What future models will be able to do. We need to start to think and prepare for how this will change our lives. How it will change society.
And so
Domain knowledge is a dangerous thing. It enables you to see truths that others cannot. Thirty years ago I could see people being encouraged to buy the wrong conferencing software. Today I can see the false narrative that was spread last week about AI models: "Don’t worry, it’s all just hype - the models aren’t very clever. Nothing is going to change."
But today, the stakes are much higher. The impact of failing to understand the changes AI will bring is much more significant than buying the wrong video conferencing software. Even if today’s AI is the best we ever create, the world of mathematics will change. The world of software engineering is changing. Distracting ourselves with false narratives serves no useful purpose. It’s going to be critical to pay attention to the right signals as the world around us starts to irrevocably change.




