Spring forward? Or fall behind...
As new releases bloom, some spring ahead while others stumble
New models continue to be delivered apace.
Saturday 5th April brought Llama 4 - Meta’s new open source model. Monday 14th brought GPT 4.1. Wednesday the 16th, we got o3 and o4-mini. DeepSeek R2 is expected imminently (possibly today). As ever, each model comes complete with amazing (but cherry picked) benchmark results and lots of hype (cue claims that o3 is ‘AGI’ - which it clearly is not - but it’s a way to get attention).
Nonetheless some interesting patterns are starting to emerge.
Firstly, the performance of the non-reasoning models appears to be levelling off. In the real world, GPT4.5 isn’t noticeably better than GPT4o or the non-thinking versions of Claude. Yet GPT4.5 is much more computationally expensive and, as a result, much slower to respond. That slowness is part of the reason I rarely use it - the performance/ latency trade-off just isn’t worth it. And OpenAI seem to have realised this too - they are planning to retire GPT4.5 in July and reuse the GPUs it requires for other models. That’s the first time a leading edge model has been retired in favour of older models - perhaps a sign that we’ve reached the limit of what can be achieved with base models and compute isn’t all we need?
Secondly the gap is closing. There are many GPT4-level class models now. And not just from the US - China has Qwen, Lshan, Hunyuan and, of course, DeepSeek. OpenAI might have a slight edge right now, but the initial hints are that o3/ o4 is roughly comparable to Gemini 2.5. OpenAI no longer have a lead measured in years; for much of this year they have been behind Anthropic and Google.
We’re also seeing convergence between the leading companies. All now have reasoning models. All have multi-modal models. All have web-search (well, at least if you are in the US). All either have or are working on voice mode. The gaps are shrinking. This is bad news for the smaller labs which don’t have the resources of the big folk like Google - it is becoming harder for them to differentiate. In the long run there will inevitably be consolidation. Will Apple or Amazon buy up Anthropic or OpenAI? For now that seems unlikely - AI labs have sky high valuations. Take Ilya Sutskever (one of the original founders of OpenAI). His new company, Safe Super Intelligence, is 10 months old and still has no products. But it has a valuation of $32 billion. A nice trick if you can manage it…
But others are stumbling. Llama 4 has been a disaster for Meta. It has impressive benchmark results (e.g. it took top spot on the LM arena benchmark), but has disappointed in real-world usage. The unofficial news coming out of Meta is that the team trained it on the benchmarks and cherry picked results to meet internal targets. Seemingly an excellent example of Goodhart’s law: when a measure becomes a target, it ceases to be a good measure. There are also rumours of a massive AI team riven with political infighting. Compare that to the small focussed DeepSeek team…
Worse Llama 4 is too big to run on consumer grade hardware (the smallest model requires 80GB of VRAM) - so while it is open-weights, it’s not practical for most people. For now the best open-weights model remains DeepSeek. And Google’s Gemma.
With Gemma and Gemini, Google are slowly, but surely, moving into top spot in the AI race. And it’s not just LLMs where they are strong. Google have video generation (veo-2), image generation (imagen), research tools (NotebookLM), coding tools (Firebase), speech tools, music generation tools. They develop their own AI hardware - they are not dependent on Nvidia. They may not be the leader in each segment - yet - but they are well positioned. If they had a better marketing department - and didn’t have the baggage of being Google they’d likely be doing even better.
And if the future of better models depends on access to data, then Google is sorted. They have access to email, docs, videos (Youtube), web data, maps. More than anyone else. Plus, with Android, Gmail and Google docs they are uniquely positioned to control access to AI.
But the question remains - are we headed for a world where AI is added to our existing tools? Or a more transformative one where AI enables us to build new tools and workflows? I suspect (and hope) the latter is the long-term destination. New tech invariably doesn’t just enable us to do the old ways faster - it creates new ways to do things.
In the real world
As Llama 4 has demonstrated, benchmarks can deceive. What really matters is whether these models can reliably solve everyday problems.
Let’s go back to my garden area calculation from earlier this week - a simple geometric problem to calculate the area of an irregularly shaped garden plot from an image. This seemingly straightforward task requires visual understanding, spatial reasoning, and mathematical problem-solving – capabilities that should be trivial for a genuinely "general" intelligence.
The problem is also unique - I doubt it is in the training data - not least because I suspect (hope) no one else would have been daft enough to try to build a structure in such an awkward-shaped area.
So let’s find out how some of the more recent models did…
Back to the garden
I’ll be honest. Overall the results disappointed. Only o3 & o4-mini got the correct answer. And none had as elegant a solution as Gemini.
Failures
GPT4.5: 52m², 35s
Llama 4 (Maverick): 10.51m²
Llama 4 (Scout): 5.1m²
DeepSeek R1: 49.05m², 85s
Grok 3 (thinking): 7.1m², 102s (it decided the shape was a triangle).
Successes
o3: 31.5m², 291s
o4-mini: 31.5m², 50s
o4-mini-high: 31.5m², 79s
o3
o3 got the correct answer. But the process revealed a lot.
First it was very slow. It took nearly 5mins to produce an answer.
For the first time, the thinking steps show which parts of the image the model is ‘looking’ at; first it rotated the shape (the original image was rotated by 90 degrees):
Then it zoomed in on key details, although sometimes these seemed, erm, surprising:
Then it thought the shape might be a pentagon:
And then it gave the correct answer.
But read that carefully and you’ll realise it’s wrong. It claims to label the right angled corner ‘A’, but actually labels it ‘B’. Then it describes a right angled triangle ‘ABD’ with legs ‘AB’ and ‘BC’ - where did ‘D’ go?
Sam Altman claimed that o3 is hallucination free. I beg to differ. o3 may well be smart, but these are basic mistakes. It’s not AGI.
o4-mini
Both variants of o4-mini were correct and quickish - though still twice as slow as Gemini. Again you get to see the image processing thinking, and the models are good at using tools - specifically code - to do the maths. Interestingly they used a different approach - the intersection of two circles to find out where the bottom right corner was located, and then used the shoelaces formula to work out the area. That might be obvious to you, but I learnt something new from watching the models!
And so?
The volume of model releases is remarkable. Perhaps even overwhelming. It’s amazing how far we’ve come in just a year. But the days of dramatic leaps are giving way to incremental improvements. And while AI can do remarkable things - especially if related examples are in the training data - it can still fail at things we humans find straightforward. It’s increasingly clear that AI is a different kind of intelligence.
The AI industry is maturing - from a race dominated by radical breakthroughs to one characterized by specialization and refinement. The winners won't necessarily be those with marginally better benchmarks, but those who can deliver consistent, reliable performance on real-world tasks. Who can seamlessly integrate useful tools into our daily lives.
And the competitive landscape is changing. OpenAI is no longer the untouchable leader. Google is quietly building an impressive set of tools, and smaller players like DeepSeek continue to punch above their weight.
Personally, I'm looking forward to getting to know o3 and o4-mini better - and adding them to my team. But the early signs suggest Gemini 2.5 still has an edge in raw technical horsepower, while Claude 3.7 remains my go-to for writing tasks. So, just as with humans, each model has its own strengths - and weaknesses.








I’ve found one of the best ways to get Claude 3.7 to hallucinate is to ask it impossible questions. Not deliberately - to say things like “I think this code is hitting an infinite loop somewhere, can you find it. If you can’t just let me know”. It will then spend minutes trying to find it, before coming back with utter nonsense - often hallucinating that the code says something it doesn’t, or saying the fix is what the code already does. It’s still transformational as an assistant, but its limits are becoming more and more apparent. This week I have mostly been getting it to help me with 6502 assembly …