The state of AI
Storm clouds gather over Silicon Valley's biggest bet
Last Saturday ChatGPT turned two. Launched in November 2022 the GPT-3.5 powered platform became the fastest growing consumer software application in history with over 100 million users by January 2023. Arguably it was ChatGPT which kick started the current AI race - Gemini, Claude, Llama and Grok all owe their existence to ChatGPT.
Then, in March 2023, Open AI released GPT-4. GPT-4 was larger in every way - more data, more parameters, more training. It further validated the “scale is all you need” theory. The race to build data centers and fill them with GPUs was truly on.
Shortly after, Microsoft, who had first partnered with OpenAI back in 2019, released Copilot and millions of Windows users got their first taste of AI.
Since then there has been a rush to integrate AI. Apple Intelligence, Gemini in Google, Ebay, Duolingo… Hundreds of new models have been built. Not just chat models. Image, video and audio as well. Large, general models. Small, specialized models. Open source models. Highly censored models. Loosely censored models. Trillions of dollars have been invested. The sun shone brightly.
Fast forward to today. Where are we now?
Has progress slowed?
For most of this year the labs continued to believe scale was all that was needed to keep improving the models. Train on more data for longer and the models would get smarter and smarter. It was the successful recipe OpenAI had followed since GPT-1 in 2018. Some folk were beginning to worry we’d run out of good quality data, but the general consensus was there was enough left for several more cranks of the handle.
But the clouds are gathering. It’s becoming clear things aren’t panning out as expected. The news leaking out of the large labs is the latest models are proving disappointing. GPT-5 was first rumored to be available in mid-2024. But it’s now end-2024 and there’s still no timeline for release.
Anthropic are also struggling. In March 2024 they announced a family of three models: Haiku, Sonnet and Opus. Haiku was small and fast, Opus large and advanced and Sonnet struck a balance between them. All three were released in V3.0 forms. In June, Sonnet and Haiku were upgraded to V3.5. But Opus V3.5 has not arrived - it’s believed the performance was disappointing. More recently Anthropic has quietly removed references to Opus V3.5 from their website - it increasingly looks like it will never appear.
Then there are computational and financial constraints. The bigger the model the more compute - and time - needed for training. Training a large model is an expensive business costing ~ $1billion. It takes time - between 12 to 18 months. Even the large labs can’t afford to keep building these models if there is no guarantee of success. It’s too big a gamble.
How useful is AI?
The right model used for the right task can be powerful. Summarization, coding, writing, ideation, research - they all benefit from AI. Github Copilot, in particular, has been a great success.
But we’ve also learnt that bad AI is terrible. Take Apple Intelligence. It was launched to great fanfare in June 2024. But the rollout has been a mess. Six months later it’s only just starting to become available. As Wired, noted “you might find yourself saying, ‘Is that it?’”. But maybe it’s because Apple were late to the AI party. They are way behind - perhaps by as much as two years.
What about Google then? They invented the AI transformer model which underpins modern AI. Presumably they are doing well? Turns out, no, not really. Consider their new flagship phone, the Pixel 9. It comes with Google’s AI, Gemini, built in. Yet the main feature Google advertises is… …AI photo editing. All this tech to eliminate selfie sticks. Sure, it has other features too. But the reviews of them haven’t been glowing - AI is not a must-have Android feature (yet).
But surely Microsoft are doing well? They helped kick start the AI race with the launch of Copilot in March 2023. That must be successful?
Well, even they are struggling. Reviews of Copilot have been mixed. And while it’s making money (apparently ~$1 billion annually) - it’s nowhere near enough. Microsoft has poured billions into AI and needs a better return. Making things worse, Microsoft has an OpenAI shaped problem. Copilot depends on OpenAI for the base model. But their relationship has soured; what happens when they part ways is unclear. No doubt Satya has a plan, but what is it and will it work?
Hype - or something else?
Regardless, the leaders of the large labs continue to predict never ending sun. Sam Altman has recently claimed the path towards AGI is clear and OpenAI may achieve it in 2025. Dario Amodei (CEO of Anthropic) has also suggested ~18 months as the point where (unspecified) things kick into gear.
Self interest is clearly at play. But Sam Altman has another reason to talk up AGI. The OpenAI contract with Microsoft ends as soon as OpenAI achieve AGI. And apparently the contract doesn’t formally define AGI - it’s up to OpenAI to define it. So as soon as Altman can plausibly claim AGI he can part ways with Microsoft. Maybe that’s what driving him to hype AGI?
Nonetheless, back in the real world it’s obvious to pretty much everyone the idea of AGI next year is laughable. We’re nowhere close. I spend a lot of time using Claude (generally considered the best model). It’s brilliant in so many ways. But it’s a long way from being able to replace a human. Never mind be considered AGI.
Are the labs delusional?
Part of the problem is benchmarks. The AI world uses a range of common benchmarks - MMLU, HumanEval, GPQA etc. Many of these benchmarks are now saturated with models achieving scores of 90% or more. The labs interpret this as evidence current models are brilliant.
The trouble is the majority of the benchmarks are, well, flawed. Badly flawed. Take the MMLU. It was created by OpenAI to test models across a wide range of tasks and domains. There are ~16,000 questions across 15 subjects. But many of the questions are bad. This analysis revealed ~60% of the virology questions contained errors. There are wrong answers. Misleading answers. Ambiguous questions.
And it gets worse. Most of the benchmarks have fixed questions with fixed answers. It’s as if the A-level maths paper remained the same year after year. The questions - and answers - inevitably seep into the training data. The models learn the expected answers.
What happens if you change the questions? Instead of asking the model how best to arrange 30 pupils in a classroom with 7 desks, ask about 25 pupils and 6 desks? This paper explored exactly that. It found that the performance of models dropped an astonishing 50% - 80%. A clear sign that the models have learnt the answers.
There are more realistic benchmarks. One example is SimpleBench which tests real-world reasoning. On this benchmark humans score ~84%. The best AI result? 42%.
Yet the leading labs seemed trapped in an echo chamber of their own making. When you hear the leaders of the labs talking it’s easy to wonder whether they actually use their own products.
The backlash?
Every time a human has a poor experience it reinforces the perception AI isn’t useful. The more this happens, the more people view AI as a toy. Or even a joke. The memes grow.
Maybe this isn't surprising - humans seem naturally skeptical of AI-generated content. A recent poetry experiment found humans consistently downgraded the quality of a poem when told it was AI generated. Seems we prefer things our species created. Who’d have guessed?
The computer industry has been here before. It has been trying to create agents for decades. Microsoft Bob, Clippy, Siri, Google Now, Alexa, Cortana... Each promised to revolutionize human-computer interaction. Each failed to deliver. Amazon reportedly lost $25 billion on Alexa between 2017-2021. Clippy became a punchline. Cortana quietly disappeared.
Ultimately all these attempts failed. Sadly, history seems to be repeating. Take Apple Intelligence message summaries - they are often terrible. It’s a bad sign when your product becomes the punchline of jokes.
The rot is starting to set in. The current AI agents need to do a lot better if they are going to succeed. They need to provide real value. And quickly.
Where does this lead?
There’s a gulf opening between people’s lived experience of AI and the capabilities talked about. A year ago most people believed the leaders of the major AI labs. But today - less so. Now that more of us have experience of using AI, we can see the clouds. The leaders of the labs are treading a dangerous path. They risk turning people away from AI as it fails to deliver on the grandiose predictions they make.
And what of the market leader OpenAI? Their current position is reminiscent of the early days of the PC market. Back then IBM led and everyone else followed. But it didn’t take many mis-steps before IBM lost their lead and disappeared into insignificance. OpenAI appear on track to follow a similar trajectory. Today they dominate but they don’t appear healthy. There have been a series of high profile departures of senior staff in recent months. Increasingly they are becoming the Sam Altman show. And he seems somewhat disconnected from reality. Does that leave an opportunity for someone else to become the modern day Compaq? Probably.
In the long run Microsoft, Apple & Google should win. AI assistants will be most useful when they have access to our data. Microsoft has the business data. Apple & Google our personal data. But can these tech giants weather the gathering storm? Microsoft is able to use the biggest, most powerful models and yet Copilot is a mixed bag. Things are worse for Google and Apple. Data privacy means they are stuck with smaller - and much more limited - models that run on handsets. Those small models don’t get close to delivering what has been touted.
There is a long way to go. For AI to succeed, we need much better local models, major advances in hardware efficiency, and proof that AI can deliver real value consistently. The storm clouds are gathering - public patience with grandiose claims is wearing thin.
Which is not to say AI won't continue to impact life. The technology is too powerful, the investment too great, the momentum too strong to simply disappear. But the next few years will be about reality, not hype. The era of blind faith in AI is ending. The era of proving real value is just beginning.



Welcome to the trough of disillusionment https://www.gartner.com/en/articles/hype-cycle-for-artificial-intelligence
Do you think they can make their existing models more useful with bigger context windows without retraining?
From your other posts, it seems that context window is often more of a limiting factor than “intelligence”, especially for professional applications like software development and legal work, where there is too much data for a team of humans to handle efficiently.