Give your AI frog more lily pads
Why extended thinking matters
Earlier this week a colleague showed me the output of a Claude conversation. In the conversation Claude was struggling to analyze data - trying various different approaches with little success. But as they shared the chat with me, one thing stood out:
Did you spot it?
Yup - extended thinking isn’t enabled. That little blue clock can make all the difference with Claude’s ability to do deep thinking.
LLM frogs
To understand why, think of it this way: models process data sequentially, token by token, word by word. Each new token is generated by a forward pass through the model weights. There’s a limit on how much computation each pass can make. It’s a bit like a frog - the LLM can only jump so far on each pass through the weights.
Ask the model to jump too far and it falls into the pond. For example, ask ChatGPT (a non-thinking model) to multiply two large numbers:
Close. But wrong. The correct answer is 1,934,513,452,356.
Extended thinking (or what OpenAI calls 'reasoning') is like adding lots of lily pads to the pond. Our LLM frog can now break the big jump into multiple smaller ones - and avoid falling into the pond. Give o3 (a thinking model) the same math question and it will construct a set of lily pads (or, more technically, a 'chain of thought') to get to the correct answer:
Why isn’t extended thinking the default?
So why isn’t extended thinking / reasoning on by default? There are a few reasons:
Firstly, it’s not always needed. There are plenty of problems (e.g. web-searching, text summarisation) where the models don’t benefit from extended thinking.
Secondly, it’s slower. ChatGPT gave me a wrong answer (albeit only off by 0.003%) in under a second. o3 took 10 seconds to produce the correct answer. ChatGPT offers speed at the expense of accuracy. Often that’s good enough.
Finally, more tokens makes the models more expensive to serve. Which costs the labs more dollars. They naturally default to the cheaper option.
So, for now, we users need to make the decision about which models to use. And there is a lot of choice - even if we just consider two of today’s market leaders: Anthropic and OpenAI.
Anthropic offers Claude 4 in Opus and Sonnet variants. Extended thinking can be enabled for either. So there are four combinations.
Of these, 'Sonnet + non-thinking' is the fast variant, while 'Opus + extended thinking' gives the best answers.
But why do the other two combinations exist? Are they of any use? Even Anthropic doesn’t have any clear advice. And ask Claude about why they are exposed in the UX and it has this to say:
And Claude has a point. The UX is terrible. But it’s not the worst. OpenAI wins that award. They offer a bewildering range of models.
And then some tools as well:
How is a normal user meant to decide which of the seven different models and six tools to use? How is "Study and learn" different from just... asking questions? And the version numbering? It’s a mess.
Most of us don’t care whether our ovens have multiple heating elements, convection fans, moisture injection, temperature probes - we just want our pizzas and cakes to taste good. But modern AI tools are like ovens which ask users whether they’d like laminar or turbulent convection flow. Or ask them to specify their Reynolds number preference.
The irony is that AI has the ability to decide what model is best to solve a given problem - and then invoke the right model. The UX could be so much better. And it might get better soon, at least for ChatGPT. It is rumoured GPT-5, coming later this year, will merge all the models together and provide a single unified interface. So while GPT-5 may not significantly advance model intelligence, it may make accessing the capabilities of the current models significantly easier.
And so?
But, right now, we still have the question of what model to use when. And we can get AI to help with that - here’s a handy flowchart from Claude for which model to use when.
And ChatGPT produced this handy chart of the tradeoff between speed and intelligence.
But, ultimately, I've got one simple rule: if you're only going to use one AI setup, default to Claude Opus with extended thinking. Yes, it's slower than Sonnet, but not dramatically so. And if you have Enterprise Claude, the token limits are generous - you’re unlikely to hit usage caps with typical work. Better to give your AI frog plenty of lily pads than watch it struggle in the pond.










