Step by step
You can't BLUF your way directly to the answer
Recently I spent an afternoon helping my son with his Higher Maths Paper 2 (the Scottish equivalent of A-levels). The questions are complicated and open-ended. They are hard (at least for me). Invariably we found ourselves writing out every fact from the question, while hoping inspiration would strike and we’d realise how to solve the problem. (Now, of course, I could have taken a photo of the question and Claude would do the rest - but we’d both agreed only to use AI as a fallback.)
But as we wrote out the details, rather than solving the question, my mind wandered to reflecting on the similarity between this approach and the one that makes the most recent reasoning models so successful.
Not so BLUF
I’m a great fan of BLUF - Bottom Line Up Front. If you’re not familiar with it, it’s a technique where you put the bottom line - the key message - at the start of any communication. Asking for something? Ask first and then explain why. It’s great for clarity and avoiding confusion.
But LLMs are bad at BLUF. Really bad. Why?
Models generate output sequentially. They can’t back-up - once they’ve generated some output they can’t change it. And that makes BLUF next to impossible - unless they can produce the correct answer immediately.
For simple questions where they’ve memorized the answer it works…
But anything more complex is a struggle:
And that’s the same with humans. I can’t solve that integral in a single step (to be honest, I’m not sure I could solve it even if you gave me unlimited steps) - and neither can an LLM. It’s just like when I’m writing an email with BLUF - I can’t go straight to the summary in one go. I need to write the body before I can write the summary.
Step by step
The game changer came when folk realized that instructing LLMs to think “step by step” as part of a prompt caused the model to switch from trying to jump straight to the answer to a more human like approach of writing out what it knew and breaking the task down into smaller tasks. And that was revolutionary - that simple phrase had a significant impact on the quality of the answers.
The next step was to train models to think step-by-step by default - and that led to the explosion of reasoning models. o1-preview was the first (last September) but has since been followed by many more - Claude, Gemini, Grok and DeepSeek are all reasoning models.
How does this work?
Understanding how reasoning models work requires understanding how LLMs are trained. These days there are three main stages of training:
Pre-training. In this phase models are trained on massive sets of internet data to predict the next token (roughly akin to a word). The models learn how language is constructed as well as storing a lot of facts and knowledge from the internet. These are called ‘base’ models; you can view them as compressed versions of the internet. Give them the start of a sentence and they’ll continue it. But they don’t know how to answer questions or solve problems. They don’t know how to be a useful assistant. For that we need more training.
Post-training - Supervised Fine Tuning (SFT). This training teaches the model how to answer questions. In this training phase the models are shown many examples of ‘user-assistant’ conversations. The user asks a question; the assistant replies. The examples were originally human generated, with responses that were crafted to be helpful, truthful and harmless - although these days they are increasingly synthetic. This is also the stage where models learn how to use tools (e.g. web search) and say “I don’t know”. One amazing thing about this stage is how quickly models pick up this skill. Pre-training may take 3 months; SFT can take as little as 3 hours. And once this is complete you have a non-reasoning model.
Post-training - Reinforcement Learning (RL). This is the new stage which brings out the reasoning capabilities in LLMs. The idea here is to give the model a set of questions. For each question we know the correct answer, but not the steps required to get to that answer. The model then generates a set of solutions; the incorrect ones are discarded, the correct ones used for further training. It’s a bit like practicing past papers for an exam. The more you practice, the more familiar you become with effective problem-solving techniques and the better the answers become. And it’s this stage which brings out the reasoning capabilities we’ve seen in the more recent models. The graph below shows how problem-solving ability rises as the number of training steps increases.
But even without RL, all models have some capability for ‘in-context’ learning. Which means you can change the behaviour of the model depending on how you prompt it. For example you can instruct the model to ‘answer in a single token’:
Or you can simulate reasoning behaviour (this is GPT-4o which is a non-reasoning model):
In the most recent reasoning models the ‘thinking’ process - or chain of thought - is largely hidden by the UI. In the example above everything prior to the ‘<ANSWER>’ separator would be summarized. Some labs let you see the full detail if you want (e.g. DeepSeek); others obfuscate it (e.g. OpenAI).
It very much reminds me of when I’m writing an email using BLUF. I’ll often write the body of the email, only to throw it away when it becomes clear what my ask is. Like an LLM, my chain of thought is only useful in as much as it helped me figure out my ask.
So if reasoning is so brilliant, why isn’t it the default? The problem is time. And cost. Thinking takes time. Generating all those reasoning tokens takes time. And costs compute. So we still have non-reasoning models where we need a quick answer to a simple question. But there are attempts to find compromises - e.g. a recent variant of chain of thought called “Chain of Draft”. The idea here is to find a balance between good quality answers, elapsed time to answer and cost. In this case the model is told to limit the number of tokens it generates in each thinking step.
As you can see, it’s got a chain of thought feel, but it’s more concise. So it’s faster - and cheaper.
Finding the human in the machine
Reflecting on my Higher Maths experience with my son, I'm struck by how similar our approach is to what's happening inside these models. Pre-training is like attending a lecture or learning from a book; reinforcement learning is practicing problems.
And this parallel isn't coincidental - it's by design. The reinforcement learning process teaches these models to mimic effective human problem-solving strategies. Rather than attempting to leap directly to answers (which rarely works for complex problems), they've been trained to show their work, just as we teach students to do.
But it raises an interesting question. How much further will these approaches scale? Eighteen months ago the labs were confidently predicting that more compute on pre-training phase was all that was needed to deliver AGI. Of course, we know now that isn’t true - GPT-4.5 and Grok 3 were both trained with ten times or more compute than GPT-4. They are better, but it’s marginal. And they are nowhere near AGI.
Nowadays the hope is spending more compute on RL will unlock AGI. And while it’s undoubtedly true there’s more performance to be achieved from RL, it remains unclear whether it will deliver AGI. The labs remain confident, but they have financial valuations to support.
Nonetheless it is striking how similar the problem-solving approaches between humans and AI are - methodical, iterative, and fundamentally imperfect in ways that make them uniquely powerful. In teaching machines to think step by step, it seems we've inadvertently created a mirror that reflects our own cognitive processes back to us.






