It's all about context
The art of conversation engineering
Earlier this week I was playing with Grok 3. I fed it a long story (~100k words) and asked it to summarize. But to my surprise it decided to skip the middle of the story. And then insisted that the doc I’d given it didn’t contain the missing chapters.
It’s not the first time I’ve observed something like this. And I’ve observed it the other way round too - where the LLM seems to get fixated on one aspect of the chat - to the exclusion of the rest of the chat history. This behaviour invariably becomes more pronounced the longer the chat is. I’ve heard people describe it as “the model losing the plot” or “going off the rails”.
Going off the rails is not exactly, err, desirable. In this post, let’s explain why LLMs lose track on long conversations, compare them with humans and offer some practical solutions to this problem. But to begin, we need to go back to basics.
Context windows: the limited memory of LLMs
LLMs have a “context window”. The context window contains the complete text of all the conversation up to that point. Everything you’ve asked. Everything the LLM has replied with. When you ask another question, everything in the context window - plus the new question - gets input into the LLM to generate the output.
Notice that the first thing in the context window is called the “System prompt”.
So what’s that then? The system prompt is a set of instructions to the model telling it how to behave. Here’s a snippet of Claude’s (the full prompt is here):
Human vs. AI memory
We humans have something similar:
To a first approximation you can map:
Your values and self-identity map to the system prompt. When someone asks you about your political views, you don't need to recalculate them - your core values and identity guide your response automatically.
Short-term memory corresponds to the content of the conversation. Like an LLM, you remember what was just said.
Long-term memory provides background knowledge and personal experiences. When discussing a movie, you can instantly recall when you saw it, with whom, and how you felt - something LLMs can't do without external retrieval.
But it’s different in some critical ways.
Most LLMs don’t have a concept of long-term memory. A technique known as RAG can be used to approximate longer-term memory. But it’s a selective long-term memory. And it’s populated up-front from external sources - and not updated with conversation history.
The memory - or state - of the conversation is contained in the text of the conversation. The AI models don’t maintain any external state.
Humans never run out of context - we compress and prune our memories to ensure we can always have another conversation. LLMs frequently run out of context.
Humans filter and compress short-term memory. We all know short-term memory is an imprecise record of our conversations and interactions. But LLMs retain the precise contents of all the conversation up to that point.
And it’s that latter human ability - the ability to filter and compress - which is so important.
The filtering problem: why LLMs get confused
The human brain's ability to selectively retrieve relevant information is remarkably efficient. When I discuss work with a colleague, my brain automatically filters out the dinner plans I made earlier with my wife. It can also pull in relevant context from long-term memory. And as the conversation advances my brain is continuously reviewing what is most relevant to the conversation.
But an LLM doesn’t do that. They lack this intuitive relevance filtering—everything in the context window has equal priority.
And as the context window fills up this confuses the model. Especially if you have a conversation that evolves and moves in different directions. The model doesn’t know - doesn’t have any way - to determine which bits of the conversation are relevant to what’s being discussed right now.
Often the topic that’s mentioned the most is prioritized. For example if most of the context window contains a conversation about the first chapter of a document and then you pivot to discuss the second chapter there is a good chance the model will get confused - it doesn’t know that all the discussion about the first chapter is suddenly irrelevant.
Solutions
For now the labs prioritize performance on benchmarks - which typically involve short, focussed prompts & answers. This works well with the current models. But there’s relatively little focus on longer conversations. That will come (not least because it’s table stakes for anyone who expects ASI/ AGI to take over the world). But for now we need to accept it’s not there and we have to adjust.
So what to do?
Keep conversations focussed. Wide ranging conversations on multiple disjoint subjects will confuse the models. I find myself consciously thinking about how to split conversations across multiple sessions.
Get the model to summarize the conversation; you can then filter what’s important and move that to a new session.
Remember most models allow you to rewrite previous questions. If you don’t like the way the conversation is going, you can go back and branch the conversation. I find myself doing this increasingly where I realise I’ve asked the wrong question - invariably my second attempt will give the model a more focussed instruction (e.g. generate COMPLETE, CONSOLIDATED, UPDATED code, not deltas). As an aside, being able to do this with humans would be quite useful - you could have fun testing out potential answers to questions before having to commit :).
Had I broken my 100k-word story into smaller chunks or added specific instructions for Grok 3 to process it sequentially, I would have gotten a better summary. The model wasn't being difficult - it was drowning in context.
While future models will improve at handling long conversations through better attention mechanisms and memory management, the fundamental architecture of LLMs suggests this will remain a challenge. The most effective users of AI will be those who understand these limitations and structure their interactions accordingly.
In many ways it’s not so different from talking to a human. Persuading and convincing other humans is rarely a matter of blurting out what you want - you need a strategy, a plan. And it’s similar with AI - learning how to ‘work’ the AI will inevitably produce better results.
As these tools become more integrated into our daily work, it’s not going to be prompt engineering that matters, but conversation engineering - the ability to structure exchanges that play to AI's strengths while minimizing its weaknesses. How might you restructure your AI interactions knowing what you now know about how these models 'think'?







Thanks Martin. Tbh I’m struggling to see why conversation history is such a big problem for the labs. What is the best tool we have for summarising a conversation? Or parse a conversation into separate threads?.. yes an AI. So why can’t the AI just be doing this automatically, as a maintenance operation.
And maybe the AI needs multiple levels of summarisation, to add longer term memories. E.g. our memories of school days are more heavily filtered and summarised the longer we leave them behind.
An end goal is perhaps for an AI to be able to update its own system prompt based on its learnings. Is this part of getting to AGI? Just as with humans it shouldn’t be easy for AIs to change their personality or fundamental traits. Maybe changes will “flow up” as summaries of summaries are compared to the AIs “fundamental beliefs”. But this sounds harder.