Losing the plot
When ChatGPT's advanced voice mode goes off script
Most mornings I go for a walk in the hills behind my house. The days are short in December, which means I get to see the sunrise from the summits. Which is nice. But it does mean the first hour is in the dark.
But as luck would have it, the walk in the dark is an excellent time to talk to ChatGPT using Advanced Voice mode. I’m also learning Spanish, so recently I’ve been getting ChatGPT to make up short stories in Spanish and quiz me.
When the cracks start showing
For the first 45 mins or so it generally works impressively. But then the artefacts start to creep in. The first is an increasing number of content flagging hits. The conversation cuts off and a different voice announces “My guidelines won’t let me talk about that. Can I help you with something else?”.
In the transcript it looks like this. Interestingly there’s more text in the transcript than was actually spoken.
In my experience once this starts it becomes persistent. I had a content flagged hit roughly every minute for the remaining 15 minutes of the chat (Advanced Voice mode usage is capped at an hour per day). And it’s not obvious why. The stories were about scarecrows, fountains, enchanted forests - the same subjects used for the first 45 minutes. A bug?
Ghosts in the transcript
Making things worse, the model also starts to hallucinate extra fragments of speech. This story is about a carousel in a city park.
But after the first question (in English), ChatGPT adds (in Spanish): “Correcto. Al atardecer, en un pueblo costero”. Which means “Correct. At sunset, in a coastal village”.
It’s not related to the current conversation. Nor does the extra speech appear in the transcript.
This hallucination can feel like gaslighting. One day ChatGPT added this sentence to several stories: “Tomó fotos de las flores, los árboles y las personas paseando con sus perros” (he took photos of the flowers, the trees, and the people walking with their dogs). But I got the answers to the questions wrong because I didn’t realise it was a hallucination and thought it was part of the story. When I got home I checked the transcript. It was disconcerting to discover the speech I remembered didn’t appear - for a brief moment I wondered: was I hallucinating?
Here’s another example where the audio doesn’t match the transcript. ChatGPT has inserted: “What could visitors learn about at the greenhouse?” into the middle of the audio.
Net - you can’t trust the transcripts.
Voices in my head
There are also occasional other voices that appear. This happens less often, but sometimes after ChatGPT asks a question a different voice will start to answer. It’s particularly noticeable if this is a male voice. And it’s very disconcerting when it happens…
A pattern emerges
One of the things I’ve found with models of all types is that response quality degrades as a session gets longer. With AI generated video, the image quality noticeably degrades for longer clips (where ‘long’ is 20-30 seconds). Similarly udio audio quality starts to degrade after several minutes (e.g. cymbals loose their shine and start to sound ‘muddy’). And Advanced Voice mode seems no different. After about 45 minutes it’s starting to lose coherence. Perhaps the 60 minute limit is not so much based on resource limitations but rather because models just aren’t capable of reliably working for extended periods of time.
Don’t get me wrong. Advanced Voice mode is a fantastic tool. Having a unwaveringly patient companion who is happy to create endless practice stories and ask me questions is amazing. If I don’t understand I can ask for an explanation. And if I still don’t understand (a frequent occurrence) I can keep asking questions until I finally get it.
But it’s also dangerous. Gaslighting and random voices are disconcerting. And not something you can afford to have in a production system. Imagine this technology is used to implement an appointment booking system. The agent asks when you’d like to see the doctor. But then a different voice starts answering! Or if the agent inserts other, random, text into the middle of the conversation. At a minimum it’s confusing, at worse it results in failure.
The 45-minute meltdown
Advanced Voice mode represents both the promise and perils of conversational AI. Having an endlessly patient language learning companion is transformative - it's the kind of breakthrough that could democratize education and make personalized tutoring available to everyone. But the degradation in quality over time, coupled with weird behaviours like hallucinations and voice switching, highlights how far we still have to go.
Are these fundamental limitations of large language models, or just initial teething troubles with voice interfaces? The parallels across different AI modalities - video degrading after 20 seconds, audio quality dropping after minutes, and now voice interactions breaking down after 45 minutes - suggest something more fundamental. Are we hitting the limits of what current context windows can reliably maintain?
Can AI keep it together?
This matters because real-world applications need sustained reliability. An AI medical assistant can't start hallucinating halfway through a consultation. A teaching assistant can't inject random Spanish phrases into an English lesson. For AI to transform areas like healthcare, education, and customer service, we need systems that maintain coherence and reliability not just for minutes, but for hours.
The 60-minute session limit may be less about computational resources and more about the current limits of AI models. Understanding and solving these limitations will be crucial for moving conversational AI from an impressive tech demo to a transformative technology. Until then, we'll have to keep our conversations short - and double-check those transcripts.




