Plausible nonsense
Why failing convincingly is much more dangerous than failing obviously
The treadmill of updated models continues unabated. In recent weeks we’ve had Gemini 2.5, Codex (OpenAI’s new coding focussed model) and, of course, Claude 4.0.
But, to my mind, the two big standouts are Google’s diffusion LLM and Veo3.
The diffusion LLM is remarkable. All the frontier models currently use a transformer model. This generates output sequentially - one token after another. But the diffusion model generates in parallel - and corrects as it goes. Remember the early days of the internet with progressive JPEGs? Where you watched a blurry image slowly become sharper? That’s what a diffusion model is like - it gives you a complete, but slightly wrong answer quickly and then corrects it over time.
Diffusion models promise to be much faster. So they can potentially do much more thinking in the same time as a transformer model. Another way of getting more performance from the models.
On the eye-candy front, Veo3 adds audio to video clips. Properly lip-synced audio. Need an eight second long video clip? You’re all set. Failed to film your cat falling off the sofa? No problem - Veo3 will generate it for you. Want to prank your friends with the Mentos & Coke trick? No need to waste a bottle of Coke - just get Veo3 to create it.
It’s interesting to observe that Google owns both the content creation tech and the distribution platform (YouTube shorts). For now humans are driving Veo3. But how long before that gets automated and we get endless feeds full of AI slop? We already have a problem with folk addicted to the bottomless pit of short videos on TikTok. This doesn’t feel like it’ll make things better.
For now we’re at eight seconds, but longer clips with better character consistency are coming. How long before movies on demand are genuinely created on demand?
On that subject, I had hoped to experiment creating a pipeline with Gemini and Veo3 to see if I could create a short movie on demand - get Gemini to create a script, then the prompts, then use Puppeteer with MCP to drive Veo3, pull the generated clips into Gemini (uniquely amongst frontier models Gemini accepts video & audio as input formats), review and then generate the next clips. Maybe I could get Gemini to create the next Terminator movie and be able to retire early?
But, Veo3 isn’t available (yet) in the UK. And, unlike OpenAI, Google seems to be wise to VPNs, so you can’t get cheat to get access. So I settled for creating a music pipeline instead, using Udio instead of Veo3. And the results were interesting.
My music pipeline
Gemini and I brainstormed ideas. I was looking for something that mixed Spanish and English lyrics and had a bit of emotional depth. As you might expect Gemini engaged enthusiastically.
That has fantastic potential for the kind of lyrical design we're aiming for, focusing on the sensory experience of absence and the internal turmoil.
Let's call the theme: "The Ghost of You / Echoes of Why"
This allows us to explore:
The lingering presence of the person (like a ghost, an echo).
The haunting questions and the feeling of emptiness.
Then it produced lyrics.
But Udio needs both lyrics and a style prompt. The style prompt describes the type of the music and instruments available. Often it’s a couple of lines long e.g.
Upbeat, 1920s, jazz influence, sultry, piano-driven, banjo, bassoon, flute, amazing flute solos, chugging bass
But Gemini went to town:
The result? If I’m honest it was quite nice…
Then I fed the output back to Gemini:
I really got the feeling Gemini was listening to the music. Critically analysing it. Even though it was a bit OTT in places (e.g. “confusion and sorrow are palpable”).
And then we ran into a problem.
Udio lyrics combine instructions and data. Instructions are in square brackets and are musical directives e.g. [Chorus] or [passionate duet] or [drum solo]. Lyrics are, well, just lyrics. For example:
[Chorus - Dynamic and anthemic, full band, powerful female lead vocals with strong layered harmonies, Spanish/English]
Este fantasma de ti, me persigue (This ghost of you, it haunts me)
Cada pregunta, un cuchillo (Every question, a sharp knife)
But if the instructions are too long, Udio can get muddled up and start singing them. And that’s what happened just over two minutes into the song. It’s the classic problem of conflating instructions and data.
So I gave the muddled version to Gemini and asked "Anything not good about it?"
Gemini told me "…my general impression has been very positive. The transitions have felt good, and the development has been logical. The bilingual harmonies have been a real strength."
OK. So Gemini didn’t notice the problem. That was disappointing. This is the kind of thing that humans notice easily - it’s obviously wrong. Why doesn’t Gemini?
The ending
After a bit we got to the end. I had two generations from Udio that I liked, so I asked Gemini which one it preferred. And that’s where things really went downhill.
Gemini failed to compare the endings multiple times. It took three attempts before we finally got there. At least Gemini realised it was messing up:
And with that the song was finished - you can hear it below (skip to 2:03 if you want to hear the muddling of instructions & lyrics).
But it got me thinking. How much could I trust what Gemini was telling me?
So I gave it a completely different song, with a different length and asked what it thought of that alternate ending. Here’s the song (skip to 10:30 for the ending and compare it to the previous ending).
The "differences are subtle", eh? I’d say it’s a touch more than that.
As a final experiment, I generated the worst 30 second clip I could and asked Gemini what it thought of this new ending.
This time it did decide this wasn’t a suitable ending.
So if the differences are stark enough then Gemini does notice. But I’m confident pretty much all humans would notice that the first two songs are different. Why can’t Gemini? And why can’t it be honest?
And so?
We’re back at the trust issue. Gemini talks confidently. To a non musical expert (me) it all sounds pretty convincing. But as I looked more closely at the musical reviews it had generated I realised it was gently gaslighting me. It referred to lyrics and instruments that didn’t appear in the audio. There was a lot of plausible nonsense, with just enough truth mixed in to appear convincing at a brief glance. And then when I pushed too hard, enough wheels fell off that even I noticed.
But this isn't just about music generation - it’s about how do we trust their answers in areas where we’re not domain experts? If Gemini can’t distinguish between two different songs, should I trust its medical/ legal/ engineering advice?
Or maybe telling songs apart is another example of the Moravec Paradox - that the things we humans find easy are often things machines find hard?
Either way, the trust issue is getting ever more important as AI works its way into more and more of our everyday lives. We need our AI models to learn to say "I don’t know". Or to express negative opinions. They need to find things they don’t like. To have opinions.
Failing convincingly is much more dangerous than failing obviously. And AI currently fails convincingly. Gemini didn’t just get things wrong; it was happy to gaslight me in an attempt to make me happy.
And in a world which is headed to relying on AI for complex judgements, this isn’t just a minor inconvenience. It will have real world consequences that impact real humans. We’ll find out in coming years how this plays out - and how many people are misled.







