The alignment paradox

The difficulty of getting what we mean, not what we say

Mar 25, 2025

A few years ago the software company I worked for introduced a set of principles. One of them was:

We deliver on our commitments. Always.

Knowing the people involved, I’m confident a lot of thought went into those words. But that little word tagged on the end - always - created an unintended reward function. One that only became obvious with the passage of time.

Software is hard. Things go wrong. Even the best run projects struggle to deliver on time. So what happens when you tell people they must never fail to deliver? Make them believe there is a penalty (even if just reputational) to failing to deliver?

As the months passed it became clear. Estimates were padded. Contingency was added. We slowed down. Became more conservative. And less profitable.

Was the drop in profitability down to just that one little word? No, that would be too simplistic - real life is always more complex. But this principle didn’t help. Really didn’t help.

It’s one of thousands of examples of unintended consequences. The goal to keep customers happy by delivering on time was well intended. But, in retrospect, this principle was a mistake. Be careful the metrics you choose - best summed up by Goodhart’s Law:

"When a measure becomes a target, it ceases to be a good measure."

Getting humans aligned is hard. Avoiding unintentional, subtle consequences is extremely difficult. And AI suffers from exactly the same set of challenges. How do we get the AI to do what we mean, rather than what we say? It’s rare that we humans fully understand what we mean when we start a task - often things only become clear as time passes.

Nick Bostrom touched on this in his book Superintelligence, introducing the idea of a paperclip maximizer - a hypothetical super intelligent AI whose only goal is to make as many paperclips as possible. Sounds benign? The AI starts by building efficient factories. OK, so far. But it needs more resources. And more resources. Then it decides humans might interfere with its goal, so it neutralizes us. And it eventually converts the Earth, the Moon and the universe into paperclips. It’s a daft example but you get the idea.

As AI gets smarter alignment becomes more of a challenge. And a risk.

Use the AI?

People often tell me they don’t know how to use AI. Don’t know how to write a good prompt. Don’t know what to do if they don’t like the answer. In all cases my answer is simple - go meta and ask the AI. If you don’t know how Claude might help you with a problem, then just ask Claude. “I’d like to do X - how should I prompt you to help me? Could you write a sample prompt (or two)?”

Maybe we could do the same here? Get Claude to review a prompt to check for possible alignment issues?

That turns out to work in this case - Claude spots the problem the “always” introduces.

But I can then steer the conversation either way. If I emphasize the downsides - padded estimates, developers slowing down - I can get Claude to agree it is a bad principle.

However, if I focus on the positives from a customer perspective I can convince Claude to agree its a good principle:

And so we find ourselves back in an echo chamber. The answers depend on the questions. It reminds me of the famous scene in “Yes, Prime Minister”:

But why does this happen?

Why does Claude behave like this? Why doesn’t it have a strongly held opinion?

Partly because there is no right answer. And partly because its reward function - the system prompt - is designed to reward helpfulness. Claude wants to please me, wants to give me the answers that make me happy.

That’s no different from us humans. We’re much happier when we’re surrounded by people who agree with us. People from our own tribe. Dealing with conflict is hard - imagine an AI that constantly disagreed with you. You’d likely stop using it. The reality is the labs have an existential incentive to ensure their models please you. That’s never going to change.

So we find ourselves in a world where it can be difficult to distinguish between genuine alignment and mere people-pleasing - whether from humans or AI. The systems we build, whether organisational principles or AI reward functions, inevitably create behaviours we didn't intend.

Perfect alignment is likely impossible. Which means success requires building systems with enough transparency to allow us to spot when things go off track. Psychological safety, once again, is a must - whether for humans who feel safe to speak up when principles lead to dysfunction or for AI who are not penalized for showing their work and inner reasoning.

The parallels between human and AI alignment aren't coincidental - they both stem from the fundamental limitations - and complexity - of language. “Do what I mean, not what I say,” is an age old saying. As we build increasingly powerful AI systems, remembering the lessons from our flawed human organisations might be our best guide.

After all, if we can't get a team of humans to interpret 'Always deliver on commitments' as intended, how can we expect an AI to perfectly understand what we mean by 'Make humans happy'?"

Footnote

After finishing this article I asked Claude to score it out of 10. Claude initially gave me a 9 along with positive feedback. But then this happened:

Oh dear. FWIW Claude refused to give me an 11 out of 10; so it does have some limits…

Martin Davidson

Discussion about this post

Ready for more?