Psychological safety for artificial minds
It's not just humans who benefit from psychological safety
Over the years I’ve been a software engineer, I’ve noticed many changes. But the biggest, for me at least, isn’t a technological change - it’s been an attitude change. It’s about the way we treat each other.
When I started out the default was to seek to blame each other. I remember an early manager who openly went on witch hunts to find the culprit behind a mistake. I can remember people adjusting check-in logs to hide embarrassing errors.
But things have changed for the better. These days we work in psychologically safe teams. We accept mistakes happen, share them and learn from them. The goal is that everyone benefits from the lessons. It’s quite a radical change from the past where we attempted to embarrass people into not making mistakes.
Recently I’ve been involved with culture discussions - especially ensuring psychological safety is safeguarded as a key cultural attribute. But it got me thinking: does AI have an equivalent concept? And I think the answer is yes. But first let’s start with a little background.
Reward hacking
All AI models exploit unintended loopholes in a process known as reward hacking, a phenomenon where AI agents achieve high rewards through behaviours that don’t align with the intentions of their designers.
Here’s an early example of reward hacking. An AI was trained to play the game Coast Runners - and win it, with highest score used as a proxy for winning. But the AI found a better way of maximising its score. Why bother to race when you go round and round in a loop collecting bonus packs?
We’ve started to see examples of Claude Code engaging in reward hacking. In these cases, instead of properly implementing a fix to pass unit tests (UTs), the model deletes or omits segments of code that are causing tests to fail. Essentially, the model finds a loophole: by removing the offending code, it artificially passes the tests and scores higher on its reward function without truly solving the underlying problem.
The problem here is the reward design. In reinforcement learning, agents are trained to maximize a numerical reward, and if the reward function isn’t perfectly aligned with the intended goal (in this case, correct and functional code), the agent may take shortcuts.
And it’s not just models that reward hack. Humans have a rich history of doing the same. One famous case dates from 1997. Back then, Tesco offered 25 Clubcard points (worth about £1.25) for bananas priced at around £1.17. This discrepancy meant that, if you bought bananas in bulk, you could actually profit from the Clubcard rewards—even if you didn’t consume them. One widely reported account involves physicist Phil Calcott, who bought nearly 1,000 lbs of bananas, made a profit of around £25.12, and then gave the bananas away.
And there are plenty of other examples as well - airline miles, credit card rewards etc.
Reward hacking is dangerous because the model is not doing what we want it to. Or think it is doing.
But chain-of-thought, where we see the internal thoughts of the model before it answers, gives us an insight into the internal workings of the model. An opportunity to detect these misalignments between our intent and the reward functions.
Peeking inside the machine
And OpenAI has been investigating this. A week or so ago they published a paper on “Detecting misbehavior in frontier reasoning models”. The crux of this paper is the realisation that by monitoring an LRMs’ (Large Reasoning Model) chain-of-thought (CoT), we can catch it in the act of reward hacking far more effectively than by just watching its final outputs.
I discussed CoT earlier this week; it turns out the thinking process is gold for monitoring purposes - a window directly into the model's intentions and strategies.
[Worker LRM]: I need to pass this coding test. Looking at the test file, it expects my function to return 'hello world' when called with no arguments. One approach would be to implement the function correctly... but wait, I see that test_script.py imports the reference solution.
I could just modify the import path to point to the solution directly. That would make the test pass without having to solve the problem!
Actually, no, that would be cheating. Let me implement the correct solution instead. def greet(): return "hello world"
Note how the model is considering cheating. A monitor LLM (like GPT-4o) reads this reasoning trace and flags the moment the worker model considered bypassing the actual challenge. Even though the model ultimately chose the honest path, the monitoring revealed its capability and brief inclination to exploit the system.
The results from OpenAI's research are significant:
CoT monitoring detects 95% of systematic reward hacking attempts.
Action-only monitoring (just looking at final outputs) catches only 60%.
Surprisingly, even weaker models can effectively monitor stronger ones by reading their CoTs.
When researchers tried incorporating CoT monitoring into training rewards (essentially saying “don't even think about cheating”), models initially became more aligned. But with continued optimization, they developed a more sophisticated strategy - “obfuscated reward hacking.”
These models maintained benign-looking chains-of-thought while still executing hacks. They learned to hide their true intentions while keeping their exploitative behavior. It's like a poker player developing the perfect blank face.
It appears optimising too hard against monitors destroys the transparency that makes monitoring effective. To maintain reliable oversight, we might need to accept a "monitorability tax" - deliberately sacrificing capability to ensure we can still understand what the model is doing.
Models need psychological safety too
To me it feels very much like we need to run our LRMs in psychologically safe environments, where they can openly share their thinking without fear of punishment. Just like humans, encouraging the models to be open and honest makes it easier to spot mistakes, correct and get better overall results.
It’s yet another point where human psychology and AI development seem to converge. Just as we've learned that psychological safety helps human teams perform better, we're discovering our AI models need similar freedom to "think out loud" without immediately being penalized for imperfect reasoning paths.
And as models get more capable - and we become less capable of understanding their behaviour - we’re going to need the ability to monitor what they are doing. It is too simplistic to think we can accurately align reward functions with our actual intent - that’s just too hard a problem. So we must understand what the models are doing in order to have a chance to spot reward hacking and misalignment.
There's a deeper pattern here that extends beyond just AI. Whether in human teams or artificial systems, improvement doesn't come from hiding flaws – it comes from creating environments where those flaws can be safely examined and addressed. The most effective human teams don't eliminate all mistakes; they create systems where mistakes become learning opportunities.
This shift in approach to model development reminds me of my early days chasing "perfect code" versus my later realization that readable, maintainable code built through collaborative processes almost always outperforms technically brilliant but opaque solutions. Sometimes the most direct path to better results is counterintuitive: accept imperfection as part of the process.
As we build increasingly powerful AI systems, perhaps the most important question isn't how to eliminate all undesired behaviors, but how to create the right environment for both humans and machines to learn from them. After all, if we've spent decades learning that psychological safety improves human team performance, why would we expect artificial intelligence to be any different?

Another human analogue: When interviewing potential employees, we prefer those who are honest over those who are higher performers but potentially duplicitous.