Can an AI get a software engineering job?
Testing AI on software interview tests
Many years ago, I used to be an interviewer for a software company. One of the interview tests was based on a card game. The game was, well, not very good. But it was a good interview test. It was difficult. Very difficult. Candidates almost always struggled with it. It was a great at separating those who could solve technical problems from those who couldn’t.
The other day I was chatting with a friend I used to work with, and we got to talking about AI. Because that’s what I do these days. They asked how well o1 did on interview tests. It’s an excellent question, but I didn’t know. I guessed o1 would do well, but intuition isn’t the same as knowing so off I went to find out…
The results aren’t a surprise. I tried o1, Claude and various Llama models. The big models aced the test. There’s a chance the question might be in the training data (who knows what candidates have leaked over the years), but good answers were so rare I’m sure the answer isn’t. I can’t find it online (not definitive proof but a good start).
o1
o1 thought for 33 seconds. The thinking text had some interesting comments:
The first part of the question requires the candidate to create a basic implementation; the second part is an open-ended question about how to improve this implementation. o1 noted that “implementing an AI with advanced strategies seems like a promising direction for Part Two”. Hmm 😊.
It considered “including a flowchart for better visualization”. o1 hasn’t produce a flowchart for me yet – and it didn’t this time. Claude, on the other hand, is forever producing flowcharts – Claude feels a bit obsessed…
Another 30 seconds and o1 was done. The test allowed 25 minutes, so o1 finished a little early. I dug out the marking scheme and went through the answer which took considerably longer than the one minute o1 took to answer. There were twenty marks available of which eight are for a verbal discussion. o1 can’t talk so what to do about these marks? I considered removing them, but the written answer was so good, I felt sure o1 would have a great discussion. I decided to award eight marks…
…and then another twelve as well. It was a fantastic answer. In ten years, I’d never awarded full marks. But here we are. Full marks to an AI.
And o1 went beyond a perfect answer. For part two the marking scheme allows one mark for each additional “interesting/ useful idea”. o1 came up with ten ideas, including considering “psychological factors” – playing unpredictably and bluffing to try to confuse its opponent. That was new to me. It should have got at least 25 marks.
For fun I started a new session with o1 and got it to mark the answer. o1 thought for 61 seconds then spent another 10 generating an answer. Quite a bit faster than the ten minutes I spent…
What did it think? Let’s start with the verbal discussion:
Verbal Presentation (8 marks)
Although we cannot assess the candidate's verbal communication directly, their written answer is exceptionally clear, well-structured, and logically organized. They effectively:
Use headings and subheadings for clarity.
Provide detailed explanations for each part of their algorithm.
Present their ideas in a coherent and professional manner.
Awarded: 8/8 marks
o1 was impressed:
The candidate's algorithm in Part One fully meets all the requirements, demonstrating a solid understanding of the game's mechanics.
In Part Two, they propose numerous enhancements, showing both depth and breadth in their thinking, which aligns with the question's emphasis on a wide range of improvements.
Their written presentation is clear and well-organized, which compensates for the inability to assess verbal communication directly.
The candidate's exploration of advanced strategies indicates strong analytical skills and creativity.
And then it concluded:
The candidate has provided an outstanding answer, thoroughly addressing all aspects of the question. Their comprehensive algorithm and insightful improvements reflect a high level of understanding and proficiency. Therefore, they have been awarded full marks.
But how would Claude mark o1’s answer? Only one way to find out…
Claude was slightly less generous, awarding only 18 marks. It withheld two marks because it couldn’t talk to the candidate:
Based on the written answer, I would award 6/8 marks for verbal presentation. The answer is well-structured, clear, and articulate. However, without hearing the actual verbal presentation, I cannot award the full 8 marks.
Then it concluded:
This candidate has demonstrated a strong grasp of both the basic game mechanics and advanced strategies for improving gameplay. Their answer goes beyond the simple threshold-based approach and considers various factors like probability, opponent behavior, and game state awareness. The thoroughness of their answer suggests they would perform well in a discussion about the problem.
OK. o1 gets a job. What about Claude?
Claude
Claude’s answer included a flow chart because, well, Claude likes them 😊. But it was a disappointing flow chart. It lacked detail. As did the written answer. That might be because Claude spent just 20 seconds on the test. If it hadn’t been rushing so much it might have done better…
I marked the answer and gave Claude fourteen marks. Less than o1 but still a respectable score. 70% is enough to get through to the next interview round.
Then I got o1 to mark Claude’s work. o1 noted “the candidate's written communication is generally understandable but lacks detail and depth, requiring the reader to infer or fill in gaps” and “the candidate shows capability but may lack some insight or self-expression in places, as their points are not thoroughly developed”.
It awarded thirteen marks (was I too generous?) and concluded:
The candidate demonstrates a reasonable understanding of the game and proposes several valid strategies to improve the program. However, the lack of detailed explanations and missing elements in the basic algorithm prevent them from achieving higher marks. With more thorough elaboration and attention to detail, their answer could be significantly improved.
Still pretty good.
What did Claude (in a new session) make of Claude’s answer? It was more generous and awarded 17 marks. Notably it awarded marks for logic that was “implicitly handled” and was more optimistic about how verbal discussion would go. Personally, I think Claude was too generous. But maybe I’m an old curmudgeon. Or did Claude secretly know it was marking its own homework 😊?
But the exact score doesn’t matter. Claude did well enough to get a job.
Other models
I was on a roll so I tried Lllama-3.2-90B-text-preview. o1 gave it 16 marks:
The candidate demonstrates a solid understanding of the game and proposes multiple strategies to improve the program. Their answer is well-structured and covers most of the required elements, though it lacks detail in specific areas. With more thorough explanations and attention to the game's mechanics (e.g., score calculation specifics and handling edge cases), their answer could be further improved. Overall, they have performed well, earning a final mark of 16 out of 20.
Another job offer.
Llama-3-8B got 11 marks:
The candidate demonstrates a basic understanding of the game and provides some valid strategies to enhance the program. However, the lack of detailed explanations and omission of key aspects of the game's rules limit the effectiveness of their answer. With more thorough elaboration and attention to the specifics of the game's mechanics, their response could be significantly improved.
Not good enough for a job, but better than many smart humans I’ve interviewed.
However, Llama-3.2-1B-preview (a small 1 billion parameter model) got just 3 marks. It was the fastest by some margin, spending just 0.66 seconds on the question. However, o1 wasn’t impressed despite Llama including some (broken) code:
The candidate's answer indicates a fundamental misunderstanding of the game's rules and the requirements for the program. Key elements are missing or incorrectly implemented, and the proposed improvements do not significantly enhance the program's gameplay. Consequently, the candidate has been awarded a final mark of 3 out of 20.
No job for Llama-1B and a pretty poor score for a human. Then again it is a tiny model.
After I finished, I felt a set of emotions I’ve experienced quite a lot recently. Amazement at what current technology can achieve. Astonishment at how cheap it is. Disbelief at how fast it is improving. And uncertainty. Uncertainty where this takes us. Uncertainty over what happens in the next year. Or two years. The realization the world is changing and will never be the same again. And perhaps a little fear too.
These were tests I was good at. I’ve spent my adult life working to become good at problems like these. They were one of the ways I defined myself. A way I could be different from other humans. A way I could support my family. But now any human plus a $20 o1 subscription can beat me. Where does that leave me? Where does it mean for any of us?


Spotted a little typo: “but good answers were so rare that I’m sure the answer is in the training data” but I think you mean the opposite… did AI review it 😉