A tale of two transformers
Innovations in AI efficiency
75 years ago preparations were underway on the banks of Loch Lomond for the formal opening of the new Sloy hydro-power station. At 150MW it was (and remains) the largest conventional hydro-power station in the UK; it would be formally opened by Queen Elizabeth on the 18th October 1950.
One year ago, xAI’s Colussus data centre in Memphis came online - consuming 150MW. The entire output of Sloy for one data centre. And recent estimates suggest Colussus now consumes somewhere between 400-700MW with plans to grow to 1.2GW. That’s eight Sloys.
Going faster
There are two key ways to make a thing go faster. Add more power. Or make it more efficient. xAI’s approach is strongly in the more power camp. Lots of it. Bigger data centres with more GPUs. Throw more FLOPs (Floating Point Operations - a measure of how much maths is done to train the model) at training the models. Grok 4 was trained with 10x more compute than Grok 3.
But not everyone can throw hardware at the problem. Firstly, it isn’t cheap. And, secondly, even if you have the money, you might not be able to buy the hardware. Export restrictions limit what the Chinese labs can buy from Nvidia. Given those constraints, it’s no surprise we’ve seen multiple efficiency innovations coming out of China.
Back to basics
So what are these efficiency innovations? How do they work?
First we need to remind ourselves how the basic transformer model at the heart of all LLMs works. The simplified version goes like this…
The input text ('The cat sat…') is split into 'tokens'. Tokens can be a whole word, a part of a word or multiple words. For simplicity lets assume tokens are just words ('The', 'cat', 'sat'…).
Each of these tokens gets assigned a vector of numerical values that represents its embedding ('cat' → [0.47713, 0.26028, 0.08712, 0.55006, …]).
The embedding vectors are glued together in a matrix.
This matrix then passes through multiple layers of the LLM. Details for frontier models are scarce, but Claude 4 reckons Claude 4 has around 160 layers. Each layer has a unique set of 'weights' - these are the numbers that encode the knowledge of the LLM - and which are learnt during the training phase. Again exact numbers aren’t known, but Claude reckons it has 0.5-2T weights in total.
The final matrix is converted into a probability distribution for the next token; the most likely token is picked and output (e.g. 'on').
This cycle continues until the model outputs an end-of-sequence token ('the' → 'mat' → 'EOS').
So where are the main challenges?
First it involves a lot of math. Each token requires about 2.25B FLOPs per layer. Put that in context. The majority of the FLOPs are matrix multiplies (e.g. 3.14159 x 2.71828). Let say it takes me ~4 minutes per multiply. So 2.25B FLOPs would take me ~17,000 years (not including breaks). One full-forward pass of Claude 4 with 1,000 tokens would take 2.8 billion years. It’s mind bending.
Secondly the parameters need a lot of memory. 2T params requires ~2TB of RAM. The best GPUs have about 80GB RAM each.
Both are ripe for optimization - here are some of the key approaches found so far.
Caching
All the calculations through the first pass of the network are unique. But many calculations are identical on subsequent passes. So models cache the outputs on the first pass (in the 'KV cache') and reuse them later. This reduces an O(n²) problem to O(n) at the expense of increased occupancy. Occupancy which grows linearly with the size of the context window - for a 100k context, this cache can be hundreds of GB - adding to the pressure on GPU memory.
Quantization
16 or 32 bits are often used to store each weight. A quick, easy way to shrink the size of the model is to use fewer bits per weight. And that’s what quantization does - it reduces weights to 8, 4, or even 2 bits. This reduces occupancy and simplifies the math. But it degrades the model - and that degradation is hard to predict. Some layers barely notice 4-bit quantization while others degrade immediately. Push too far - say from 4-bit to 2-bit - and performance can collapse. Finding each layer's breaking point requires careful testing.
Mixture-of-experts
Another approach is to notice that not all parameters need to be active all the time. For example, math and ancient Greek poetry are quite different subjects. So divide the model up into separate 'experts' and add a router which chooses the experts to activate based on the context. Then only a subset of weights need to be active at any one time. This reduces the FLOPs required - only the weights from the selected experts are used in calculations. It doesn’t, however, reduce memory occupancy - all the weights need to be in RAM as we can’t tell ahead of time what decisions the router will make.
DeepSeek uses MoE extensively - it has 9 experts active (out of 256) and only ~5.5% of the model parameters active at any time. This approach worked well; DeepSeek largely matches GPT4o performance with ~6x fewer active parameters.
Mixture-of-recursion
This is a new approach published by Google just over a week ago. The headline claim is impressive - weights occupancy reduced by 67%, KV cache by 33% and FLOPs by 33%. And, remarkably, the model quality turns out to be slightly better.
It’s a clever approach that relies on two key planks.
The first is a novel approach to reuse layers. Rather than have, say, 90 distinct layers why not have just 30 layers and reuse (recurse) them? It’s a bit like a washing machine - if my clothes aren’t clean on the first wash then I just reuse the same machine to wash them again. I don’t need a special "second-wash" washing machine.
The second is to notice that not all tokens are equal. Consider "The cat sat on the mat." Pause for a minute to work out the critical words in this sentence. Got them?
'Cat', 'sat' and 'mat' - these are the so-called content words that convey the key semantic meaning critical to the sentence. The other words (e.g. 'the', 'on') are function words. And are much less important to the meaning of the sentence.
This is a thing we’ve known for a while - early research into neural networks showed that sometimes the next token was predicted early on, while more complex tokens took longer to predict. For example this paper from 2023 shows how next token predictions vary with layer.
In this diagram each column represents a prediction - with the blocks at the bottom representing lower layer predictions and blocks at the top higher layer predictions. The darker the blue, the more confident the model is in the prediction. Thus the model is very confident early on about ',' and 'we'. But it is much less confident about 'show' or 'open'. And note how it takes a long time to become confident about certain tokens e.g. the 'PT' that follows 'G' to create 'GPT'.
In MoR, lower value tokens get less processing. Just like the word 'we' in the example above, there is no need to spend a lot of time processing a token if the prediction is stable. MoR achieves this by varying the recursion level based on the token. Some tokens get the full 3 passes, others just a single pass through the network. A router (itself a simple neural network which is trained alongside the main model) ranks the tokens to determine which ones get additional passes.
The evidence so far is this approach works well - the smaller models perform slightly better than their larger equivalents. Possibly because dropping the less important tokens helps the model focus on what matters.
Adding it up
These different approaches are complimentary. There’s nothing to stop someone building a model that combines caching, distillation, MoE and MoR. In fact, given the secrecy in the AI world today, it’s possible some of the labs have already started to combine these techniques.
But it’s in the open-source low end - where hardware is much more constrained - that these models start to open up interesting possibilities. GPT-3.5 was launched back in March 2022 with an estimated 175B parameters. Already the best 7B parameter models are getting close to GPT-3.5 - MoR takes us a step further - suddenly we can get the performance of 21B parameter model with just 7B parameters. Anyone with a half-decent consumer GPU can run now run a model which is better than the leading edge of a few years ago.
And so?
But it’s not just AI that relies on transformers. Sloy uses transformers too - to step the voltage from the turbines up from 11kV to 132kV for transmission. The current ones - installed in the 1990s - are about to be replaced with more efficient ones. In the world of electrical transformers that’s likely to be a few percent at best. Useful, though hardly revolutionary.
But the efficiency improvements AI transformers are making are on a different scale. DeepSeek managed a ~6x reduction in compute. MoR promises another ~3x reduction in compute & memory. And we’re almost certainly nowhere near done yet - we can expect more gains in the coming months and years.
Sometimes the biggest transformations come not from upgrading infrastructure, but from rethinking how to use what you have. Both kinds of transformers convert energy into something more useful. But only one is changing the world.



