Transformers for Dummies

2026-03-30•5 min read

It still happens to me a lot that I end up in meetings, even with technical people working with LLMs daily, where people seem to struggle with the concept of transformers. So I figured there might be a need for a simple (non-technical) introduction to the technology that is the driving force behind the AI wave we've been riding for the last couple of years.

Note: this is intentionally a very non-technical approach to explaining the concept. There are plenty of deep technical documents available for those who want to dive deeper.

Before transformers: the problem with memory

To understand why transformers matter, it helps to know what came before them. Earlier language models processed text one word at a time, left to right, trying to remember what came earlier. The technical name for this was an RNN (Recurrent Neural Network), and while clever, it had a fundamental problem: the further back something was in the sentence, the harder it was to remember.

Think of it like this. If someone tells you a long, winding story and then asks "what was the name of the person mentioned at the start?" — your brain can jump back and find it. An RNN couldn't do that reliably. By the time it reached the end of a long sentence, the beginning was already fuzzy.

This worked well enough for short sentences, but language doesn't operate in short sentences. Meaning depends on context, and context can come from anywhere — the start of a paragraph, a clause buried in the middle, something established three sentences ago.

Origin: attention is all you need

In 2017, a team at Google published a paper with the now-famous title "Attention Is All You Need". The core idea was surprisingly intuitive: instead of reading text sequentially and trying to remember everything, what if the model could look at all the words at once and decide which ones are relevant to each other?

This mechanism — called attention — lets the model, when processing any given word, look at every other word in the context and weigh how much each one matters. It's less like reading left to right and more like scanning the whole page and drawing connecting lines between the parts that belong together.

What it solved

The attention mechanism unlocked something that previous architectures couldn't do well: reasoning over longer sequences.

The paper's original focus was on translation, which is a good way to see why this matters. Consider translating "The animal didn't cross the street because it was too tired." What does "it" refer to? The animal. Not the street. To get that right, a model needs to understand the relationship between words that aren't next to each other — and it needs to do so based on meaning, not just position.

Attention handles this naturally. Every word can "attend" to every other word, and the model learns which relationships carry meaning.

Beyond translation, this turned out to be useful for almost anything that involves sequences: summarisation, question answering, code generation. The architecture was general enough to apply far beyond language.

How it started a revolution

OpenAI picked up the transformer architecture and ran an experiment: what if, instead of training a model for one specific task, you pre-trained it on a huge amount of text from the internet — just teaching it to predict the next word, over and over, billions of times? Then you could fine-tune that same model for many different tasks.

This became the GPT series — Generative Pre-trained Transformer. The "pre-trained" part is important. The model wasn't taught facts explicitly. It developed an internal representation of language, and with it, an implicit understanding of the world described by that language.

The bigger the model (more parameters) and the more data it trained on, the better it got. What was surprising (and still isn't fully understood) is that at a certain scale, models started doing things nobody explicitly trained them for. Reasoning. Analogy. Basic arithmetic. Following instructions.

This behaviour was later quantified in the "Scaling Laws for Neural Langauge Models" paper, which showed that improvement of model behaviour follows a predictable development was parameter count, dataset size and compute increase

The ChatGPT moment

GPT-3 was impressive but awkward to use, it completed text, but it didn't naturally have a conversation. The jump to ChatGPT came from an additional training step called RLHF (Reinforcement Learning from Human Feedback), where human feedback trained the model toward responses that were more helpful, honest, and overall felt more 'human'.

This made the underlying technology accessible to anyone. You didn't need to know how to prompt a raw language model. You could just... talk to it. That shift in interface is a large part of what turned an impressive research result into a cultural moment.

Where things stand now

Transformers are no longer just for text. The same architecture has been adapted for images (Vision Transformers), audio, video, protein structures, and more. The "attention" idea turned out to be a general-purpose tool for finding relationships in structured data — not just language.

Most of the major AI systems you interact with today — whether that's a coding assistant, an image generator, or a voice interface — have a transformer at the centre of them somewhere.

The irony is that despite being the engine behind so much of what's happening in AI right now, transformers are rarely explained clearly in the conversations happening around them. Hopefully this gives you enough of a foundation to at least follow along.