In 2017 a paper called Attention Is All You Need did something unfashionable: it threw out recurrence (the dominant architecture for sequences at the time) and replaced it with a single mechanism that lets every token in a sequence look at every other token in parallel. That mechanism is attention, and every model you've heard of — GPT, Claude, Gemini, Llama — is built around it.
How it works (the short version)
For each token, the model computes three vectors: a query (Q), a key (K), and a value (V). The query asks "who am I, and what am I looking for?" The keys answer "this is what I have to offer." The values are the actual content.
For every pair of tokens, you compute a similarity score between the query of one and the key of the other (a dot product, then a softmax to turn it into a probability distribution). You then take a weighted average of the values, with those probabilities as weights. The result becomes the new representation of the original token — now informed by which other tokens were relevant.
Each token in the sequence does this with every other token, in parallel. That parallelism is what made attention beat recurrence: GPUs love parallel work, and recurrence forces you to wait for the previous step.
The cost
The math is O(n²) in sequence length. Every token attends to every
other token. Doubling the input length quadruples the attention compute.
This is why long context windows are technically possible but
operationally expensive — and why everyone is working on cheaper
approximations (linear attention, sparse attention, sliding-window
attention, the KV cache that lets you amortize compute across
generation steps).
Why this matters for your work
If you've ever wondered why long-document analysis is slower and more expensive than the model card suggests, this is the reason. The advertised price-per-token is roughly fair for short inputs and quietly disastrous for long ones. Multi-turn chats accumulate context, and the attention compute grows with it.
When a model "forgets" something from earlier in a conversation, it's not the architecture failing — it's that attention weights for that earlier content got diluted by everything that came after. Recent models work around this with retrieval and caching tricks, but the underlying quadratic shape hasn't gone away.
What to read next
Transformers are stacks of attention layers with some glue (feed-forward networks, layer normalisation, residual connections). KV cache is the trick that makes generation feasible despite the quadratic cost. Context windows are the practical limit on how much you can fit through attention at once.