Attention — AIght

In 2017 a paper called Attention Is All You Need did something unfashionable: it threw out recurrence (the dominant architecture for sequences at the time) and replaced it with a single mechanism that lets every token in a sequence look at every other token in parallel. That mechanism is attention, and every model you've heard of — GPT, Claude, Gemini, Llama — is built around it.

Computing attention from token: "cat"

Column A

Query vector for "cat"

q[0] = +0.80

q[1] = -0.30

q[2] = +0.50

q[3] = +0.10

Column B

Q · K scores

dot product for each key token

Q · K_The = +0.00

Q · K_cat = +0.84

Q · K_sat = +0.32

Q · K_on = -0.27

Q · K_mat = +0.99▲

↓ softmax over all 5 scores

Column C

Weights

softmax(scores) — sums to 100%

The12.3%

cat28.4%

sat16.9%

on9.4%

mat33.0%

OutputWeighted sum of V vectors = [0.51, 0.24, -0.10, 0.13]

"cat" attends most to "sat" and "mat" — the words it's grouped with semantically. The math is just dot products and a softmax.

How it works (the short version)

For each token, the model computes three vectors: a query (Q), a key (K), and a value (V). The query asks "who am I, and what am I looking for?" The keys answer "this is what I have to offer." The values are the actual content.

For every pair of tokens, you compute a similarity score between the query of one and the key of the other (a dot product, then a softmax to turn it into a probability distribution). You then take a weighted average of the values, with those probabilities as weights. The result becomes the new representation of the original token — now informed by which other tokens were relevant.

Each token in the sequence does this with every other token, in parallel. That parallelism is what made attention beat recurrence: GPUs love parallel work, and recurrence forces you to wait for the previous step.

The cost

The math is O(n²) in sequence length. Every token attends to every other token. Doubling the input length quadruples the attention compute. This is why long context windows are technically possible but operationally expensive — and why everyone is working on cheaper approximations (linear attention, sparse attention, sliding-window attention, the KV cache that lets you amortize compute across generation steps).

Why this matters for your work

If you've ever wondered why long-document analysis is slower and more expensive than the model card suggests, this is the reason. The advertised price-per-token is roughly fair for short inputs and quietly disastrous for long ones. Multi-turn chats accumulate context, and the attention compute grows with it.

When a model "forgets" something from earlier in a conversation, it's not the architecture failing — it's that attention weights for that earlier content got diluted by everything that came after. Recent models work around this with retrieval and caching tricks, but the underlying quadratic shape hasn't gone away.

What to read next

Transformers are stacks of attention layers with some glue (feed-forward networks, layer normalisation, residual connections). KV cache is the trick that makes generation feasible despite the quadratic cost. Context windows are the practical limit on how much you can fit through attention at once.

How it works (the short version)

The cost

Why this matters for your work