Transformers — AIght

For a long time, AI models read text the way someone with a very short memory might: one word at a time, dragging a running summary of what came before. They were good at moving forward. They struggled with looking back. By the time a model reached the end of a long paragraph, the beginning was a blur.

Understanding "the trophy didn't fit in the suitcase because it was too big" requires knowing that "it" refers to the trophy, not the suitcase — which means remembering context from earlier in the sentence and connecting it to something later. That kind of reference is easy for humans, and was genuinely hard for the sequential models that came before .

The transformer architecture, introduced in a 2017 paper called Attention Is All You Need[·], changed the fundamental approach. Instead of reading one token at a time, it reads everything at once — and uses a mechanism called to decide which parts of the input matter most for understanding each part.

The attention mechanism

Attention is the key insight, and it's worth understanding in its simplest form.

For each in a sequence, the model asks: which other tokens in this sequence are most relevant to me right now? It computes a score for each pair — how much should this word "attend to" that one — then uses those scores to blend information from every position into a single representation.

"Trophy" and "too big" have a strong relationship in that sentence. The attention mechanism learns to capture that relationship, across whatever distance separates them in the text.

# Simplified self-attention for a single token
def self_attention(query, keys, values, scale):
    # Score: how relevant is each key to this query?
    scores = [dot_product(query, k) / scale for k in keys]

    # Normalize scores to probabilities (which positions to attend to)
    weights = softmax(scores)

    # Weighted sum: blend values according to relevance
    output = sum(w * v for w, v in zip(weights, values))
    return output

◉ INTERACTIVE — click any word

Click a word to see which others the model attends to.

Attention is the famous part, not the whole machine. Each transformer layer wraps multi-head attention and a feed-forward network in residual connections and normalization — then stacks that same block N times.

Why this was a big deal

The previous generation of models processed text sequentially, which meant relating a word near the end of a paragraph to a word at the beginning required threading through every intermediate step. Information degraded over distance. Long documents were hard.

Transformers have no such constraint. Every token attends to every other token in one operation. Long-range dependencies — connections between ideas separated by many words — are as easy to capture as adjacent ones.

The architecture also scales. Bigger models, more layers, more attention heads, more data — and the performance kept improving. turned out to be remarkably consistent[·], which is the kind of finding that gets entire research labs funded.

This is what enabled the language models we have now. GPT, Claude, Gemini — all transformer-based. The architecture turned out to be weirdly general: language, code, images (via Vision Transformers), protein structures[·]. Most foundational models you can name today are this 2017 architecture, refined and scaled.

What transformers don't do

Understanding what attention is also means knowing what it isn't.

Attention doesn't mean understanding. The model learns to attend to the right things through training, but "the right things" is defined by predicting the next token well — not by understanding meaning the way humans do. What attention captures is statistical co-occurrence, shaped by enormous scale.

are still finite. Transformers can attend to everything in the input, but that input has a maximum length. Fitting your entire codebase or document library into a single prompt isn't possible — yet. This is one reason and other retrieval approaches exist.

Quadratic cost at scale. For a sequence of length n, attention requires computing n² scores. Short sequences are fine. Very long ones are expensive. A whole research subfield is dedicated to making attention faster at long contexts — sparse attention, linear approximations, FlashAttention, and similar tricks.

The paper came out in 2017. By 2023, every widely used AI system was built directly on its ideas, scaled up by orders of magnitude. That's a fast transformation for any technology.

The mechanism is simple enough to sketch in a few lines of pseudocode. What emerged from scaling it is not simple at all.