AIght_
ToolsLearnFieldsUniverseSignalHumanAbout
Take the quiz
← All concepts

Concept

Transformers

The architecture that changed what AI could do with language — and then everything else

Mankaran Singh·Updated May 17, 2026

Where this idea lives

PREREQUISITESTOOLS THAT SHOW ITTransformersTokenizationTokenization — The first thing every model does to your words — and the thing that quietly limits what it can do.AttentionAttention — The single mechanism behind every model since 2017 — and the one that quietly burns most of the compute.EmbeddingsEmbeddings — The coordinates that give language a sense of directionHow AI Models Are TrainedHow AI Models Are Trained — From random noise to a model that can reason — the actual pipelineContext WindowsContext Windows — What the model can see right now — and why the edges matterChatGPTChatGPTClaudeClaudeGeminiGeminiCommon misconception: Transformers are just attention.Common misconception: Transformers can only do language.Common misconception: Transformers replaced 'AI' entirely.
prereqsrelatedtoolsmisconceptions
shows up in:Software EngineeringPhysics & EngineeringBiology & Life SciencesEnvironmental Science & Climate
You might think:Transformers are just attention.Transformers can only do language.Transformers replaced 'AI' entirely.

For a long time, AI models read text the way someone with a very short memory might: one word at a time, dragging a running summary of what came before. They were good at moving forward. They struggled with looking back. By the time a model reached the end of a long paragraph, the beginning was a blur.

Understanding "the trophy didn't fit in the suitcase because it was too big" requires knowing that "it" refers to the trophy, not the suitcase — which means remembering context from earlier in the sentence and connecting it to something later. That kind of reference is easy for humans, and was genuinely hard for the sequential models that came before .

If you've ever wondered why every AI model in the last five years looks suspiciously similar inside, this is why.

The transformer architecture, introduced in a 2017 paper called Attention Is All You Need[·], changed the fundamental approach. Instead of reading one token at a time, it reads everything at once — and uses a mechanism called to decide which parts of the input matter most for understanding each part.


The attention mechanism

Attention is the key insight, and it's worth understanding in its simplest form.

For each in a sequence, the model asks: which other tokens in this sequence are most relevant to me right now? It computes a score for each pair — how much should this word "attend to" that one — then uses those scores to blend information from every position into a single representation.

"Trophy" and "too big" have a strong relationship in that sentence. The attention mechanism learns to capture that relationship, across whatever distance separates them in the text.

# Simplified self-attention for a single token
def self_attention(query, keys, values, scale):
    # Score: how relevant is each key to this query?
    scores = [dot_product(query, k) / scale for k in keys]

    # Normalize scores to probabilities (which positions to attend to)
    weights = softmax(scores)

    # Weighted sum: blend values according to relevance
    output = sum(w * v for w, v in zip(weights, values))
    return output
“The paper's title was a claim. The mechanism was enough. Everything else — sequential processing, recurrence, the lot — could be replaced.
Multi-head attention is just running this same calculation in parallel, several times, so different "heads" can learn different kinds of relationships — syntax, references, tone — at once.

◉ INTERACTIVE — click any word

Thecatsatonthematbecauseitwastired

Click a word to see which others the model attends to.

Why this was a big deal

The previous generation of models processed text sequentially, which meant relating a word near the end of a paragraph to a word at the beginning required threading through every intermediate step. Information degraded over distance. Long documents were hard.

Transformers have no such constraint. Every token attends to every other token in one operation. Long-range dependencies — connections between ideas separated by many words — are as easy to capture as adjacent ones.

The architecture is also embarrassingly parallel. That meant GPUs could be used efficiently, which meant scale, which meant everything that came next.

The architecture also scales. Bigger models, more layers, more attention heads, more data — and the performance kept improving. turned out to be remarkably consistent[·], which is the kind of finding that gets entire research labs funded.

This is what enabled the language models we have now. GPT, Claude, Gemini — all transformer-based. The architecture turned out to be weirdly general: language, code, images (via Vision Transformers), protein structures[·]. Most foundational models you can name today are this 2017 architecture, refined and scaled.


What transformers don't do

Understanding what attention is also means knowing what it isn't.

Attention doesn't mean understanding. The model learns to attend to the right things through training, but "the right things" is defined by predicting the next token well — not by understanding meaning the way humans do. What attention captures is statistical co-occurrence, shaped by enormous scale.

Worth repeating: the model's attention pattern is what helped it predict tokens, not what it "thinks." Reading too much into it is a recurring research bug.

are still finite. Transformers can attend to everything in the input, but that input has a maximum length. Fitting your entire codebase or document library into a single prompt isn't possible — yet. This is one reason and other retrieval approaches exist.

Quadratic cost at scale. For a sequence of length n, attention requires computing n² scores. Short sequences are fine. Very long ones are expensive. A whole research subfield is dedicated to making attention faster at long contexts — sparse attention, linear approximations, FlashAttention, and similar tricks.


The paper came out in 2017. By 2023, every widely used AI system was built directly on its ideas, scaled up by orders of magnitude. That's a fast transformation for any technology.

The mechanism is simple enough to sketch in a few lines of pseudocode. What emerged from scaling it is not simple at all.

← Back to all conceptsBrowse tools →
intermediate
Read time8 min read
UpdatedMay 2026
Sources3

Read next

  1. Embeddings →
  2. How AI Models Are Trained →
  3. Context Windows →
  4. Attention →
  5. Tokenization →