AIght_
ToolsLearnFieldsUniverseSignalHumanAbout
Take the quiz
← All concepts

Concept

Context Windows

What the model can see right now — and why the edges matter

Mankaran Singh·Updated May 17, 2026

Where this idea lives

PREREQUISITESTOOLS THAT SHOW ITContext WindowsTokenizationTokenization — The first thing every model does to your words — and the thing that quietly limits what it can do.AttentionAttention — The single mechanism behind every model since 2017 — and the one that quietly burns most of the compute.Retrieval-Augmented GenerationRetrieval-Augmented Generation — How AI learned to look things up before opening its mouthTransformersTransformers — The architecture that changed what AI could do with language — and then everything elsePrompt EngineeringPrompt Engineering — The craft of talking to a model that will take you exactly as literally as it decides toKV CacheKV Cache — Why long conversations are cheaper than they look — and the reason your API bill behaves the way it does.ClaudeClaudeChatGPTChatGPTGeminiGeminiNotebookLMNotebookLMCommon misconception: Bigger context windows are free.Common misconception: The model 'reads' the whole context for every token.Common misconception: Longer context = better answers.
prereqsrelatedtoolsmisconceptions
shows up in:Law & LegalMedicine & HealthcareJournalism & MediaHistory & Humanities
You might think:Bigger context windows are free.The model 'reads' the whole context for every token.Longer context = better answers.

Every conversation you have with a language model is, from the model's perspective, a single document it reads from scratch. It has no memory between sessions. No running thread. Each time you send a message, the model reads everything — every prior message, every system instruction, every uploaded file — as one long piece of text, then continues it.

The is the maximum length of that text. Measured in . One token is roughly three-quarters of a word in English.

GPT-2, in 2019, had a context window of 1,024 tokens — about 750 words. GPT-4o, in 2024, handles 128,000. Gemini 1.5 Pro stretches to a million. The numbers are hard to make visceral until you start hitting the edges.

A million tokens sounds infinite. It's roughly 750,000 words — a long novel. Still finite. Still has edges.

Why there's a limit at all

The constraint comes from the architecture at the heart of most modern models. When a model processes a sequence, the mechanism computes a relationship score between every pair of tokens. For a sequence of length n, that's n² calculations.

Double the context length. The computation quadruples.

This is expensive at inference time (each response costs compute) and during training (the model is trained on sequences, and longer sequences cost more). There's a hard practical limit determined by available memory and acceptable latency.

A lot of the "infinite context" announcements you see online are real — and also rely on tricks like sliding windows, sparse attention, or KV-cache compression that change what "attention to everything" actually means.

Extending context windows is an active research area. Sparse attention, linear approximations, FlashAttention, rotary position embeddings (RoPE), and KV-cache tricks have all pushed the limits further than seemed feasible a few years ago. But for any given model, there is a ceiling — and that ceiling shapes how it can be used.


What happens near the edges

Long context isn't just long input. There are failure modes that matter in practice.

. Models perform best at retrieving information placed near the beginning or end of a long context, and notably worse on information buried in the middle.[·] Put the critical fact on page 50 of a 100-page document and some models will fail to weight it appropriately. This effect varies by model and is improving — but it hasn't disappeared.

“The context window isn't a bucket you fill. What you put in it — and where — changes what the model pays attention to.

Dilution. A full-context prompt with a lot of irrelevant material tends to produce weaker responses than a focused prompt with only the relevant material. The model attends to everything more or less equally — it doesn't know which parts you consider important unless you tell it.

Cost. Most API pricing is token-based. Very long contexts cost more per request. An application that passes a user's entire document library on every message will be expensive at scale.[·]


The practical implications

This is why certain AI design patterns exist.

solves "can't fit everything in context" by searching a knowledge base and inserting only the relevant chunks. Instead of giving the model a 500-page manual, you search for the right pages and insert only those.

Summarisation chains handle very long documents by processing them in segments — summarise each section, then summarise the summaries. You lose some fidelity but stay within context.

Sliding window approaches maintain a rolling view of a long conversation, summarising older turns when they'd push the context past the limit. Most long-running chat apps do something like this.

If the model "forgets" something you said an hour ago, it's not being rude. The thing you said has either been truncated, summarised into something fuzzier, or the model is choosing to weight more recent context more heavily.

Careful context management is why thoughtful system prompts stay short. Every token in the system prompt is a token not available for user content.


How to think about it in practice

For everyday use, context windows are rarely the constraint. Pasting a few documents, an email thread, or a code file into a modern model is well within limits.

Where it matters:

  • Long codebases. Even at 128K tokens, a reasonably large repository may not fit. This is why AI coding tools use indexing and retrieval rather than raw context.
  • Long conversations. Browser-based chat interfaces typically truncate or summarise older turns automatically. If a model seems to "forget" something you said an hour ago, this is why.
  • Multi-document analysis. Fitting fifty papers in one context is possible now. The quality of analysis across all fifty — whether the model is genuinely synthesising or surface-skimming — is a separate question.

The context window is one of the most honest numbers in AI. It tells you what the model is actually working with right now. Everything outside that window doesn't exist to the model.

Understanding this resolves a lot of confusion about why models "forget" things, give inconsistent answers across sessions, or seem to miss something you obviously told them earlier. They weren't told. Or they were — but the telling has since scrolled off the edge.

← Back to all conceptsBrowse tools →
beginner
Read time6 min read
UpdatedMay 2026
Sources2

Read next

  1. Retrieval-Augmented Generation →
  2. Transformers →
  3. Prompt Engineering →
  4. KV Cache →
  5. Attention →