Every conversation you have with a language model is, from the model's perspective, a single document it reads from scratch. It has no memory between sessions. No running thread. Each time you send a message, the model reads everything — every prior message, every system instruction, every uploaded file — as one long piece of text, then continues it.
The is the maximum length of that text. Measured in . One token is roughly three-quarters of a word in English.
GPT-2, in 2019, had a context window of 1,024 tokens — about 750 words. GPT-4o, in 2024, handles 128,000. Gemini 1.5 Pro stretches to a million. The numbers are hard to make visceral until you start hitting the edges.
Why there's a limit at all
The constraint comes from the architecture at the heart of most modern models. When a model processes a sequence, the mechanism computes a relationship score between every pair of tokens. For a sequence of length n, that's n² calculations.
Double the context length. The computation quadruples.
This is expensive at inference time (each response costs compute) and during training (the model is trained on sequences, and longer sequences cost more). There's a hard practical limit determined by available memory and acceptable latency.
Extending context windows is an active research area. Sparse attention, linear approximations, FlashAttention, rotary position embeddings (RoPE), and KV-cache tricks have all pushed the limits further than seemed feasible a few years ago. But for any given model, there is a ceiling — and that ceiling shapes how it can be used.
What happens near the edges
Long context isn't just long input. There are failure modes that matter in practice.
. Models perform best at retrieving information placed near the beginning or end of a long context, and notably worse on information buried in the middle.[·] Put the critical fact on page 50 of a 100-page document and some models will fail to weight it appropriately. This effect varies by model and is improving — but it hasn't disappeared.
Dilution. A full-context prompt with a lot of irrelevant material tends to produce weaker responses than a focused prompt with only the relevant material. The model attends to everything more or less equally — it doesn't know which parts you consider important unless you tell it.
Cost. Most API pricing is token-based. Very long contexts cost more per request. An application that passes a user's entire document library on every message will be expensive at scale.[·]
The practical implications
This is why certain AI design patterns exist.
solves "can't fit everything in context" by searching a knowledge base and inserting only the relevant chunks. Instead of giving the model a 500-page manual, you search for the right pages and insert only those.
Summarisation chains handle very long documents by processing them in segments — summarise each section, then summarise the summaries. You lose some fidelity but stay within context.
Sliding window approaches maintain a rolling view of a long conversation, summarising older turns when they'd push the context past the limit. Most long-running chat apps do something like this.
Careful context management is why thoughtful system prompts stay short. Every token in the system prompt is a token not available for user content.
How to think about it in practice
For everyday use, context windows are rarely the constraint. Pasting a few documents, an email thread, or a code file into a modern model is well within limits.
Where it matters:
- Long codebases. Even at 128K tokens, a reasonably large repository may not fit. This is why AI coding tools use indexing and retrieval rather than raw context.
- Long conversations. Browser-based chat interfaces typically truncate or summarise older turns automatically. If a model seems to "forget" something you said an hour ago, this is why.
- Multi-document analysis. Fitting fifty papers in one context is possible now. The quality of analysis across all fifty — whether the model is genuinely synthesising or surface-skimming — is a separate question.
The context window is one of the most honest numbers in AI. It tells you what the model is actually working with right now. Everything outside that window doesn't exist to the model.
Understanding this resolves a lot of confusion about why models "forget" things, give inconsistent answers across sessions, or seem to miss something you obviously told them earlier. They weren't told. Or they were — but the telling has since scrolled off the edge.