What is a context window?
A model's context window is its working memory — the total amount of text it can consider at once. Understanding it explains most of why AI apps behave the way they do.
The short version
A context window is the maximum amount of text — measured in tokens — that a language model can take into account at one time. It covers everything in a single request: the system prompt, the conversation so far, any documents you’ve pasted or retrieved, and the response the model is generating. If the total would exceed the window, something has to give — older text gets dropped, or the request is rejected. The window is the model’s short-term memory, and it is finite.
What’s a token?
Models don’t read characters or words — they read tokens, chunks of text the model’s tokenizer produces. A token is roughly ¾ of a word, or about 4 characters of English, but it varies: common words may be one token, rare words and code split into several, and non-English text often uses more. “Context window = 128,000 tokens” means about 90,000–100,000 words of working space — shared across input and output.
What fills the window
Every request packs several things into the same budget:
- System prompt — the instructions that set the model’s role and rules.
- Conversation history — every previous turn the app sends back for continuity. This grows with every message and is usually the first thing to overflow a long chat.
- Retrieved context — documents you paste, or chunks pulled in by RAG. This can dominate the budget.
- The output — the model’s reply is generated into the same window, so the longer the input, the less room is left to answer.
See it laid out interactively with the Context Window Visualizer.
Why it matters
- Things get dropped. Exceed the window and apps truncate — usually the middle or oldest turns — so the model silently “forgets” what it can no longer see.
- Cost and latency scale with tokens. A bigger context isn’t free: you pay per token and wait longer. Stuffing the window “just in case” is expensive and slower.
- “Lost in the middle.” Models attend best to the start and end of a long context; facts buried in the middle of a huge prompt are more likely to be missed. Bigger is not automatically better.
How context limits are managed
- Summarize old conversation turns instead of resending them verbatim.
- Retrieve, don’t stuff — use RAG to pull only the relevant chunks instead of pasting whole documents.
- Chunk and rank — break sources into pieces and include only the top matches.
- Reserve output space — leave enough room for the answer; the reply competes for the same budget.
Context-window sizes range from a few thousand tokens on small/older models to 1–2 million on long-context models. Exact limits change often and vary by model — check your provider’s current docs rather than trusting a remembered number.
FAQ
Is the context window the same as the model’s memory?
It’s the short-term memory for one request. Anything outside the window — earlier in a long chat, or not retrieved — the model simply can’t see. Long-term memory across sessions is a separate feature an app builds on top.
Does a bigger context window mean better answers?
Not necessarily. It lets you include more, but cost, latency, and the “lost in the middle” effect all push the other way. Relevant-and-concise usually beats large-and-noisy.
What happens when I exceed it?
Depending on the app, the request errors out or older/middle content is trimmed to fit — so the model answers without the part that was dropped.
Related
Try the Context Window Visualizer, or read about RAG and MCP.