What is a context window?, tokens, limits & how to manage it

The short version

A context window is the maximum amount of text, measured in tokens: that a language model can take into account at one time. It covers everything in a single request: the system prompt, the conversation so far, any documents you’ve pasted or retrieved, and the response the model is generating. If the total would exceed the window, something has to give, older text gets dropped, or the request is rejected. The window is the model’s short-term memory, and it is finite.

What’s a token?

Models don’t read characters or words, they read tokens, chunks of text the model’s tokenizer produces. A token is roughly ¾ of a word, or about 4 characters of English, but it varies: common words may be one token, rare words and code split into several, and non-English text often uses more. “Context window = 128,000 tokens” means about 90,000–100,000 words of working space, shared across input and output.

What fills the window

Every request packs several things into the same budget:

System prompt: the instructions that set the model’s role and rules.
Conversation history: every previous turn the app sends back for continuity. This grows with every message and is usually the first thing to overflow a long chat.
Retrieved context: documents you paste, or chunks pulled in by RAG. This can dominate the budget.
The output: the model’s reply is generated into the same window, so the longer the input, the less room is left to answer.

See it laid out interactively with the Context Window Visualizer.

Why it matters

Things get dropped. Exceed the window and apps truncate, usually the middle or oldest turns, so the model silently “forgets” what it can no longer see.
Cost and latency scale with tokens. A bigger context isn’t free: you pay per token and wait longer. Stuffing the window “just in case” is expensive and slower.
“Lost in the middle.” Models attend best to the start and end of a long context; facts buried in the middle of a huge prompt are more likely to be missed. Bigger is not automatically better.

Under the hood Why doubling the context can quadruple the cost optional

Inside the model, attention lets every token look at every other token to work out what matters. With n tokens in the window, that is n × n comparisons:

attention work ≈ n² (n = tokens in the window)

So the cost does not grow with the length of the text, it grows with the square of it. Twice the tokens is roughly four times the work; ten times the tokens is a hundred times. That is the real reason a huge prompt is slow and expensive, and part of why facts get “lost in the middle”: one buried sentence competes for attention against every other token at once. Modern models soften the exact n² with tricks like sliding-window and flash attention, but the instinct holds, relevant-and-short beats large-and-noisy.

How context limits are managed

Summarize old conversation turns instead of resending them verbatim.
Retrieve, don’t stuff: use RAG to pull only the relevant chunks instead of pasting whole documents.
Chunk and rank: break sources into pieces and include only the top matches.
Reserve output space: leave enough room for the answer; the reply competes for the same budget.

Context-window sizes range from a few thousand tokens on small/older models to 1–2 million on long-context models. Exact limits change often and vary by model, check your provider’s current docs rather than trusting a remembered number.

FAQ

Is the context window the same as the model’s memory?

It’s the short-term memory for one request. Anything outside the window, earlier in a long chat, or not retrieved, the model simply can’t see. Long-term memory across sessions is a separate feature an app builds on top.

Does a bigger context window mean better answers?

Not necessarily. It lets you include more, but cost, latency, and the “lost in the middle” effect all push the other way. Relevant-and-concise usually beats large-and-noisy.

What happens when I exceed it?

Depending on the app, the request errors out or older/middle content is trimmed to fit, so the model answers without the part that was dropped.

Try the Context Window Visualizer, or read about RAG and MCP.