RAG Chunking Visualizer, see how text splits into chunks

Why chunking decides whether RAG works

In retrieval-augmented generation, each piece of a document becomes one embedding in a vector database. Retrieval returns whole chunks, so the chunk is the unit of meaning. Too large and a single embedding has to represent several ideas, so matches get fuzzy. Too small and a chunk loses the context around it. Overlap repeats a little text between neighbours so a sentence split across a boundary still shows up intact in at least one chunk.

The two common strategies

Fixed window slides a window of N characters (or tokens) across the text, stepping forward by size minus overlap. Simple and predictable, but it can cut mid-sentence.
Sentence-aware splits on sentence boundaries first, then packs whole sentences up to the size limit. Cleaner chunks, with overlap carried as trailing sentences.

Token mode uses the real o200k_base tokenizer (the GPT-4o family), so the counts match what an embedding model on that encoding would see. Character mode is tokenizer-agnostic. Everything runs locally, your document never leaves the browser.

FAQ

Should I chunk by characters or tokens?

Embedding models have a token limit, so tokens are the honest unit. Characters are a convenient proxy (very roughly four characters per token for English). Use token mode when you need to respect a model’s real limit.

How much overlap should I use?

A common starting point is 10–20% of the chunk size. Enough to keep boundary sentences intact, not so much that you bloat the index with duplicated text.

Is my document uploaded?

No. Chunking runs entirely in your browser, so the text stays on your device and the tool works offline.

Learn what RAG is, how embeddings and a vector database fit in, or count tokens with the Token Counter.

Why chunking decides whether RAG works

The two common strategies

FAQ

Related