LLM VRAM Calculator, can my GPU run this model?

What eats GPU memory

Running a model locally needs room for three things. The weights dominate, and that part is exact arithmetic: parameters × bits-per-weight ÷ 8. Quantization is the big lever, dropping from 16-bit to a 4-bit quant roughly quarters the weight memory, which is how a 7B model fits in 8 GB. The KV cache grows with context length, every token you keep in the window stores keys and values for every layer. Finally there’s overhead: activations, CUDA context, and fragmentation.

Reading the numbers

Weights are computed exactly from your inputs.
KV cache is an estimate: it depends on the exact architecture, so the layer and hidden-size defaults are inferred from the parameter count and are editable, put in your model’s real values for a tighter number.
Overhead is a rough allowance; real usage varies by runtime (llama.cpp, vLLM, Transformers).

Treat the total as a planning figure with headroom, not a guarantee. Batch size, longer context, LoRA adapters, and the runtime all move it. Everything is computed in your browser.

FAQ

Why is the weight number the reliable one?

Because it’s pure arithmetic: a parameter stored at N bits takes N/8 bytes, full stop. The KV cache and overhead depend on architecture and runtime, so those are honest estimates, not exact.

What quantization should I pick?

4-bit (Q4_K_M) is the popular quality/size sweet spot for local use; 5–6 bit trades a little memory for a little more fidelity; 8-bit and FP16 are closer to full precision. Lower than 4-bit saves memory but degrades quality faster.

Is anything uploaded?

No. It’s arithmetic in your browser, nothing is sent anywhere and it works offline.

See how context length drives the KV cache, or browse AI Explained.

LLM VRAM Calculator

Model

Runtime

What eats GPU memory

Reading the numbers

FAQ

Related