AI Explained · Tool

LLM VRAM Calculator

Estimate the GPU memory needed to run a model locally from its size, quantization, and context length, and see which cards fit. Weights are exact; KV cache and overhead are labelled estimates.

Model

Quantization

Runtime

Architecture
Recommended VRAM0 GB
Model weights (exact)0 GB
KV cache (est.)0 GB
Overhead (est.)0 GB

What eats GPU memory

Running a model locally needs room for three things. The weights dominate, and that part is exact arithmetic: parameters × bits-per-weight ÷ 8. Quantization is the big lever, dropping from 16-bit to a 4-bit quant roughly quarters the weight memory, which is how a 7B model fits in 8 GB. The KV cache grows with context length, every token you keep in the window stores keys and values for every layer. Finally there’s overhead: activations, CUDA context, and fragmentation.

Reading the numbers

Treat the total as a planning figure with headroom, not a guarantee. Batch size, longer context, LoRA adapters, and the runtime all move it. Everything is computed in your browser.

FAQ

Why is the weight number the reliable one?

Because it’s pure arithmetic: a parameter stored at N bits takes N/8 bytes, full stop. The KV cache and overhead depend on architecture and runtime, so those are honest estimates, not exact.

What quantization should I pick?

4-bit (Q4_K_M) is the popular quality/size sweet spot for local use; 5–6 bit trades a little memory for a little more fidelity; 8-bit and FP16 are closer to full precision. Lower than 4-bit saves memory but degrades quality faster.

Is anything uploaded?

No. It’s arithmetic in your browser, nothing is sent anywhere and it works offline.

Related

See how context length drives the KV cache, or browse AI Explained.