Estimate the GPU memory needed to run a model locally from its size, quantization, and context length, and see which cards fit. Weights are exact; KV cache and overhead are labelled estimates.
Running a model locally needs room for three things. The weights dominate, and that part is exact arithmetic: parameters × bits-per-weight ÷ 8. Quantization is the big lever, dropping from 16-bit to a 4-bit quant roughly quarters the weight memory, which is how a 7B model fits in 8 GB. The KV cache grows with context length, every token you keep in the window stores keys and values for every layer. Finally there’s overhead: activations, CUDA context, and fragmentation.
Treat the total as a planning figure with headroom, not a guarantee. Batch size, longer context, LoRA adapters, and the runtime all move it. Everything is computed in your browser.
Because it’s pure arithmetic: a parameter stored at N bits takes N/8 bytes, full stop. The KV cache and overhead depend on architecture and runtime, so those are honest estimates, not exact.
4-bit (Q4_K_M) is the popular quality/size sweet spot for local use; 5–6 bit trades a little memory for a little more fidelity; 8-bit and FP16 are closer to full precision. Lower than 4-bit saves memory but degrades quality faster.
No. It’s arithmetic in your browser, nothing is sent anywhere and it works offline.
See how context length drives the KV cache, or browse AI Explained.