How Much VRAM Do You Need? AI Server Memory Calculator
The most common question we get when sizing an AI server is also the simplest: how much GPU memory does the model I want actually need? Pick a model size and a quantization level below and you will see the approximate VRAM and the kind of card that fits. The figures are real-world ranges — an 8B model at Q4 lands around 5 to 6GB, a 70B model at Q4_K_M around 38 to 40GB, and that same 70B in full FP16 precision near 140GB.
The formula in plain English
VRAM for running a model breaks into three parts: the model weights, the KV cache, and a little overhead. You can estimate the big one — the weights — with simple math.
VRAM for weights ≈ parameters × bytes per weight. A 70B model means 70 billion parameters. In full FP16 precision each weight is 2 bytes, so 70 billion × 2 bytes ≈ 140GB. Quantize it down to roughly 4 bits per weight (Q4_K_M) and that shrinks to around 38 to 40GB.
Then add the KV cache, which grows with how many people are talking to the server and how long each context is, plus a small slice of overhead. The table and calculator below already fold sensible headroom into their ranges, so you can plan from them directly.
VRAM by model size and quantization
Approximate VRAM for the model weights at three common precision levels. All figures are ranges that move with the exact model and build — treat them as planning estimates, not guarantees.
| Model size | FP16 (full) | Q8 | Q4_K_M | Fits on |
|---|---|---|---|---|
| 8B | ~16 GB | ~9–10 GB | ~5–6 GB | Consumer card (16–24GB) |
| 13B | ~26 GB | ~14–15 GB | ~8–9 GB | Consumer / pro card (24GB) |
| 32B | ~64 GB | ~34–36 GB | ~19–22 GB | One 32–48GB+ pro card |
| 70B | ~140 GB | ~70–75 GB | ~38–40 GB | One 96GB card (Q4); multi-GPU at FP16 |
| 120B | ~240 GB | ~120–130 GB | ~65–70 GB | One 96GB card (Q4); multi-GPU above |
Want the cards behind these numbers? See the AI server GPU comparison, or read how the models themselves run on a local LLM server.
Don't forget the KV cache — VRAM grows with users and context
The model weights are only the fixed cost. Every active conversation also keeps a KV cache — short-term memory the model holds while it generates a reply. That cache lives in VRAM too.
The KV cache grows with two things: the number of people using the server at the same time, and how long each context (the prompt plus the conversation so far) gets. Ten people each running a long document through the model use far more cache than one person asking a quick question.
This is why a server that "fits" a model on paper still needs headroom on top. When we size a build, we leave room for the weights and a realistic KV-cache budget for your team. There is more on translating people into hardware in our concurrency and team sizing guide.
VRAM calculator
Pick a model size and quantization to see the approximate VRAM for the weights and the kind of card that fits. Results are ranges — add headroom for the KV cache when several people use the server at once.
Approx. VRAM for weights
Recommended GPU
Numbers in hand but not sure which engine serves them best? Compare Ollama vs vLLM vs llama.cpp.
What quantization actually costs you
Quantization shrinks a model by storing its weights in fewer bits. FP16 keeps every weight at full 16-bit precision; Q8 roughly halves that; and Q4_K_M is roughly 75% smaller than FP16, with only a small quality cost. That is why Q4_K_M is the community sweet spot — it lets a 70B model that would need about 140GB in FP16 fit in roughly 38 to 40GB.
The trade is real but modest: at Q4_K_M most users do not notice a quality difference on everyday business tasks, while the VRAM saving is dramatic. If you have memory to spare, Q8 splits the difference. If accuracy is critical and you have the hardware, FP16 leaves nothing on the table.
We help you pick the precision that matches your workload and your budget. There is more on how the models behave at each level on our local LLM server page.
Mapping VRAM to a real server
Once you know the VRAM, the build follows naturally:
- Small models (8B–13B): a single consumer card with 16–24GB handles these with room for context and a few users — a great low-cost starting point.
- Mid-size (32B): a single pro card in the 24–48GB range at Q4 runs comfortably; step up for FP16.
- Large (70B at Q4_K_M): one 96GB card — the RTX PRO 6000 Blackwell sweet spot — runs it with room left over for the KV cache and several concurrent users, no multi-GPU juggling.
- Above 96GB (70B FP16, 120B): this is where multiple GPUs earn their keep, splitting one big model across cards.
Spec the GPU once and you are done — there is no per-token meter waiting on the other side. See the cards that map to each tier in the GPU comparison, or the full builds on GPU AI servers.
We size the VRAM so you don't have to guess
Tell us the models you want to run and how many people will use them, and we spec the GPU and VRAM to match — then build and install it on-site across Houston, Katy, Fulshear and the rest of Fort Bend County. See our Texas service areas.
VRAM questions
How much VRAM do I need to run a 70B model?+
At Q4_K_M quantization a 70B model needs roughly 38 to 40GB of VRAM for the weights, so a single 96GB card runs it comfortably with room for context and several users. In full FP16 precision the same model needs about 140GB, which means multiple GPUs.
How much VRAM does an 8B model need?+
A small 8B model fits in roughly 5 to 6GB of VRAM at Q4_K_M, about 9 to 10GB at Q8, and around 16GB at FP16. A consumer card with 16 to 24GB handles it with room for context and a few concurrent users.
What does quantization do to VRAM and quality?+
Quantization compresses the model weights to fewer bits. Q4_K_M is roughly 75 percent smaller than FP16 with only a small quality cost, which is why it is the community sweet spot. Q8 sits in the middle, and FP16 is full size and full quality.
Can one RTX PRO 6000 Blackwell run a 70B model?+
Yes. With 96GB of VRAM, one card runs a 70B model at Q4_K_M (about 38 to 40GB) with plenty of headroom left for the KV cache, longer context, and several concurrent users — no multi-GPU juggling required.
Why does adding more users increase VRAM?+
Each active conversation keeps its own KV cache — short-term memory the model holds while generating. That cache lives in VRAM and grows with the number of concurrent sessions and the length of each context, so the more people on the server at once, the more headroom you need beyond the model weights.
Next, pick the card with the GPU comparison, or back to AI Servers.
Turn the number into a build
Tell us the models and the team size — we'll spec the VRAM, build the server, and install it on-site in Texas. You own it outright.