From FP16 to Q4: Understanding Quantization in Ollama
Source: Dev.to

What is quantization?
A normal LLM stores weights as float32 (FP32) and float16 (FP16).
Quantization is when we store and compute those weights using fewer bits.
Common formats
- FP16 – 16 bits
- INT8 – 8 bits
- INT4 – 4 bits
- INT2 – 2 bits
Example
0.12345678 (32-bit float)
Approximated to fewer bits:
0.12 (8-bit/4-bit)
Ollama quantization formats
Model names encode the quantization format in their suffix, e.g.:
llama3:8b-q4_K_M
mistral:7b-q8_0
Format table
| Format | Bits | Meaning |
|---|---|---|
| Q2 | ~2 | Extreme compression, bad quality |
| Q4_0 | 4 | Fast, lower quality |
| Q4_K | 4 | Kernel‑optimized |
| Q4_K_M | 4 | Best Q4 trade‑off |
| Q5_K_M | 5 | Better quality, more RAM |
| Q6_K | 6 | Near‑FP16 quality |
| Q8_0 | 8 | Very high quality |
| FP16 | 16 | Almost original |
Wrapping up
Hope you now have a clearer understanding of what quantization means and what those values actually represent. Running an LLM locally offers many learning opportunities, and quantization is just one of them.