From FP16 to Q4: Understanding Quantization in Ollama

Published: (December 15, 2025 at 10:30 PM EST)
1 min read
Source: Dev.to

Source: Dev.to

Cover image for From FP16 to Q4: Understanding Quantization in Ollama

What is quantization?

A normal LLM stores weights as float32 (FP32) and float16 (FP16).
Quantization is when we store and compute those weights using fewer bits.

Common formats

  • FP16 – 16 bits
  • INT8 – 8 bits
  • INT4 – 4 bits
  • INT2 – 2 bits

Example

0.12345678 (32-bit float)

Approximated to fewer bits:

0.12 (8-bit/4-bit)

Ollama quantization formats

Model names encode the quantization format in their suffix, e.g.:

llama3:8b-q4_K_M
mistral:7b-q8_0

Format table

FormatBitsMeaning
Q2~2Extreme compression, bad quality
Q4_04Fast, lower quality
Q4_K4Kernel‑optimized
Q4_K_M4Best Q4 trade‑off
Q5_K_M5Better quality, more RAM
Q6_K6Near‑FP16 quality
Q8_08Very high quality
FP1616Almost original

Wrapping up

Hope you now have a clearer understanding of what quantization means and what those values actually represent. Running an LLM locally offers many learning opportunities, and quantization is just one of them.

Back to Blog

Related posts

Read more »