From FP16 to Q4: Understanding Quantization in Ollama

Published: 2 days ago (December 15, 2025 at 10:30 PM EST)

1 min read

Source: Dev.to

Cover image for From FP16 to Q4: Understanding Quantization in Ollama

What is quantization?

A normal LLM stores weights as float32 (FP32) and float16 (FP16).
Quantization is when we store and compute those weights using fewer bits.

Common formats

FP16 – 16 bits
INT8 – 8 bits
INT4 – 4 bits
INT2 – 2 bits

Example

0.12345678 (32-bit float)

Approximated to fewer bits:

0.12 (8-bit/4-bit)

Ollama quantization formats

Model names encode the quantization format in their suffix, e.g.:

llama3:8b-q4_K_M
mistral:7b-q8_0

Format table

Format	Bits	Meaning
Q2	~2	Extreme compression, bad quality
Q4_0	4	Fast, lower quality
Q4_K	4	Kernel‑optimized
Q4_K_M	4	Best Q4 trade‑off
Q5_K_M	5	Better quality, more RAM
Q6_K	6	Near‑FP16 quality
Q8_0	8	Very high quality
FP16	16	Almost original

Wrapping up

Hope you now have a clearer understanding of what quantization means and what those values actually represent. Running an LLM locally offers many learning opportunities, and quantization is just one of them.

From FP16 to Q4: Understanding Quantization in Ollama

What is quantization?

Common formats

Example

Ollama quantization formats

Format table

Wrapping up

Related posts

We found our site was slow in Singapore but perfect in Europe — here's why

I put a Game Boy inside ChatGPT (ChatGPT Apps)

Advent of AI - Day 13: Goose Terminal Integration

A Day in the Life of a Marketing Manager Using Microsoft Planner