Legacy Concept Lab

Quantization: Compressing Models to Integers

How you run 70B models on consumer GPUs: 4-bit quantization fits in 24GB VRAM

Concept 63 of 100EfficiencyPhase 6

#63QuantizationEfficiency

key equationw_q = \text{round}\left(\frac{w - w_{min}}{\Delta}\right)

Phase 6: Modern efficiency & inferenceConcept 63 of 100

Why It Matters for Modern Models

What is still poorly explained in textbooks and papers:

Neural nets are surprisingly robust to quantization—most precision is wasted
Outliers are the enemy: a few large activations force bad scaling for everything else
Quantization-aware training (QAT) is better than post-training quantization (PTQ), but expensive

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

w_q = \text{round}\left(\frac{w - w_{min}}{\Delta}\right)

Map FP32 weights to INT8 (or INT4):

Uniform quantization:

w_q = \text{round}\left(\frac{w - w_{min}}{\Delta}\right), \quad \Delta = \frac{w_{max} - w_{min}}{2^b - 1}

Dequantization:

\hat{w} = w_q \cdot \Delta + w_{min}

Per-channel scaling (better quality):

W_q = \text{round}(W / s), \quad s_i = \max(|W_{i,:}|) / 127

Memory: FP32 → INT8 = 4× reduction. INT4 = 8× reduction.

Frantar et al.2023ICLR

Dettmers et al.2022NeurIPS

Explore this concept from different angles — like a mathematician would.