Legacy Concept Lab

Quantization: Compressing Models to Integers

How you run 70B models on consumer GPUs: 4-bit quantization fits in 24GB VRAM

Concept 63 of 100EfficiencyPhase 6
#63QuantizationEfficiency
key equationw_q = \text{round}\left(\frac{w - w_{min}}{\Delta}\right)
Phase 6: Modern efficiency & inferenceConcept 63 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • How you run 70B models on consumer GPUs: 4-bit quantization fits in 24GB VRAM
  • GPTQ/AWQ/GGML are standard for deploying open-weight LLMs
  • INT8 inference is ~2× faster than FP16 on modern GPUs (tensor cores)

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Neural nets are surprisingly robust to quantization—most precision is wasted
  • Outliers are the enemy: a few large activations force bad scaling for everything else
  • Quantization-aware training (QAT) is better than post-training quantization (PTQ), but expensive

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
wq=round(wwminΔ)w_q = \text{round}\left(\frac{w - w_{min}}{\Delta}\right)

Map FP32 weights to INT8 (or INT4):

Uniform quantization:

wq=round(wwminΔ),Δ=wmaxwmin2b1w_q = \text{round}\left(\frac{w - w_{min}}{\Delta}\right), \quad \Delta = \frac{w_{max} - w_{min}}{2^b - 1}

Dequantization:

w^=wqΔ+wmin\hat{w} = w_q \cdot \Delta + w_{min}

Per-channel scaling (better quality):

Wq=round(W/s),si=max(Wi,:)/127W_q = \text{round}(W / s), \quad s_i = \max(|W_{i,:}|) / 127

Memory: FP32 → INT8 = 4× reduction. INT4 = 8× reduction.

Canonical Papers

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar et al.2023ICLR
Read paper →

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Dettmers et al.2022NeurIPS
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.