Legacy Concept Lab
Quantization: Compressing Models to Integers
How you run 70B models on consumer GPUs: 4-bit quantization fits in 24GB VRAM
#63QuantizationEfficiency
key equation
w_q = \text{round}\left(\frac{w - w_{min}}{\Delta}\right)Phase 6: Modern efficiency & inferenceConcept 63 of 100
Why It Matters for Modern Models
- How you run 70B models on consumer GPUs: 4-bit quantization fits in 24GB VRAM
- GPTQ/AWQ/GGML are standard for deploying open-weight LLMs
- INT8 inference is ~2× faster than FP16 on modern GPUs (tensor cores)
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Neural nets are surprisingly robust to quantization—most precision is wasted
- Outliers are the enemy: a few large activations force bad scaling for everything else
- Quantization-aware training (QAT) is better than post-training quantization (PTQ), but expensive
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Map FP32 weights to INT8 (or INT4):
Uniform quantization:
Dequantization:
Per-channel scaling (better quality):
Memory: FP32 → INT8 = 4× reduction. INT4 = 8× reduction.