Efficiency: Quantization, Distillation, LoRA & Sparse MoE
Canonical Papers
Distilling the Knowledge in a Neural Network
Read paper →LoRA: Low-Rank Adaptation of Large Language Models
Read paper →Switch Transformers: Scaling to Trillion Parameter Models
Read paper →Core Mathematics
Distillation: train student to match teacher :
Quantization: map float weights to low-bit integers:
LoRA: re-parameterize weight matrix as: , where — only train , freezing .
Sparse MoE: FFN layers replaced by many experts , with router:
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- Quantization + LoRA are standard for deploying and fine-tuning Llama-class models on modest GPUs
- Distillation compresses large base models into "small assistants"
- MoE/Switch-style sparsity powers very large Google-scale models (likely Gemini)
Missing Intuition
What is still poorly explained in textbooks and papers:
- Geometric views of low-rank updates: LoRA as adding a small, oriented "slice" in weight space
- Intuitive trade-offs in quantization: how error propagates, why some layers are more sensitive