16Efficiency

Efficiency: Quantization, Distillation, LoRA & Sparse MoE

Canonical Papers

Distilling the Knowledge in a Neural Network

Hinton et al.2015NeurIPS Workshop
Read paper →

LoRA: Low-Rank Adaptation of Large Language Models

Hu et al.2021ICLR
Read paper →

Switch Transformers: Scaling to Trillion Parameter Models

Fedus et al.2021JMLR
Read paper →

Core Mathematics

Distillation: train student qψq_\psi to match teacher pθp_\theta:

L=T2KL(pθT(x)qψT(x))\mathcal L = T^2\,\mathrm{KL}(p_\theta^T(\cdot\mid x)\,\|\,q_\psi^T(\cdot\mid x))

Quantization: map float weights to low-bit integers: w~=Δround(w/Δ)\tilde w = \Delta \cdot \mathrm{round}(w/\Delta)

LoRA: re-parameterize weight matrix as: W=W+BAW' = W + BA, where BRd×r,ARr×d,rdB\in\mathbb R^{d\times r}, A\in\mathbb R^{r\times d}, r\ll d — only train A,BA,B, freezing WW.

Sparse MoE: FFN layers replaced by many experts fef_e, with router: FFNMoE(x)=fe(x)(x)\text{FFN}_{\text{MoE}}(x) = f_{e^*(x)}(x)

Key Equation
W=W+BA,rdW' = W + BA,\quad r \ll d

Interactive Visualization

Why It Matters for Modern Models

  • Quantization + LoRA are standard for deploying and fine-tuning Llama-class models on modest GPUs
  • Distillation compresses large base models into "small assistants"
  • MoE/Switch-style sparsity powers very large Google-scale models (likely Gemini)

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Geometric views of low-rank updates: LoRA as adding a small, oriented "slice" in weight space
  • Intuitive trade-offs in quantization: how error propagates, why some layers are more sensitive

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.