16Efficiency

⚡Efficiency: Quantization, Distillation, LoRA & Sparse MoE

Canonical Papers

Distilling the Knowledge in a Neural Network

Hinton et al.2015NeurIPS Workshop

Read paper →

LoRA: Low-Rank Adaptation of Large Language Models

Hu et al.2021ICLR

Read paper →

Switch Transformers: Scaling to Trillion Parameter Models

Fedus et al.2021JMLR

Read paper →

Core Mathematics

Distillation: train student $q_\psi$ to match teacher $p_\theta$ :

\mathcal L = T^2\,\mathrm{KL}(p_\theta^T(\cdot\mid x)\,\|\,q_\psi^T(\cdot\mid x))

Quantization: map float weights to low-bit integers: $\tilde w = \Delta \cdot \mathrm{round}(w/\Delta)$

LoRA: re-parameterize weight matrix as: $W' = W + BA$ , where $B\in\mathbb R^{d\times r}, A\in\mathbb R^{r\times d}, r\ll d$ — only train $A,B$ , freezing $W$ .

Sparse MoE: FFN layers replaced by many experts $f_e$ , with router: $\text{FFN}_{\text{MoE}}(x) = f_{e^*(x)}(x)$

Key Equation

W' = W + BA,\quad r \ll d

Interactive Visualization

Why It Matters for Modern Models

Quantization + LoRA are standard for deploying and fine-tuning Llama-class models on modest GPUs
Distillation compresses large base models into "small assistants"
MoE/Switch-style sparsity powers very large Google-scale models (likely Gemini)

Missing Intuition

What is still poorly explained in textbooks and papers:

Geometric views of low-rank updates: LoRA as adding a small, oriented "slice" in weight space
Intuitive trade-offs in quantization: how error propagates, why some layers are more sensitive

Connections

Prerequisites

⌇Sharpness ∂Diffusion

Enables

⚙Efficient Attention ⏩Speculative Decoding 🔀MoE 🔤Tokens

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

🔄 Same Technique

⚖KL anchoring→RLHF ∇Adaptive step sizing→Adam

⚠️ Breaks When

⌇Sharp minima→Sharpness ⚠️Speed vs safety tradeoff→Reward Hacking

≈ Analogy

⏩Student as draft→Speculative Decoding

🔧 Invented to Fix

∂Accelerate sampling→Diffusion