Legacy Concept Lab

Pruning: Removing Unnecessary Weights

Lottery ticket hypothesis changed how we think about overparameterization

Concept 64 of 100EfficiencyPhase 6

#64PruningEfficiency

key equationW_{pruned} = W \odot M, \quad M_{ij} = \mathbf{1}[|W_{ij}| > \theta]

Phase 6: Modern efficiency & inferenceConcept 64 of 100

Why It Matters for Modern Models

Lottery ticket hypothesis changed how we think about overparameterization
SparseGPT can prune 50% of GPT-175B weights with minimal quality loss
Structured pruning enables actual speedups; unstructured sparsity needs special hardware

What is still poorly explained in textbooks and papers:

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

W_{pruned} = W \odot M, \quad M_{ij} = \mathbf{1}[|W_{ij}| > \theta]

Magnitude pruning: Remove weights with smallest $|w|$ :

W_{pruned} = W \odot M, \quad M_{ij} = \mathbf{1}[|W_{ij}| > \theta]

Structured pruning: Remove entire neurons/attention heads:

W_{pruned} = W[:, \text{keep\_indices}]

Lottery Ticket: There exist sparse subnetworks that train as well as dense:

\exists M: \text{train}(W_0 \odot M) \approx \text{train}(W_0)

OBS/OBD criterion (second-order): Prune weight that minimizes loss increase:

\delta L \approx \frac{w_i^2}{2 H_{ii}^{-1}}

Frankle & Carlin2019ICLR

Frantar & Alistarh2023ICML

Explore this concept from different angles — like a mathematician would.