Domain Neighborhood
Efficiency
How we make models cheaper to train and serve: quantization, distillation, low-rank adapters, sparsity, and the memory/latency tradeoffs that dominate real deployments.
Recommended Route
Start here, then follow the prerequisites forward.
This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.
- 01Knowledge Distillation: Learning from Teachers
Train a smaller student to mimic a stronger teacher by matching soft probability distributions (often with temperature), transferring 'dark knowledge' beyond hard labels.
16 mincodedemoafter Maximum Likelihood, Label Smoothing & Soft TargetsCheck Maximum Likelihood first if the symbols feel slippery.
- 02Pruning: Removing Unnecessary Weights
Reduce parameter count by zeroing or removing weights. Unstructured sparsity needs sparse kernels for speed; structured pruning removes whole channels/heads to shrink dense tensor shapes.
16 mincodedemoafter Efficiency: Quantization, Distillation, LoRA & Sparse MoE, Weight Initialization: Xavier, He & muPWhy this follows: both pages keep the efficiency thread active.
- 03Quantization: Compressing Models to Integers
Reduce memory and bandwidth by storing weights/activations in low-bit integers (INT8/INT4) with careful scaling to limit accuracy loss.
16 mincodedemoafter Efficiency: Quantization, Distillation, LoRA & Sparse MoE, LLM Serving at Scale: Prefill, Decode & Continuous BatchingWhy this follows: both pages keep the efficiency thread active.
- 04Efficiency: Quantization, Distillation, LoRA & Sparse MoE
The practical toolkit for making big models cheaper: quantize weights/activations, distill teachers into students, adapt with low-rank updates (LoRA), and use sparsity (MoE).
20 mincodedemoafter Maximum Likelihood, Scaled Dot-Product Attention & Transformer Layers, Loss Landscapes, Sharpness & Flat MinimaWhy this follows: both pages keep the efficiency / quantization thread active.
- 05Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism
Conditional computation: a router picks a few experts per token. You can increase total expert parameters while keeping activated expert FFN compute small, but distributed systems may pay in communication and scheduling.
20 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Maximum Likelihood, Efficiency: Quantization, Distillation, LoRA & Sparse MoEWhy this follows: Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism uses Efficiency: Quantization, Distillation, LoRA & Sparse MoE directly.
All Published Notebooks
Browse the territory.
Knowledge Distillation: Learning from Teachers
Train a smaller student to mimic a stronger teacher by matching soft probability distributions (often with temperature), transferring 'dark knowledge' beyond hard labels.
Pruning: Removing Unnecessary Weights
Reduce parameter count by zeroing or removing weights. Unstructured sparsity needs sparse kernels for speed; structured pruning removes whole channels/heads to shrink dense tensor shapes.
Quantization: Compressing Models to Integers
Reduce memory and bandwidth by storing weights/activations in low-bit integers (INT8/INT4) with careful scaling to limit accuracy loss.
Efficiency: Quantization, Distillation, LoRA & Sparse MoE
The practical toolkit for making big models cheaper: quantize weights/activations, distill teachers into students, adapt with low-rank updates (LoRA), and use sparsity (MoE).
Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism
Conditional computation: a router picks a few experts per token. You can increase total expert parameters while keeping activated expert FFN compute small, but distributed systems may pay in communication and scheduling.
Advanced Bridges