Domain Neighborhood

Efficiency

How we make models cheaper to train and serve: quantization, distillation, low-rank adapters, sparsity, and the memory/latency tradeoffs that dominate real deployments.

5 concepts5 published5 demos

Start with Knowledge Distillation: Learning from Teachers Search Atlas

Recommended Route

Start here, then follow the prerequisites forward.

This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.

01
Knowledge Distillation: Learning from Teachers
Train a smaller student to mimic a stronger teacher by matching soft probability distributions (often with temperature), transferring 'dark knowledge' beyond hard labels.
16 mincodedemoafter Maximum Likelihood, Label Smoothing & Soft Targets
Check Maximum Likelihood first if the symbols feel slippery.
02
Pruning: Removing Unnecessary Weights
Reduce parameter count by zeroing or removing weights. Unstructured sparsity needs sparse kernels for speed; structured pruning removes whole channels/heads to shrink dense tensor shapes.
16 mincodedemoafter Efficiency: Quantization, Distillation, LoRA & Sparse MoE, Weight Initialization: Xavier, He & muP
Why this follows: both pages keep the efficiency thread active.
03
Quantization: Compressing Models to Integers
Reduce memory and bandwidth by storing weights/activations in low-bit integers (INT8/INT4) with careful scaling to limit accuracy loss.
16 mincodedemoafter Efficiency: Quantization, Distillation, LoRA & Sparse MoE, LLM Serving at Scale: Prefill, Decode & Continuous Batching
Why this follows: both pages keep the efficiency thread active.
04
Efficiency: Quantization, Distillation, LoRA & Sparse MoE
The practical toolkit for making big models cheaper: quantize weights/activations, distill teachers into students, adapt with low-rank updates (LoRA), and use sparsity (MoE).
20 mincodedemoafter Maximum Likelihood, Scaled Dot-Product Attention & Transformer Layers, Loss Landscapes, Sharpness & Flat Minima
Why this follows: both pages keep the efficiency / quantization thread active.
05
Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism
Conditional computation: a router picks a few experts per token. You can increase total expert parameters while keeping activated expert FFN compute small, but distributed systems may pay in communication and scheduling.
20 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Maximum Likelihood, Efficiency: Quantization, Distillation, LoRA & Sparse MoE
Why this follows: Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism uses Efficiency: Quantization, Distillation, LoRA & Sparse MoE directly.

All Published Notebooks

Browse the territory.

Knowledge Distillation: Learning from Teachers

Train a smaller student to mimic a stronger teacher by matching soft probability distributions (often with temperature), transferring 'dark knowledge' beyond hard labels.

Level 316 mindemo

Pruning: Removing Unnecessary Weights

Reduce parameter count by zeroing or removing weights. Unstructured sparsity needs sparse kernels for speed; structured pruning removes whole channels/heads to shrink dense tensor shapes.

Level 316 mindemo

Quantization: Compressing Models to Integers

Reduce memory and bandwidth by storing weights/activations in low-bit integers (INT8/INT4) with careful scaling to limit accuracy loss.

Level 316 mindemo

Efficiency: Quantization, Distillation, LoRA & Sparse MoE

The practical toolkit for making big models cheaper: quantize weights/activations, distill teachers into students, adapt with low-rank updates (LoRA), and use sparsity (MoE).

Level 420 mindemo

Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism

Conditional computation: a router picks a few experts per token. You can increase total expert parameters while keeping activated expert FFN compute small, but distributed systems may pay in communication and scheduling.

Level 420 mindemo

Advanced Bridges

Use these after the core path.

Efficiency: Quantization, Distillation, LoRA & Sparse MoE Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism