Domain Neighborhood

Efficiency

How we make models cheaper to train and serve: quantization, distillation, low-rank adapters, sparsity, and the memory/latency tradeoffs that dominate real deployments.

5 concepts5 published5 demos

Recommended Route

This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.

  1. 01
    Knowledge Distillation: Learning from Teachers

    Train a smaller student to mimic a stronger teacher by matching soft probability distributions (often with temperature), transferring 'dark knowledge' beyond hard labels.

    16 mincodedemoafter Maximum Likelihood, Label Smoothing & Soft Targets

    Check Maximum Likelihood first if the symbols feel slippery.

  2. 02
    Pruning: Removing Unnecessary Weights

    Reduce parameter count by zeroing or removing weights. Unstructured sparsity needs sparse kernels for speed; structured pruning removes whole channels/heads to shrink dense tensor shapes.

    16 mincodedemoafter Efficiency: Quantization, Distillation, LoRA & Sparse MoE, Weight Initialization: Xavier, He & muP

    Why this follows: both pages keep the efficiency thread active.

  3. 03
    Quantization: Compressing Models to Integers

    Reduce memory and bandwidth by storing weights/activations in low-bit integers (INT8/INT4) with careful scaling to limit accuracy loss.

    16 mincodedemoafter Efficiency: Quantization, Distillation, LoRA & Sparse MoE, LLM Serving at Scale: Prefill, Decode & Continuous Batching

    Why this follows: both pages keep the efficiency thread active.

  4. 04
    Efficiency: Quantization, Distillation, LoRA & Sparse MoE

    The practical toolkit for making big models cheaper: quantize weights/activations, distill teachers into students, adapt with low-rank updates (LoRA), and use sparsity (MoE).

    20 mincodedemoafter Maximum Likelihood, Scaled Dot-Product Attention & Transformer Layers, Loss Landscapes, Sharpness & Flat Minima

    Why this follows: both pages keep the efficiency / quantization thread active.

  5. 05
    Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism

    Conditional computation: a router picks a few experts per token. You can increase total expert parameters while keeping activated expert FFN compute small, but distributed systems may pay in communication and scheduling.

    20 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Maximum Likelihood, Efficiency: Quantization, Distillation, LoRA & Sparse MoE

    Why this follows: Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism uses Efficiency: Quantization, Distillation, LoRA & Sparse MoE directly.

All Published Notebooks

Browse the territory.

Advanced Bridges

Use these after the core path.