Domain Neighborhood

LLM Systems

How models run in production: prefill vs decode, KV cache memory, batching and scheduling, and the techniques that make latency and throughput practical.

6 concepts5 published6 demos

Recommended Route

This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.

  1. 01
    Decoding & Sampling: Temperature, Top-p & Inference-Time Control

    How inference settings reshape the next-token distribution into actual model behavior: temperature, nucleus sampling, and why decoding is a control knob.

    16 mincodedemoafter Maximum Likelihood, Scaled Dot-Product Attention & Transformer Layers

    Check Maximum Likelihood first if the symbols feel slippery.

  2. 02
    LLM Serving at Scale: Prefill, Decode & Continuous Batching

    A systems mental model for LLM inference: prefill vs decode, TTFT vs TPOT, batching/scheduling, and why KV cache memory dominates.

    22 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Efficient Attention at Scale: KV Cache, GQA & FlashAttention

    Why this follows: both pages keep the llm systems thread active.

  3. 03
    Structured Decoding: Token Masks From Schema Automata

    How a schema automaton or parser state turns next-token logits into constraint-valid generation by masking invalid continuations, while leaving truth and task success outside the formal guarantee.

    22 mincodedemoafter Decoding & Sampling: Temperature, Top-p & Inference-Time Control, Tokenization & Vocabulary Design

    Why this follows: both pages keep the llm systems thread active.

  4. 04
    Speculative Decoding: Lossless Multi-Token Generation

    Draft several tokens with a fast model, score draft prefixes with the target model in parallel, then use modified rejection/residual sampling so the sampled distribution matches target-model decoding.

    18 mincodedemoafter Maximum Likelihood, Scaled Dot-Product Attention & Transformer Layers, LLM Serving at Scale: Prefill, Decode & Continuous Batching

    Why this follows: both pages keep the llm systems / decoding thread active.

  5. 05
    MoE Serving & Scheduling: Token Dispatch, All-to-All, Disaggregated Parallelism

    Serving MoE turns sparse compute into a scheduling problem: routing skew can create stragglers and token-dispatch communication can bottleneck, motivating scheduling and, in systems such as MegaScale-Infer, disaggregated expert-parallel serving.

    20 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Efficient Attention at Scale: KV Cache, GQA & FlashAttention, LLM Serving at Scale: Prefill, Decode & Continuous Batching

    Why this follows: both pages keep the llm systems thread active.

All Published Notebooks

Browse the territory.

Advanced Bridges

Use these after the core path.

In Progress

Notebooks still below the publish bar.

Retrieval-Augmented Generation: External Memory for Generation