Domain Neighborhood
LLM Systems
How models run in production: prefill vs decode, KV cache memory, batching and scheduling, and the techniques that make latency and throughput practical.
Recommended Route
Start here, then follow the prerequisites forward.
This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.
- 01Decoding & Sampling: Temperature, Top-p & Inference-Time Control
How inference settings reshape the next-token distribution into actual model behavior: temperature, nucleus sampling, and why decoding is a control knob.
16 mincodedemoafter Maximum Likelihood, Scaled Dot-Product Attention & Transformer LayersCheck Maximum Likelihood first if the symbols feel slippery.
- 02LLM Serving at Scale: Prefill, Decode & Continuous Batching
A systems mental model for LLM inference: prefill vs decode, TTFT vs TPOT, batching/scheduling, and why KV cache memory dominates.
22 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Efficient Attention at Scale: KV Cache, GQA & FlashAttentionWhy this follows: both pages keep the llm systems thread active.
- 03Structured Decoding: Token Masks From Schema Automata
How a schema automaton or parser state turns next-token logits into constraint-valid generation by masking invalid continuations, while leaving truth and task success outside the formal guarantee.
22 mincodedemoafter Decoding & Sampling: Temperature, Top-p & Inference-Time Control, Tokenization & Vocabulary DesignWhy this follows: both pages keep the llm systems thread active.
- 04Speculative Decoding: Lossless Multi-Token Generation
Draft several tokens with a fast model, score draft prefixes with the target model in parallel, then use modified rejection/residual sampling so the sampled distribution matches target-model decoding.
18 mincodedemoafter Maximum Likelihood, Scaled Dot-Product Attention & Transformer Layers, LLM Serving at Scale: Prefill, Decode & Continuous BatchingWhy this follows: both pages keep the llm systems / decoding thread active.
- 05MoE Serving & Scheduling: Token Dispatch, All-to-All, Disaggregated Parallelism
Serving MoE turns sparse compute into a scheduling problem: routing skew can create stragglers and token-dispatch communication can bottleneck, motivating scheduling and, in systems such as MegaScale-Infer, disaggregated expert-parallel serving.
20 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Efficient Attention at Scale: KV Cache, GQA & FlashAttention, LLM Serving at Scale: Prefill, Decode & Continuous BatchingWhy this follows: both pages keep the llm systems thread active.
All Published Notebooks
Browse the territory.
Decoding & Sampling: Temperature, Top-p & Inference-Time Control
How inference settings reshape the next-token distribution into actual model behavior: temperature, nucleus sampling, and why decoding is a control knob.
LLM Serving at Scale: Prefill, Decode & Continuous Batching
A systems mental model for LLM inference: prefill vs decode, TTFT vs TPOT, batching/scheduling, and why KV cache memory dominates.
Structured Decoding: Token Masks From Schema Automata
How a schema automaton or parser state turns next-token logits into constraint-valid generation by masking invalid continuations, while leaving truth and task success outside the formal guarantee.
Speculative Decoding: Lossless Multi-Token Generation
Draft several tokens with a fast model, score draft prefixes with the target model in parallel, then use modified rejection/residual sampling so the sampled distribution matches target-model decoding.
MoE Serving & Scheduling: Token Dispatch, All-to-All, Disaggregated Parallelism
Serving MoE turns sparse compute into a scheduling problem: routing skew can create stragglers and token-dispatch communication can bottleneck, motivating scheduling and, in systems such as MegaScale-Infer, disaggregated expert-parallel serving.
Advanced Bridges
Use these after the core path.
In Progress