Domain Neighborhood

LLM Systems

How models run in production: prefill vs decode, KV cache memory, batching and scheduling, and the techniques that make latency and throughput practical.

6 concepts5 published6 demos

Start with Decoding & Sampling: Temperature, Top-p & Inference-Time Control Search Atlas

Recommended Route

Start here, then follow the prerequisites forward.

This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.

01
Decoding & Sampling: Temperature, Top-p & Inference-Time Control
How inference settings reshape the next-token distribution into actual model behavior: temperature, nucleus sampling, and why decoding is a control knob.
16 mincodedemoafter Maximum Likelihood, Scaled Dot-Product Attention & Transformer Layers
Check Maximum Likelihood first if the symbols feel slippery.
02
LLM Serving at Scale: Prefill, Decode & Continuous Batching
A systems mental model for LLM inference: prefill vs decode, TTFT vs TPOT, batching/scheduling, and why KV cache memory dominates.
22 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Efficient Attention at Scale: KV Cache, GQA & FlashAttention
Why this follows: both pages keep the llm systems thread active.
03
Structured Decoding: Token Masks From Schema Automata
How a schema automaton or parser state turns next-token logits into constraint-valid generation by masking invalid continuations, while leaving truth and task success outside the formal guarantee.
22 mincodedemoafter Decoding & Sampling: Temperature, Top-p & Inference-Time Control, Tokenization & Vocabulary Design
Why this follows: both pages keep the llm systems thread active.
04
Speculative Decoding: Lossless Multi-Token Generation
Draft several tokens with a fast model, score draft prefixes with the target model in parallel, then use modified rejection/residual sampling so the sampled distribution matches target-model decoding.
18 mincodedemoafter Maximum Likelihood, Scaled Dot-Product Attention & Transformer Layers, LLM Serving at Scale: Prefill, Decode & Continuous Batching
Why this follows: both pages keep the llm systems / decoding thread active.
05
MoE Serving & Scheduling: Token Dispatch, All-to-All, Disaggregated Parallelism
Serving MoE turns sparse compute into a scheduling problem: routing skew can create stragglers and token-dispatch communication can bottleneck, motivating scheduling and, in systems such as MegaScale-Infer, disaggregated expert-parallel serving.
20 mincodedemoafter Scaled Dot-Product Attention & Transformer Layers, Efficient Attention at Scale: KV Cache, GQA & FlashAttention, LLM Serving at Scale: Prefill, Decode & Continuous Batching
Why this follows: both pages keep the llm systems thread active.

All Published Notebooks

Browse the territory.

Decoding & Sampling: Temperature, Top-p & Inference-Time Control

How inference settings reshape the next-token distribution into actual model behavior: temperature, nucleus sampling, and why decoding is a control knob.

Level 316 mindemo

LLM Serving at Scale: Prefill, Decode & Continuous Batching

A systems mental model for LLM inference: prefill vs decode, TTFT vs TPOT, batching/scheduling, and why KV cache memory dominates.

Level 422 mindemo

Structured Decoding: Token Masks From Schema Automata

How a schema automaton or parser state turns next-token logits into constraint-valid generation by masking invalid continuations, while leaving truth and task success outside the formal guarantee.

Level 422 mindemo

Speculative Decoding: Lossless Multi-Token Generation

Draft several tokens with a fast model, score draft prefixes with the target model in parallel, then use modified rejection/residual sampling so the sampled distribution matches target-model decoding.

Level 418 mindemo

MoE Serving & Scheduling: Token Dispatch, All-to-All, Disaggregated Parallelism

Serving MoE turns sparse compute into a scheduling problem: routing skew can create stragglers and token-dispatch communication can bottleneck, motivating scheduling and, in systems such as MegaScale-Infer, disaggregated expert-parallel serving.

Level 420 mindemo

Advanced Bridges

Use these after the core path.

LLM Serving at Scale: Prefill, Decode & Continuous Batching Structured Decoding: Token Masks From Schema Automata Speculative Decoding: Lossless Multi-Token Generation MoE Serving & Scheduling: Token Dispatch, All-to-All, Disaggregated Parallelism

In Progress

Notebooks still below the publish bar.

Retrieval-Augmented Generation: External Memory for Generation