Legacy Concept Lab

Distributed Training: Data, Tensor & Pipeline Parallelism

Large models REQUIRE distributed training—no single GPU can hold GPT-4 class models

Concept 50 of 100EfficiencyPhase 9
#50DistributedEfficiency
key equation\text{Memory per GPU} \approx \frac{\text{Model} + \text{Optimizer}}{\text{Parallelism degree}}
Phase 9: Advanced architectures & generationConcept 50 of 100

Why It Matters for Modern Models

  • Large models REQUIRE distributed training—no single GPU can hold GPT-4 class models
  • 3D parallelism (data + tensor + pipeline) is how 100B+ models are trained
  • Understanding communication patterns explains why some architectures scale better than others

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Communication is often the bottleneck: AllReduce time can exceed compute time at scale
  • ZeRO stages trade memory for communication: ZeRO-3 is most memory-efficient but slowest
  • Pipeline bubbles waste compute: micro-batching (1F1B schedule) minimizes idle time

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
Memory per GPUModel+OptimizerParallelism degree\text{Memory per GPU} \approx \frac{\text{Model} + \text{Optimizer}}{\text{Parallelism degree}}

Data parallelism: Each GPU holds full model, processes different batches:

L=1Ni=1NLi(AllReduce)\nabla L = \frac{1}{N}\sum_{i=1}^{N} \nabla L_i \quad \text{(AllReduce)}

Tensor parallelism: Split matrix multiplies across GPUs:

Y=XW=X[W1W2]=[XW1XW2](column split)Y = XW = X[W_1 | W_2] = [XW_1 | XW_2] \quad \text{(column split)}

Pipeline parallelism: Split layers across GPUs, micro-batch for efficiency:

GPUi:layers [li,li+1)\text{GPU}_i: \text{layers } [l_i, l_{i+1})

ZeRO partitions optimizer states, gradients, parameters across ranks.

Canonical Papers

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi et al.2020arXiv
Read paper →

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Rajbhandari et al.2020SC
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.