Legacy Concept Lab

Distributed Training: Data, Tensor & Pipeline Parallelism

Large models REQUIRE distributed training—no single GPU can hold GPT-4 class models

Concept 50 of 100EfficiencyPhase 9

#50DistributedEfficiency

key equation\text{Memory per GPU} \approx \frac{\text{Model} + \text{Optimizer}}{\text{Parallelism degree}}

Phase 9: Advanced architectures & generationConcept 50 of 100

Why It Matters for Modern Models

Large models REQUIRE distributed training—no single GPU can hold GPT-4 class models
3D parallelism (data + tensor + pipeline) is how 100B+ models are trained
Understanding communication patterns explains why some architectures scale better than others

What is still poorly explained in textbooks and papers:

Communication is often the bottleneck: AllReduce time can exceed compute time at scale
ZeRO stages trade memory for communication: ZeRO-3 is most memory-efficient but slowest
Pipeline bubbles waste compute: micro-batching (1F1B schedule) minimizes idle time

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\text{Memory per GPU} \approx \frac{\text{Model} + \text{Optimizer}}{\text{Parallelism degree}}

Data parallelism: Each GPU holds full model, processes different batches:

\nabla L = \frac{1}{N}\sum_{i=1}^{N} \nabla L_i \quad \text{(AllReduce)}

Tensor parallelism: Split matrix multiplies across GPUs:

Y = XW = X[W_1 | W_2] = [XW_1 | XW_2] \quad \text{(column split)}

Pipeline parallelism: Split layers across GPUs, micro-batch for efficiency:

\text{GPU}_i: \text{layers } [l_i, l_{i+1})

ZeRO partitions optimizer states, gradients, parameters across ranks.

Shoeybi et al.2020arXiv

Rajbhandari et al.2020SC

Explore this concept from different angles — like a mathematician would.