KEMBAR78
Distributed Training Scaling Deep Dive | PDF | Parallel Computing | Graphics Processing Unit
0% found this document useful (0 votes)
44 views33 pages

Distributed Training Scaling Deep Dive

The document discusses the necessity of distributed training for large neural networks, driven by the increasing complexity and size of models that exceed single-GPU capabilities. It outlines various parallelism strategies, including Data Parallelism, Model Parallelism, Tensor Parallelism, and Pipeline Parallelism, each addressing different challenges related to memory capacity and computational efficiency. The document emphasizes the importance of selecting the appropriate parallelism technique based on model architecture, hardware, and specific bottlenecks encountered during training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views33 pages

Distributed Training Scaling Deep Dive

The document discusses the necessity of distributed training for large neural networks, driven by the increasing complexity and size of models that exceed single-GPU capabilities. It outlines various parallelism strategies, including Data Parallelism, Model Parallelism, Tensor Parallelism, and Pipeline Parallelism, each addressing different challenges related to memory capacity and computational efficiency. The document emphasizes the importance of selecting the appropriate parallelism technique based on model architecture, hardware, and specific bottlenecks encountered during training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Mastering Scale: An Expert's Guide to Distributed Training

and Optimization of Large Neural Networks


1. Introduction to Distributed Training and Scaling
1.1. The Imperative for Scale: Why Large Models Demand Distributed Approaches
The landscape of artificial intelligence has been dramatically reshaped by the advent
of increasingly large and complex neural network models. The development of
architectures boasting billions, and in some cases, trillions of parameters—such as
GPT-3, Turing Natural Language Generation (Turing-NLG), MT-530B, and
BLOOM—has pushed the boundaries of what can be achieved with single-GPU
systems.1 Training these colossal models necessitates computational resources and
memory capacities that far exceed the capabilities of any individual processing unit.
This surge in model scale is not arbitrary; empirical evidence consistently
demonstrates that model performance, particularly in complex domains like natural
language understanding and generation, often correlates positively with size.
However, this growth introduces fundamental hardware bottlenecks concerning GPU
memory—required for storing parameters, gradients, optimizer states, and
activations—and the sheer computational throughput (measured in Floating Point
Operations Per Second, or FLOPs) needed for training. Consequently, distributed
training is no longer a niche technique but an essential methodology for advancing
the state-of-the-art in AI.

Deep learning itself can be conceptualized as a novel software paradigm that


inherently requires a new, massively parallel computing model to be effectively
realized.3 The challenges encountered in training models like the 17-billion-parameter
Turing-NLG, and the continued research into trillion-parameter architectures,
underscore this necessity.2 The pursuit of scale is driven not only by the desire for
larger models but also by the need to effectively leverage vast datasets and tackle
increasingly sophisticated tasks. This creates a dynamic interplay: advancements in
model capabilities fuel the demand for more data and complex problem-solving,
which in turn necessitates more powerful and scalable distributed systems. This
co-evolution is a defining characteristic of modern AI research, where progress in
models, algorithms, and systems are inextricably linked. The ability to distribute the
computational and memory load across multiple devices is paramount to continued
innovation.

1.2. An Overview of Parallelism Dimensions in Deep Learning


To address the multifaceted challenges of training large neural networks, several
distinct parallelism strategies have been developed. Broadly, these can be
categorized into Data Parallelism (DP), Model Parallelism (MP)—which itself
encompasses techniques like Tensor Parallelism (TP)—and Pipeline Parallelism (PP).5
These strategies are not mutually exclusive; in fact, contemporary large-scale training
systems often employ hybrid approaches, combining these dimensions to optimize for
specific model architectures and hardware configurations.7

Data parallelism focuses on distributing the data across multiple workers, each
holding a replica of the model. Model parallelism, in contrast, involves partitioning the
model itself across different devices. Pipeline parallelism takes this further by dividing
the model's layers into sequential stages, processed in an assembly-line fashion.
Tensor parallelism is a specific form of model parallelism that splits individual
operations within a layer across devices. Each of these approaches targets different
bottlenecks: data parallelism primarily addresses compute limitations by processing
more data concurrently, while model and pipeline parallelism are chiefly concerned
with fitting large models into limited per-device memory. The selection and
orchestration of these parallelism dimensions are critical design choices, heavily
influenced by the model's architecture (e.g., its depth versus the size of its individual
layers), the available hardware (GPU memory, interconnect speeds), and the primary
bottlenecks encountered—be they memory capacity, computational throughput, or
inter-device communication bandwidth. There is no universal solution; the optimal
strategy emerges from a careful consideration of these interacting factors, often
representing a complex optimization problem in its own right.

2. Fundamental Parallelism Strategies


2.1. Data Parallelism (DP)
Concept and Mechanism
Data Parallelism (DP) is a widely adopted strategy in distributed deep learning where
the core idea is to replicate the entire model across multiple processing units, often
referred to as workers (typically GPUs). Each worker then processes a different
subset, or mini-batch, of the training data simultaneously.5 After each worker
computes the gradients based on its local data, these gradients are aggregated
across all workers—commonly through an All-Reduce operation—to produce a
consistent update. This aggregated gradient is then used to update the model
parameters on each worker, ensuring that all model replicas remain synchronized.6
This approach primarily accelerates training by enabling the processing of a larger
effective batch size in the same amount of wall-clock time.
The mechanics involve several steps: data is divided into chunks, each chunk is
assigned to a separate processor, these processors work on their data concurrently
applying the same operations, and finally, the results (gradients) are combined.9 While
conceptually straightforward and often the easiest parallelism method to implement, a
key prerequisite is that the entire model, along with its optimizer states and memory
required for activations, must fit within the memory of each individual GPU.5

Benefits
Data parallelism offers several advantages. It can lead to significant improvements in
training performance and throughput, especially when the dataset is large and
processing it sequentially would be time-consuming.9 The strategy scales well with
increasing dataset sizes, provided the model fits in memory. By distributing the
workload, DP allows for more efficient utilization of available computational
resources.9

Limitations
The primary limitation of data parallelism is its memory inefficiency concerning model
states. Since the model parameters and optimizer states are replicated on every GPU,
the memory footprint per GPU remains high.5 This means DP alone is insufficient if the
model itself is too large to fit on a single device. Another critical challenge is the
communication overhead associated with synchronizing gradients after each
backward pass. This aggregation step, typically an All-Reduce operation, can become
a significant bottleneck, especially for very large models (which produce large
gradient tensors) or when using systems with slower inter-GPU or inter-node
interconnects.8

Synchronization and Communication


The efficiency of gradient synchronization is paramount in DP. The All-Reduce
collective operation sums the gradients from all workers and distributes the result
back to all of them.6 The time taken for this step depends on the size of the gradients,
the number of workers, the interconnect bandwidth (e.g., PCIe, NVLink, Ethernet), and
the specific All-Reduce algorithm employed (e.g., ring-based, tree-based).

Data parallelism is often the initial strategy considered for scaling training when the
model can be accommodated by individual GPU memory, but the sheer volume of
data makes training on a single GPU too slow. Its principal drawback is the redundant
storage of model parameters and optimizer states, which becomes prohibitive for
extremely large models. This memory redundancy is precisely what advanced
techniques like ZeRO and FSDP aim to eliminate.

2.2. Model Parallelism (MP) - General Concept


Model Parallelism (MP) offers a solution when a neural network is too large to fit into
the memory of a single device. Instead of replicating the entire model on each worker
as in data parallelism, MP involves partitioning the model itself across multiple
devices.6 Each device then becomes responsible for storing and computing only a
portion of the model's architecture, such as a subset of its layers, distinct
sub-modules, or even fragments of individual operations within a layer.8

This approach directly addresses the GPU memory capacity bottleneck by distributing
the storage of model parameters. Consequently, the aggregate memory of multiple
GPUs can be harnessed to accommodate models that would otherwise be
untrainable. Model parallelism is a broad term that encompasses more specific
techniques, most notably Tensor Parallelism (splitting individual operations) and
Pipeline Parallelism (splitting sequential layers), each with its own mechanisms and
trade-offs. The fundamental contribution of model parallelism is enabling the training
of models whose size transcends the memory limitations of individual hardware
accelerators.

2.3. Tensor Parallelism (TP) (A form of Model Parallelism)


Concept: Intra-layer sharding (horizontal splitting)
Tensor Parallelism (TP), also known as intra-layer model parallelism, is a technique
that partitions the execution of individual operations within a neural network layer,
such as matrix multiplications or attention computations, across multiple GPUs.5 This
is often described as a "horizontal" split of the model, as it divides the computation
along the feature or hidden dimensions of a tensor, rather than splitting sequential
layers.12 TP is particularly effective for models characterized by very large individual
layers with substantial weight matrices, a common feature in modern Large Language
Models (LLMs) like Transformers.6 Each GPU in a tensor-parallel group computes a
part of an operation's result, and these partial results are subsequently combined
through collective communication operations to produce the full output.

Mechanism: Splitting weight matrices (Column/Row Parallelism)


Consider a typical linear layer represented by the matrix multiplication Y=XA+B, where
X is the input activation, A is the weight matrix, and B is the bias. Tensor parallelism
can be implemented by sharding the weight matrix A (and correspondingly the bias B)
across GPUs. Two common strategies are column parallelism and row parallelism.7
In column parallelism, the weight matrix A is split column-wise across N GPUs. If
A=[A1​,A2​,...,AN​], then each GPU i computes XAi​. The input X is typically broadcast or
made available to all GPUs in the TP group. The outputs XAi​are then concatenated
along the column dimension to form the full output Y. This concatenation is usually
achieved using an All-Gather collective operation.7

In row parallelism, the weight matrix A is split row-wise. If A=T, the input X must also
be split (or sharded) along its feature dimension such that each GPU i receives Xi​and
computes Xi​Ai​. The partial outputs Xi​Ai​are then summed across all GPUs to form the
final output Y. This summation is typically performed using an All-Reduce collective
operation.7

The choice between column and row parallelism, and their sequencing, is critical for
efficiency. For example, in a Transformer's feed-forward network (FFN) block, which
often consists of two linear layers with a non-linear activation in between (e.g., GELU),
a common pattern is to use column parallelism for the first linear layer and row
parallelism for the second. This "pairwise sharding" can be more
communication-efficient because the output of the first (column-parallel) layer is
sharded, and if the activation function (like GELU) is element-wise, it can operate on
the sharded data directly. The second (row-parallel) layer can then take this sharded
output as input, and only a single All-Reduce is needed after the second layer to
produce the full output. This avoids an All-Gather after the first layer and an
All-Reduce after the second, effectively halving the communication for the block.7

Communication Overhead
Tensor parallelism inherently involves frequent communication among the GPUs
participating in the parallel computation. After each sharded operation in the forward
pass, a collective communication (e.g., All-Gather or All-Reduce) is typically required
to combine the partial results. Similar communication is needed during the backward
pass for gradients with respect to weights and inputs.7 This high frequency of
synchronization makes TP highly sensitive to the bandwidth and latency of the
interconnects between GPUs.6 For some large models, communication overhead in TP
can account for a substantial portion, potentially 50-70%, of the total runtime.7
Consequently, TP is most effective when deployed on GPUs connected by very
high-speed, low-latency interconnects like NVIDIA's NVLink, and is often restricted to
GPUs within a single server node.12

Use Cases and Considerations


TP is ideal for models where individual layers are too large to fit in a single GPU's
memory or where parallelizing the computation of these large layers can yield
speedups despite the communication costs. This is particularly true for the massive
MLP and attention layers found in LLMs.6

The introduction of TP presents a clear trade-off: it reduces the memory footprint per
GPU for very large layers and can accelerate their computation, but this comes at the
cost of significant inter-GPU communication. The effectiveness of TP is therefore
intrinsically linked to the availability of fast interconnects. If the communication
infrastructure is not sufficiently robust, the overhead can outweigh the computational
gains, rendering TP inefficient.

Furthermore, the mathematical structure of TP directly dictates the type of collective


communication required (All-Gather for column parallelism outputs, All-Reduce for
row parallelism outputs). Optimizing the sequence of these operations, as seen in
pairwise sharding, is crucial for minimizing the overall communication volume and is a
non-trivial aspect of implementing TP efficiently. This requires a careful analysis of the
data flow and the properties of the operators involved to adapt the model script with
the necessary collective operations, demanding a deeper understanding from the
user or reliance on sophisticated frameworks like Megatron-LM or DeepSpeed that
handle these transformations.7

2.4. Pipeline Parallelism (PP)


Concept: Inter-layer sharding (vertical splitting)
Pipeline Parallelism (PP) partitions a model "vertically" by assigning sequential layers
or blocks of layers to different GPUs, forming a pipeline of stages.5 Each GPU (or set
of GPUs) in the pipeline is responsible for executing only its assigned portion of the
model.8 Data flows from one stage to the next, with the output of one stage becoming
the input for the subsequent stage. This strategy is particularly well-suited for training
very deep neural networks, where the sheer number of layers makes it impossible to
fit the entire model onto a single device, even if individual layers are not excessively
large.8

Mechanism: Micro-batching and Staging


To enable concurrent execution across stages and improve GPU utilization, the input
data batch is typically divided into smaller units called micro-batches.5 These
micro-batches are then fed into the pipeline in a staggered fashion. This allows
different stages to operate on different micro-batches simultaneously, akin to an
assembly line.8 For instance, while stage k is processing micro-batch m, stage k−1 can
be processing micro-batch m+1, and stage k+1 can be processing micro-batch m−1.
This concurrency is essential for mitigating the "pipeline bubble."

The "Pipeline Bubble" and Mitigation Strategies


A significant challenge in naive pipeline parallelism is the "pipeline bubble" or idle time
incurred by GPUs at the beginning and end of processing a full batch.6 At the start,
later stages must wait for the first micro-batches to propagate through earlier stages.
Similarly, towards the end, earlier stages finish their micro-batches and become idle
while later stages complete processing. This bubble reduces overall hardware
utilization and efficiency.15

Several strategies are employed to mitigate this bubble:


1.​ Micro-batching: As mentioned, processing multiple micro-batches concurrently
helps keep stages busy.
2.​ Scheduling: Sophisticated scheduling algorithms, such as those used in GPipe,
aim to interleave forward and backward passes of different micro-batches to
maximize overlap and minimize idle time. Some schedules, termed "zero bubble
schedules," attempt to fill the bubble by using the backward pass computations
for weights to occupy GPUs that would otherwise be idle.15
3.​ Load Balancing: Ensuring that the computational load (i.e., the time taken to
process a micro-batch) is roughly equal across all pipeline stages is crucial.
Imbalanced stages lead to faster stages waiting for slower ones, reintroducing
bubbles.7
4.​ Non-Consecutive Layer Assignment: Some advanced pipeline schedules, like
PTD-P, assign multiple non-consecutive layers to each device, which can reduce
bubble overhead at the cost of more complex network communication patterns.5

PyTorch's torch.distributed.pipelining module, for example, provides tools to


automatically partition models and manage the execution of pipeline stages, including
micro-batch splitting and scheduling, to simplify implementation and improve
efficiency.15

Balancing Stages
Achieving good load balance across pipeline stages is critical for PP efficiency. If one
stage takes significantly longer than others, it becomes the bottleneck, and all other
stages will experience idle time waiting for it.7 This requires careful partitioning of the
model's layers, which can be a complex task, especially for models with
heterogeneous layer structures. Automated model splitting tools, as provided by
frameworks like PyTorch, can assist in this process by analyzing the computational
graph of the model.15

Pipeline parallelism is most effective for models whose primary bottleneck is their
depth. However, its efficiency is highly dependent on minimizing the pipeline bubble
through optimal micro-batch sizing and scheduling, as well as meticulous load
balancing across stages. This makes PP generally more complex to implement and
tune effectively compared to data parallelism.

The communication in pipeline parallelism involves passing activations forward


between stages and gradients backward. While these communication events are
typically less frequent than in tensor parallelism (occurring per micro-batch between
stages, rather than within each layer), the volume of data transferred (the activations)
can be substantial, especially for layers with large activation tensors or when using
large micro-batch sizes. This characteristic makes PP relatively more tolerant of
slower, inter-node interconnects compared to TP.15 However, the total communication
volume can still be a concern and may benefit from techniques such as activation
compression if activations are particularly large and need to traverse slower network
links.

2.5. Hybrid Parallelism Approaches


Concept: Combining DP, TP, and PP (e.g., 3D Parallelism)
As models grow in both depth and width, and datasets expand, relying on a single
parallelism strategy often proves insufficient. Consequently, modern large-scale
training frameworks frequently employ hybrid parallelism, which combines Data
Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) to harness their
respective advantages and mitigate their individual limitations.5 For instance, a
common configuration involves using TP within a node (leveraging fast NVLink
interconnects between GPUs), PP across nodes (to distribute deep models), and DP
across replicas of these pipelined/tensor-parallelized model instances to increase
throughput. DeepSpeed's 3D parallelism formalizes this concept, orchestrating these
dimensions to train models with up to trillions of parameters.2 Some frameworks even
discuss 4D parallelism, incorporating sequence parallelism alongside DP, TP, and PP.13

Rationale and Benefits


The primary rationale for hybrid parallelism is to simultaneously address multiple
scaling challenges. TP and PP help manage the memory footprint of massive models
by partitioning parameters and activations, while DP and TP contribute to
computational efficiency by parallelizing operations and data processing.2 This
multi-pronged approach is what enables the training of models at the
trillion-parameter scale, efficiently utilizing the memory and compute resources of
large GPU clusters.

Complexity
The power of hybrid parallelism comes at the cost of significantly increased
complexity. Managing the interactions between different parallelism dimensions,
scheduling computations and communications, and mapping the parallel execution
strategy onto the physical hardware topology requires sophisticated algorithms and
software support.2 For example, DeepSpeed's 3D parallelism involves topology-aware
mapping to optimize communication patterns, placing tensor-parallel groups on
tightly connected GPUs within a node and pipeline stages across nodes.2

The evolution towards hybrid and 3D parallelism underscores that scaling large
language models and other massive neural networks is a multifaceted challenge
where no single dimension of parallelism is a panacea. Optimizing such complex
systems requires a holistic understanding of how these strategies interact with each
other and with the underlying hardware. For example, the degree of tensor parallelism
affects the memory available per GPU for each pipeline stage, and the number of
pipeline stages can influence the effective batch size available for data parallelism.
Effective hybrid parallelism is therefore not merely an additive combination of
techniques but a synergistic one, demanding careful co-optimization of all dimensions
to achieve peak performance.

Table 2.1: Comparative Analysis of Data, Tensor, and Pipeline Parallelism

Parallelism Granularity Primary Key Communica Typical Use


Type Benefit Challenge(s tion Pattern Case
)

Data Batch-level Increased Model must Gradient Model fits on


Parallelism (data training fit on one aggregation GPU, large
subsets) 6 throughput, GPU; (All-Reduce) dataset,
scales with memory per batch 6 compute-bo
9
data redundancy und 6
for
parameters/
optimizer 5

Tensor Tensor-level Reduces High All-Reduce/A Very large


Parallelism (within-layer memory for communicati ll-Gather individual
operations) 6 large layers; on overhead within layer layers (e.g.,
parallelizes (frequent computation LLM
layer syncs); s MLP/Attentio
compute 12 sensitive to (intra-node) n); requires
interconnect 7 fast
speed 6 interconnect
s6

Pipeline Layer-level Enables Pipeline Activation Very deep


Parallelism (sequential training of bubble (GPU passing models; can
model very deep idle time); between tolerate
stages) 6 models; load stages slower
reduces balancing (inter-stage, inter-node
parameter across can be links than TP
memory per stages is inter-node) 5 6

GPU 5 critical 6

3. Advanced Memory Optimization: Sharding Model States with


ZeRO and FSDP
3.1. The Need for Advanced Memory Optimization beyond Basic Parallelism
While fundamental parallelism strategies like Data Parallelism (DP), Tensor Parallelism
(TP), and Pipeline Parallelism (PP) provide pathways to scale training, they each have
limitations, particularly concerning memory efficiency. Standard DP, for instance,
replicates optimizer states and gradients across all participating GPUs, leading to
substantial memory consumption that does not scale with the number of devices.2
Even with TP and PP, which partition the model parameters, the optimizer states
associated with the parameters on each GPU can still be very large, as can the
gradients and activations.

Advanced memory optimization techniques, exemplified by DeepSpeed's Zero


Redundancy Optimizer (ZeRO) and PyTorch's Fully Sharded Data Parallel (FSDP), have
emerged to address these shortcomings. These frameworks build upon the data
parallelism paradigm but incorporate ideas from model parallelism—specifically, the
sharding of model parameters, gradients, and optimizer states—to drastically reduce
the memory footprint on each GPU. This allows for the training of significantly larger
models than would be possible with basic parallelism strategies alone, or allows for
larger batch sizes, improving hardware utilization. The core innovation of ZeRO and
FSDP lies in their ability to make data parallelism highly memory-efficient by
systematically eliminating redundancy across distributed workers.
3.2. DeepSpeed ZeRO (Zero Redundancy Optimizer)
Core Idea: Eliminating Memory Redundancy in Data Parallelism
DeepSpeed ZeRO is a suite of memory optimization techniques designed to eliminate
memory redundancy in data-parallel training.16 Instead of replicating all model training
states (parameters, gradients, and optimizer states) on each data-parallel worker,
ZeRO partitions these states across the available devices (GPUs, and potentially
CPUs/NVMe with ZeRO-Infinity).4 This approach significantly reduces the per-device
memory consumption, enabling the training of extremely large models, potentially
with trillions of parameters, often with minimal or no modifications to the model code
itself.2

ZeRO Stages (Incremental Optimizations)


ZeRO implements its optimizations in several incremental stages, where each
subsequent stage builds upon the previous one to offer greater memory savings:
●​ ZeRO Stage 1: Optimizer State Partitioning (Pos​)​
In this stage, ZeRO partitions the optimizer states across the data-parallel
processes.17 For common optimizers like Adam, which maintain momentum and
variance terms (often 32-bit each) in addition to a copy of the 32-bit master
parameters for mixed-precision training, these states can consume significantly
more memory than the model parameters themselves (e.g., 4x to 8x the
parameter memory for FP16 training). By sharding these states, ZeRO-1 ensures
that each GPU is responsible for storing and updating only a fraction of the total
optimizer states. This can lead to memory savings of up to 4x compared to
traditional data parallelism where optimizer states are fully replicated.4
●​ ZeRO Stage 2: Gradient Partitioning (Pos+g​)​
ZeRO Stage 2 extends the optimizations of Stage 1 by also partitioning the
gradients.17 During the backward pass, after gradients are computed locally, they
are typically reduced across all GPUs. In ZeRO-2, these reduced gradients (often
in 16-bit precision) are also sharded such that each GPU only holds the gradients
corresponding to its partition of the optimizer states.20 This further reduces the
memory required on each GPU, potentially offering up to an 8x reduction in
memory for model states (optimizer states + gradients) compared to standard
DP.19
●​ ZeRO Stage 3: Parameter Partitioning (Pos+g+p​)​
ZeRO Stage 3 is the most comprehensive optimization, partitioning not only the
optimizer states and gradients but also the model parameters themselves
(typically 16-bit) across the data-parallel GPUs.16 During the forward and
backward passes, the parameters required for computation by a particular
module or layer are dynamically gathered on-the-fly onto the respective GPU(s)
as needed and then discarded immediately after use to free up memory.18 This
approach allows the memory savings for parameters to scale linearly with the
degree of data parallelism; for instance, with Ndp​data-parallel GPUs, the
parameter memory footprint per GPU is reduced by a factor of Ndp​.16​
The benefits of ZeRO-3 are particularly significant:
○​ It enables the training of extremely large models, often with billions or even
trillions of parameters, that would be far too large for even the aggregate
memory of a cluster without such partitioning.16
○​ The memory savings scale effectively with the number of GPUs.
○​ It generally requires minimal code changes from the user's perspective, as
DeepSpeed handles the parameter partitioning and gathering.16
○​ It seamlessly integrates with offloading capabilities (ZeRO-Infinity) for even
greater scale. ZeRO-3 fundamentally changes the memory landscape for data
parallelism by ensuring that each GPU effectively holds only a slice of the
entire training state (parameters, gradients, and optimizer states), thereby
maximizing memory distribution and enabling unprecedented model sizes.

Table 3.1: DeepSpeed ZeRO Stages: Sharded Components and Memory Savings

ZeRO Stage Sharded Typical Memory Key Mechanism


Components Reduction Factor
(vs DDP for Model
States)

Stage 0 (Baseline None (all replicated) 1x Full replication of


DDP) parameters,
gradients
(temporarily), and
optimizer states.4

Stage 1 (Pos​) Optimizer States Up to 4x 4 Optimizer states (e.g.,


Adam moments,
master weights) are
partitioned across
data-parallel GPUs.17

Stage 2 (Pos+g​) Optimizer States, Up to 8x 19 Gradients are also


Gradients partitioned after
reduction,
corresponding to
optimizer state
shards.17

Stage 3 (Pos+g+p​) Optimizer States, Scales with DP Model parameters


Gradients, degree (e.g., are partitioned;
Parameters $N_{dp}$x for gathered on-the-fly
parameters) 16 for computation and
then discarded.16

ZeRO-Infinity: Offloading to CPU and NVMe


For models that exhaust even the distributed GPU memory capacity available with
ZeRO-3, DeepSpeed offers ZeRO-Infinity. This extension allows the offloading of
model states (parameters, gradients, and optimizer states) from GPU memory to more
abundant, albeit slower, CPU RAM or even NVMe solid-state drives.16 ZeRO-Infinity
builds upon ZeRO-Offload (which was primarily for ZeRO-2 and offloaded
optimizer/gradient states to CPU) by providing comprehensive offloading for all
ZeRO-3 states.2 This dramatically expands the effective memory pool, enabling the
training of trillion-scale models. However, this comes at the cost of increased data
transfer latency between GPU and CPU/NVMe, which DeepSpeed attempts to mitigate
through techniques like memory-centric tiling (breaking large operators into smaller,
sequentially processed tiles to reduce working memory) and overlapping
communication with computation.17

Practical Usage with Hugging Face Transformers


DeepSpeed, including its ZeRO optimizer, is well-integrated with popular libraries like
Hugging Face Transformers and Accelerate. Users can typically enable DeepSpeed
and configure ZeRO stages and offloading options through a JSON configuration file,
which is then passed to the Trainer or managed by Accelerate.18 This simplifies the
adoption of these advanced memory optimization techniques, allowing researchers
and practitioners to scale their models with relative ease. The configuration options
are extensive, covering ZeRO stages, offload targets (CPU/NVMe), batch sizes,
optimizer choices, and more.20

The staged approach of ZeRO provides a flexible gradient of memory savings versus
communication overhead. Higher stages save more memory but typically involve more
intricate data management and potentially more communication as parameters are
gathered and scattered. ZeRO-3 combined with ZeRO-Infinity represents the most
aggressive end of this spectrum, prioritizing maximum model scale by leveraging
system memory as an extension of GPU memory, accepting the trade-off of slower
access speeds.

While a key advantage of ZeRO is the "minimal code changes" promise 16, its efficacy,
particularly for ZeRO-3, hinges on the DeepSpeed runtime's intelligent management
of parameter gathering and partitioning. This often occurs at a submodule granularity.
The runtime attempts to automatically detect when parameters are needed for a
module's forward or backward pass.17 However, the performance can be sensitive to
the model's structure and the internal heuristics DeepSpeed employs. For instance,
the granularity at which modules are defined can affect fetching efficiency: very
fine-grained modules might lead to excessive, inefficient fetching, while very
coarse-grained ones might materialize more parameters than strictly necessary at a
given moment. In some cases, manual hints using constructs like
deepspeed.zero.GatheredParameters might be needed to ensure correctness or
optimize performance for parameters accessed outside their defining module's
standard forward pass.16

3.3. PyTorch Fully Sharded Data Parallel (FSDP)


Concept and Mechanism
PyTorch's Fully Sharded Data Parallel (FSDP) is a native data parallelism feature
designed to reduce memory consumption on each GPU by sharding model
parameters, gradients, and optimizer states across the ranks (GPUs) in a
DistributedDataParallel (DDP) group.11 When an FSDP-wrapped module (often a layer
or a block of layers) executes its forward or backward pass, it first performs an
All-Gather operation to collect the full parameters for that specific module from all
participating GPUs. After the computation is complete, these full parameters are
immediately discarded to free up GPU memory. During the backward pass, gradients
are synchronized using a Reduce-Scatter operation, ensuring each rank only
possesses the shard of gradients corresponding to its shard of parameters, which it
then uses to update its local optimizer state shard.11

Sharding Strategies
FSDP offers several sharding strategies, allowing users to choose the level of sharding
based on their needs:
●​ FULL_SHARD (or 1): This strategy shards model parameters, gradients, and
optimizer states. It provides the maximum memory savings, conceptually similar to
ZeRO-3.21
●​ SHARD_GRAD_OP (or 2): This strategy shards gradients and optimizer states,
but model parameters are replicated on each GPU (like standard DDP for
parameters). However, the optimizer states for these replicated parameters are
sharded. This is akin to ZeRO-2.21 With use_orig_params=True, this strategy
exposes the unsharded parameters after the forward pass, unlike FULL_SHARD.23
●​ NO_SHARD (or 3): This effectively reverts to standard DistributedDataParallel
behavior, with no sharding of parameters, gradients, or optimizer states.21
●​ HYBRID_SHARD (or 4): This strategy implements FULL_SHARD within each node
(intra-node sharding of parameters, gradients, and optimizer states) and then
applies standard DDP across different nodes (inter-node replication). This can be
useful in heterogeneous network environments where intra-node communication
is much faster than inter-node communication.
●​ HYBRID_SHARD_ZERO2 (or 5): Similar to HYBRID_SHARD, but applies
SHARD_GRAD_OP (ZeRO-2 like behavior) within each node and DDP across
nodes.

Table 3.2: PyTorch FSDP Sharding Strategies

Sharding Shards Shards Shards Description/Us


Strategy Parameters Gradients Optimizer e Case
(Value) States

FULL_SHARD (1) Yes Yes Yes Maximum


memory saving;
shards all model
states across all
ranks. Similar to
ZeRO-3.

SHARD_GRAD_O No (replicated) Yes Yes Shards


P (2) gradients and
optimizer states;
parameters are
replicated.
Similar to
ZeRO-2.

NO_SHARD (3) No No No Standard DDP


behavior; no
sharding.

HYBRID_SHARD Yes (intra-node) Yes (intra-node) Yes (intra-node) Full sharding


(4) within each
node, DDP
(replication)
across nodes.
Balances
memory saving
with inter-node
communication.

HYBRID_SHARD_ No (intra-node Yes (intra-node) Yes (intra-node) Gradient and


ZERO2 (5) replicated) optimizer state
sharding within
each node, DDP
across nodes.

CPU Offloading
Similar to DeepSpeed ZeRO-Infinity, FSDP supports offloading model parameters and
gradients to the CPU when they are not actively being used on the GPU.21 This can
further reduce peak GPU memory usage, especially for very large models, at the cost
of increased communication latency between CPU and GPU. There are some
limitations, such as with gradient accumulation when CPU offloading is enabled.23

Wrapping Policies (Auto-wrap, Size-based)


FSDP is applied to a model by wrapping its modules (e.g., individual layers or blocks of
layers) with the FSDP class. To simplify this process and optimize performance
through communication-computation overlap, FSDP supports various wrapping
policies 21:
●​ Auto-wrapping: Policies like TRANSFORMER_BASED_WRAP automatically
identify and wrap specific types of layers (e.g., Transformer blocks) within the
model. This is often the simplest approach and requires no code changes to the
model definition.21
●​ Size-based wrapping: This policy wraps any module whose parameters exceed a
certain specified size.
●​ Manual wrapping: Users can also manually wrap specific parts of their model.
The wrapping is typically applied in a nested fashion, meaning an outer FSDP
instance might wrap several inner FSDP instances (or individual layers). This
hierarchical sharding allows for fine-grained control over memory usage, as the
full weights for an FSDP unit are materialized only during its execution and
discarded afterward, enabling memory for subsequent units.11 The choice of
wrapping policy and the granularity of FSDP units are crucial for performance, as
they determine the balance between the peak memory required for a single unit
and the frequency of All-Gather/Reduce-Scatter communication operations.23

Relationship and Comparison to DeepSpeed ZeRO


FSDP shares many conceptual similarities with DeepSpeed ZeRO, particularly when
FSDP's FULL_SHARD strategy is used, which closely mirrors ZeRO-3.21 Both systems
aim to enhance data parallelism by sharding parameters, gradients, and optimizer
states to reduce per-GPU memory load.11

The primary distinction often lies in their ecosystem and scope. FSDP is a component
of the core PyTorch library, offering potentially tighter integration and a more
streamlined experience for users already within the PyTorch ecosystem.11 DeepSpeed,
on the other hand, is a more comprehensive optimization library from Microsoft that
includes ZeRO but also offers a broader suite of tools such as specialized inference
engines, compression techniques, and other advanced parallelism strategies like
pipeline and tensor parallelism utilities.1 For users seeking cutting-edge features or
already accustomed to the DeepSpeed environment, it might be the preferred choice.
For those new to model-parallel training or migrating from standard PyTorch DDP,
FSDP can offer a more direct path.25 The decision between them can depend on
specific feature requirements, the desired level of integration with PyTorch, and the
maturity or stability of particular advanced features in each framework.

The FSDP wrapping policy is a key tuning parameter, analogous to how DeepSpeed
ZeRO-3 handles submodules. It defines the FSDP units, and each unit materializes its
full parameters via All-Gather. Smaller, more granular FSDP units will result in lower
peak memory per unit but will incur more frequent communication. Conversely, larger
FSDP units will require more memory when active but will communicate less
frequently. Optimizing this trade-off is essential for achieving good performance with
FSDP.

3.4. Sharding Specifics (Consolidated View)


The advanced memory optimization techniques in DeepSpeed ZeRO and PyTorch
FSDP revolve around the systematic sharding of the three main components of model
state during training: parameters, gradients, and optimizer states.
●​ Parameter Sharding:
○​ Mechanism: Instead of each data-parallel GPU storing a full copy of the
model's weights (parameters), the parameters are divided (sharded) across
these GPUs. Each GPU, therefore, holds only a unique slice or partition of the
total model parameters. When a specific module or FSDP unit needs to
perform its forward or backward computation, the full set of parameters for
that unit is dynamically reconstructed on the participating GPU(s) via an
All-Gather collective communication. Once the computation is done, these
temporarily gathered full parameters are discarded to free up memory.
○​ Frameworks: This is the hallmark of DeepSpeed ZeRO-3 16 and PyTorch FSDP
when using strategies like FULL_SHARD or HYBRID_SHARD (for intra-node
sharding).11
○​ Benefit: This dramatically reduces the static memory required on each GPU
for storing model weights, enabling the training of models that are orders of
magnitude larger than what could fit on a single GPU or even multiple GPUs
with standard data parallelism.
●​ Gradient Sharding:
○​ Mechanism: Gradients, which are computed during the backward pass and
are the same size as the parameters, are also partitioned. After local gradient
computations, instead of performing a standard All-Reduce operation that
would result in each GPU holding the full gradient tensor, a Reduce-Scatter
operation is used. This operation computes the sum of gradients across all
GPUs (like All-Reduce) but then distributes (scatters) only the relevant shard
of the total gradient to each GPU—specifically, the shard corresponding to
the parameters that GPU is responsible for updating.
○​ Frameworks: Implemented in DeepSpeed ZeRO-2 and ZeRO-3 17, and in
PyTorch FSDP with strategies like FULL_SHARD, SHARD_GRAD_OP, and
HYBRID_SHARD.11
○​ Benefit: This reduces the memory required for storing gradients during the
backward pass and before the optimizer step, contributing significantly to
overall memory savings.
●​ Optimizer State Sharding:
○​ Mechanism: Optimizer states, such as the momentum and variance terms for
Adam or AdamW optimizers, and potentially 32-bit master copies of
parameters for mixed-precision training, can consume a large amount of
memory (often 2x to 4x or more the size of the parameters themselves). These
states are also partitioned across the data-parallel GPUs. Each GPU stores
and updates only the optimizer states corresponding to its assigned shard of
the parameters.
○​ Frameworks: This is the foundational optimization in DeepSpeed ZeRO-1 and
is carried through in ZeRO-2 and ZeRO-3.17 PyTorch FSDP also shards
optimizer states in all its sharding strategies except NO_SHARD.11
○​ Benefit: This provides substantial memory relief, as optimizer states are a
major contributor to memory usage in traditional data parallelism.

By holistically sharding these three components, ZeRO and FSDP transform data
parallelism from a memory-intensive technique into a highly memory-efficient one.
This distribution of the entire model and training state across the collective memory of
available GPUs is the cornerstone of their ability to scale to extremely large neural
networks.

4. Managing Activation Memory


4.1. The Challenge of Activation Memory in Large Models
Activations are the intermediate tensors generated during the forward pass of a
neural network, which are then stored in memory because they are required for
computing gradients during the backward pass via the chain rule.27 In large and deep
models, particularly those with substantial batch sizes or processing long sequences
(common in Transformers), the cumulative size of these stored activations can
become a dominant component of GPU memory usage.28 In fact, for certain
architectures and configurations, the memory consumed by activations can surpass
that required for the model weights and optimizer states combined.28

The memory footprint of activations typically scales with the batch size, sequence
length (especially for attention mechanisms), model depth (number of layers), and
hidden dimension size. As these dimensions grow, activation memory can quickly
become a bottleneck, limiting the model size or batch size that can be trained, even if
parameter memory is managed effectively through techniques like ZeRO or FSDP. This
is particularly acute for Transformer models due to their multi-layer structure and the
attention mechanism, whose memory requirements can scale quadratically with
sequence length in naive implementations, although techniques like FlashAttention
help mitigate this specific aspect. The peak activation memory is usually reached at
the beginning of the backward pass, just before gradients start being computed and
activations can be freed.27

4.2. Activation Sharding (Implicit in Full Parameter Sharding and


Sequence/Context Parallelism)
The term "activation sharding" is not always used to describe a standalone technique
in the same explicit way as parameter, gradient, or optimizer state sharding. However,
certain parallelism strategies inherently lead to a form of activation sharding or
distribution.

When full parameter sharding is employed (e.g., DeepSpeed ZeRO-3 or PyTorch FSDP
FULL_SHARD), each GPU only materializes parameters for the specific part of the
model it is currently computing. Consequently, the activations generated are also local
to that GPU's computation. While each GPU still stores activations for its portion of
the work, the entire set of activations for the full model (if it were on one GPU) is not
present on any single device simultaneously. This is more of an indirect benefit of
parameter sharding rather than a direct sharding strategy applied to activations
themselves across the data-parallel group in the same way parameters are sharded
and gathered.

More direct forms of activation sharding occur with parallelism strategies that split the
input data along dimensions other than the batch dimension. For example:
●​ Sequence Parallelism: This technique, particularly relevant for Transformer
models, splits the input sequence length across multiple devices.13 Since
activations are computed based on these input hidden_states, splitting the
sequence effectively shards the activations along the sequence dimension. Each
GPU then processes and stores activations for only a segment of the full
sequence.
●​ Context Parallelism (CP): As implemented in NVIDIA's NeMo Framework, CP also
splits the sequence dimension across GPUs for all layers of the model, not just a
few select ones as in some sequence parallelism variants.28 This means each GPU
processes and stores activations corresponding to only a chunk of the sequence,
leading to a direct reduction in activation memory per GPU for long sequences.

These approaches (Sequence/Context Parallelism) are distinct from activation


checkpointing or offloading, as they aim to distribute the storage of activations rather
than recomputing them or moving them to slower memory.

4.3. Activation Checkpointing (Recomputation)


Concept: Trading Compute for Memory
Activation checkpointing, also known as activation recomputation or gradient
checkpointing, is a widely used technique to reduce the memory footprint of
activations.27 The core idea is to avoid storing all intermediate activations generated
during the forward pass. Instead, only a subset of activations (e.g., the inputs to
designated checkpointed segments or layers) are saved in GPU memory. During the
backward pass, when an activation that was not saved is needed for gradient
computation, it is recomputed on-the-fly by re-executing the relevant portion of the
forward pass using the saved inputs.28 This trade-off—sacrificing additional compute
time for reduced memory usage—can be very effective. The typical computational
overhead introduced by recomputation is often cited as around 30% 28, but this can
vary depending on the model and the extent of checkpointing.

Selective Activation Checkpointing


A more refined version of this technique is selective activation checkpointing (SAC).
Standard activation checkpointing recomputes all operations within a checkpointed
region. However, not all operations are equally expensive to recompute. For instance,
element-wise operations are cheap, while large matrix multiplications can be costly.
SAC allows for more granular control by enabling users or automated policies to
specify which intermediate activations within a checkpointed region should be saved
(typically those resulting from expensive operations) and which should be
recomputed.27 This provides a better balance on the speed-versus-memory trade-off
curve, aiming to achieve significant memory savings while minimizing the
recomputation of the most performance-critical operations. For example, one might
choose to checkpoint every n Transformer blocks instead of every single block, or
even make decisions at an operator level within a block.30 This model-dependent
tuning can yield substantial throughput improvements compared to naive full
checkpointing.

Implementation in PyTorch and Frameworks (e.g., NeMo)


PyTorch provides built-in support for activation checkpointing via
torch.utils.checkpoint.27 Libraries like FairScale have historically offered enhanced
checkpointing wrappers with additional functionalities.29 Frameworks designed for
large-model training, such as NVIDIA's NeMo, also incorporate activation
recomputation as a key feature for managing memory, especially for long-context
models.28 The choice of which parts of the model to checkpoint is crucial; applying it
strategically to memory-intensive sections can significantly reduce peak memory
usage.

Activation checkpointing is a vital tool for balancing memory constraints and


computational speed. Selective checkpointing further refines this balance,
recognizing that the cost of recomputation is not uniform across all operations. The
optimal checkpointing strategy is highly dependent on the specific model architecture
and the available hardware resources. For instance, as observed with different model
sizes, aggressive checkpointing might be necessary for larger models, while smaller
models might achieve better throughput with less or no checkpointing if memory
permits.30

4.4. Activation Offloading to CPU/NVMe


Mechanism and Benefits
Activation offloading is another memory-saving technique where activations are
moved from GPU VRAM to more abundant but slower CPU RAM (or even NVMe
storage) during the forward pass.31 When these offloaded activations are required for
gradient computation during the backward pass, they are transferred back to the
GPU. This approach is particularly beneficial when dealing with very large batch sizes
or extremely long context lengths where activation memory becomes a severe
constraint, even with checkpointing, or when the compute overhead of checkpointing
is undesirable.31

Drawbacks and Performance Considerations


The primary drawback of activation offloading is the latency incurred by data
transfers between the GPU and CPU/NVMe.31 These transfers are significantly slower
than GPU VRAM access and can substantially reduce the overall training speed
(measured in tokens-per-second or samples-per-second). To mitigate this, some
implementations attempt to overlap the data transfers with ongoing GPU computation
using techniques like asynchronous CUDA streams.31 However, the effectiveness of
such overlap depends on the model architecture, the size and number of tensors
being offloaded, and the interconnect bandwidth.

Usage in Frameworks (e.g., NeMo, Torchtune)


Activation offloading is supported by various deep learning libraries and frameworks.
PyTorch itself allows for this through saved_tensors_hooks in its autograd system.31
Higher-level frameworks like NVIDIA NeMo and Torchtune provide more user-friendly
configurations to enable activation offloading.31 For example, NeMo allows specifying
the number of transformer layers for which activations (and potentially weights)
should be offloaded to the CPU.32

Activation offloading can be viewed as an extension of the memory hierarchy, treating


CPU RAM or NVMe as a larger, slower tier of memory for activations. It is often
considered when activation checkpointing alone is insufficient to meet memory
targets or when its recomputation overhead proves too detrimental to training speed
for a given memory budget. Interestingly, activation checkpointing and offloading can
be used in conjunction.31 In such a hybrid approach, some activations within a
checkpointed region might be recomputed, while others (perhaps those that are
particularly expensive to recompute or those that still don't fit in GPU memory even
after some recomputation within the segment) might be offloaded to the CPU. This
offers a more nuanced set of trade-offs, allowing for finer control over the balance
between GPU memory usage, recomputation overhead, and data transfer latency.

5. The Impact of Hardware Interconnects on Scaling Performance


5.1. Why Interconnects Matter: The Communication Bottleneck
All distributed training strategies, by their very nature, involve communication: data
parallelism requires gradient synchronization 8, model parallelism (including tensor
parallelism) necessitates the exchange of activations and gradients between model
shards 6, and pipeline parallelism involves passing activations between stages.15 The
efficiency of these communication steps is critically dependent on the underlying
hardware interconnects that link GPUs within a server and servers within a cluster. The
bandwidth (data transfer rate) and latency (delay) of these interconnects often
become the primary bottleneck limiting the scalability and overall performance of
distributed training jobs.12

As models grow larger and training clusters expand to more GPUs and nodes, the
volume of data that needs to be moved increases dramatically. If the interconnects
cannot sustain the required data transfer rates, GPUs will spend a significant amount
of time idle, waiting for data to arrive or for synchronization to complete. This
diminishes the benefits of adding more computational resources and can lead to poor
scaling efficiency.2 Therefore, the choice and configuration of interconnect
technologies are first-order concerns in designing effective large-scale distributed
training systems. Communication can, in some cases, consume a very large fraction of
the total runtime, for instance, 50-70% in tensor parallelism for certain large models if
interconnects are not optimal.7

5.2. On-Node Interconnects (GPU-to-GPU within a server)


Communication between GPUs residing within the same physical server relies on
specialized on-node interconnects.
●​ PCIe (Peripheral Component Interconnect Express):​
PCIe is the standard interface used to connect GPUs to the motherboard and, in
some configurations, to each other if direct GPU-to-GPU links are not available or
fully utilized.33 While PCIe has evolved through several generations (e.g., PCIe
3.0, 4.0, 5.0), offering increasing bandwidth (e.g., a PCIe 4.0 x16 slot provides
approximately 64 GB/s bidirectional bandwidth, and PCIe 5.0 x16 offers around
128 GB/s), its bandwidth and, importantly, its latency (typically in the range of
100-200 nanoseconds) are generally inferior to more specialized GPU
interconnects like NVLink.33 Communication over PCIe often involves the CPU as
an intermediary, which can add to the latency.35
●​ NVLink & NVSwitch:​
NVLink is a proprietary high-speed, low-latency interconnect technology
developed by NVIDIA specifically for direct GPU-to-GPU communication.35 It
offers significantly higher bandwidth and lower latency (typically around 10-20
ns) compared to PCIe.33 For example, NVLink 3.0 can provide up to 50 GB/s per
link per direction, and modern GPUs like the A100 or H100 can have multiple
NVLink connections, leading to aggregate bidirectional bandwidths of 600 GB/s
(A100) or even 900 GB/s (H100) per GPU.33​
NVSwitch extends the capabilities of NVLink by creating a fully connected,
non-blocking fabric that allows any GPU in a multi-GPU system (like an NVIDIA
DGX server) to communicate with any other GPU at full NVLink speed.33 This is
crucial for enabling efficient all-to-all communication patterns required by some
collective operations.​
The presence and speed of NVLink/NVSwitch are critical enablers for
communication-intensive parallelism strategies, most notably Tensor Parallelism,
within a server node.12 The performance disparity between PCIe and NVLink can
significantly influence the viability and efficiency of certain scaling approaches.
For instance, attempting to run highly synchronous tensor parallelism over PCIe
links would likely result in severe performance degradation due to insufficient
bandwidth and higher latency, pushing users towards strategies that minimize
such inter-GPU communication or to accept suboptimal scaling.

5.3. Intra-GPU Memory: HBM (High Bandwidth Memory)


While not an interconnect between GPUs, High Bandwidth Memory (HBM) is a critical
component of individual GPU performance and thus indirectly impacts distributed
training. HBM is a type of 3D-stacked SDRAM that is integrated directly onto the GPU
package, often via a silicon interposer.36 This tight coupling and wide memory
interface provide exceptionally high memory bandwidth (e.g., HBM2 can offer over
256 GB/s per stack, and HBM3/HBM3E push this to 800 GB/s to over 1 TB/s per stack,
with GPUs often having multiple stacks) at lower power consumption compared to
traditional off-package GDDR memory.37

This massive on-chip memory bandwidth is essential for feeding the thousands of
parallel compute cores within a modern GPU. Without HBM, these cores would
frequently be starved for data, leading to underutilization and bottlenecking overall
computational throughput. Therefore, HBM is a foundational technology for the high
single-GPU performance that distributed training aims to leverage and scale. If
individual GPUs are slow due to memory bandwidth limitations, the entire distributed
system will suffer.

5.4. Multi-Node Interconnects (Server-to-Server)


When distributed training scales beyond a single server, communication between
nodes relies on network fabrics.
●​ Ethernet (especially RoCE - RDMA over Converged Ethernet):​
Ethernet is a ubiquitous networking technology that has seen continuous
evolution in speed, with standards supporting 100 Gbps, 200 Gbps, 400 Gbps,
and even 800 Gbps and beyond.38 For high-performance computing (HPC) and
AI workloads, a key development is RDMA over Converged Ethernet (RoCE). RDMA
allows one computer to access the memory of another computer directly without
involving the operating system or CPU of either, significantly reducing latency and
CPU overhead.38 This makes Ethernet with RoCE more competitive for
demanding distributed training tasks. Emerging standards like Ultra Ethernet aim
to further optimize Ethernet for AI/HPC workloads by incorporating features like
advanced congestion control and packet spraying, striving to close the
performance gap with InfiniBand while maintaining Ethernet's cost-effectiveness
and broad ecosystem compatibility.38
●​ InfiniBand:​
InfiniBand was designed from the ground up as a high-performance interconnect
for HPC and data centers.38 It typically offers very low latency (sub-microsecond,
with some RDMA operations under 300ns) and high bandwidth, with current
generations like NDR (Next Data Rate) providing 400 Gbps per link and future
iterations aiming higher.38 InfiniBand natively supports RDMA and employs a
switched fabric architecture that can be configured in various topologies (e.g.,
fat-tree) to provide high bisection bandwidth and minimize contention.38
●​ Comparison (Latency, Bandwidth, RDMA, Cost, Complexity):​
Historically, InfiniBand has held an edge in raw latency and often in effective
bandwidth for HPC workloads due to its native RDMA support and leaner protocol
stack.38 Ethernet, while generally more cost-effective and widely deployed, has
traditionally incurred higher latency due to TCP/IP overhead, though RoCE
mitigates this significantly.39​
The cost and complexity can also differ; InfiniBand solutions have often been
more expensive, potentially accounting for a larger fraction of the total cluster
cost (e.g., 20% for InfiniBand vs. <10% for Ethernet, as noted in one source).39
However, for this increased cost, InfiniBand has often delivered corresponding
performance improvements.​
The choice between InfiniBand and high-speed Ethernet (with RoCE or Ultra
Ethernet technologies) for multi-node training is becoming increasingly nuanced.
While InfiniBand has been the traditional leader for ultra-low latency applications,
Ethernet's continuous advancements, its larger ecosystem, inherent cost
advantages, and improved RDMA support and congestion control mechanisms
are making it a highly viable and often preferred option for large-scale AI clusters.
The decision often involves a complex trade-off analysis considering peak
performance requirements, budget constraints, existing infrastructure, ease of
management, and the specific communication patterns of the AI workloads. For AI
data flows, which can be bursty and unpredictable, robust congestion
management (a focus of Ultra Ethernet) can be as critical as raw latency.39

Table 5.1: Comparison of Interconnect Technologies

Technology Type Typical Typical Key Primary Use


Bandwidth Latency Features Case in DL
(Example)

PCIe (Gen 5 Intra-node ~128 GB/s 100-200 ns Standard GPU to CPU


x16) (GPU-Host/G (bidirectional 33 host communicati
PU) ) 35 interface, on, baseline
ubiquitous. GPU-GPU if
no
specialized
link.

NVLink (e.g., Intra-node 900 GB/s 10-20 ns 33 High-speed, Tensor


H100) (GPU-GPU) (bidirectional low-latency Parallelism,
per GPU) 33 direct fast
GPU-GPU All-Reduce
link, within a
proprietary node.
NVIDIA.

NVSwitch Intra-node Enables full ~10-20 ns Creates Scaling


(GPU-GPU NVLink (via NVLink) all-to-all Tensor
fabric) speed GPU Parallelism
between all connectivity and other
GPUs in a at NVLink communicati
system 33 speeds. on-intensive
tasks across
many GPUs
in a node.
HBM (e.g., Intra-GPU >800 GB/s Very low 3D-stacked Feeding GPU
HBM3) Memory per stack (<10 ns) DRAM, very compute
(multiple wide bus, units with
stacks per on-package data at
GPU) 37 with GPU. extremely
high speeds.

Ethernet Inter-node 400 Gbps Microsecond Widely Data


(e.g., (Server-Serv (~50 GB/s) s (RoCE) adopted, Parallelism,
400GbE + er) per port RDMA over Pipeline
RoCE) Ethernet Parallelism
(RoCE), across
evolving nodes,
(Ultra general
Ethernet).38 cluster
networking.

InfiniBand Inter-node 400 Gbps Sub-microse Designed for Demanding


(e.g., NDR) (Server-Serv (~50 GB/s) cond 39 HPC, native Data/Pipelin
er) per port RDMA, low e Parallelism
latency, across
switched nodes,
fabric.38 low-latency
sensitive
applications.

5.5. Choosing the Right Interconnect for Your Distributed Workload


The optimal interconnect configuration is not a one-size-fits-all solution but is deeply
intertwined with the chosen distributed training parallelism strategy. A mismatch
between the communication demands of the parallelism strategy and the capabilities
of the interconnects will inevitably lead to performance bottlenecks.
●​ Tensor Parallelism (TP), with its frequent intra-layer communication (All-Reduce,
All-Gather), places the highest demands on intra-node bandwidth and latency. It
is most effectively implemented using high-speed, direct GPU-to-GPU
interconnects like NVLink and NVSwitch.12 Attempting TP across nodes connected
by slower Ethernet, or even within a node solely over PCIe, would likely yield poor
performance due to communication stalls.
●​ Pipeline Parallelism (PP) involves passing activations between stages. If these
stages are on different nodes, this communication traverses the inter-node
network (Ethernet or InfiniBand).15 While the frequency of communication is lower
than in TP (once per micro-batch per stage), the volume of activations can be
large. PP is generally more tolerant of inter-node latencies than TP but still
benefits significantly from high inter-node bandwidth.
●​ Data Parallelism (DP) requires gradient aggregation (typically All-Reduce) after
each backward pass. The size of the gradient tensors scales with the model size.
For intra-node DP, NVLink/NVSwitch can accelerate this. For inter-node DP,
high-bandwidth, low-latency inter-node networks like InfiniBand or advanced
Ethernet are crucial, especially for large models.
●​ Hybrid Strategies (e.g., 3D Parallelism), which combine TP, PP, and DP, require
robust interconnects at all levels. For example, a common setup might use NVLink
for TP within nodes, and InfiniBand/high-speed Ethernet for PP and DP across
nodes.2 Topology-aware scheduling, which maps communication-intensive parts
of the parallel strategy to faster links, becomes essential.

In essence, designing a distributed training system necessitates a co-design


approach where the parallelism strategy is chosen or adapted based on the available
interconnect capabilities, and conversely, interconnect infrastructure investments are
guided by the communication demands of the target AI workloads and scaling
strategies.

6. Conclusion: Synthesizing Strategies for Effective Large-Scale


Training
6.1. Recap of Key Techniques and Their Interdependencies
The journey to effectively train large-scale neural networks involves navigating a
complex landscape of parallelism strategies, memory optimization techniques, and
hardware considerations. We have explored the fundamental dimensions of
parallelism: Data Parallelism (DP) for scaling throughput by distributing data, Tensor
Parallelism (TP) for partitioning large individual layers, and Pipeline Parallelism (PP) for
distributing deep sequential models. Each of these addresses different bottlenecks
but also introduces its own communication and synchronization challenges.

To overcome the memory limitations inherent in these basic strategies, especially the
redundancy in data parallelism, advanced frameworks like DeepSpeed ZeRO and
PyTorch's Fully Sharded Data Parallel (FSDP) have emerged. These systems
systematically shard model parameters, gradients, and optimizer states across
distributed workers, dramatically reducing per-GPU memory requirements and
enabling the training of models with billions or even trillions of parameters.
Complementing these are techniques for managing activation memory, a significant
consumer of GPU resources. Activation checkpointing (recomputation) trades
additional compute for memory savings, while activation offloading moves less
frequently needed activations to CPU or NVMe storage, further expanding the
effective memory capacity.

The performance of all these distributed techniques is critically dependent on the


underlying hardware interconnects. Within a server, NVLink and NVSwitch provide
high-bandwidth, low-latency pathways for GPU-to-GPU communication, essential for
strategies like Tensor Parallelism. High Bandwidth Memory (HBM) on the GPU itself is
fundamental for feeding the compute units. Across servers, high-speed networks like
InfiniBand and advanced Ethernet (with RoCE or Ultra Ethernet capabilities) are vital
for efficient inter-node communication.

It is crucial to recognize that these elements—parallelism algorithms, memory


optimization software, activation management, and hardware interconnects—are not
isolated components. They form an intricate, interdependent system. Effective
large-scale training is, therefore, an exercise in co-design, where choices in one area
profoundly impact others. The algorithmic decisions regarding how to parallelize a
model must align with the capabilities of the software frameworks used to implement
that parallelism, and both must be matched to the strengths and limitations of the
hardware infrastructure.

6.2. Holistic Considerations for Designing Distributed Training Setups


There is no single "best" strategy for all scenarios. The optimal approach to
distributed training is highly contextual and requires a holistic consideration of several
factors:
●​ Model Architecture: The structure of the neural network plays a significant role.
Models with extremely wide layers (large hidden dimensions) may benefit most
from Tensor Parallelism. Very deep models are natural candidates for Pipeline
Parallelism. The sheer number of parameters will dictate the necessity of
parameter sharding techniques like ZeRO-3 or FSDP FULL_SHARD. The size and
lifespan of activations will influence the need for checkpointing or offloading.
●​ Hardware Availability: The specific GPUs (memory capacity, compute power),
intra-node interconnects (PCIe vs. NVLink/NVSwitch), inter-node network fabric
(Ethernet speed, InfiniBand), and the total number of available GPUs and nodes
heavily constrain the feasible parallelism strategies and the achievable scale.
●​ Performance Goals: The primary objective—whether it's minimizing time-to-train
(raw speed), maximizing the trainable model size, or optimizing for cost-efficiency
(e.g., using fewer or less powerful GPUs more effectively)—will guide the
trade-offs. For instance, aggressive offloading techniques save memory but can
reduce training speed.
●​ Ease of Use vs. Performance Tuning: Frameworks like DeepSpeed and PyTorch
FSDP significantly lower the barrier to entry for distributed training by automating
many complex aspects of sharding and communication. However, achieving peak
performance often requires a deeper understanding of the underlying principles
to fine-tune configurations, such as sharding granularity, micro-batch sizes, and
communication scheduling, to match the specific model and hardware.
●​ The Future: Emerging Trends: The field continues to evolve rapidly. We
anticipate further advancements in the co-design of algorithms, software, and
hardware. This includes more sophisticated automated parallelism tools,
potentially more specialized hardware accelerators for specific AI computations
or communication patterns, and even more refined techniques for memory
management and communication optimization. The drive for larger, more capable
models will continue to push the boundaries of distributed systems.

Ultimately, designing an effective distributed training setup involves navigating a


complex decision tree. One must weigh the memory demands of the model against
the available resources, the communication overhead of different parallelism schemes
against the interconnect speeds, and the computational costs of techniques like
activation recomputation against the memory savings they provide. While
sophisticated tools are increasingly abstracting away some of this complexity, expert
knowledge of these interacting components remains invaluable for pushing the
frontiers of AI model scale and efficiency.

Works cited

1.​ DeepSpeed is a deep learning optimization library that makes distributed training
and inference easy, efficient, and effective. - GitHub, accessed June 9, 2025,
https://github.com/deepspeedai/DeepSpeed
2.​ DeepSpeed: Extreme-scale model training for everyone - Microsoft ..., accessed
June 9, 2025,
https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-mod
el-training-for-everyone/
3.​ Accelerating AI with GPUs: A New Computing Model - NVIDIA Blog, accessed
June 9, 2025,
https://blogs.nvidia.com/blog/accelerating-ai-artificial-intelligence-gpus/
4.​ ZeRO-2 & DeepSpeed: Shattering barriers of deep learning speed & scale -
Microsoft, accessed June 9, 2025,
https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-b
arriers-of-deep-learning-speed-scale/
5.​ Techniques for training large neural networks | OpenAI, accessed June 9, 2025,
https://openai.com/index/techniques-for-training-large-neural-networks/
6.​ Can you explain the difference between tensor parallelism and other parallelism
techniques? - Massed Compute, accessed June 9, 2025,
https://massedcompute.com/faq-answers/?question=Can%20you%20explain%2
0the%20difference%20between%20tensor%20parallelism%20and%20other%20
parallelism%20techniques?
7.​ Demystifying Tensor Parallelism | Robot Chinwag, accessed June 9, 2025,
https://robotchinwag.com/posts/demystifying-tensor-parallelism/
8.​ Parallelism and Distributed Training for Maximizing AI Efficiency ..., accessed June
9, 2025,
https://www.exxactcorp.com/blog/deep-learning/parallelization-and-distributed-t
raining
9.​ What Is Data Parallelism? | Pure Storage, accessed June 9, 2025,
https://www.purestorage.com/knowledge/what-is-data-parallelism.html
10.​Data parallelism - Wikipedia, accessed June 9, 2025,
https://en.wikipedia.org/wiki/Data_parallelism
11.​ Getting Started with Fully Sharded Data Parallel(FSDP) — PyTorch ..., accessed
June 9, 2025, https://docs.pytorch.org/tutorials/intermediate/FSDP1_tutorial.html
12.​Part 4.1: Tensor Parallelism - the UvA Deep Learning Tutorials!, accessed June 9,
2025,
https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/scaling/JAX
/tensor_parallel_simple.html
13.​Tensor Parallelism and Sequence Parallelism: Detailed Analysis - Insu Jang,
accessed June 9, 2025,
https://insujang.github.io/2024-01-11/tensor-parallelism-and-sequence-parallelis
m-detailed-analysis/
14.​Part 3.1: Pipeline Parallelism — UvA DL Notebooks v1.2 documentation, accessed
June 9, 2025,
https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/scaling/JAX
/pipeline_parallel_simple.html
15.​Pipeline Parallelism — PyTorch 2.7 documentation, accessed June 9, 2025,
https://docs.pytorch.org/docs/stable/distributed.pipelining.html
16.​Zero Redundancy Optimizer - DeepSpeed, accessed June 9, 2025,
https://www.deepspeed.ai/tutorials/zero/
17.​ZeRO — DeepSpeed 0.17.1 documentation - DeepSpeed Docs, accessed June 9,
2025, https://deepspeed.readthedocs.io/en/latest/zero3.html
18.​DeepSpeed - Hugging Face, accessed June 9, 2025,
https://huggingface.co/docs/peft/main/accelerate/deepspeed
19.​DeepSpeed ZeRO 1, 2, 3 Differences Explained - BytePlus, accessed June 9, 2025,
https://www.byteplus.com/en/topic/407298
20.​DeepSpeed - Hugging Face, accessed June 9, 2025,
https://huggingface.co/docs/transformers/deepspeed
21.​FullyShardedDataParallel - Hugging Face, accessed June 9, 2025,
https://huggingface.co/docs/transformers/fsdp
22.​Part 2.2: (Fully-Sharded) Data Parallelism — UvA DL Notebooks v1 ..., accessed
June 9, 2025,
https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/scaling/JAX
/data_parallel_fsdp.html
23.​FullyShardedDataParallel — PyTorch 2.7 documentation, accessed June 9, 2025,
https://docs.pytorch.org/docs/stable/fsdp.html
24.​Tensor and Fully Sharded Data Parallelism - How Trillion Parameter Models Are
Trained : r/mlscaling - Reddit, accessed June 9, 2025,
https://www.reddit.com/r/mlscaling/comments/1i4n2wr/tensor_and_fully_sharded
_data_parallelism_how/
25.​lightning.ai, accessed June 9, 2025,
https://lightning.ai/docs/pytorch/stable//advanced/model_parallel.html#:~:text=Us
e%20FSDP%20if%20you%20are,and%20are%20migrating%20to%20Lightning.
26.​Train models with billions of parameters - Lightning AI, accessed June 9, 2025,
https://lightning.ai/docs/pytorch/stable/advanced/model_parallel.html
27.​Current and New Activation Checkpointing Techniques in PyTorch ..., accessed
June 9, 2025, https://pytorch.org/blog/activation-checkpointing-techniques/
28.​Scaling to Millions of Tokens with Efficient Long-Context LLM ..., accessed June 9,
2025,
https://developer.nvidia.com/blog/scaling-to-millions-of-tokens-with-efficient-lon
g-context-llm-training/
29.​Enhanced Activation Checkpointing | FairScale documentation - Read the Docs,
accessed June 9, 2025,
https://fairscale.readthedocs.io/en/latest/deep_dive/activation_checkpointing.htm
l
30.​Maximizing training throughput using PyTorch FSDP, accessed June 9, 2025,
https://pytorch.org/blog/maximizing-training/
31.​Memory Optimization Overview — torchtune 0.3 documentation, accessed June
9, 2025,
https://docs.pytorch.org/torchtune/0.3/tutorials/memory_optimizations.html
32.​CPU Offloading — NVIDIA NeMo Framework User Guide, accessed June 9, 2025,
https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/
optimizations/cpu_offloading.html
33.​Can you compare the performance of PCIe, NVLink, and NVSwitch in terms of
bandwidth and latency? - Massed Compute, accessed June 9, 2025,
https://massedcompute.com/faq-answers/?question=Can%20you%20compare%
20the%20performance%20of%20PCIe,%20NVLink,%20and%20NVSwitch%20in
%20terms%20of%20bandwidth%20and%20latency?
34.​Interconnect - Latest Articles and Reviews on AnandTech, accessed June 9, 2025,
https://www.anandtech.com/tag/interconnect
35.​What are the key differences between NVLink and PCIe? | AI FAQ, accessed June
9, 2025,
https://www.runpod.io/ai-faq/what-are-the-key-differences-between-nvlink-and
-pcie
36.​How HBM is Transforming GPU Performance - Reddit, accessed June 9, 2025,
https://www.reddit.com/r/gpu/comments/1fq7ic0/how_hbm_is_transforming_gpu
_performance/
37.​High Bandwidth Memory - Wikipedia, accessed June 9, 2025,
https://en.wikipedia.org/wiki/High_Bandwidth_Memory
38.​InfiniBand vs Ethernet: Which is Best for Your Data Center? - GEARit, accessed
June 9, 2025, https://www.gearit.com/blogs/news/infiniband-vs-ethernet
39.​Ultra Ethernet vs. InfiniBand: The Next Generation of Networking Solutions -
UfiSpace, accessed June 9, 2025,
https://www.ufispace.com/company/blog/ultra-ethernet-vs-infiniband

You might also like