PrimePar Efficient Spatial-Temporal Tensor
PrimePar Efficient Spatial-Temporal Tensor
801
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han
Temporal Temporal
all-reduce since the summation is accomplished locally by
t3 t3
+
accumulating outputs through the passage of time. If tacti-
t2 t2 cally handled, temporal partitioning can both mitigate mem-
+
ory/communication overhead and fully utilize the compu-
t1 t1
+ tation resource of multiple devices when combined with
t0 + + +
t0 spatial partitioning. Our research reveals that adding the
Spatial Spatial
D0 D1 D2 D3 D0 D1 D2 D3 temporal dimension to the tensor partition space unlocks
numerous possibilities, including avoiding all-reduce, reduc-
Summations are
Parameter tensor Parameter tensor Summations ing memory footprint and overlapping communication with
implemented as
is replicated is only stored in happen in D0
all-reduce among
among D0-D3
D0-D3
D0 locally computation.
(a) All-reduce and tensor (b) Temporal distribution eliminates
Guided by these insights, we propose PrimePar, an au-
replication along spatial dimension the all-reduce and replication tomated parallel training framework targeting large trans-
former models. PrimePar incorporates the extended tensor
Figure 1. Illustration of distributing 4 sub-operators along partition space, resulting in parallel solutions that achieve
spatial dimension (a) and temporal dimension (b). The hori- superior training throughput with lower peak memory occu-
zontal axis represents 4 devices (D0-D3) while the vertical pancy. To the best of our knowledge, this is the first research
axis represents 4 temporal steps (t0-t3). Each rounded rec- that introduces the temporal dimension into tensor parti-
tangle represents a sub-operator, with the upper and lower tioning for parallel training. Our major contributions can be
representing two tensors. The same color represents iden- summarized as:
tical tensors, while different colors indicate distinct partial
sums that require summation. • We expand the tensor partition space for LLM train-
ing by introducing a new partition primitive, which
leverages the previously ignored temporal dimension
adjust interconnection topology to match the application to distribute sub-operators in a fashion of more com-
and deliver more bandwidth [22]. munication and memory efficiency.
Parallel training systems are required to adapt to the de- • Regarding the lack of scalable tensor partition opti-
velopment of transformer models and hardware resources, mization algorithm for transformer models, we design
so that the expensive computational power, interconnection a dynamic programming-based algorithm capable of
bandwidth and memory resources are efficiently utilized. identifying the optimal tensor partition strategy from
Current common practice for training transformer models is our extended space with solvable complexity.
resorting to 3D parallelism [37, 45, 49], including data, model • We evaluate the effectiveness of PrimePar on train-
and pipeline parallelism. Amongst 3D space, the Pipeline ing workloads of popular transformer models with a
parallelism [18, 35] features inexpensive point-to-point com- cluster of 32 GPUs. Compared to state-of-the-art dis-
munication in the cost of pipeline bubbles caused by pe- tributed training systems Megatron-LM [37, 49] and
riodic flushes during training. Data and model parallelism Alpa [67], we can achieve up to 1.68 × speedup, with
[9, 12, 13, 21, 53, 57], which can be collectively represented only 69% peak memory occupancy. When scaling to
by tensor partitioning, are more communication-intensive 32 GPUs, 1.30 × geo-mean speedup can be achieved.
and favored in devices with relatively high interconnection When applied to 3D parallelism, up to 1.46 × speedup
bandwidth. The performance of tensor partitioning is critical over Megatron-LM can be achieved.
to system performance. With the advancement of intercon-
nection topology and bandwidth, tensor partitioning will
2 BACKGROUND AND MOTIVATION
have wider application scenarios and play a more important
role in system performance. 2.1 Tensor Partitioning for Transformers
However, existing tensor partitioning fails in fully uti- Tensor partitioning has been extensively explored as a key
lizing hardware resources. The inadequacy is that current technique to parallelize neural networks [5, 8, 12, 24, 63] to
tensor partition primitive only distributes the partitioned speedup the computation and amortize the memory over-
sub-operators along spatial dimensions. It is often the case head. Recursive tensor partitioning has been proposed [15,
that sub-operators produce partial sum results that necessi- 20, 21, 33, 51, 52, 58] to formalize a relatively general parallel
tate further aggregation from devices. As depicted in Fig.1, space and enable automatic search for the best parallel plans.
when distributing these sub-operators along spatial dimen- Tensor partitioning has been applied to accelerate train-
sions (i.e. to different devices), the summation of partial-sum ing transformer models. Megatron-LM [37, 49] manually
leads to all-reduce communication traffic among mapped designed efficient tensor partitions for transformer models,
devices. On the contrary, distributing these sub-operators where row or column dimensions of linear operators and
along the temporal dimension in a single device can avoid head dimension of attention matrix multiplications were
802
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
All-reduce The rest is a waste of memory. The all-reduce and memory waste are
41.5%
OPT 6.7B
not coincidences. When partitioning a dimension of an oper-
ator which is mathematically summed-over, distributing the
38.4%
Llama2 70B partitioned sub-operators to different devices will inevitably
64.6%
induce all-reduce. When some tensor has no dimension par-
BLOOM 176B titioned, then it is replicated among devices’ memory. As
(a) proportion of all-reduce latency during training
shown in Fig.2 (a), all-reduce communication takes up a sig-
nificant proportion of training latency of Megatron-LM. Fig.2
Megatron memory
occupancy
(b) depicts the disparity in peak memory occupation between
Collective Replication
Ideal memory Megatron-LM and the ideal scenario, where the ideal sce-
occupancy nario assumes no tensor replication among devices. It can
Proposed × × be seen that the memory waste caused by tensor replication
becomes progressively more severe as the parallelism size
Partition
1.10 ×
batch
√ √ increases.
The inefficiencies mentioned above have become increas-
Partition
1.29 ×
row
√ √ ingly unacceptable considering the demanding requirements
placed on computation, communication and memory by
1.66 × Partition
√ √ large transformer models training workloads. Although these
column
inefficiencies seem unavoidable, in fact, they are addressed
Partition if we introduce a new dimension into tensor partitioning,
sequence
√ √
4 8 16 32 which is the temporal dimension. Inspired by cannon par-
(b) memory occupation (c) advantage of proposed partitioning allel matrix multiplication [3], we consider distributing par-
titioned sub-operators along both spatial dimensions and
Figure 2. The proportions of all-reduce latency during train- temporal dimension (i.e. on the same device but at different
ing OPT 6.7B, Llama2 70B and Bloom 176B on 16 V100 GPUs time). Specifically, as illustrated in Fig.1, if the results of a
using Megatron-LM (model parallelism within a node and group of sub-operators need to be aggregated (or general
data parallelism across nodes) are shown in (a). Training reduction), distributing them along temporal dimension elim-
Llama2 70B model with the same batch size on 4, 8, 16, 32 inates the necessity for all-reduce communication. Moreover,
GPUs, the disparity in peak memory occupation per GPU distributing sub-operators which have some common tensor
between Megatron-LM and the ideal scenario is shown in along temporal dimension can avoid physically replicating
(b). The advantage of proposed partition primitive compared their common tensor in device memory.
with existing ones is shown in (c). Motivated by these considerations, we propose a novel
(c) 应该包含的元素:
spatial-temporal tensor partition primitive for parallel train-
Temporal dimension -> ing. This is a challenging work in that the proposed partition
Partition batch/row
partitioned. … ->
Alpa distribute
[67] sub- a framework supporting au-
proposed primitive should meet the training requirements, such as
operators to spatial dimensions -> 产生
tomatic searching and
all-reduce, replication
deploying optimalSub-operator: temporal dimension
tensor partition tensor alignment between different training phases, while
Novel partition
solutions -> to
in their temporal
space, achieving transformer model train- exhibiting better parallel performance. As summarized in
dimensions -> 消除
ing performance that is comparable to Megatron-LM.
另外两个关键元素:reduce among sub- Fig.2 (c), our proposed partition primitive avoids incurring
operators the overhead associated with both collective communication
Replicated tensor among sub-operators
2.2 Motivation and tensor replication, differing from conventional partition-
Unfortunately, the tensor partition space explored so far is ing methodologies employed in transformer models.
incomplete, missing opportunities to better utilize hardware
resources. Take partitioning row dimension of linear opera-
3 PRIMEPAR PARTITION SPACE
tor weight as an example, which is one of the strategies pro-
posed in Megatron-LM. Row dimension of weight 𝑊 is par- 3.1 Basic Formulation
titioned into 𝑛 slices, generating 𝑛 tensor blocks 𝑊1, · · · ,𝑊𝑛 . PrimePar partitions over 2𝑛 homogeneous devices with each
Therefore, the forward computation is divided into 𝑛 sub- device indexed by a Device ID D = (𝑑 1, · · · , 𝑑𝑛 ), where
operators 𝐼 1𝑊1, · · · , 𝐼𝑛𝑊𝑛 , where 𝐼 1, · · · , 𝐼𝑛 are input tensor 𝑑𝑖 = 0, 1. After tensor partitioning, an operator is partitioned
blocks partitioned along last dimension in accordance with into sub-operators with each sub-operator holding a part of
partitioning 𝑊 . These 𝑛 sub-operators are distributed to 𝑛 the partitioned tensor. In PrimePar, two sub-operators can
different devices with each of them computing a partial re- be allocated to different devices D, or one single device at
sult, thus all-reduce among these 𝑛 devices is necessary for different temporal steps 𝑡, hence both D, 𝑡 are necessary to
computing the final result 𝐼 1𝑊1 + · · · + 𝐼𝑛𝑊𝑛 . Also, after all- identify a sub-operator. We introduce the Dimension Slice
reduce, each device holds a copy of the output tensor, which Index (DSI), which is a function of D, 𝑡 and records which
803
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han
F F F
slice of dimension the sub-operator (D, 𝑡) holds. Specifically, Forward I M = d1 I N = d2 I K = 0 D=(0,0) D=(0,1)
consider the linear operator in transformer models: I W = O 0 × 0,2 1 × 1,3
0 1 0,2 0,1
I W O Device 0 Device 1
0 1
[N, 0,2 0,1 0 × 0,2 + 1 × 1,3
[B, M, K] [B, M, N] × K] M
Forward: = · 𝑊
2 3 1,3
=
2,3 2 × 0,2 + 3 × 1,3
2 3 1,3 2,3 D=(1,0) D=(1,1)
𝑂 𝐼 Forward Device 2 Device 3
T
Device 0 Device 2 2 × 0,2 3 × 1,3
N K
I dO dW
[B, M, N] [B, M, K]0 2 [K,0,1N] 0,2 0 × 0,1 + 2 × 2,3
×
𝑇=
Backward: 𝑑𝐼 = 𝑑𝑂 1
·𝑊
Gradient
3 2,3 1,3 1 × 0,1
Device 1
(1)
+ 3 × 2,3
Device 3
G G G
Gradient I M = d1 I N = d2 I K = 0 D=(0,0) D=(1,0)
[N, K] [B, N, M] [B, M, K] I
T
dO = dW
𝑇 0 × 0,1 2 × 2,3
Gradient: 𝑑𝑊 = 𝐼 · 𝑑𝑂 0 2 0,1 0,2
N D=(0,1) D=(1,1)
Linear operator has four dimensions 𝐵, 𝑀, 𝑁 , 𝐾 (referring 1 3 2,3 1,3
1 × 0,1 3 × 2,3
to batch, sequence, input/output hidden dimension), each M K
dimension, say, dimension 𝑁 maintains three DSIs 𝐼 𝑁𝐹 , 𝐼 𝑁𝐵 , 𝐼 𝑁𝐺
for Forward, Backward and Gradient phases respectively. Figure 3. Parallelizing linear operator to 4 devices by parti-
If 𝐼 𝑁𝐹 (D, 𝑡) = 2, then at step 𝑡 of Forward phase, device D tion dimension 𝑀 and 𝑁 . The colored blocks represent par-
computes a sub-operator whose 𝐼,𝑊 tensors’ dimension 𝑁 titioned tensors, and the number (decimal device id) within
is the 2-th slice among the slices of original dimension 𝑁 indicate which devices are holding the tensor. Blocks with
generated by tensor partitioning. Any tensor partition plan multiple numbers indicate the tensor replication among de-
in the space of PrimePar can be uniquely represented by vices. Dashed boxes on the right show the sub-operators
specifying DSIs. mapped on devices, where ⊕ represents all-reduce.
In the following of this section, we introduce PrimePar
basic tensor partitions and their DSI formulations. Our tensor
dimension of the operator into two slices, resulting in the
partition space is represented as the space of sequences of
partitioning of tensors that include this dimension into two
these basic partitions.
parts, each allocated to separate sub-operators. For tensors
3.2 Existing Tensor Partitions not containing this dimension, they are replicated in sub-
operators. Recursively partition by dimension forms a large
The current paradigm of tensor partition can be summarized
tensor partition space which covers data parallelism and
as partition by dimension. This involves dividing a single
model parallelism that have been applied to train transformer
models. For example, in Megatron-LM linear operators are
Algorithm 1 PrimePar Linear Operator DSIs parallelized by recursively partitioning row or column (di-
mension 𝑁 , 𝐾 in Eq.1), while matrix multiplications in the
1: Input: Sequence of partitions P
attention block are partitioned by head dimension. PrimePar
2: Output: DSIs 𝐼𝑋𝐹 , 𝐼𝑋𝐵 , 𝐼𝑋𝐺 , 𝑋 refers to 𝐵, 𝑀, 𝑁 , 𝐾
establishes a comprehensive tensor partition space encom-
3: Initialize: 𝑖 = 1, 𝐼𝑋𝐹 = 𝐼𝑋𝐵 = 𝐼𝑋𝐺 = 0
passing conventional partition by dimension, with adapta-
4: for p in P do
tion to transformer operators.
5: if p is partitioning by dimension 𝑋 then
Linear Operator. PrimePar supports partitioning all four
6: 𝐼𝑋𝐹 ← 2𝐼𝑋𝐹 + 𝑑𝑖 , 𝐼𝑋𝐵 ← 2𝐼𝑋𝐵 + 𝑑𝑖 , 𝐼𝑋𝐺 ← 2𝐼𝑋𝐺 + 𝑑𝑖
dimensions 𝐵, 𝑀, 𝑁 , 𝐾 of linear operator (Eq.1). We exemplify
7: 𝑖 ←𝑖 +1
parallelizing linear operator to 4 devices (D = (𝑑 1, 𝑑 2 )) by
8: else if p is 𝑃2𝑘 ×2𝑘 then
partitioning dimension 𝑀 and 𝑁 sequentially. Initially all
9: 𝑟 = 2𝑘 −1𝑑𝑖 + 2𝑘 −2𝑑𝑖+2 + · · · + 20𝑑𝑖+2𝑘 −2
DSIs are 0, indicating that no dimension has been partitioned.
10: 𝑐 = 2𝑘 −1𝑑𝑖+1 + 2𝑘 −2𝑑𝑖+3 + · · · + 20𝑑𝑖+2𝑘 −1
The first partition by dimension 𝑀 can be represented as
11: temporal index 𝑡 varying from 0 to 2𝑘 − 1
𝐹 𝐹 𝐵 𝐵 𝐺 𝐺
12: 𝐼𝑀 𝐹 ← 2𝑘 𝐼 𝐹 + (𝑟 mod 2𝑘 )
𝑀
𝐼𝑀 ← 2𝐼𝑀 + 𝑑 1, 𝐼 𝑀 ← 2𝐼𝑀 + 𝑑 1, 𝐼 𝑀 ← 2𝐼𝑀 + 𝑑1 (2)
13: 𝐼𝑀 ← 2𝑘 𝐼𝑀
𝐵 𝐵 + (𝑟 mod 2𝑘 )
where dimension 𝑀 is partitioned into 2 slices and devices
14: 𝐺 ← 2𝑘 𝐼 𝐺 + ((𝑟 + 𝑡) mod 2𝑘 )
𝐼𝑀 𝑀 with 𝑑 1 = 0 hold the 0-th slice while others hold the 1-th
15: 𝐼 𝑁𝐹 ← 2𝑘 𝐼 𝑁𝐹 + ((𝑟 + 𝑐 + 𝑡) mod 2𝑘 ) slice. The second partition
16: 𝐼 𝑁𝐵 ← 2𝑘 𝐼 𝑁𝐵 + ((𝑟 + 𝑐 − 1) mod 2𝑘 ) 𝐼 𝑁𝐹 ← 2𝐼 𝑁𝐹 + 𝑑 2, 𝐼 𝑁𝐵 ← 2𝐼 𝑁𝐵 + 𝑑 2, 𝐼 𝑁𝐺 ← 2𝐼 𝑁𝐺 + 𝑑 2 (3)
17: 𝐼 𝑁𝐺 ← 2𝑘 𝐼 𝑁𝐺 + ((𝑟 + 𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘 )
18: 𝐼𝐾𝐹 ← 2𝑘 𝐼𝐾𝐹 + (𝑐 mod 2𝑘 ) partitions dimension 𝑁 into 2 slices and devices with 𝑑 2 = 0
and 𝑑 2 = 1 hold the 0-th slice and 1-th slice respectively.
19: 𝐼𝐾𝐵 ← 2𝑘 𝐼𝐾𝐵 + ((𝑐 + 𝑡) mod 2𝑘 )
The partitioned tensor distribution and the way they form
20: 𝐼𝐾𝐺 ← 2𝑘 𝐼𝐾𝐺 + ((𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘 ) sub-operators are depicted in Fig.3.
21: 𝑖 ← 𝑖 + 2𝑘 Consider devices whose device ids differ only in 𝑑 1 (devices
22: end if D=(0,0),(1,0) or D=(0,1),(1,1) in Fig. 3), they hold different
23: end for slices of dimension 𝑀 while holding the same slice of any
804
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
Table 1. Sender device coordinates for receiver device (𝑟, 𝑐) Formulation of Partition. We denote the proposed parti-
for ring communications. Table row distinguishes temporal tion as 𝑃2𝑘 ×2𝑘 (𝑘 ≥ 1), which is parallelized across a group of
intervals that have different communication directions. Note 22𝑘 devices. 𝑃2𝑘 ×2𝑘 logically sees these devices as a 2𝑘 × 2𝑘
that each coordinate entry specifies a ring communication square. Each device D in the group has its relative row and
which overlaps with computation step 𝑡, and the blank entry column indices 𝑟 (D), 𝑐 (D) within the logical square.
means no communication is required. Unlike partitioning by dimension, 𝑃 2𝑘 ×2𝑘 assigns each de-
vice with 2𝑘 sub-operators, which are executed sequentially
Forward in temporal steps with 0 ≤ 𝑡 < 2𝑘 indexing these steps. We
Temporal step 𝐼 𝑊 design 𝑃2𝑘 ×2𝑘 to satisfy the following features:
𝑡< 2𝑘 −1 (𝑟, 𝑐 + 1) (𝑟 + 1, 𝑐) 1. Collective communication free: Collective communi-
Backward cation, which is expensive and hard to overlap with
computation, is not required throughout the whole
Temporal step 𝑑𝑂 𝑊
training process.
𝑡< 2𝑘 −1 (𝑟, 𝑐 + 1) (𝑟 − 1, 𝑐 + 1) 2. Memory efficient: No replication of tensors among the
𝑡 = 2𝑘 − 1 (𝑟, 𝑐 + 1) memory of devices.
Gradient 3. Support training: Forward, Backward and Gradient
Temporal step 𝐼 𝑑𝑂 𝑑𝑊 phases can be executed periodically without additional
𝑡< 2𝑘 −2 (𝑟 + 1, 𝑐 − 1) (𝑟 + 1, 𝑐) operations between them.
𝑡 = 2𝑘 − 2 (𝑟 + 1, 𝑐) (𝑟 + 1, 𝑐 + 1) Feature 1 and 2 endow 𝑃2𝑘 ×2𝑘 with superior performance and
memory efficiency compared to partitioning by dimension.
𝑡 = 2𝑘 − 1 (𝑟, 𝑐 + 1)
Feature 3 requires two kinds of tensor distribution align-
ments. For the tensor stashed in memory in an earlier phase
other dimensions. Thus during Gradient phase, they com- which will be used in some latter phase, its distribution at the
pute the same 𝑑𝑊 (containing dimension 𝑁 , 𝐾), but each of last step of the earlier phase must align with that at the first
them obtains partial sum due to not having the complete di- step of the latter phase. Also, the weight distribution at the
mension 𝑀. This induces all-reduce communication among first step of Forward phase must align with that at the last
them as shown in Fig.3 (Gradient phase). Also, tensors not step of Gradient phase. If feature 3 is not satisfied, frequent
containing dimension 𝑀 (𝑊 , 𝑑𝑊 ) are replicated among them redistribution of weight, gradient and other intermediate
(blue and red blocks with black borders in Fig.3). tensors is required between training phases.
Other Operators in Transformer. PrimePar supports par- We now specify the details of 𝑃2𝑘 ×2𝑘 . For device (𝑟, 𝑐) at
titioning all dimensions of matrix multiplications in atten- temporal step 𝑡, its DSIs are:
tion, except for head embed dimension. The head embed
𝐼𝑀 = 𝑟 mod 2𝑘
dimension typically takes values of 64 or 128, partitioning Forward:
𝐼 𝑁 = (𝑟 + 𝑐 + 𝑡) mod 2𝑘 (4)
which would significantly reduce the operational intensity
𝐼𝐾 = 𝑐 mod 2𝑘
and lower the throughput. For softmax operator, we parti-
tion all of its dimensions except for the dimension along
𝐼𝑀 = 𝑟 mod 2𝑘
which softmax is computed (usually the last dimension). For
Backward: 𝐼 𝑁 = (𝑟 + 𝑐 − 1) mod 2𝑘 (5)
normalization operator, we support partitioning all of its
𝐼𝐾 = (𝑐 + 𝑡) mod 2𝑘
dimensions, with potential all-reduce of expectations and
gradient of parameters 𝛾, 𝛽. For the remaining element-wise 𝐼 = (𝑟 + 𝑡) mod 2𝑘
𝑀
operators, we support partitioning all of their dimensions.
Gradient: 𝐼 𝑁 = (𝑟 + 𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘 (6)
While partitioning by dimension dominates existing ten-
𝐼𝐾 = (𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘
sor partition, it suffers from significant shortcomings of need-
ing collective communication and tensors replication. As where 𝛿𝑡,2𝑘 −1 takes value 1 if 𝑡 = 2𝑘 − 1 otherwise 0. It can
transformer models continue to scale up, these shortcomings be verified that features 1,2,3 are satisfied.
are becoming increasingly destructive to parallel training. Feature 1: During Forward phase, the summed-over dimen-
sion of matrix multiplication is dimension 𝑁 . In Eq.4, for an
3.3 Novel Tensor Partition Primitive arbitrary device (𝑟, 𝑐), 𝐼 𝑁 takes all 2𝑘 values as 𝑡 varies from
We introduce a novel tensor partition primitive that effec- 0 to 2𝑘 − 1 while 𝐼𝑀 , 𝐼𝐾 are fixed to 𝑟, 𝑐. Thus after 2𝑘 steps
tively addresses the shortcomings of conventional partition of computation, device (𝑟, 𝑐) computes the (𝑟, 𝑐)-th block of
by dimension. When incorporating this novel partition, we output tensor, with no device yielding partial sum because
can identify parallel strategies that achieves higher through- the partial sums of all slices of dimension 𝑁 are summed
put and lower peak memory occupancy simultaneously. locally in device (𝑟, 𝑐) during these 2𝑘 steps (see Fig.4). The
805
The upper-right corner of tensor block: which double buffer
In the ring: which double buffer as source of communication
holds
ASPLOS ’24, April this tensor
27-May block
1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han
c
0 0 1 1
O00 O01 O00 O01 dI01 dI00 dI01 dI00 dW11 dW00 dW00 dW11
0
+= 0 += 0 += 1 += 1 += 0 += 0 += 1 += 1 += 1 += 1 += 0 += 0
0 0 T T T T
I 00 I 01 I 01 I 00 dO00 dO01 dO01 dO00 I 10 I 00 I 01 I 11
×0 ×0 ×1 ×1 ×1 ×1 ×0 ×0 ×1 ×1 ×0 ×0
T T T 0 T
r W00 W11 W10 W01 W01 W10 W11 W00 dO01 dO00 dO10 dO11
0 0 1 1 1 1 1 1
0 0 1 1
O10 O11 O10 O11 dI10 dI11 dI10 dI11 dW01 dW10 dW10 dW01
0
+= 0 += 0 += 1 += 1 += 0 += 0 += 1 += 1 += 1 += 1 += 0 += 0
T T T T
I 11 I 10 I 10 I 11 dO10 dO11 dO11 dO10 I 01 I 11 I 10 I 00
0 0
×0 ×0 ×1 ×1 ×1 ×1 ×0 ×0 ×1 ×1 ×0 ×0
T T 0
W10 W01 W00 W11 W00 W11 T
W10 T
W01 dO11 dO10 dO00 dO01
Forward Step 0 Forward Step 1 Backward Step 0 Backward Step 1 Gradient Step 0 Gradient Step 1
Comp stream
× × × × × ×
Comm stream
recv recv recv recv recv recv recv recv
次 right bot right right-top right bot right-bot right Time
Figure 4. The process of training with partition 𝑃2×2 . Details of orchestration among 2 × 2 devices are shown in the upper
的 half, while the corresponding execution timeline is shown in the lower. The 0-1 numbers to the upper-right of tensor block or
within communication ring indicate the tensor or communication source is in which double buffer.
Computation × × × × × ×
same conclusion
Communication can be verified for Backward and Gradient result and can be carried out simultaneously with computa-
phases similarly. tion. This plays a crucial role in effectively harnessing Time
both
Feature 2: Consider arbitrary two devices (𝑟, 𝑐) and (𝑟 ′, 𝑐 ′ ) computational power and communication bandwidth.
in Gradient phase (Eq.6), suppose their corresponding 𝐼 𝑁 Specifically, in Forward phase, tensors 𝐼,𝑊 require com-
and 𝐼𝐾 are the same at temporal step 𝑡 munication between steps. Leveraging double buffer, each
device can compute with 𝐼,𝑊 of the current step while receiv-
(𝑟 + 𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) ≡ (𝑟 ′ + 𝑐 ′ − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘
ing 𝐼,𝑊 for the next step and storing them into double buffer.
(𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) ≡ (𝑐 ′ − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘 The communications of 𝑑𝑂,𝑊 in Backward phase and 𝐼, 𝑑𝑂
in Gradient phase are similar. As for 𝑑𝑊 in Gradient phase,
The latter equation yields 𝑐 = 𝑐 ′ which further yields 𝑟 = 𝑟 ′
it is the result tensor accumulating the computation result
in the former. Thus at any step 𝑡 during Gradient phase, no
of each step. Since 𝐼 𝑁 , 𝐼𝐾 only change from step 𝑡 = 2𝑘 − 2 to
two devices have the same 𝐼 𝑁 and 𝐼𝐾 simultaneously, which
𝑡 = 2𝑘 − 1 when 𝛿𝑡,2𝑘 −1 turns from 0 to 1, 𝑑𝑊 accumulated in
means 𝑊 , 𝑑𝑊 are never replicated among devices. Similar
previous steps should be redistributed during the last step of
reasoning can be done for all tensors in all phases.
computation. Afterwards, each device adds the redistributed
Feature 3: Let 𝑡 = 2𝑘 − 1 in Eq.4 and 𝑡 = 0 in Eq.6, where we
𝑑𝑊 with its last step computation result to yield the final
can verify that 𝐼𝑀 , 𝐼 𝑁 at the last step of Forward match with
𝑑𝑊 result. This 𝑑𝑊 redistribution is necessary for weight
that at the first step of Gradient. This means the distribution
alignment between start of Forward and end of Gradient.
of tensor 𝐼 does not change when transition from Forward
Here we explicitly derive the communication pattern of
to Gradient. Thus each device can stash its 𝐼 at the end of
𝑊 in Backward phase. In Eq.5, 𝐼 𝑁 , 𝐼𝐾 are the DSIs of 𝑊 . At
Forward and directly use the stashed 𝐼 for computation when
step 𝑡 + 1, device (𝑟, 𝑐) holds this part of 𝑊 :
entering Gradient phase (see the blue blocks in Forward step
1 and Gradient step 0 in Fig.4). Other alignments can be 𝐼 𝑁 = (𝑟 + 𝑐 − 1) mod 2𝑘
verified similarly. 𝐼𝐾 = (𝑐 + (𝑡 + 1)) mod 2𝑘
Formulation of Communication. Due to the variation
of DSIs with 𝑡, there exists tensor communication between To identify which device was holding the part of 𝑊 specified
temporal steps. In Eq.4, 𝐼 𝑁 changes with 𝑡. Since both tensors by the above 𝐼 𝑁 , 𝐼𝐾 at step 𝑡, rewrite the above equations as:
𝐼 and 𝑊 contain dimension 𝑁 , the communication for 𝐼,𝑊 𝐼 𝑁 = ((𝑟 − 1) + (𝑐 + 1) − 1) mod 2𝑘
is necessary. Similarly from Eq.5, 6 it can be inferred that
in Backward phase, communication is required for tensors 𝐼𝐾 = ((𝑐 + 1) + 𝑡) mod 2𝑘
𝑑𝑂,𝑊 , while in Gradient phase, communication is necessary which indicates that device (𝑟 − 1, 𝑐 + 1) was holding this part
for tensors 𝐼, 𝑑𝑂, 𝑑𝑊 . of 𝑊 . Note here 𝑟 −1 = 2𝑘 −1 if 𝑟 = 0 and 𝑐 +1 = 0 if 𝑐 = 2𝑘 −1.
However, unlike all-reduce communication, these com- Therefore, we can conclude that for 𝑡 < 2𝑘 − 1 in Backward
munications are not data-dependent on the computation phase, each device receives 𝑊 from its right-top neighbor,
806
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
establishing a ring point-to-point communication pattern, High bandwidth connection Low bandwidth connection
807
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han
Softmax
Q
∑︁
𝑖𝑛𝑡𝑟𝑎𝐶 (𝑛, P) = 𝑚𝑎𝑥 (𝑐𝑜𝑚𝑝𝑢𝑡𝑒 (𝑛, P, 𝑡), 𝑟𝑖𝑛𝑔(𝑛, P, 𝑡))
MLP Out
NonLin
MLP In
Norm
Norm
Proj
K
𝑡
+ 𝑎𝑙𝑙𝑟𝑒𝑑𝑢𝑐𝑒 (𝑛, P) + 𝛼 · 𝑚𝑒𝑚𝑜𝑟𝑦 (𝑛, P)
V
(7)
Computation graph
where 𝛼 is the adjustment coefficient between latency and
0 1 2 3 4 5 6 7 8 9 10 11 12
memory metrics.
By evaluating DSIs for all dimensions of the tensor, the 5 OPTIMIZATION ALGORITHM
intersection lengths in each dimension can be computed.
PrimePar establishes a comprehensive tensor partition space,
The product of these intersection lengths is the size of the
which contains better solutions to release more system per-
part of 𝑛 1 output tensor computed by device D, which also
formance. However it is not trivial to find optimal partition
serves as part of device D’s input tensor for computing 𝑛 2 .
strategy, given that our partition space is extended by adding
Consequently, the total redistribution communication traffic
temporal dimension.
of all devices when running from 𝑛 1 to 𝑛 2 during forward
To optimize the tensor partition strategy for a model, it is
computation is:
necessary to find the optimal strategy for each operator in
the model’s computation graph such that they collectively
∑︁ Ö
(𝑉 − |𝑆𝑋1 (D) ∩ 𝑆𝑋2 (D)|) (9)
D 𝑋
achieve the minimal cost. This forms a search space which
expands exponentially as the number of operators increases,
where 𝑉 is the partitioned size of 𝑛 2 input tensor on each with the base being the size of operator partition space.
device and || represent taking the length of dimension slice. Previous optimization algorithm applicable to transformer
The redistribution traffic from 𝑛 2 to 𝑛 1 during backward models formalized the problem as an integer linear program-
computation can be computed similarly. We model the inter- ming (ILP) [67]. However, due to the fact that ILP is NP-hard,
operator communication latency between 𝑛 1 and 𝑛 2 as a this optimization algorithm scales poorly. In this section, we
linear function of the sum of forward and backward redis- propose an optimization algorithm to efficiently search for
tribution communication traffic. The linear coefficients are the optimal tensor partition strategy for transformer models,
obtained by profiling real system latency. Following this which minimize the sum of intra-operator (node) and inter-
method, we establish the inter-operator cost function, de- operator (edge) costs according to the cost model established
noted as 𝑖𝑛𝑡𝑒𝑟𝐶 (𝑛 1, 𝑛 2, P1, P2 ). in previous section.
808
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
However, due to the non-linear model structure of trans- 𝑀𝑜𝑑𝑒𝑙 2,7 have a common node 𝑛 2 whose cost needs to be
former models, it is challenging to formalize tractable Bell- subtracted.
man equations. Thus we propose segmented dynamic pro-
𝐶 0,2 (𝑝 0, 𝑝 2 ) + 𝐶 2,7 (𝑝 2, 𝑝 7 )+
gramming that divides model into segments and carry out 𝐶 0,7 (𝑝 0, 𝑝 7 ) = min (13)
𝑝2 𝑒 0,7 (𝑝 0, 𝑝 7 ) − 𝑛 2 (𝑝 2 )
dynamic programming within each segment.
Consider two nodes 𝑛𝑖 , 𝑛 𝑗 in the computation graph (𝑖 < Furthermore, the optimal sub-structure of a whole Trans-
𝑗), we define the sub-model 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 consisting of nodes former layer can be obtained by merging 𝐶 0,7, 𝐶 7,12 .
𝑛𝑖 , 𝑛𝑖+1, ...𝑛 𝑗 (listed in increasing topological order of com- 𝐶 0,12 (𝑝 0, 𝑝 12 ) = min{𝐶 0,7 (𝑝 0, 𝑝 7 ) + 𝐶 7,12 (𝑝 7, 𝑝 12 ) − 𝑛 7 (𝑝 7 )}
𝑝7
putation graph) and edges whose source and destination
(14)
nodes lie within 𝑛𝑖 , 𝑛𝑖+1, ...𝑛 𝑗 . Define the optimal cost of sub-
model 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 when 𝑛𝑖 , 𝑛 𝑗 are under partition states 𝑝𝑖 , 𝑝 𝑗 Due to the stacking of identical transformer layers in trans-
as 𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 𝑗 ), which is the optimal sub-structure in our dy- former models, it is sufficient to compute 𝐶 0,12 once. In Fig.6,
namic programming formulation. 𝑛 0, 𝑛 12 are the last operator of previous layer and current
Assumption 1. For all edges that connects 𝑛 𝑗 and belongs layer respectively, meaning that optimal sub-structures of
to 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗+1 , the only one that is not contained in 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 neighboring layers actually has one node in common and can
is 𝑒 𝑗,𝑗+1 . be merged like Eq.14. With 𝑙𝑜𝑔(#𝑙𝑎𝑦𝑒𝑟𝑠) steps of recursive
Assumption 2. Among all nodes in 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 only 𝑛𝑖 and 𝑛 𝑗 merging, we can compute optimal sub-structure for 2, 4, 8
can connect with 𝑛 𝑗+1 . layers and so on.
Under these two assumptions we can derive the Bellman
equation between optimal cost of 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 and 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗+1 5.2 Optimality Proof
5.2.1 Correctness of Bellman equation. We prove the
𝐶𝑖,𝑗+1 (𝑝𝑖 , 𝑝 𝑗+1 ) = correctness of Eq.12 by induction where the induction hy-
min{𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 𝑗 ) + 𝑛 𝑗+1 (𝑝 𝑗+1 ) + 𝑒 𝑗,𝑗+1 (𝑝 𝑗 , 𝑝 𝑗+1 )} (11) pothesis is that 𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 𝑗 ) is the minimal cost for 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 .
𝑝𝑗
The proof for Eq.11 is similar and omitted. Based on as-
or sumptions 1,2 it can be inferred that 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗+1 differs from
𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 𝑗 ) + 𝑛 𝑗+1 (𝑝 𝑗+1 ) + 𝑒 𝑗,𝑗+1 (𝑝 𝑗 , 𝑝 𝑗+1 ) 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 by the addition of a single node 𝑛 𝑗+1 and two edges
min (12) 𝑒 𝑗,𝑗+1, 𝑒𝑖,𝑗+1 . For simplicity, we will abbreviate 𝑛 for 𝑛 𝑗+1 , 𝑒
𝑝𝑗 + 𝑒𝑖,𝑗+1 (𝑝𝑖 , 𝑝 𝑗+1 )
for 𝑒 𝑗,𝑗+1 and 𝑒 ′ for 𝑒𝑖,𝑗+1 in the proof. Suppose there ex-
ists some 𝐶𝑖,𝑗+1 ∗ (𝑝𝑖 , 𝑝 𝑗+1 ) for 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗+1 which is smaller than
where 𝑛 𝑗+1 (𝑝 𝑗+1 ) represents the intra-operator cost of node
𝑛 𝑗+1 under partition state 𝑝 𝑗+1 and 𝑒 𝑗,𝑗+1 (𝑝 𝑗 , 𝑝 𝑗+1 ) represents ∗
𝐶𝑖,𝑗+1 (𝑝𝑖 , 𝑝 𝑗+1 ) and the partition state of 𝑛 𝑗 in 𝐶𝑖,𝑗+1 (𝑝𝑖 , 𝑝 𝑗+1 )
the inter-operator cost between 𝑛 𝑗 and 𝑛 𝑗+1 when they are ∗
is 𝑝 𝑗 . It is equivalent to say
under 𝑝 𝑗 and 𝑝 𝑗+1 respectively. Eq.12 is for the case when ∗
𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 ∗𝑗 ) + 𝑛(𝑝 𝑗+1 ) + 𝑒 (𝑝 ∗𝑗 , 𝑝 𝑗+1 ) + 𝑒 ′ (𝑝𝑖 , 𝑝 𝑗+1 ) <
both 𝑛𝑖 , 𝑛 𝑗 connect with 𝑛 𝑗+1 . It is desired that the dynamic
programming for the whole model should be accomplished min{𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 𝑗 ) + 𝑛(𝑝 𝑗+1 ) + 𝑒 (𝑝 𝑗 , 𝑝 𝑗+1 ) + 𝑒 ′ (𝑝𝑖 , 𝑝 𝑗+1 )}
𝑝𝑗
with only these two Bellman equations.
However, it is not possible to straightforwardly iterate Since the R.H.S takes the minimal for all possible 𝑝 𝑗 , it is
from the first operator of transformer to the last with only obviously no greater than fixing 𝑝 𝑗 to 𝑝 ∗𝑗 , thus
∗
Eq.11, 12. As illustrated in Fig.6, when iterating from 𝑀𝑜𝑑𝑒𝑙 0,4 𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 ∗𝑗 ) + 𝑛(𝑝 𝑗+1 ) + 𝑒 (𝑝 ∗𝑗 , 𝑝 𝑗+1 ) + 𝑒 ′ (𝑝𝑖 , 𝑝 𝑗+1 ) <
to 𝑀𝑜𝑑𝑒𝑙 0,5 , 𝑛 2 in 𝑀𝑜𝑑𝑒𝑙 0,4 also connects with 𝑛 5 , thus As-
𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 ∗𝑗 ) + 𝑛(𝑝 𝑗+1 ) + 𝑒 (𝑝 ∗𝑗 , 𝑝 𝑗+1 ) + 𝑒 ′ (𝑝𝑖 , 𝑝 𝑗+1 )
sumption 2 is violated. Similar violation also happens when
∗ (𝑝 , 𝑝 ∗ ) < 𝐶 (𝑝 , 𝑝 ∗ ) contradicting the in-
which yields 𝐶𝑖,𝑗
iterating from 𝑀𝑜𝑑𝑒𝑙 0,11 to 𝑀𝑜𝑑𝑒𝑙 0,12 . 𝑖 𝑗 𝑖,𝑗 𝑖 𝑗
With segmented dynamic programming, we can avoid duction hypothesis.
these violations. Since 𝑛 0, 𝑛 2, 𝑛 7 have extended edge whose
5.2.2 Correctness of Merging. We also only prove the
destination is not the subsequent node, not regarding them
correctness of Eq.13 because the proof for Eq.14 is similar.
as starting node of iteration will inevitably violate Assump- ∗ which is better than 𝐶
Suppose there exists some 𝐶 0,7 0,7 and
tion 2 when iterating to the destination node of their ex- ∗ is 𝑝 ∗ . It is equivalent to say
the partition state of 𝑛 2 in 𝐶 0,7 2
tended edge. Thus we apply dynamic programming within
∗
segments 𝑀𝑜𝑑𝑒𝑙 0,2 , 𝑀𝑜𝑑𝑒𝑙 2,7 and 𝑀𝑜𝑑𝑒𝑙 7,12 separately, where 𝐶 0,2 (𝑝 0, 𝑝 2∗ ) + 𝐶 2,7
∗
(𝑝 2∗, 𝑝 7 ) + 𝑒 0,7 (𝑝 0, 𝑝 7 ) − 𝑛 2 (𝑝 2∗ ) <
Assumptions 1,2 are never violated. In this way, optimal sub- min{𝐶 0,2 (𝑝 0, 𝑝 2 ) + 𝐶 2,7 (𝑝 2, 𝑝 7 ) + 𝑒 0,7 (𝑝 0, 𝑝 7 ) − 𝑛 2 (𝑝 2 )}
𝑝2
structures 𝐶 0,2, 𝐶 2,7, 𝐶 7,12 can be computed with Eq.11, 12.
With 𝐶 0,2, 𝐶 2,7, 𝐶 7,12 computed, optimal sub-structure for Similar to section 5.2.1, fixing 𝑝 2 to 𝑝 2∗ in R.H.S
larger sub-models can be obtained by merging. Fig.6 shows ∗
𝐶 0,2 (𝑝 0, 𝑝 2∗ ) + 𝐶 2,7
∗
(𝑝 2∗, 𝑝 7 ) + 𝑒 0,7 (𝑝 0, 𝑝 7 ) − 𝑛 2 (𝑝 2∗ ) <
that 𝑒 0,7 is not contained in 𝑀𝑜𝑑𝑒𝑙 0,2 nor 𝑀𝑜𝑑𝑒𝑙 2,7 , whose cost
needs to be added when merging 𝐶 0,2, 𝐶 2,7 . Also, 𝑀𝑜𝑑𝑒𝑙 0,2 and 𝐶 0,2 (𝑝 0, 𝑝 2∗ ) + 𝐶 2,7 (𝑝 2∗, 𝑝 7 ) + 𝑒 0,7 (𝑝 0, 𝑝 7 ) − 𝑛 2 (𝑝 2∗ )
809
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han
1.6
1.4
1.2
1.0
0.8
OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom
6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B
4 GPUs 8 GPUs 16 GPUs 32 GPUs
Figure 7. Normalized training throughputs of Megatron, Alpa and PrimePar. (The downward pillar indicates that the model
structure is not supported by the current version of the framework.)
∗ (𝑝 , 𝑝 ∗ ) + 𝐶 ∗ (𝑝 ∗, 𝑝 ) < 𝐶 (𝑝 , 𝑝 ∗ ) + 𝐶 (𝑝 ∗, 𝑝 )
Thus 𝐶 0,2 6.1 Performance Analysis
0 2 2,7 2 7 0,2 0 2 2,7 2 7
contradicting the optimality of 𝐶 0,2, 𝐶 2,7 , which is proved in To assess the efficacy of PrimePar tensor partitioning, we
section 5.2.1. evaluate the training throughput of transformer models scal-
ing to 4, 8, 16, 32 GPUs by different tensor partition strategies.
5.3 Generality and Complexity
Here, we refrain from utilizing pipeline parallelism to con-
This optimization algorithm is applicable to most trans- trol variables. When evaluating Megatron-LM on 𝑛 GPUs,
former models, with only the segmentation different from we enumerate all possible data parallelism (partition batch
model to model. The complexity is 𝑂 (𝑃 3 ), where 𝑃 is the size dimension) size 𝑑 and employ Megatron-LM’s model par-
of operator partition space. When optimizing for system with allelism (partition head/row/column dimensions) with size
32 devices, modern CPU can solve it in seconds. Moreover, 𝑛
𝑑 . We select the configuration that exhibits the best perfor-
the main computation comes from Eq.11-14, which are paral- mance of Megatron-LM to compare with PrimePar. For Alpa,
lelizable and can be implemented as CUDA kernels to enable we follow a similar process, where the partition strategy for
fast optimization for potentially larger parallel systems. non-batch dimensions are determined by the model paral-
lelism strategy searched by Alpa on our machine. Please note
6 EVALUATION that PrimePar incorporates partitioning all dimensions to its
Implementation. We implement PrimePar with PyTorch search space, including batch dimension, thus the enumera-
[41]. Specified with searched optimal tensor partition strat- tion of data parallelism size is unnecessary for PrimePar.
egy or any other strategies, it can automatically deploy mod- Fig.7 illustrates the training throughput of 6 models un-
els to multiple GPUs. We utilize PyTorch distributed com- der 4 parallelism scales. In all testcases, PrimePar achieves
munication package with NCCL [38] backend to implement better throughput than Megatron-LM and Alpa. Note that
collective and inter-operator communications. For proposed Megatron-LM and Alpa demonstrate close performance as
novel tensor partition primitive, we implement in C++ code they are both state-of-the-art within conventional tensor
utilizing CUDA and CUDA-aware MPI APIs [39, 40], which partition space. PrimePar achieves 1.16 - 1.20 × throughput
can be invoked from PyTorch code as a C++ extension. over Megatron-LM in models with parameter scales around
Environment and models. We evaluate PrimePar on a 7B. For large models exceeding 100B, 1.11 - 1.68 × through-
cluster of 8 nodes where each node consists of 4 NVIDIA put can be achieved. When scaling to 32 GPUs, the geo-mean
V100-SXM2 32GB GPUs and Intel Xeon Gold 5218 32-core speedup across benchmarks is 1.30 ×. It can be observed that
CPU. The GPUs within each node is connected via 300 GB/s the speedup increases as the number of GPUs grow, and
NVLink while nodes are connected via 100 GB/s InfiniBand. large models get significant promotion when scaling to 16,
We evaluate the training workloads of 6 sets of popular 32 GPUs.
transformer-based language models, including OPT 6.7B,
175B, Llama2 7B, 70B and BLOOM 7B1, 176B [55, 60, 66].
Baseline and metrics. We choose Megatron-LM [37, 49] 6.2 Peak Memomry Occupation Analysis
as the baseline. We also compare with previous state-of-the- We evaluate the effect of memory saving of PrimePar tensor
art automated parallel training framework Alpa [67]. Since partitioning. Under the same tensor partition configurations
PrimePar parallelism rigorously preserves the mathematical that achieve the throughputs illustrated in Fig.7, we pro-
semantics of original training, we focus on evaluating the file the peak memory occupation during the entire training.
training throughput and memory occupancy. Due to the Single Program Multiple Data (SPMD) nature of
810
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
1.0
0.9
0.8
0.7
0.6
OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom
6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B
4 GPUs 8 GPUs 16 GPUs 32 GPUs
Figure 8. Normalized peak memory occupancy during training of Megatron, Alpa and PrimePar.
Megatron-LM, Alpa and PrimePar, the memory occupation (𝑑 2, 𝑑 3 ). This leads to all-reduce of output tensor 𝑂 within
is consistent across different GPUs, thus it is sufficient to each group of (0, 1, 2, 3) and (4, 5, 6, 7), which means intra-
profile the memory of one GPU for evaluation. node all-reduce. Since fc2.P contains a partition along di-
As shown in Fig.8, PrimePar presents lower peak memory mension 𝐵, the size of tensor 𝑂 for all-reduce is half of the
occupation in all testcases. PrimePar consumes around 90% original size.
of the memory footprint compared to Megatron-LM when Kernel ○ 1 of PrimePar arises from the outermost partition
training models with a scale around 7B. It is worth highlight- of fc2.P where dimension 𝑁 is partitioned once and dis-
ing that PrimePar demonstrates notable memory saving for tributed with group indicator (𝑑 1 ). This induces all-reduce
large models. Specifically when training BLOOM 176B with of output tensor 𝑂 within each group of (0, 4), (1, 5), (2, 6),
parallelism size of 16 or 32, PrimePar consumes only 68% (3, 7) via inter-node communication. Since fc2.P contains
of the memory footprint compared to Megatron-LM. This a 𝑃2×2 , the size of tensor 𝑂 for all-reduce is 1/4 of the orig-
memory saving originates from reduced tensor replication inal size. As for 𝑃2×2 , it induces ring point-to-point com-
when applying the novel tensor partition. munications which happen in groups with group indicator
(𝑑 2, 𝑑 3 ). Thus ring communications happen within groups
6.3 Ablation Study of (0, 1, 2, 3) and (4, 5, 6, 7). It can be seen that these ring
We explain the source of effectiveness of PrimePar by com- communications have small latency and can be fully over-
paring details of latency breakdown and partition strategies lapped with computation kernels. In this example, PrimePar
with Megatron-LM. As shown in the left part of Fig.9, the demonstrates minimal communication overhead within each
major speedup originates from the reduction in collective group of (0, 1, 2, 3) and (4, 5, 6, 7) attributed to the utilization
communication latency, where PrimePar consumes 19.9% of 𝑃2×2 . When scaling to 8 GPUs, PrimePar entails mild col-
- 62.2% the collective communication latency compared to lective communication involving only 1/4 the size of input
Megatron-LM. Also, the ring point-to-point communication and output tensor of the MLP block due to the fact that 𝑃2×2
originating from the novel partition introduced in PrimePar partitions both input and output tensors.
has short latency and can be fully overlapped with computa- The primary source of speedup of PrimePar is the in-
tion. Note that Megatron-LM and PrimePar share roughly troduction of novel partition and its appropriate position
the same computation latency, which means PrimePar does in the partition sequence searched by our optimization al-
not trade computation efficiency for communication effi- gorithm. PrimePar strategies successfully trade expensive
ciency. The right part of Fig.9 shows the partition strategies collective communication for point-to-point communication
and corresponding kernel execution timelines for OPT 175B that is relatively cheap and convenient to be overlapped with
MLP block when parallelizing to 8 GPUs. Note that we only computation.
demonstrate the timeline of one device and this shows the
system latency due to the SPMD nature. 6.4 Impact on 3D Parallelism
We elucidate the correspondence between partition strate- Since 3D parallelism is the prevailing training technique
gies and collective communication kernels. Here we number for large transformer models, it is important to know the
the 8 GPUs from 0 to 7 (D = (𝑑 1, 𝑑 2, 𝑑 3 )), where GPUs 0 to improvement that PrimePar brings to 3D parallelism. We
3 constitute one node and GPUs 4 to 7 constitute another use the notation (𝑝, 𝑑, 𝑚) to represent the parallelism sizes
node. For example kernel ○ 1 of Megatron-LM arises from of pipeline, data and model parallelism respectively. We fit
the innermost two partitions of fc2.P, where dimension with pipeline parallelism in the same way as introduced in
𝑁 is partitioned twice and distributed with group indicator Megatron-LM. For example when (𝑝, 𝑑, 𝑚) = (4, 1, 8), 32
811
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han
collective Megatron OPT 175B BSZ=8 fc1.𝒫 ={B,K,K} relu.𝒫 ={B,D,D} fc2.𝒫 ={B,N,N}
computation
communication
167 168 165 166 162 164
ring p2p redistribution fc1.F fc2.F fc2.B fc2.G fc1.B fc1.G
communication communication
1.0 ① ② ③ ④
24 364 41 377
Time (ms)
62.2%
0.8
59.1%
① fc2.𝒫 ={B,N,N}, Intra-node reduce(size(O)/2) ② fc2. 𝒫 ={B,N,N}, Inter-node reduce(size(W)/4)
51.2%
34.6%
23.0%
0.6 ③ fc1.𝒫 ={B,K,K}, Intra-node reduce(size(I)/2) ④ fc1.𝒫 ={B,K,K}, Inter-node reduce(size(W)/4)
36.5%
19.9%
PrimePar OPT 175B BSZ=8 fc1.𝒫 ={K,P2x2} relu.𝒫 ={D,S,D} fc2.𝒫 ={N,P2x2}
22.2%
0.4
81 81 82 82 80 88 81 88 85 82 86 82
① ②
0 157 128
Time (ms)
B=8 B=16 B=8 B=16 B=8 B=16 B=8 B=16
OPT 175B llama 70B OPT 175B llama 70B
① fc2.𝒫 ={N,P 2x2}, Inter-node reduce(size(O)/4) ② fc2.𝒫 ={K,P2x2}, Inter-node reduce(size(I)/4)
8 GPUs 16 GPUs
Figure 9. Left: normalized latency breakdown of MLP blocks with configurations of batch size 8, 16 and scaling to 8, 16 GPUs.
Each pair of pillars represent Megatron-LM and PrimePar respectively, where collective communication latency reductions
are labeled. Right: detailed partition sequence P of operators in OPT 175B MLP block of Megatron-LM and PrimePar and
corresponding kernel execution timelines. Capital letters in P represent dimensions being partitioned, where notions for fc1
and fc2 are the same as Eq.1 and B, S, D represent batch, sequence, hidden dimensions for relu operator. We use F, B, G to label
the forward, backward, gradient computation phases. The correspondence between collective communication kernels and
partitions are listed.
devices are first separated into 4 groups, with each group Table 2. Optimization time in milliseconds for OPT, Llama2
consisting 8 devices computing a pipeline stage. 8-way model and Bloom model structures for parallelism sizes 4, 8, 16, 32.
parallelism is further implemented within each group. We
evaluate training each model with 32 GPUs using 3D par- Models Parallelism size
allelism of all possible (𝑝, 𝑑, 𝑚) configurations (𝑝 > 1). Un- 4 8 16 32
der each (𝑝, 𝑑, 𝑚) configuration, we evaluate the training OPT 85.3 86.5 170.9 5,357.3
throughputs of Megatron-LM and PrimePar by applying
Llama2 86.5 88.8 185.9 6,070.3
their model parallel strategies with size 𝑚 respectively, where
the pipeline and data parallelism configurations are con- Bloom 85.4 80.2 165.8 4,153.0
trolled to be consistent. In order to control 𝑑, we disable
partitioning batch dimension in PrimePar and search for op-
timal partitioning non-batch dimensions strategy to generate 6.5 Optimization Time
model parallelism with size 𝑚. We report the time consumption of the proposed segmented
Fig.10 demonstrates that 3D parallelism incorporating dynamic programming algorithm. Our optimization algo-
PrimePar consistently achieves superior throughput across rithm runs in a single thread of Intel Xeon Gold 5218 (2.3GHz)
different models and (𝑝, 𝑑, 𝑚) configurations. For models CPU. As Table.2 shows, searching the optimal solution for
around 7B, both Megatron-LM and PrimePar exhibits high- OPT, Llama2 and Bloom models only consumes seconds even
est throughput under the configuration (𝑝 = 2, 𝑑 = 4, 𝑚 = 4), if the parallelism size scales to 32.
where PrimePar slightly outperforms Megatron-LM. In the
case of OPT 175B, Llama2 70B and Bloom 176B, PrimePar 7 DISCUSSION
performs significantly better than Megatron-LM, achieving PrimePar’s design supports modeling and optimizing for
1.46, 1.27 and 1.40 × highest throughput than Megatron-LM general computational environments with various intercon-
achieves. Different from models around 7B, large models nection topologies. Since PrimePar’s novel partition primi-
exceeding 100B achieve peak performance under the con- tive only induces ring communication, PrimePar is expected
figuration (𝑝 = 2, 𝑑 = 1, 𝑚 = 16), where model parallelism to achieve more efficient scaling in interconnection topolo-
is preferred over data parallelism. This is attributed to the gies with torus. For example, TPU v4 [22] chips are connected
expensive all-reduce communication induced by data paral- with twistable tori. It is expected that this interconnection
lelism when the weight scale grows large. is suitable for PrimePar’s novel partition because the tori
can be configured to be horizontal, vertical and parallel to
diagonal to cater to PrimePar’s ring communications such
812
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
Throughput normalized
Throughput normalized
Throughput normalized
to Megatron-(2,1,16)
to Megatron-(2,1,16)
to Megatron-(2,1,16)
Throughput normalized
Throughput normalized
Throughput normalized
to Megatron-(2,1,16)
to Megatron-(2,1,16)
1.8 1.4
to Megatron-(2,2,8)
1.2 0.8
0.8 PrimePar
1.0 0.6
Megatron
0.6 0.8 0.4
(2,1,16) (2,2,8) (2,4,4) (4,1,8) (4,2,4) (8,1,4) (2,1,16) (2,2,8) (2,4,4) (4,1,8) (4,2,4) (8,1,4) (2,1,16) (2,2,8) (2,4,4) (4,1,8) (4,2,4) (8,1,4)
(p,d,m) configurations (p,d,m) configurations (p,d,m) configurations
Figure 10. Normalized 3D Parallelism throughput of Megatron-LM and PrimePar with different (𝑝, 𝑑, 𝑚) configurations,
parallelizing to 32 GPUs.
that bandwidth are fully utilized. PrimePar is expected to replication problem and proposed splitting optimizer states,
achieve linear scaling on hardware platforms with balanced in the cost of collective communication reduce-scatter and
bandwidth and computation resources, as long as the ring all-gather. These parallel training systems overlooked the
communication latency per step is no longer than compu- temporal dimension in tensor partitioning (data, model paral-
tation latency. Moreover, it is possible to utilize PrimePar lelism), and can be further enhanced through the application
to guide the co-design of computation and interconnection of our proposed technique.
architecture.
9 CONCLUSION
8 RELATED WORKS
PrimePar introduces previously ignored temporal dimension
Tensor partition and automatic searching. Tensor par- into tensor partitioning, which provides abundant opportu-
titioning has been explored to enable parallel deployment nities for better utilizing hardware resources and improving
of models for acceleration and reducing memory overhead the performance of parallel training. We design a novel ten-
on individual device [15, 20, 21, 24, 63]. Optimizing tensor sor partition primitive and apply it to improve LLM training
partitioning with minimum overhead [1, 17, 19, 20, 23, 25, performance, which is a successful practice in unlocking
27, 31, 32, 34, 42, 47, 48] has been a highly anticipated topic. the potentials of spatial-temporal tensor partitioning. We
Recursive tensor partitioning has emerged as a critical for- provide a formalization which encodes the spatial-temporal
mulation for automating tensor partitioning [33, 51, 52, 58], distribution of sub-operators and exposes the essence of
which covers a vast tensor partition space and is optimization computation and communication, facilitating the design of
friendly. Nevertheless, as far as we know, no research has efficient spatial-temporal tensor partitioning primitives. Fur-
introduced the temporal dimension into tensor partitioning ther explorations into spatial-temporal tensor partition space
for parallel training. The potential benefits and challenges are worthwhile.
brought by temporal dimension has not been studied.
Transformer parallel training systems. 3D parallelism
has been extensively optimized to support large transformer ACKNOWLEDGMENTS
model training [4, 26, 28, 37, 45, 46, 49, 50, 61, 67]. Data paral- This work was supported in part by the National Natural
lelism [9–12, 16, 29, 59, 65], model parallelism [13, 21, 53, 57] Science Foundation of China (Grant No. 62025404, 62104229,
were applied to distribute input data or model parameter to 62104230, 61874124, 62222411), and in part by the Strategic
multiple devices, each offering their own trade-offs. Pipeline Priority Research Program of Chinese Academy of Sciences
parallelism [18, 30, 35, 36] were proposed to strip layers of (Grant No. XDB44030300, XDB44020300), Zhe jiang Lab un-
a model over multiple GPUs, where pipeline execution is derGrants 2021PC0AC01. The corresponding authors are
scheduled over micro-batches. ZeRO [45, 46] tackled tensor Haobo Xu and Ying Wang.
813
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han
814
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
[27] Alex Krizhevsky. One weird trick for parallelizing convolutional neural [43] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei,
networks. arXiv preprint arXiv:1404.5997, 2014. Ilya Sutskever, et al. Language models are unsupervised multitask
[28] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, learners. OpenAI blog, 1(8):9, 2019.
Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and [44] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan
Zhifeng Chen. Gshard: Scaling giant models with conditional compu- Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring
tation and automatic sharding, 2020. the limits of transfer learning with a unified text-to-text transformer.
[29] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing [45] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.
Su. Scaling distributed machine learning with the parameter server. Zero: Memory optimizations toward training trillion parameter models.
In 11th USENIX Symposium on operating systems design and implemen- In SC20: International Conference for High Performance Computing,
tation (OSDI 14), pages 583–598, 2014. Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
[30] Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, [46] Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and
Dawn Song, and Ion Stoica. Terapipe: Token-level pipeline parallelism Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme
for training large-scale language models. In International Conference scale deep learning. In Proceedings of the International Conference for
on Machine Learning, pages 6543–6552. PMLR, 2021. High Performance Computing, Networking, Storage and Analysis, pages
[31] Qingda Lu, Christophe Alias, Uday Bondhugula, Thomas Henretty, 1–14, 2021.
Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, [47] J Ramanujam and P Sadayappan. Compile-time techniques for data
Ponnuswamy Sadayappan, Yongjian Chen, Haibo Lin, et al. Data layout distribution in distributed memory machines. IEEE Transactions on
transformation for enhancing data locality on nuca chip multiproces- parallel and distributed systems, 2(4):472–482, 1991.
sors. In 2009 18th International Conference on Parallel Architectures and [48] Jagannathan Ramanujam and Ponnuswamy Sadayappan. A method-
Compilation Techniques, pages 348–357. IEEE, 2009. ology for parallelizing programs for multicomputers and complex
[32] Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xi- memory multiprocessors. In Proceedings of the 1989 ACM/IEEE confer-
aowei Li. Flexflow: A flexible dataflow accelerator architecture for ence on Supercomputing, pages 637–646, 1989.
convolutional neural networks. In 2017 IEEE International Symposium [49] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley,
on High Performance Computer Architecture (HPCA), pages 553–564. Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-
IEEE, 2017. billion parameter language models using model parallelism. arXiv
[33] Jiachen Mao, Zhongda Yang, Wei Wen, Chunpeng Wu, Linghao Song, preprint arXiv:1909.08053, 2019.
Kent W Nixon, Xiang Chen, Hai Li, and Yiran Chen. Mednn: A [50] Jaeyong Song, Jinkyu Yim, Jaewon Jung, Hongsun Jang, Hyung-Jin
distributed mobile system with enhanced partition and deployment Kim, Youngsok Kim, and Jinho Lee. Optimus-cc: Efficient large nlp
for large-scale dnns. In 2017 IEEE/ACM International Conference on model training with 3d parallelism aware communication compression.
Computer-Aided Design (ICCAD), pages 751–756. IEEE, 2017. In Proceedings of the 28th ACM International Conference on Architectural
[34] Igor Z Milosavljevic and Marwan A Jabri. Automatic array alignment Support for Programming Languages and Operating Systems, Volume 2,
in parallel matlab scripts. In Proceedings 13th International Parallel pages 560–573, 2023.
Processing Symposium and 10th Symposium on Parallel and Distributed [51] Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, Hai Li, and
Processing. IPPS/SPDP 1999, pages 285–289. IEEE, 1999. Yiran Chen. Accpar: Tensor partitioning for heterogeneous deep
[35] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, learning accelerators. In 2020 IEEE International Symposium on High
Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Performance Computer Architecture (HPCA), pages 342–355. IEEE, 2020.
Zaharia. Pipedream: Generalized pipeline parallelism for dnn train- [52] Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and
ing. In Proceedings of the 27th ACM Symposium on Operating Systems Yiran Chen. Hypar: Towards hybrid parallelism for deep learning
Principles, pages 1–15, 2019. accelerator array. In 2019 IEEE International Symposium on High Per-
[36] Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and formance Computer Architecture (HPCA), pages 56–68. IEEE, 2019.
Matei Zaharia. Memory-efficient pipeline-parallel dnn training. In [53] Jakub M Tarnawski, Deepak Narayanan, and Amar Phanishayee. Piper:
International Conference on Machine Learning, pages 7937–7947. PMLR, Multidimensional planner for dnn parallelization. Advances in Neural
2021. Information Processing Systems, 34:24829–24840, 2021.
[37] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- [54] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-
ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric
Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large- Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
scale language model training on gpu clusters using megatron-lm. Grave, and Guillaume Lample. Llama: Open and efficient foundation
In Proceedings of the International Conference for High Performance language models, 2023.
Computing, Networking, Storage and Analysis, pages 1–15, 2021. [55] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma-
[38] NVIDIA. Nvidia collective communications library (nccl). https:// hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
developer.nvidia.com/nccl, 2018. Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
[39] NVIDIA. Nvidia cuda. https://developer.nvidia.com/cuda-zone, 2023. Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes,
[40] OPEN-MPI. Cuda aware mpi. https://www.open-mpi.org/faq/ Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami,
?category=runcuda, 2023. Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
[41] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad- Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann,
bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
Luca Antiga, et al. Pytorch: An imperative style, high-performance Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier
deep learning library. Advances in neural information processing sys- Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie,
tems, 32, 2019. Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi,
[42] Michael Philippsen. Automatic alignment of array data and processes Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian,
to reduce communication time on dmpps. In Proceedings of the fifth Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xi-
ACM SIGPLAN symposium on Principles and practice of parallel pro- ang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela
gramming, pages 156–165, 1995. Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert
815
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han
Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open founda- Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit
tion and fine-tuned chat models, 2023. Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers,
[56] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan
is all you need. Advances in neural information processing systems, 30, Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Eka-
2017. terina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf,
[57] Minjie Wang, Chien-chin Huang, and Jinyang Li. Unifying data, model Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan
and hybrid parallelism in deep learning via tensor tiling. arXiv preprint Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat,
arXiv:1805.04170, 2018. Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer
[58] Minjie Wang, Chien-chin Huang, and Jinyang Li. Supporting very large Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian
models using automatic dataflow graph partitioning. In Proceedings of Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas
the Fourteenth EuroSys Conference 2019, pages 1–17, 2019. Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Pro-
[59] Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gre- tasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov,
gory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana,
Managed communication and consistency for fast data-parallel itera- Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony
tive analytics. In Proceedings of the Sixth ACM Symposium on Cloud Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha
Computing, pages 381–394, 2015. Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade,
[60] BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Bharat Saxena, Carlos Muñoz Ferrandis, Daniel McDuff, Danish Con-
Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexan- tractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen,
dra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline
Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene
Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer,
Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Mar-
Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile got Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna,
Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Lau- Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen
rençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus
Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Al- Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang,
fassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo
Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano,
Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush
Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag
Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady El- Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel
sahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann,
Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec,
Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph To- Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde,
bing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani,
Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi,
Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Ma- Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger,
soud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De
Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang,
Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus
Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Re-
Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, nata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi
Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shub-
Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsud- hanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti
deen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant,
Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Tim- Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj,
poni Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vas- Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie,
silina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf. Bloom: A
Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chen- 176b-parameter open-access multilingual language model, 2023.
glei Si, Davut Emre Taşar, Elizabeth Salesky, Sabrina J. Mielke, Wil- [61] Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yan-
son Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Ar- ping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly,
naud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Marcello Maggioni, et al. Gspmd: General and scalable paralleliza-
Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos tion for ml computation graphs. arxiv e-prints, art. arXiv preprint
Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, arXiv:2105.04663, 2021.
Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng [62] Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji
Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Matsumoto. Luke: deep contextualized entity representations with
Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xi- entity-aware self-attention. arXiv preprint arXiv:2010.01057, 2020.
angru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow [63] Xuan Yang, Jing Pu, Blaine Burton Rister, Nikhil Bhagdikar, Stephen
Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Richardson, Shahar Kvatinsky, Jonathan Ragan-Kelley, Ardavan Pe-
Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bour- dram, and Mark Horowitz. A systematic approach to blocking convo-
foune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Min- lutional neural networks, 2016.
jia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, [64] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R
Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive
816
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
pretraining for language understanding. Advances in neural informa- Zettlemoyer. Opt: Open pre-trained transformer language models,
tion processing systems, 32, 2019. 2022.
[65] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning [67] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng
with elastic averaging sgd. Advances in neural information processing Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo,
systems, 28, 2015. Eric P Xing, et al. Alpa: Automating inter-and {Intra-Operator} par-
[66] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, allelism for distributed deep learning. In 16th USENIX Symposium
Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria on Operating Systems Design and Implementation (OSDI 22), pages
Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel 559–578, 2022.
Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke
817