0% found this document useful (0 votes)

72 views17 pages

PrimePar Efficient Spatial-Temporal Tensor

Uploaded by

yichuan.cheng2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views17 pages

PrimePar Efficient Spatial-Temporal Tensor

Uploaded by

yichuan.cheng2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

PrimePar: Efficient Spatial-temporal Tensor

Partitioning for Large Transformer Model Training

Haoran Wang Lei Wang Haobo Xu
wanghaoran20g@ict.ac.cn wanglei21s@ict.ac.cn xuhaobo@ict.ac.cn
Institute of Computing Technology, Institute of Computing Technology, Institute of Computing Technology,
Chinese Academy of Sciences Chinese Academy of Sciences Chinese Academy of Sciences
University of Chinese Academy of University of Chinese Academy of Beijing, China
Sciences Sciences
Beijing, China Beijing, China

Ying Wang Yuming Li Yinhe Han

wangying2009@ict.ac.cn liyuming22@mails.ucas.ac.cn yinhes@ict.ac.cn
Institute of Computing Technology, Institute of Computing Technology, Institute of Computing Technology,
Chinese Academy of Sciences Chinese Academy of Sciences Chinese Academy of Sciences
Beijing, China University of Chinese Academy of Beijing, China
Sciences
Beijing, China

Abstract an optimization algorithm that can find optimal parallel solu-

With the rapid up-scaling of transformer-based large lan- tions from our spatial-temporal tensor partition space with
guage models (LLM), training these models is becoming in- acceptable search time. Our evaluation shows that our op-
creasingly demanding on novel parallel training techniques. timized tensor partitioning achieves up to 1.68 × training
Tensor partitioning is an extensively researched parallel tech- throughput with 69% peak memory occupancy compared
nique, encompassing data and model parallelism, and has to state-of-the-art distributed training systems when train-
a significant influence on LLM training performance. How- ing LLMs. Upon scaling to 32 GPUs, the geo-mean speedup
ever, existing state-of-the-art parallel training systems are across benchmarks is 1.30 ×. When applied in 3D parallelism,
based on incomplete tensor partitioning space, where the up to 1.46 × training throughput can be achieved.
distribution of partitioned sub-operators is limited to the spa- ACM Reference Format:
tial dimension. We discover that introducing the temporal Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe
dimension into tensor partitioning of LLM training instance Han. 2024. PrimePar: Efficient Spatial-temporal Tensor Partition-
provides extra opportunities to avoid collective communica- ing for Large Transformer Model Training. In 29th ACM Interna-
tion across devices, saving memory space and also overlap- tional Conference on Architectural Support for Programming Lan-
guages and Operating Systems, Volume 3 (ASPLOS ’24), April 27-May
ping device-to-device communication with computation. In
1, 2024, La Jolla, CA, USA. ACM, New York, NY, USA, 17 pages.
this paper, we propose a new tensor partition primitive that
https://doi.org/10.1145/3620666.3651357
distributes sub-operators along both the spatial and temporal
dimensions to further explore communication and memory 1 INTRODUCTION
overhead reduction over current solutions. This new primi-
tive creates a broader parallelization space and leads to par- With the recent advances in artificial intelligence, the contin-
allel solutions that achieve better training throughput with uous scaling of transformer models [56] has significantly en-
lower peak memory occupancy compared to state-of-the-art hanced model accuracy, generalization, and self-supervised
techniques. To efficiently deploy optimized parallel trans- capabilities. Transformer-based large language models with
former model training to multiple devices, we further present tens or hundreds of billions of parameters have achieved
state-of-the-art performance on a wide range of tasks [2, 7,
14, 43, 44, 54, 55, 60, 62, 64, 66].
With the exponential growth in parameter scale, train-
ing transformer models is becoming increasingly demand-
This work is licensed under a Creative Commons Attribution International ing on hardware resources, not only in terms of individual
4.0 License. device computational power, but also in terms of intercon-
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA
nection bandwidth and memory. Responsively, NVIDIA has
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0386-7/24/04.
introduced NVSwitch to support high-performance commu-
https://doi.org/10.1145/3620666.3651357 nication between cross-node GPUs [6]. TPU v4 proposed
optical circuit switching to improve networking, which can

801
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han

Temporal Temporal
all-reduce since the summation is accomplished locally by
t3 t3
+
accumulating outputs through the passage of time. If tacti-
t2 t2 cally handled, temporal partitioning can both mitigate mem-
+
ory/communication overhead and fully utilize the compu-
t1 t1
+ tation resource of multiple devices when combined with
t0 + + +
t0 spatial partitioning. Our research reveals that adding the
Spatial Spatial
D0 D1 D2 D3 D0 D1 D2 D3 temporal dimension to the tensor partition space unlocks
numerous possibilities, including avoiding all-reduce, reduc-
Summations are
Parameter tensor Parameter tensor Summations ing memory footprint and overlapping communication with
implemented as
is replicated is only stored in happen in D0
all-reduce among
among D0-D3
D0-D3
D0 locally computation.
(a) All-reduce and tensor (b) Temporal distribution eliminates
Guided by these insights, we propose PrimePar, an au-
replication along spatial dimension the all-reduce and replication tomated parallel training framework targeting large trans-
former models. PrimePar incorporates the extended tensor
Figure 1. Illustration of distributing 4 sub-operators along partition space, resulting in parallel solutions that achieve
spatial dimension (a) and temporal dimension (b). The hori- superior training throughput with lower peak memory occu-
zontal axis represents 4 devices (D0-D3) while the vertical pancy. To the best of our knowledge, this is the first research
axis represents 4 temporal steps (t0-t3). Each rounded rec- that introduces the temporal dimension into tensor parti-
tangle represents a sub-operator, with the upper and lower tioning for parallel training. Our major contributions can be
representing two tensors. The same color represents iden- summarized as:
tical tensors, while different colors indicate distinct partial
sums that require summation. • We expand the tensor partition space for LLM train-
ing by introducing a new partition primitive, which
leverages the previously ignored temporal dimension
adjust interconnection topology to match the application to distribute sub-operators in a fashion of more com-
and deliver more bandwidth [22]. munication and memory efficiency.
Parallel training systems are required to adapt to the de- • Regarding the lack of scalable tensor partition opti-
velopment of transformer models and hardware resources, mization algorithm for transformer models, we design
so that the expensive computational power, interconnection a dynamic programming-based algorithm capable of
bandwidth and memory resources are efficiently utilized. identifying the optimal tensor partition strategy from
Current common practice for training transformer models is our extended space with solvable complexity.
resorting to 3D parallelism [37, 45, 49], including data, model • We evaluate the effectiveness of PrimePar on train-
and pipeline parallelism. Amongst 3D space, the Pipeline ing workloads of popular transformer models with a
parallelism [18, 35] features inexpensive point-to-point com- cluster of 32 GPUs. Compared to state-of-the-art dis-
munication in the cost of pipeline bubbles caused by pe- tributed training systems Megatron-LM [37, 49] and
riodic flushes during training. Data and model parallelism Alpa [67], we can achieve up to 1.68 × speedup, with
[9, 12, 13, 21, 53, 57], which can be collectively represented only 69% peak memory occupancy. When scaling to
by tensor partitioning, are more communication-intensive 32 GPUs, 1.30 × geo-mean speedup can be achieved.
and favored in devices with relatively high interconnection When applied to 3D parallelism, up to 1.46 × speedup
bandwidth. The performance of tensor partitioning is critical over Megatron-LM can be achieved.
to system performance. With the advancement of intercon-
nection topology and bandwidth, tensor partitioning will
2 BACKGROUND AND MOTIVATION
have wider application scenarios and play a more important
role in system performance. 2.1 Tensor Partitioning for Transformers
However, existing tensor partitioning fails in fully uti- Tensor partitioning has been extensively explored as a key
lizing hardware resources. The inadequacy is that current technique to parallelize neural networks [5, 8, 12, 24, 63] to
tensor partition primitive only distributes the partitioned speedup the computation and amortize the memory over-
sub-operators along spatial dimensions. It is often the case head. Recursive tensor partitioning has been proposed [15,
that sub-operators produce partial sum results that necessi- 20, 21, 33, 51, 52, 58] to formalize a relatively general parallel
tate further aggregation from devices. As depicted in Fig.1, space and enable automatic search for the best parallel plans.
when distributing these sub-operators along spatial dimen- Tensor partitioning has been applied to accelerate train-
sions (i.e. to different devices), the summation of partial-sum ing transformer models. Megatron-LM [37, 49] manually
leads to all-reduce communication traffic among mapped designed efficient tensor partitions for transformer models,
devices. On the contrary, distributing these sub-operators where row or column dimensions of linear operators and
along the temporal dimension in a single device can avoid head dimension of attention matrix multiplications were

802
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

All-reduce The rest is a waste of memory. The all-reduce and memory waste are
41.5%
OPT 6.7B
not coincidences. When partitioning a dimension of an oper-
ator which is mathematically summed-over, distributing the
38.4%
Llama2 70B partitioned sub-operators to different devices will inevitably
64.6%
induce all-reduce. When some tensor has no dimension par-
BLOOM 176B titioned, then it is replicated among devices’ memory. As
(a) proportion of all-reduce latency during training
shown in Fig.2 (a), all-reduce communication takes up a sig-
nificant proportion of training latency of Megatron-LM. Fig.2
Megatron memory
occupancy
(b) depicts the disparity in peak memory occupation between
Collective Replication
Ideal memory Megatron-LM and the ideal scenario, where the ideal sce-
occupancy nario assumes no tensor replication among devices. It can
Proposed × × be seen that the memory waste caused by tensor replication
becomes progressively more severe as the parallelism size
Partition
1.10 ×
batch
√ √ increases.
The inefficiencies mentioned above have become increas-
Partition
1.29 ×
row
√ √ ingly unacceptable considering the demanding requirements
placed on computation, communication and memory by
1.66 × Partition
√ √ large transformer models training workloads. Although these
column
inefficiencies seem unavoidable, in fact, they are addressed
Partition if we introduce a new dimension into tensor partitioning,
sequence
√ √
4 8 16 32 which is the temporal dimension. Inspired by cannon par-
(b) memory occupation (c) advantage of proposed partitioning allel matrix multiplication [3], we consider distributing par-
titioned sub-operators along both spatial dimensions and
Figure 2. The proportions of all-reduce latency during train- temporal dimension (i.e. on the same device but at different
ing OPT 6.7B, Llama2 70B and Bloom 176B on 16 V100 GPUs time). Specifically, as illustrated in Fig.1, if the results of a
using Megatron-LM (model parallelism within a node and group of sub-operators need to be aggregated (or general
data parallelism across nodes) are shown in (a). Training reduction), distributing them along temporal dimension elim-
Llama2 70B model with the same batch size on 4, 8, 16, 32 inates the necessity for all-reduce communication. Moreover,
GPUs, the disparity in peak memory occupation per GPU distributing sub-operators which have some common tensor
between Megatron-LM and the ideal scenario is shown in along temporal dimension can avoid physically replicating
(b). The advantage of proposed partition primitive compared their common tensor in device memory.
with existing ones is shown in (c). Motivated by these considerations, we propose a novel
(c) 应该包含的元素：
spatial-temporal tensor partition primitive for parallel train-
Temporal dimension -> ing. This is a challenging work in that the proposed partition
Partition batch/row
partitioned. … ->
Alpa distribute
[67] sub- a framework supporting au-
proposed primitive should meet the training requirements, such as
operators to spatial dimensions -> 产生
tomatic searching and
all-reduce, replication
deploying optimalSub-operator: temporal dimension
tensor partition tensor alignment between different training phases, while
Novel partition
solutions -> to
in their temporal
space, achieving transformer model train- exhibiting better parallel performance. As summarized in
dimensions -> 消除
ing performance that is comparable to Megatron-LM.
另外两个关键元素：reduce among sub- Fig.2 (c), our proposed partition primitive avoids incurring
operators the overhead associated with both collective communication
Replicated tensor among sub-operators
2.2 Motivation and tensor replication, differing from conventional partition-
Unfortunately, the tensor partition space explored so far is ing methodologies employed in transformer models.
incomplete, missing opportunities to better utilize hardware
resources. Take partitioning row dimension of linear opera-
3 PRIMEPAR PARTITION SPACE
tor weight as an example, which is one of the strategies pro-
posed in Megatron-LM. Row dimension of weight 𝑊 is par- 3.1 Basic Formulation
titioned into 𝑛 slices, generating 𝑛 tensor blocks 𝑊1, · · · ,𝑊𝑛 . PrimePar partitions over 2𝑛 homogeneous devices with each
Therefore, the forward computation is divided into 𝑛 sub- device indexed by a Device ID D = (𝑑 1, · · · , 𝑑𝑛 ), where
operators 𝐼 1𝑊1, · · · , 𝐼𝑛𝑊𝑛 , where 𝐼 1, · · · , 𝐼𝑛 are input tensor 𝑑𝑖 = 0, 1. After tensor partitioning, an operator is partitioned
blocks partitioned along last dimension in accordance with into sub-operators with each sub-operator holding a part of
partitioning 𝑊 . These 𝑛 sub-operators are distributed to 𝑛 the partitioned tensor. In PrimePar, two sub-operators can
different devices with each of them computing a partial re- be allocated to different devices D, or one single device at
sult, thus all-reduce among these 𝑛 devices is necessary for different temporal steps 𝑡, hence both D, 𝑡 are necessary to
computing the final result 𝐼 1𝑊1 + · · · + 𝐼𝑛𝑊𝑛 . Also, after all- identify a sub-operator. We introduce the Dimension Slice
reduce, each device holds a copy of the output tensor, which Index (DSI), which is a function of D, 𝑡 and records which

803
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han

F F F
slice of dimension the sub-operator (D, 𝑡) holds. Specifically, Forward I M = d1 I N = d2 I K = 0 D=(0,0) D=(0,1)
consider the linear operator in transformer models: I W = O 0 × 0,2 1 × 1,3
0 1 0,2 0,1
I W O Device 0 Device 1
0 1
[N, 0,2 0,1 0 × 0,2 + 1 × 1,3
[B, M, K] [B, M, N] × K] M
Forward: = · 𝑊
2 3 1,3
=
2,3 2 × 0,2 + 3 × 1,3
2 3 1,3 2,3 D=(1,0) D=(1,1)
𝑂 𝐼 Forward Device 2 Device 3

T
Device 0 Device 2 2 × 0,2 3 × 1,3
N K
I dO dW
[B, M, N] [B, M, K]0 2 [K,0,1N] 0,2 0 × 0,1 + 2 × 2,3
×
𝑇=
Backward: 𝑑𝐼 = 𝑑𝑂 1
·𝑊
Gradient
3 2,3 1,3 1 × 0,1
Device 1
(1)
+ 3 × 2,3
Device 3
G G G
Gradient I M = d1 I N = d2 I K = 0 D=(0,0) D=(1,0)
[N, K] [B, N, M] [B, M, K] I
T
dO = dW
𝑇 0 × 0,1 2 × 2,3
Gradient: 𝑑𝑊 = 𝐼 · 𝑑𝑂 0 2 0,1 0,2
N D=(0,1) D=(1,1)
Linear operator has four dimensions 𝐵, 𝑀, 𝑁 , 𝐾 (referring 1 3 2,3 1,3
1 × 0,1 3 × 2,3
to batch, sequence, input/output hidden dimension), each M K
dimension, say, dimension 𝑁 maintains three DSIs 𝐼 𝑁𝐹 , 𝐼 𝑁𝐵 , 𝐼 𝑁𝐺
for Forward, Backward and Gradient phases respectively. Figure 3. Parallelizing linear operator to 4 devices by parti-
If 𝐼 𝑁𝐹 (D, 𝑡) = 2, then at step 𝑡 of Forward phase, device D tion dimension 𝑀 and 𝑁 . The colored blocks represent par-
computes a sub-operator whose 𝐼,𝑊 tensors’ dimension 𝑁 titioned tensors, and the number (decimal device id) within
is the 2-th slice among the slices of original dimension 𝑁 indicate which devices are holding the tensor. Blocks with
generated by tensor partitioning. Any tensor partition plan multiple numbers indicate the tensor replication among de-
in the space of PrimePar can be uniquely represented by vices. Dashed boxes on the right show the sub-operators
specifying DSIs. mapped on devices, where ⊕ represents all-reduce.
In the following of this section, we introduce PrimePar
basic tensor partitions and their DSI formulations. Our tensor
dimension of the operator into two slices, resulting in the
partition space is represented as the space of sequences of
partitioning of tensors that include this dimension into two
these basic partitions.
parts, each allocated to separate sub-operators. For tensors
3.2 Existing Tensor Partitions not containing this dimension, they are replicated in sub-
operators. Recursively partition by dimension forms a large
The current paradigm of tensor partition can be summarized
tensor partition space which covers data parallelism and
as partition by dimension. This involves dividing a single
model parallelism that have been applied to train transformer
models. For example, in Megatron-LM linear operators are
Algorithm 1 PrimePar Linear Operator DSIs parallelized by recursively partitioning row or column (di-
mension 𝑁 , 𝐾 in Eq.1), while matrix multiplications in the
1: Input: Sequence of partitions P
attention block are partitioned by head dimension. PrimePar
2: Output: DSIs 𝐼𝑋𝐹 , 𝐼𝑋𝐵 , 𝐼𝑋𝐺 , 𝑋 refers to 𝐵, 𝑀, 𝑁 , 𝐾
establishes a comprehensive tensor partition space encom-
3: Initialize: 𝑖 = 1, 𝐼𝑋𝐹 = 𝐼𝑋𝐵 = 𝐼𝑋𝐺 = 0
passing conventional partition by dimension, with adapta-
4: for p in P do
tion to transformer operators.
5: if p is partitioning by dimension 𝑋 then
Linear Operator. PrimePar supports partitioning all four
6: 𝐼𝑋𝐹 ← 2𝐼𝑋𝐹 + 𝑑𝑖 , 𝐼𝑋𝐵 ← 2𝐼𝑋𝐵 + 𝑑𝑖 , 𝐼𝑋𝐺 ← 2𝐼𝑋𝐺 + 𝑑𝑖
dimensions 𝐵, 𝑀, 𝑁 , 𝐾 of linear operator (Eq.1). We exemplify
7: 𝑖 ←𝑖 +1
parallelizing linear operator to 4 devices (D = (𝑑 1, 𝑑 2 )) by
8: else if p is 𝑃2𝑘 ×2𝑘 then
partitioning dimension 𝑀 and 𝑁 sequentially. Initially all
9: 𝑟 = 2𝑘 −1𝑑𝑖 + 2𝑘 −2𝑑𝑖+2 + · · · + 20𝑑𝑖+2𝑘 −2
DSIs are 0, indicating that no dimension has been partitioned.
10: 𝑐 = 2𝑘 −1𝑑𝑖+1 + 2𝑘 −2𝑑𝑖+3 + · · · + 20𝑑𝑖+2𝑘 −1
The first partition by dimension 𝑀 can be represented as
11: temporal index 𝑡 varying from 0 to 2𝑘 − 1
𝐹 𝐹 𝐵 𝐵 𝐺 𝐺
12: 𝐼𝑀 𝐹 ← 2𝑘 𝐼 𝐹 + (𝑟 mod 2𝑘 )
𝑀
𝐼𝑀 ← 2𝐼𝑀 + 𝑑 1, 𝐼 𝑀 ← 2𝐼𝑀 + 𝑑 1, 𝐼 𝑀 ← 2𝐼𝑀 + 𝑑1 (2)
13: 𝐼𝑀 ← 2𝑘 𝐼𝑀
𝐵 𝐵 + (𝑟 mod 2𝑘 )
where dimension 𝑀 is partitioned into 2 slices and devices
14: 𝐺 ← 2𝑘 𝐼 𝐺 + ((𝑟 + 𝑡) mod 2𝑘 )
𝐼𝑀 𝑀 with 𝑑 1 = 0 hold the 0-th slice while others hold the 1-th
15: 𝐼 𝑁𝐹 ← 2𝑘 𝐼 𝑁𝐹 + ((𝑟 + 𝑐 + 𝑡) mod 2𝑘 ) slice. The second partition
16: 𝐼 𝑁𝐵 ← 2𝑘 𝐼 𝑁𝐵 + ((𝑟 + 𝑐 − 1) mod 2𝑘 ) 𝐼 𝑁𝐹 ← 2𝐼 𝑁𝐹 + 𝑑 2, 𝐼 𝑁𝐵 ← 2𝐼 𝑁𝐵 + 𝑑 2, 𝐼 𝑁𝐺 ← 2𝐼 𝑁𝐺 + 𝑑 2 (3)
17: 𝐼 𝑁𝐺 ← 2𝑘 𝐼 𝑁𝐺 + ((𝑟 + 𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘 )
18: 𝐼𝐾𝐹 ← 2𝑘 𝐼𝐾𝐹 + (𝑐 mod 2𝑘 ) partitions dimension 𝑁 into 2 slices and devices with 𝑑 2 = 0
and 𝑑 2 = 1 hold the 0-th slice and 1-th slice respectively.
19: 𝐼𝐾𝐵 ← 2𝑘 𝐼𝐾𝐵 + ((𝑐 + 𝑡) mod 2𝑘 )
The partitioned tensor distribution and the way they form
20: 𝐼𝐾𝐺 ← 2𝑘 𝐼𝐾𝐺 + ((𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘 ) sub-operators are depicted in Fig.3.
21: 𝑖 ← 𝑖 + 2𝑘 Consider devices whose device ids differ only in 𝑑 1 (devices
22: end if D=(0,0),(1,0) or D=(0,1),(1,1) in Fig. 3), they hold different
23: end for slices of dimension 𝑀 while holding the same slice of any

804
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

Table 1. Sender device coordinates for receiver device (𝑟, 𝑐) Formulation of Partition. We denote the proposed parti-
for ring communications. Table row distinguishes temporal tion as 𝑃2𝑘 ×2𝑘 (𝑘 ≥ 1), which is parallelized across a group of
intervals that have different communication directions. Note 22𝑘 devices. 𝑃2𝑘 ×2𝑘 logically sees these devices as a 2𝑘 × 2𝑘
that each coordinate entry specifies a ring communication square. Each device D in the group has its relative row and
which overlaps with computation step 𝑡, and the blank entry column indices 𝑟 (D), 𝑐 (D) within the logical square.
means no communication is required. Unlike partitioning by dimension, 𝑃 2𝑘 ×2𝑘 assigns each de-
vice with 2𝑘 sub-operators, which are executed sequentially
Forward in temporal steps with 0 ≤ 𝑡 < 2𝑘 indexing these steps. We
Temporal step 𝐼 𝑊 design 𝑃2𝑘 ×2𝑘 to satisfy the following features:
𝑡< 2𝑘 −1 (𝑟, 𝑐 + 1) (𝑟 + 1, 𝑐) 1. Collective communication free: Collective communi-
Backward cation, which is expensive and hard to overlap with
computation, is not required throughout the whole
Temporal step 𝑑𝑂 𝑊
training process.
𝑡< 2𝑘 −1 (𝑟, 𝑐 + 1) (𝑟 − 1, 𝑐 + 1) 2. Memory efficient: No replication of tensors among the
𝑡 = 2𝑘 − 1 (𝑟, 𝑐 + 1) memory of devices.
Gradient 3. Support training: Forward, Backward and Gradient
Temporal step 𝐼 𝑑𝑂 𝑑𝑊 phases can be executed periodically without additional
𝑡< 2𝑘 −2 (𝑟 + 1, 𝑐 − 1) (𝑟 + 1, 𝑐) operations between them.
𝑡 = 2𝑘 − 2 (𝑟 + 1, 𝑐) (𝑟 + 1, 𝑐 + 1) Feature 1 and 2 endow 𝑃2𝑘 ×2𝑘 with superior performance and
memory efficiency compared to partitioning by dimension.
𝑡 = 2𝑘 − 1 (𝑟, 𝑐 + 1)
Feature 3 requires two kinds of tensor distribution align-
ments. For the tensor stashed in memory in an earlier phase
other dimensions. Thus during Gradient phase, they com- which will be used in some latter phase, its distribution at the
pute the same 𝑑𝑊 (containing dimension 𝑁 , 𝐾), but each of last step of the earlier phase must align with that at the first
them obtains partial sum due to not having the complete di- step of the latter phase. Also, the weight distribution at the
mension 𝑀. This induces all-reduce communication among first step of Forward phase must align with that at the last
them as shown in Fig.3 (Gradient phase). Also, tensors not step of Gradient phase. If feature 3 is not satisfied, frequent
containing dimension 𝑀 (𝑊 , 𝑑𝑊 ) are replicated among them redistribution of weight, gradient and other intermediate
(blue and red blocks with black borders in Fig.3). tensors is required between training phases.
Other Operators in Transformer. PrimePar supports par- We now specify the details of 𝑃2𝑘 ×2𝑘 . For device (𝑟, 𝑐) at
titioning all dimensions of matrix multiplications in atten- temporal step 𝑡, its DSIs are:
tion, except for head embed dimension. The head embed 

 𝐼𝑀 = 𝑟 mod 2𝑘
dimension typically takes values of 64 or 128, partitioning Forward:


𝐼 𝑁 = (𝑟 + 𝑐 + 𝑡) mod 2𝑘 (4)
which would significantly reduce the operational intensity 
𝐼𝐾 = 𝑐 mod 2𝑘


and lower the throughput. For softmax operator, we parti- 
tion all of its dimensions except for the dimension along 

 𝐼𝑀 = 𝑟 mod 2𝑘
which softmax is computed (usually the last dimension). For


Backward: 𝐼 𝑁 = (𝑟 + 𝑐 − 1) mod 2𝑘 (5)
normalization operator, we support partitioning all of its 
𝐼𝐾 = (𝑐 + 𝑡) mod 2𝑘


dimensions, with potential all-reduce of expectations and 
gradient of parameters 𝛾, 𝛽. For the remaining element-wise  𝐼 = (𝑟 + 𝑡) mod 2𝑘
 𝑀


operators, we support partitioning all of their dimensions.

Gradient: 𝐼 𝑁 = (𝑟 + 𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘 (6)
While partitioning by dimension dominates existing ten- 
 𝐼𝐾 = (𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘

sor partition, it suffers from significant shortcomings of need- 
ing collective communication and tensors replication. As where 𝛿𝑡,2𝑘 −1 takes value 1 if 𝑡 = 2𝑘 − 1 otherwise 0. It can
transformer models continue to scale up, these shortcomings be verified that features 1,2,3 are satisfied.
are becoming increasingly destructive to parallel training. Feature 1: During Forward phase, the summed-over dimen-
sion of matrix multiplication is dimension 𝑁 . In Eq.4, for an
3.3 Novel Tensor Partition Primitive arbitrary device (𝑟, 𝑐), 𝐼 𝑁 takes all 2𝑘 values as 𝑡 varies from
We introduce a novel tensor partition primitive that effec- 0 to 2𝑘 − 1 while 𝐼𝑀 , 𝐼𝐾 are fixed to 𝑟, 𝑐. Thus after 2𝑘 steps
tively addresses the shortcomings of conventional partition of computation, device (𝑟, 𝑐) computes the (𝑟, 𝑐)-th block of
by dimension. When incorporating this novel partition, we output tensor, with no device yielding partial sum because
can identify parallel strategies that achieves higher through- the partial sums of all slices of dimension 𝑁 are summed
put and lower peak memory occupancy simultaneously. locally in device (𝑟, 𝑐) during these 2𝑘 steps (see Fig.4). The

805
The upper-right corner of tensor block: which double buffer
In the ring: which double buffer as source of communication
holds
ASPLOS ’24, April this tensor
27-May block
1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han

c
0 0 1 1
O00 O01 O00 O01 dI01 dI00 dI01 dI00 dW11 dW00 dW00 dW11
0
+= 0 += 0 += 1 += 1 += 0 += 0 += 1 += 1 += 1 += 1 += 0 += 0
0 0 T T T T
I 00 I 01 I 01 I 00 dO00 dO01 dO01 dO00 I 10 I 00 I 01 I 11
×0 ×0 ×1 ×1 ×1 ×1 ×0 ×0 ×1 ×1 ×0 ×0
T T T 0 T
r W00 W11 W10 W01 W01 W10 W11 W00 dO01 dO00 dO10 dO11
0 0 1 1 1 1 1 1
0 0 1 1
O10 O11 O10 O11 dI10 dI11 dI10 dI11 dW01 dW10 dW10 dW01
0
+= 0 += 0 += 1 += 1 += 0 += 0 += 1 += 1 += 1 += 1 += 0 += 0
T T T T
I 11 I 10 I 10 I 11 dO10 dO11 dO11 dO10 I 01 I 11 I 10 I 00
0 0
×0 ×0 ×1 ×1 ×1 ×1 ×0 ×0 ×1 ×1 ×0 ×0
T T 0
W10 W01 W00 W11 W00 W11 T
W10 T
W01 dO11 dO10 dO00 dO01

Forward Step 0 Forward Step 1 Backward Step 0 Backward Step 1 Gradient Step 0 Gradient Step 1
Comp stream

× × × × × ×

Comm stream
recv recv recv recv recv recv recv recv
次 right bot right right-top right bot right-bot right Time

Figure 4. The process of training with partition 𝑃2×2 . Details of orchestration among 2 × 2 devices are shown in the upper
的 half, while the corresponding execution timeline is shown in the lower. The 0-1 numbers to the upper-right of tensor block or
within communication ring indicate the tensor or communication source is in which double buffer.
Computation × × × × × ×

same conclusion
Communication can be verified for Backward and Gradient result and can be carried out simultaneously with computa-
phases similarly. tion. This plays a crucial role in effectively harnessing Time
both
Feature 2: Consider arbitrary two devices (𝑟, 𝑐) and (𝑟 ′, 𝑐 ′ ) computational power and communication bandwidth.
in Gradient phase (Eq.6), suppose their corresponding 𝐼 𝑁 Specifically, in Forward phase, tensors 𝐼,𝑊 require com-
and 𝐼𝐾 are the same at temporal step 𝑡 munication between steps. Leveraging double buffer, each
device can compute with 𝐼,𝑊 of the current step while receiv-
(𝑟 + 𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) ≡ (𝑟 ′ + 𝑐 ′ − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘
ing 𝐼,𝑊 for the next step and storing them into double buffer.
(𝑐 − 1 + 𝛿𝑡,2𝑘 −1 ) ≡ (𝑐 ′ − 1 + 𝛿𝑡,2𝑘 −1 ) mod 2𝑘 The communications of 𝑑𝑂,𝑊 in Backward phase and 𝐼, 𝑑𝑂
in Gradient phase are similar. As for 𝑑𝑊 in Gradient phase,
The latter equation yields 𝑐 = 𝑐 ′ which further yields 𝑟 = 𝑟 ′
it is the result tensor accumulating the computation result
in the former. Thus at any step 𝑡 during Gradient phase, no
of each step. Since 𝐼 𝑁 , 𝐼𝐾 only change from step 𝑡 = 2𝑘 − 2 to
two devices have the same 𝐼 𝑁 and 𝐼𝐾 simultaneously, which
𝑡 = 2𝑘 − 1 when 𝛿𝑡,2𝑘 −1 turns from 0 to 1, 𝑑𝑊 accumulated in
means 𝑊 , 𝑑𝑊 are never replicated among devices. Similar
previous steps should be redistributed during the last step of
reasoning can be done for all tensors in all phases.
computation. Afterwards, each device adds the redistributed
Feature 3: Let 𝑡 = 2𝑘 − 1 in Eq.4 and 𝑡 = 0 in Eq.6, where we
𝑑𝑊 with its last step computation result to yield the final
can verify that 𝐼𝑀 , 𝐼 𝑁 at the last step of Forward match with
𝑑𝑊 result. This 𝑑𝑊 redistribution is necessary for weight
that at the first step of Gradient. This means the distribution
alignment between start of Forward and end of Gradient.
of tensor 𝐼 does not change when transition from Forward
Here we explicitly derive the communication pattern of
to Gradient. Thus each device can stash its 𝐼 at the end of
𝑊 in Backward phase. In Eq.5, 𝐼 𝑁 , 𝐼𝐾 are the DSIs of 𝑊 . At
Forward and directly use the stashed 𝐼 for computation when
step 𝑡 + 1, device (𝑟, 𝑐) holds this part of 𝑊 :
entering Gradient phase (see the blue blocks in Forward step
1 and Gradient step 0 in Fig.4). Other alignments can be 𝐼 𝑁 = (𝑟 + 𝑐 − 1) mod 2𝑘
verified similarly. 𝐼𝐾 = (𝑐 + (𝑡 + 1)) mod 2𝑘
Formulation of Communication. Due to the variation
of DSIs with 𝑡, there exists tensor communication between To identify which device was holding the part of 𝑊 specified
temporal steps. In Eq.4, 𝐼 𝑁 changes with 𝑡. Since both tensors by the above 𝐼 𝑁 , 𝐼𝐾 at step 𝑡, rewrite the above equations as:
𝐼 and 𝑊 contain dimension 𝑁 , the communication for 𝐼,𝑊 𝐼 𝑁 = ((𝑟 − 1) + (𝑐 + 1) − 1) mod 2𝑘
is necessary. Similarly from Eq.5, 6 it can be inferred that
in Backward phase, communication is required for tensors 𝐼𝐾 = ((𝑐 + 1) + 𝑡) mod 2𝑘
𝑑𝑂,𝑊 , while in Gradient phase, communication is necessary which indicates that device (𝑟 − 1, 𝑐 + 1) was holding this part
for tensors 𝐼, 𝑑𝑂, 𝑑𝑊 . of 𝑊 . Note here 𝑟 −1 = 2𝑘 −1 if 𝑟 = 0 and 𝑐 +1 = 0 if 𝑐 = 2𝑘 −1.
However, unlike all-reduce communication, these com- Therefore, we can conclude that for 𝑡 < 2𝑘 − 1 in Backward
munications are not data-dependent on the computation phase, each device receives 𝑊 from its right-top neighbor,

806
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

establishing a ring point-to-point communication pattern, High bandwidth connection Low bandwidth connection

where each device only communicates with its neighbors

in the ring (Fig.4 Backward step 0). During the last step of
Device Device Device Device
Backward, 𝑊 should be redistributed to align with the 𝑊 (0,0,0) (0,0,1) (1,0,0) (1,0,1)

distribution at the start of Forward. At the start of Forward,

device (𝑟, 𝑐) holds: Device Device Device Device
(0,1,0) (0,1,1) (1,1,0) (1,1,1)
𝐼 𝑁 = (𝑟 + 𝑐) mod 2𝑘
𝐼𝐾 = 𝑐 mod 2𝑘
(a) Grouping pattern of group indicator (d1, d3 )
which can be rewritten into:
𝐼 𝑁 = (𝑟 + (𝑐 + 1) − 1) mod 2𝑘
Device Device Device Device
(0,0,0) (0,0,1) (1,0,0) (1,0,1)
𝐼𝐾 = ((𝑐 + 1) + (2𝑘 − 1)) mod 2𝑘
comparing these equations with Eq.5 shows that during the Device Device Device Device
last step of Backward, device (𝑟, 𝑐 + 1) holds the part of 𝑊 (0,1,0) (0,1,1) (1,1,0) (1,1,1)

which device (𝑟, 𝑐) holds at the start of Forward. Thus during

the last step of Backward, each device should receive 𝑊 from
(b) Grouping pattern of group indicator (d2, d3 )
its right neighbor to ensure the 𝑊 distribution alignment
(Fig.4 Backward step 1).
Figure 5. Illustration of grouping patterns of group indi-
Similar reasoning can derive all communication patterns
cators (a) (𝑑 1, 𝑑 3 ) and (b) (𝑑 2, 𝑑 3 ), where devices within the
listed in Table.1, all of which are ring communication and
same dashed box form a group. When executing all-reduce
can be effectively overlapped with computation.
or ring communication, (a) exhibits longer latency than (b)
because low bandwidth interconnections are contained in
4 COST MODEL
each group of (a).
4.1 Intra-operator Cost
The cost of executing a parallelized operator based on a
tensor partition strategy is the intra-operator cost, which All-reduce happens within each group, hence the overall
is dependent on both the operator configuration and the all-reduce latency is determined by the slowest among these
tensor partition strategy. We model two aspects of intra- groups, which can be modeled as a linear function of the
operator cost, training latency and peak memory occupancy. tensor size being all-reduced. Fig.5 visualizes the grouping
The intra-operator cost is a weighted sum of these two met- patterns when the group indicators are (𝑑 1, 𝑑 3 ) and (𝑑 2, 𝑑 3 ).
rics, enabling joint optimization of performance and memory For each group pattern, we obtain its linear function coef-
occupation. We introduce the intra-operator cost for an op- ficients by profiling real system latency with different all-
erator 𝑛 that is partitioned with partition sequence P (refer reduce tensor sizes and apply linear regression. Note this
to Alg.1). profiling is scalable for large systems since the number of
Intra-operator Communication. PrimePar has two kinds group indicators is less than the number of devices (group
of intra-operator communication: all-reduce communication indicator is a sub-sequence of device id D). The modeling of
caused by conventional partition by dimension and ring ring communication latency is similar. In this manner, we
communication caused by 𝑃2𝑘 ×2𝑘 . Both all-reduce and ring establish functions for computing the total all-reduce latency
communication happens in groups. and ring communication latency at step 𝑡, denoting them as
The grouping pattern for all-reduce is determined by 𝑎𝑙𝑙𝑟𝑒𝑑𝑢𝑐𝑒 (𝑛, P), 𝑟𝑖𝑛𝑔(𝑛, P, 𝑡) for convenience.
the positions of partitioning the summed-over dimension Computation. We model the computation latency of op-
in P. Consider parallelizing linear operator to 8 devices erators as linear functions of the amount of floating point
(D = (𝑑 1, 𝑑 2, 𝑑 3 )), we determine the grouping pattern of all- operations and memory access. We also obtain the linear
reduce in Forward phase. Suppose 𝐼 𝑁𝐹 changes with (𝑑 1, 𝑑 3 ), coefficients through real system latency profiling and linear
while the rest of device id (𝑑 2 ) does not affect the value of regression. The coefficients are profiled separately for differ-
𝐼 𝑁𝐹 . Two devices within one group must have the same (𝑑 2 ) ent types of operators. We denote the computation latency
and different (𝑑 1, 𝑑 3 ). The same (𝑑 2 ) ensures that they com- at step 𝑡 as 𝑐𝑜𝑚𝑝𝑢𝑡𝑒 (𝑛, P, 𝑡).
pute the same part of result tensor, while different (𝑑 1, 𝑑 3 ) Peak Memory Occupancy. For each operator, its peak mem-
ensures that they hold different slices of dimension 𝑁 to ory occupancy during training is modeled as the total size of
compute different partial sums. We call (𝑑 1, 𝑑 3 ) as the group its parameter tensors and tensors stashed in Forward phase
indicator. In this way, all devices are divided into several for use in Backward and Gradient phases. Denote the peak
disjoint groups, whose union is the complete set of devices. memory occupancy as 𝑚𝑒𝑚𝑜𝑟𝑦 (𝑛, P).

807
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han

The overall intra-operator cost can be modeled as: Transformer block

Softmax
Q
∑︁
𝑖𝑛𝑡𝑟𝑎𝐶 (𝑛, P) = 𝑚𝑎𝑥 (𝑐𝑜𝑚𝑝𝑢𝑡𝑒 (𝑛, P, 𝑡), 𝑟𝑖𝑛𝑔(𝑛, P, 𝑡))

MLP Out
NonLin
MLP In
Norm

Norm
Proj
K
𝑡
+ 𝑎𝑙𝑙𝑟𝑒𝑑𝑢𝑐𝑒 (𝑛, P) + 𝛼 · 𝑚𝑒𝑚𝑜𝑟𝑦 (𝑛, P)

V
(7)
Computation graph
where 𝛼 is the adjustment coefficient between latency and
0 1 2 3 4 5 6 7 8 9 10 11 12
memory metrics.

4.2 Inter-operator Cost Segment 1 Segment 2 Segment 3

Dynamic programming within segment 2
Consider two consecutive operators 𝑛 1, 𝑛 2 partitioned with
2 3 4 5 6 7
P1, P2 , where the output of 𝑛 1 serves as the input of 𝑛 2 .
On a given device D, the locally held output tensor after Eq. 7 Eq. 7 Eq. 8 Eq. 7 Eq. 7
n2 C 2,3 C 2,4 C 2,5 C 2,6 C 2,7
execution of 𝑛 1 may not cover the input tensor of 𝑛 2 . In such
cases, it becomes necessary to redistribute the missing part
of tensor from other devices [51, 67]. Figure 6. Architecture and computation graph of a trans-
For each dimension 𝑋 of the tensor passing from 𝑛 1 to former block in popular LLMs. The dynamic programming
𝑛 2 , there exist two DSIs, 𝐼𝑋1 , 𝐼𝑋2 , associated with 𝑛 1 and 𝑛 2 process within one of the segment is illustrated.
respectively. Note that 𝐼𝑋1 , 𝐼𝑋2 are determined by P1, P2 as
shown in Alg.1. Let 𝑠𝑋1 , 𝑠𝑋2 be the lengths of each segment of specified with partition strategy P𝑖 is:
dimension 𝑋 of 𝑛 1, 𝑛 2 after partitioning. We can determine ∑︁ ∑︁
the part of dimension 𝑋 held by each device D at the last 𝐶= 𝑖𝑛𝑡𝑟𝑎𝐶 (𝑛𝑖 , P𝑖 ) + 𝑖𝑛𝑡𝑒𝑟𝐶 (𝑛𝑖 , 𝑛 𝑗 , P𝑖 , P 𝑗 )
and first temporal steps of 𝑛 1 and 𝑛 2 respectively: 𝑛𝑖 ∈𝑁 (𝑛𝑖 ,𝑛 𝑗 ) ∈𝐸
(10)
𝑆𝑋1 (D) = [𝑠𝑋1 𝐼𝑋1 (D, 𝑡 = −1) : 𝑠𝑋1 𝐼𝑋1 (D, 𝑡 = −1) + 𝑠𝑋1 ] The optimization goal is to find the P𝑖 for each 𝑛𝑖 such
𝑆𝑋2 (D) = [𝑠𝑋2 𝐼𝑋2 (D, 𝑡 = 0) : 𝑠𝑋2 𝐼𝑋2 (D, 𝑡 = 0) + 𝑠𝑋2 ] (8) that they collectively minimize the overall cost 𝐶.

By evaluating DSIs for all dimensions of the tensor, the 5 OPTIMIZATION ALGORITHM
intersection lengths in each dimension can be computed.
PrimePar establishes a comprehensive tensor partition space,
The product of these intersection lengths is the size of the
which contains better solutions to release more system per-
part of 𝑛 1 output tensor computed by device D, which also
formance. However it is not trivial to find optimal partition
serves as part of device D’s input tensor for computing 𝑛 2 .
strategy, given that our partition space is extended by adding
Consequently, the total redistribution communication traffic
temporal dimension.
of all devices when running from 𝑛 1 to 𝑛 2 during forward
To optimize the tensor partition strategy for a model, it is
computation is:
necessary to find the optimal strategy for each operator in
the model’s computation graph such that they collectively
∑︁ Ö
(𝑉 − |𝑆𝑋1 (D) ∩ 𝑆𝑋2 (D)|) (9)
D 𝑋
achieve the minimal cost. This forms a search space which
expands exponentially as the number of operators increases,
where 𝑉 is the partitioned size of 𝑛 2 input tensor on each with the base being the size of operator partition space.
device and || represent taking the length of dimension slice. Previous optimization algorithm applicable to transformer
The redistribution traffic from 𝑛 2 to 𝑛 1 during backward models formalized the problem as an integer linear program-
computation can be computed similarly. We model the inter- ming (ILP) [67]. However, due to the fact that ILP is NP-hard,
operator communication latency between 𝑛 1 and 𝑛 2 as a this optimization algorithm scales poorly. In this section, we
linear function of the sum of forward and backward redis- propose an optimization algorithm to efficiently search for
tribution communication traffic. The linear coefficients are the optimal tensor partition strategy for transformer models,
obtained by profiling real system latency. Following this which minimize the sum of intra-operator (node) and inter-
method, we establish the inter-operator cost function, de- operator (edge) costs according to the cost model established
noted as 𝑖𝑛𝑡𝑒𝑟𝐶 (𝑛 1, 𝑛 2, P1, P2 ). in previous section.

4.3 Overall Cost 5.1 Segmented Dynamic Programming

Consider the computation graph 𝐺 =< 𝑁 , 𝐸 > of a model. Dynamic programming that defines the sub-problem as find-
Based on the intra-operator and inter-operator cost formula- ing optimal solution for sub-models can address the expo-
tions, the overall cost of the whole model with each node 𝑛𝑖 nential complexity with respect to the number of operators.

808
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

However, due to the non-linear model structure of trans- 𝑀𝑜𝑑𝑒𝑙 2,7 have a common node 𝑛 2 whose cost needs to be
former models, it is challenging to formalize tractable Bell- subtracted.
man equations. Thus we propose segmented dynamic pro-
𝐶 0,2 (𝑝 0, 𝑝 2 ) + 𝐶 2,7 (𝑝 2, 𝑝 7 )+

gramming that divides model into segments and carry out 𝐶 0,7 (𝑝 0, 𝑝 7 ) = min (13)
𝑝2 𝑒 0,7 (𝑝 0, 𝑝 7 ) − 𝑛 2 (𝑝 2 )
dynamic programming within each segment.
Consider two nodes 𝑛𝑖 , 𝑛 𝑗 in the computation graph (𝑖 < Furthermore, the optimal sub-structure of a whole Trans-
𝑗), we define the sub-model 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 consisting of nodes former layer can be obtained by merging 𝐶 0,7, 𝐶 7,12 .
𝑛𝑖 , 𝑛𝑖+1, ...𝑛 𝑗 (listed in increasing topological order of com- 𝐶 0,12 (𝑝 0, 𝑝 12 ) = min{𝐶 0,7 (𝑝 0, 𝑝 7 ) + 𝐶 7,12 (𝑝 7, 𝑝 12 ) − 𝑛 7 (𝑝 7 )}
𝑝7
putation graph) and edges whose source and destination
(14)
nodes lie within 𝑛𝑖 , 𝑛𝑖+1, ...𝑛 𝑗 . Define the optimal cost of sub-
model 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 when 𝑛𝑖 , 𝑛 𝑗 are under partition states 𝑝𝑖 , 𝑝 𝑗 Due to the stacking of identical transformer layers in trans-
as 𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 𝑗 ), which is the optimal sub-structure in our dy- former models, it is sufficient to compute 𝐶 0,12 once. In Fig.6,
namic programming formulation. 𝑛 0, 𝑛 12 are the last operator of previous layer and current
Assumption 1. For all edges that connects 𝑛 𝑗 and belongs layer respectively, meaning that optimal sub-structures of
to 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗+1 , the only one that is not contained in 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 neighboring layers actually has one node in common and can
is 𝑒 𝑗,𝑗+1 . be merged like Eq.14. With 𝑙𝑜𝑔(#𝑙𝑎𝑦𝑒𝑟𝑠) steps of recursive
Assumption 2. Among all nodes in 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 only 𝑛𝑖 and 𝑛 𝑗 merging, we can compute optimal sub-structure for 2, 4, 8
can connect with 𝑛 𝑗+1 . layers and so on.
Under these two assumptions we can derive the Bellman
equation between optimal cost of 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 and 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗+1 5.2 Optimality Proof
5.2.1 Correctness of Bellman equation. We prove the
𝐶𝑖,𝑗+1 (𝑝𝑖 , 𝑝 𝑗+1 ) = correctness of Eq.12 by induction where the induction hy-
min{𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 𝑗 ) + 𝑛 𝑗+1 (𝑝 𝑗+1 ) + 𝑒 𝑗,𝑗+1 (𝑝 𝑗 , 𝑝 𝑗+1 )} (11) pothesis is that 𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 𝑗 ) is the minimal cost for 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 .
𝑝𝑗
The proof for Eq.11 is similar and omitted. Based on as-
or sumptions 1,2 it can be inferred that 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗+1 differs from
𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 𝑗 ) + 𝑛 𝑗+1 (𝑝 𝑗+1 ) + 𝑒 𝑗,𝑗+1 (𝑝 𝑗 , 𝑝 𝑗+1 ) 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗 by the addition of a single node 𝑛 𝑗+1 and two edges

min (12) 𝑒 𝑗,𝑗+1, 𝑒𝑖,𝑗+1 . For simplicity, we will abbreviate 𝑛 for 𝑛 𝑗+1 , 𝑒
𝑝𝑗 + 𝑒𝑖,𝑗+1 (𝑝𝑖 , 𝑝 𝑗+1 )
for 𝑒 𝑗,𝑗+1 and 𝑒 ′ for 𝑒𝑖,𝑗+1 in the proof. Suppose there ex-
ists some 𝐶𝑖,𝑗+1 ∗ (𝑝𝑖 , 𝑝 𝑗+1 ) for 𝑀𝑜𝑑𝑒𝑙𝑖,𝑗+1 which is smaller than
where 𝑛 𝑗+1 (𝑝 𝑗+1 ) represents the intra-operator cost of node
𝑛 𝑗+1 under partition state 𝑝 𝑗+1 and 𝑒 𝑗,𝑗+1 (𝑝 𝑗 , 𝑝 𝑗+1 ) represents ∗
𝐶𝑖,𝑗+1 (𝑝𝑖 , 𝑝 𝑗+1 ) and the partition state of 𝑛 𝑗 in 𝐶𝑖,𝑗+1 (𝑝𝑖 , 𝑝 𝑗+1 )
the inter-operator cost between 𝑛 𝑗 and 𝑛 𝑗+1 when they are ∗
is 𝑝 𝑗 . It is equivalent to say
under 𝑝 𝑗 and 𝑝 𝑗+1 respectively. Eq.12 is for the case when ∗
𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 ∗𝑗 ) + 𝑛(𝑝 𝑗+1 ) + 𝑒 (𝑝 ∗𝑗 , 𝑝 𝑗+1 ) + 𝑒 ′ (𝑝𝑖 , 𝑝 𝑗+1 ) <
both 𝑛𝑖 , 𝑛 𝑗 connect with 𝑛 𝑗+1 . It is desired that the dynamic
programming for the whole model should be accomplished min{𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 𝑗 ) + 𝑛(𝑝 𝑗+1 ) + 𝑒 (𝑝 𝑗 , 𝑝 𝑗+1 ) + 𝑒 ′ (𝑝𝑖 , 𝑝 𝑗+1 )}
𝑝𝑗
with only these two Bellman equations.
However, it is not possible to straightforwardly iterate Since the R.H.S takes the minimal for all possible 𝑝 𝑗 , it is
from the first operator of transformer to the last with only obviously no greater than fixing 𝑝 𝑗 to 𝑝 ∗𝑗 , thus
∗
Eq.11, 12. As illustrated in Fig.6, when iterating from 𝑀𝑜𝑑𝑒𝑙 0,4 𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 ∗𝑗 ) + 𝑛(𝑝 𝑗+1 ) + 𝑒 (𝑝 ∗𝑗 , 𝑝 𝑗+1 ) + 𝑒 ′ (𝑝𝑖 , 𝑝 𝑗+1 ) <
to 𝑀𝑜𝑑𝑒𝑙 0,5 , 𝑛 2 in 𝑀𝑜𝑑𝑒𝑙 0,4 also connects with 𝑛 5 , thus As-
𝐶𝑖,𝑗 (𝑝𝑖 , 𝑝 ∗𝑗 ) + 𝑛(𝑝 𝑗+1 ) + 𝑒 (𝑝 ∗𝑗 , 𝑝 𝑗+1 ) + 𝑒 ′ (𝑝𝑖 , 𝑝 𝑗+1 )
sumption 2 is violated. Similar violation also happens when
∗ (𝑝 , 𝑝 ∗ ) < 𝐶 (𝑝 , 𝑝 ∗ ) contradicting the in-
which yields 𝐶𝑖,𝑗
iterating from 𝑀𝑜𝑑𝑒𝑙 0,11 to 𝑀𝑜𝑑𝑒𝑙 0,12 . 𝑖 𝑗 𝑖,𝑗 𝑖 𝑗
With segmented dynamic programming, we can avoid duction hypothesis.
these violations. Since 𝑛 0, 𝑛 2, 𝑛 7 have extended edge whose
5.2.2 Correctness of Merging. We also only prove the
destination is not the subsequent node, not regarding them
correctness of Eq.13 because the proof for Eq.14 is similar.
as starting node of iteration will inevitably violate Assump- ∗ which is better than 𝐶
Suppose there exists some 𝐶 0,7 0,7 and
tion 2 when iterating to the destination node of their ex- ∗ is 𝑝 ∗ . It is equivalent to say
the partition state of 𝑛 2 in 𝐶 0,7 2
tended edge. Thus we apply dynamic programming within
∗
segments 𝑀𝑜𝑑𝑒𝑙 0,2 , 𝑀𝑜𝑑𝑒𝑙 2,7 and 𝑀𝑜𝑑𝑒𝑙 7,12 separately, where 𝐶 0,2 (𝑝 0, 𝑝 2∗ ) + 𝐶 2,7
∗
(𝑝 2∗, 𝑝 7 ) + 𝑒 0,7 (𝑝 0, 𝑝 7 ) − 𝑛 2 (𝑝 2∗ ) <
Assumptions 1,2 are never violated. In this way, optimal sub- min{𝐶 0,2 (𝑝 0, 𝑝 2 ) + 𝐶 2,7 (𝑝 2, 𝑝 7 ) + 𝑒 0,7 (𝑝 0, 𝑝 7 ) − 𝑛 2 (𝑝 2 )}
𝑝2
structures 𝐶 0,2, 𝐶 2,7, 𝐶 7,12 can be computed with Eq.11, 12.
With 𝐶 0,2, 𝐶 2,7, 𝐶 7,12 computed, optimal sub-structure for Similar to section 5.2.1, fixing 𝑝 2 to 𝑝 2∗ in R.H.S
larger sub-models can be obtained by merging. Fig.6 shows ∗
𝐶 0,2 (𝑝 0, 𝑝 2∗ ) + 𝐶 2,7
∗
(𝑝 2∗, 𝑝 7 ) + 𝑒 0,7 (𝑝 0, 𝑝 7 ) − 𝑛 2 (𝑝 2∗ ) <
that 𝑒 0,7 is not contained in 𝑀𝑜𝑑𝑒𝑙 0,2 nor 𝑀𝑜𝑑𝑒𝑙 2,7 , whose cost
needs to be added when merging 𝐶 0,2, 𝐶 2,7 . Also, 𝑀𝑜𝑑𝑒𝑙 0,2 and 𝐶 0,2 (𝑝 0, 𝑝 2∗ ) + 𝐶 2,7 (𝑝 2∗, 𝑝 7 ) + 𝑒 0,7 (𝑝 0, 𝑝 7 ) − 𝑛 2 (𝑝 2∗ )

809
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han

Megatron Alpa PrimePar

1.8
Normalized throughput

1.6

1.4

1.2

1.0

0.8
OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom
6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B
4 GPUs 8 GPUs 16 GPUs 32 GPUs

Figure 7. Normalized training throughputs of Megatron, Alpa and PrimePar. (The downward pillar indicates that the model
structure is not supported by the current version of the framework.)

∗ (𝑝 , 𝑝 ∗ ) + 𝐶 ∗ (𝑝 ∗, 𝑝 ) < 𝐶 (𝑝 , 𝑝 ∗ ) + 𝐶 (𝑝 ∗, 𝑝 )
Thus 𝐶 0,2 6.1 Performance Analysis
0 2 2,7 2 7 0,2 0 2 2,7 2 7
contradicting the optimality of 𝐶 0,2, 𝐶 2,7 , which is proved in To assess the efficacy of PrimePar tensor partitioning, we
section 5.2.1. evaluate the training throughput of transformer models scal-
ing to 4, 8, 16, 32 GPUs by different tensor partition strategies.
5.3 Generality and Complexity
Here, we refrain from utilizing pipeline parallelism to con-
This optimization algorithm is applicable to most trans- trol variables. When evaluating Megatron-LM on 𝑛 GPUs,
former models, with only the segmentation different from we enumerate all possible data parallelism (partition batch
model to model. The complexity is 𝑂 (𝑃 3 ), where 𝑃 is the size dimension) size 𝑑 and employ Megatron-LM’s model par-
of operator partition space. When optimizing for system with allelism (partition head/row/column dimensions) with size
32 devices, modern CPU can solve it in seconds. Moreover, 𝑛
𝑑 . We select the configuration that exhibits the best perfor-
the main computation comes from Eq.11-14, which are paral- mance of Megatron-LM to compare with PrimePar. For Alpa,
lelizable and can be implemented as CUDA kernels to enable we follow a similar process, where the partition strategy for
fast optimization for potentially larger parallel systems. non-batch dimensions are determined by the model paral-
lelism strategy searched by Alpa on our machine. Please note
6 EVALUATION that PrimePar incorporates partitioning all dimensions to its
Implementation. We implement PrimePar with PyTorch search space, including batch dimension, thus the enumera-
[41]. Specified with searched optimal tensor partition strat- tion of data parallelism size is unnecessary for PrimePar.
egy or any other strategies, it can automatically deploy mod- Fig.7 illustrates the training throughput of 6 models un-
els to multiple GPUs. We utilize PyTorch distributed com- der 4 parallelism scales. In all testcases, PrimePar achieves
munication package with NCCL [38] backend to implement better throughput than Megatron-LM and Alpa. Note that
collective and inter-operator communications. For proposed Megatron-LM and Alpa demonstrate close performance as
novel tensor partition primitive, we implement in C++ code they are both state-of-the-art within conventional tensor
utilizing CUDA and CUDA-aware MPI APIs [39, 40], which partition space. PrimePar achieves 1.16 - 1.20 × throughput
can be invoked from PyTorch code as a C++ extension. over Megatron-LM in models with parameter scales around
Environment and models. We evaluate PrimePar on a 7B. For large models exceeding 100B, 1.11 - 1.68 × through-
cluster of 8 nodes where each node consists of 4 NVIDIA put can be achieved. When scaling to 32 GPUs, the geo-mean
V100-SXM2 32GB GPUs and Intel Xeon Gold 5218 32-core speedup across benchmarks is 1.30 ×. It can be observed that
CPU. The GPUs within each node is connected via 300 GB/s the speedup increases as the number of GPUs grow, and
NVLink while nodes are connected via 100 GB/s InfiniBand. large models get significant promotion when scaling to 16,
We evaluate the training workloads of 6 sets of popular 32 GPUs.
transformer-based language models, including OPT 6.7B,
175B, Llama2 7B, 70B and BLOOM 7B1, 176B [55, 60, 66].
Baseline and metrics. We choose Megatron-LM [37, 49] 6.2 Peak Memomry Occupation Analysis
as the baseline. We also compare with previous state-of-the- We evaluate the effect of memory saving of PrimePar tensor
art automated parallel training framework Alpa [67]. Since partitioning. Under the same tensor partition configurations
PrimePar parallelism rigorously preserves the mathematical that achieve the throughputs illustrated in Fig.7, we pro-
semantics of original training, we focus on evaluating the file the peak memory occupation during the entire training.
training throughput and memory occupancy. Due to the Single Program Multiple Data (SPMD) nature of

810
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

Megatron Alpa PrimePar

Normalized memory occupancy

1.0

0.9

0.8

0.7

0.6
OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom OPT OPT llama llama bloom bloom
6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B 6.7B 175B 7B 70B 7B1 176B
4 GPUs 8 GPUs 16 GPUs 32 GPUs

Figure 8. Normalized peak memory occupancy during training of Megatron, Alpa and PrimePar.

Megatron-LM, Alpa and PrimePar, the memory occupation (𝑑 2, 𝑑 3 ). This leads to all-reduce of output tensor 𝑂 within
is consistent across different GPUs, thus it is sufficient to each group of (0, 1, 2, 3) and (4, 5, 6, 7), which means intra-
profile the memory of one GPU for evaluation. node all-reduce. Since fc2.P contains a partition along di-
As shown in Fig.8, PrimePar presents lower peak memory mension 𝐵, the size of tensor 𝑂 for all-reduce is half of the
occupation in all testcases. PrimePar consumes around 90% original size.
of the memory footprint compared to Megatron-LM when Kernel ○ 1 of PrimePar arises from the outermost partition
training models with a scale around 7B. It is worth highlight- of fc2.P where dimension 𝑁 is partitioned once and dis-
ing that PrimePar demonstrates notable memory saving for tributed with group indicator (𝑑 1 ). This induces all-reduce
large models. Specifically when training BLOOM 176B with of output tensor 𝑂 within each group of (0, 4), (1, 5), (2, 6),
parallelism size of 16 or 32, PrimePar consumes only 68% (3, 7) via inter-node communication. Since fc2.P contains
of the memory footprint compared to Megatron-LM. This a 𝑃2×2 , the size of tensor 𝑂 for all-reduce is 1/4 of the orig-
memory saving originates from reduced tensor replication inal size. As for 𝑃2×2 , it induces ring point-to-point com-
when applying the novel tensor partition. munications which happen in groups with group indicator
(𝑑 2, 𝑑 3 ). Thus ring communications happen within groups
6.3 Ablation Study of (0, 1, 2, 3) and (4, 5, 6, 7). It can be seen that these ring
We explain the source of effectiveness of PrimePar by com- communications have small latency and can be fully over-
paring details of latency breakdown and partition strategies lapped with computation kernels. In this example, PrimePar
with Megatron-LM. As shown in the left part of Fig.9, the demonstrates minimal communication overhead within each
major speedup originates from the reduction in collective group of (0, 1, 2, 3) and (4, 5, 6, 7) attributed to the utilization
communication latency, where PrimePar consumes 19.9% of 𝑃2×2 . When scaling to 8 GPUs, PrimePar entails mild col-
- 62.2% the collective communication latency compared to lective communication involving only 1/4 the size of input
Megatron-LM. Also, the ring point-to-point communication and output tensor of the MLP block due to the fact that 𝑃2×2
originating from the novel partition introduced in PrimePar partitions both input and output tensors.
has short latency and can be fully overlapped with computa- The primary source of speedup of PrimePar is the in-
tion. Note that Megatron-LM and PrimePar share roughly troduction of novel partition and its appropriate position
the same computation latency, which means PrimePar does in the partition sequence searched by our optimization al-
not trade computation efficiency for communication effi- gorithm. PrimePar strategies successfully trade expensive
ciency. The right part of Fig.9 shows the partition strategies collective communication for point-to-point communication
and corresponding kernel execution timelines for OPT 175B that is relatively cheap and convenient to be overlapped with
MLP block when parallelizing to 8 GPUs. Note that we only computation.
demonstrate the timeline of one device and this shows the
system latency due to the SPMD nature. 6.4 Impact on 3D Parallelism
We elucidate the correspondence between partition strate- Since 3D parallelism is the prevailing training technique
gies and collective communication kernels. Here we number for large transformer models, it is important to know the
the 8 GPUs from 0 to 7 (D = (𝑑 1, 𝑑 2, 𝑑 3 )), where GPUs 0 to improvement that PrimePar brings to 3D parallelism. We
3 constitute one node and GPUs 4 to 7 constitute another use the notation (𝑝, 𝑑, 𝑚) to represent the parallelism sizes
node. For example kernel ○ 1 of Megatron-LM arises from of pipeline, data and model parallelism respectively. We fit
the innermost two partitions of fc2.P, where dimension with pipeline parallelism in the same way as introduced in
𝑁 is partitioned twice and distributed with group indicator Megatron-LM. For example when (𝑝, 𝑑, 𝑚) = (4, 1, 8), 32

811
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han

collective Megatron OPT 175B BSZ=8 fc1.𝒫 ={B,K,K} relu.𝒫 ={B,D,D} fc2.𝒫 ={B,N,N}
computation
communication
167 168 165 166 162 164
ring p2p redistribution fc1.F fc2.F fc2.B fc2.G fc1.B fc1.G
communication communication
1.0 ① ② ③ ④
24 364 41 377
Time (ms)
62.2%

0.8

59.1%
① fc2.𝒫 ={B,N,N}, Intra-node reduce(size(O)/2) ② fc2. 𝒫 ={B,N,N}, Inter-node reduce(size(W)/4)
51.2%
34.6%

23.0%
0.6 ③ fc1.𝒫 ={B,K,K}, Intra-node reduce(size(I)/2) ④ fc1.𝒫 ={B,K,K}, Inter-node reduce(size(W)/4)

36.5%
19.9%
PrimePar OPT 175B BSZ=8 fc1.𝒫 ={K,P2x2} relu.𝒫 ={D,S,D} fc2.𝒫 ={N,P2x2}

22.2%
0.4
81 81 82 82 80 88 81 88 85 82 86 82

0.2 fc1.F fc2.F fc2.B fc2.G fc1.B fc1.G

① ②
0 157 128
Time (ms)
B=8 B=16 B=8 B=16 B=8 B=16 B=8 B=16
OPT 175B llama 70B OPT 175B llama 70B
① fc2.𝒫 ={N,P 2x2}, Inter-node reduce(size(O)/4) ② fc2.𝒫 ={K,P2x2}, Inter-node reduce(size(I)/4)
8 GPUs 16 GPUs

Figure 9. Left: normalized latency breakdown of MLP blocks with configurations of batch size 8, 16 and scaling to 8, 16 GPUs.
Each pair of pillars represent Megatron-LM and PrimePar respectively, where collective communication latency reductions
are labeled. Right: detailed partition sequence P of operators in OPT 175B MLP block of Megatron-LM and PrimePar and
corresponding kernel execution timelines. Capital letters in P represent dimensions being partitioned, where notions for fc1
and fc2 are the same as Eq.1 and B, S, D represent batch, sequence, hidden dimensions for relu operator. We use F, B, G to label
the forward, backward, gradient computation phases. The correspondence between collective communication kernels and
partitions are listed.

devices are first separated into 4 groups, with each group Table 2. Optimization time in milliseconds for OPT, Llama2
consisting 8 devices computing a pipeline stage. 8-way model and Bloom model structures for parallelism sizes 4, 8, 16, 32.
parallelism is further implemented within each group. We
evaluate training each model with 32 GPUs using 3D par- Models Parallelism size
allelism of all possible (𝑝, 𝑑, 𝑚) configurations (𝑝 > 1). Un- 4 8 16 32
der each (𝑝, 𝑑, 𝑚) configuration, we evaluate the training OPT 85.3 86.5 170.9 5,357.3
throughputs of Megatron-LM and PrimePar by applying
Llama2 86.5 88.8 185.9 6,070.3
their model parallel strategies with size 𝑚 respectively, where
the pipeline and data parallelism configurations are con- Bloom 85.4 80.2 165.8 4,153.0
trolled to be consistent. In order to control 𝑑, we disable
partitioning batch dimension in PrimePar and search for op-
timal partitioning non-batch dimensions strategy to generate 6.5 Optimization Time
model parallelism with size 𝑚. We report the time consumption of the proposed segmented
Fig.10 demonstrates that 3D parallelism incorporating dynamic programming algorithm. Our optimization algo-
PrimePar consistently achieves superior throughput across rithm runs in a single thread of Intel Xeon Gold 5218 (2.3GHz)
different models and (𝑝, 𝑑, 𝑚) configurations. For models CPU. As Table.2 shows, searching the optimal solution for
around 7B, both Megatron-LM and PrimePar exhibits high- OPT, Llama2 and Bloom models only consumes seconds even
est throughput under the configuration (𝑝 = 2, 𝑑 = 4, 𝑚 = 4), if the parallelism size scales to 32.
where PrimePar slightly outperforms Megatron-LM. In the
case of OPT 175B, Llama2 70B and Bloom 176B, PrimePar 7 DISCUSSION
performs significantly better than Megatron-LM, achieving PrimePar’s design supports modeling and optimizing for
1.46, 1.27 and 1.40 × highest throughput than Megatron-LM general computational environments with various intercon-
achieves. Different from models around 7B, large models nection topologies. Since PrimePar’s novel partition primi-
exceeding 100B achieve peak performance under the con- tive only induces ring communication, PrimePar is expected
figuration (𝑝 = 2, 𝑑 = 1, 𝑚 = 16), where model parallelism to achieve more efficient scaling in interconnection topolo-
is preferred over data parallelism. This is attributed to the gies with torus. For example, TPU v4 [22] chips are connected
expensive all-reduce communication induced by data paral- with twistable tori. It is expected that this interconnection
lelism when the weight scale grows large. is suitable for PrimePar’s novel partition because the tori
can be configured to be horizontal, vertical and parallel to
diagonal to cater to PrimePar’s ring communications such

812
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

OPT 6.7B OPT 175B Llama2 7B

2.0 1.6 2.0
PrimePar

Throughput normalized

Throughput normalized
Throughput normalized

to Megatron-(2,1,16)

to Megatron-(2,1,16)
to Megatron-(2,1,16)

1.8 1.4 1.8

Megatron
1.6 1.2 1.6

1.4 1.0 1.4

1.2 0.8 1.2

PrimePar PrimePar
1.0 0.6 1.0
Megatron Megatron
0.8 0.4 0.8
(2,1,16) (2,2,8) (2,4,4) (4,1,8) (4,2,4) (8,1,4) (2,1,16) (2,2,8) (2,4,4) (4,1,8) (4,2,4) (8,1,4) (2,1,16) (2,2,8) (2,4,4) (4,1,8) (4,2,4) (8,1,4)
(p,d,m) configurations (p,d,m) configurations (p,d,m) configurations

Llama2 70B Bloom 7B1 Bloom 176B

1.4 2.0 1.6
PrimePar PrimePar

Throughput normalized
Throughput normalized

Throughput normalized

to Megatron-(2,1,16)
to Megatron-(2,1,16)
1.8 1.4
to Megatron-(2,2,8)

1.2 Megatron Megatron

1.6 1.2

1.0 1.4 1.0

1.2 0.8
0.8 PrimePar
1.0 0.6
Megatron
0.6 0.8 0.4
(2,1,16) (2,2,8) (2,4,4) (4,1,8) (4,2,4) (8,1,4) (2,1,16) (2,2,8) (2,4,4) (4,1,8) (4,2,4) (8,1,4) (2,1,16) (2,2,8) (2,4,4) (4,1,8) (4,2,4) (8,1,4)
(p,d,m) configurations (p,d,m) configurations (p,d,m) configurations

Figure 10. Normalized 3D Parallelism throughput of Megatron-LM and PrimePar with different (𝑝, 𝑑, 𝑚) configurations,
parallelizing to 32 GPUs.

that bandwidth are fully utilized. PrimePar is expected to replication problem and proposed splitting optimizer states,
achieve linear scaling on hardware platforms with balanced in the cost of collective communication reduce-scatter and
bandwidth and computation resources, as long as the ring all-gather. These parallel training systems overlooked the
communication latency per step is no longer than compu- temporal dimension in tensor partitioning (data, model paral-
tation latency. Moreover, it is possible to utilize PrimePar lelism), and can be further enhanced through the application
to guide the co-design of computation and interconnection of our proposed technique.
architecture.
9 CONCLUSION
8 RELATED WORKS
PrimePar introduces previously ignored temporal dimension
Tensor partition and automatic searching. Tensor par- into tensor partitioning, which provides abundant opportu-
titioning has been explored to enable parallel deployment nities for better utilizing hardware resources and improving
of models for acceleration and reducing memory overhead the performance of parallel training. We design a novel ten-
on individual device [15, 20, 21, 24, 63]. Optimizing tensor sor partition primitive and apply it to improve LLM training
partitioning with minimum overhead [1, 17, 19, 20, 23, 25, performance, which is a successful practice in unlocking
27, 31, 32, 34, 42, 47, 48] has been a highly anticipated topic. the potentials of spatial-temporal tensor partitioning. We
Recursive tensor partitioning has emerged as a critical for- provide a formalization which encodes the spatial-temporal
mulation for automating tensor partitioning [33, 51, 52, 58], distribution of sub-operators and exposes the essence of
which covers a vast tensor partition space and is optimization computation and communication, facilitating the design of
friendly. Nevertheless, as far as we know, no research has efficient spatial-temporal tensor partitioning primitives. Fur-
introduced the temporal dimension into tensor partitioning ther explorations into spatial-temporal tensor partition space
for parallel training. The potential benefits and challenges are worthwhile.
brought by temporal dimension has not been studied.
Transformer parallel training systems. 3D parallelism
has been extensively optimized to support large transformer ACKNOWLEDGMENTS
model training [4, 26, 28, 37, 45, 46, 49, 50, 61, 67]. Data paral- This work was supported in part by the National Natural
lelism [9–12, 16, 29, 59, 65], model parallelism [13, 21, 53, 57] Science Foundation of China (Grant No. 62025404, 62104229,
were applied to distribute input data or model parameter to 62104230, 61874124, 62222411), and in part by the Strategic
multiple devices, each offering their own trade-offs. Pipeline Priority Research Program of Chinese Academy of Sciences
parallelism [18, 30, 35, 36] were proposed to strip layers of (Grant No. XDB44030300, XDB44020300), Zhe jiang Lab un-
a model over multiple GPUs, where pipeline execution is derGrants 2021PC0AC01. The corresponding authors are
scheduled over micro-batches. ZeRO [45, 46] tackled tensor Haobo Xu and Ying Wang.

813
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han

References a gpu-specialized parameter server. In Proceedings of the eleventh

[1] David Bau, Induprakas Kodukula, Vladimir Kotlyar, Keshav Pingali, european conference on computer systems, pages 1–16, 2016.
and Paul Stodghill. Solving alignment using elementary linear algebra. [12] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin,
In Languages and Compilers for Parallel Computing: 7th International Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang,
Workshop Ithaca, NY, USA, August 8–10, 1994 Proceedings 7, pages et al. Large scale distributed deep networks. Advances in neural
46–60. Springer, 1995. information processing systems, 25, 2012.
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared [13] Nikoli Dryden, Naoya Maruyama, Tim Moon, Tom Benson, Marc Snir,
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, and Brian Van Essen. Channel and filter parallelism for large-scale
Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, cnn training. In Proceedings of the International Conference for High
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Performance Computing, Networking, Storage and Analysis, pages 1–20,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, 2019.
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, [14] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers:
Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Scaling to trillion parameter models with simple and efficient sparsity.
Sutskever, and Dario Amodei. Language models are few-shot learners, The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
2020. [15] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos
[3] Lynn Elliot Cannon. A cellular computer to implement the Kalman filter Kozyrakis. Tetris: Scalable and efficient neural network acceleration
algorithm. Montana State University, 1969. with 3d memory. In Proceedings of the Twenty-Second International
[4] Yang Cheng, Dan Li, Zhiyuan Guo, Binyao Jiang, Jinkun Geng, Wei Conference on Architectural Support for Programming Languages and
Bai, Jianping Wu, and Yongqiang Xiong. Accelerating end-to-end deep Operating Systems, pages 751–764, 2017.
learning workflow with codesign of data preprocessing and scheduling. [16] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz
IEEE Transactions on Parallel and Distributed Systems, 32(7):1802–1814, Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming
2020. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv
[5] Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalya- preprint arXiv:1706.02677, 2017.
naraman. Project adam: Building an efficient and scalable deep learn- [17] Chua-Huang Huang and Ponnuswamy Sadayappan. Communication-
ing training system. In 11th USENIX symposium on operating systems free hyperplane partitioning of nested loops. Journal of Parallel and
design and implementation (OSDI 14), pages 571–582, 2014. Distributed Computing, 19(2):90–102, 1993.
[6] Jack Choquette, Edward Lee, Ronny Krashinsky, Vishnu Balan, and [18] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao
Brucek Khailany. 3.2 the a100 datacenter gpu and ampere architec- Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui
ture. In 2021 IEEE International Solid-State Circuits Conference (ISSCC), Wu, et al. Gpipe: Efficient training of giant neural networks using
volume 64, pages 48–50. IEEE, 2021. pipeline parallelism. Advances in neural information processing systems,
[7] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, 32, 2019.
Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, [19] David E Hudak and Santosh G Abraham. Compiler techniques for data
Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha partitioning of sequentially iterated parallel loops. ACM SIGARCH
Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Computer Architecture News, 18(3b):187–200, 1990.
Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben [20] Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. Exploring hidden
Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael dimensions in parallelizing convolutional neural networks. In ICML,
Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, pages 2279–2288, 2018.
Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, [21] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model
Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne parallelism for deep neural networks. Proceedings of Machine Learning
Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiri- and Systems, 1:1–13, 2019.
donov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, [22] Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan,
Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian
Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Towles, et al. Tpu v4: An optically reconfigurable supercomputer
Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark for machine learning with hardware support for embeddings. In
Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Proceedings of the 50th Annual International Symposium on Computer
Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling Architecture, pages 1–14, 2023.
language modeling with pathways, 2022. [23] Y J Ju and H Dietz. Reduction of cache coherence overhead by compiler
[8] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, data layout and loop transformation. In Languages and Compilers for
and Ng Andrew. Deep learning with cots hpc systems. In Sanjoy Parallel Computing: Fourth International Workshop Santa Clara, Cali-
Dasgupta and David McAllester, editors, Proceedings of the 30th Inter- fornia, USA, August 7–9, 1991 Proceedings 4, pages 344–358. Springer,
national Conference on Machine Learning, volume 28 of Proceedings of 1992.
Machine Learning Research, pages 1337–1345, Atlanta, Georgia, USA, [24] Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and
17–19 Jun 2013. PMLR. Saibal Mukhopadhyay. Neurocube: A programmable digital neuromor-
[9] Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. Scale out phic architecture with high-density 3d memory. In 2016 ACM/IEEE
for large minibatch sgd: Residual network training on imagenet-1k 43rd Annual International Symposium on Computer Architecture (ISCA),
with improved accuracy and reduced time to train. arXiv preprint pages 380–392, 2016.
arXiv:1711.04291, 2017. [25] Kathleen Knobe, Joan D Lukas, and Guy L Steele Jr. Data optimiza-
[10] Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, tion: Allocation of arrays to reduce communication on simd machines.
Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R Ganger, Phillip B Journal of parallel and Distributed Computing, 8(2):102–118, 1990.
Gibbons, et al. Exploiting bounded staleness to speed up big data [26] Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence
analytics. In 2014 USENIX Annual Technical Conference (USENIX ATC McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catan-
14), pages 37–48, 2014. zaro. Reducing activation recomputation in large transformer models.
[11] Henggang Cui, Hao Zhang, Gregory R Ganger, Phillip B Gibbons, and Proceedings of Machine Learning and Systems, 5, 2023.
Eric P Xing. Geeps: Scalable deep learning on distributed gpus with

814
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

[27] Alex Krizhevsky. One weird trick for parallelizing convolutional neural [43] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei,
networks. arXiv preprint arXiv:1404.5997, 2014. Ilya Sutskever, et al. Language models are unsupervised multitask
[28] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, learners. OpenAI blog, 1(8):9, 2019.
Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and [44] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan
Zhifeng Chen. Gshard: Scaling giant models with conditional compu- Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring
tation and automatic sharding, 2020. the limits of transfer learning with a unified text-to-text transformer.
[29] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing [45] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.
Su. Scaling distributed machine learning with the parameter server. Zero: Memory optimizations toward training trillion parameter models.
In 11th USENIX Symposium on operating systems design and implemen- In SC20: International Conference for High Performance Computing,
tation (OSDI 14), pages 583–598, 2014. Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
[30] Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, [46] Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and
Dawn Song, and Ion Stoica. Terapipe: Token-level pipeline parallelism Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme
for training large-scale language models. In International Conference scale deep learning. In Proceedings of the International Conference for
on Machine Learning, pages 6543–6552. PMLR, 2021. High Performance Computing, Networking, Storage and Analysis, pages
[31] Qingda Lu, Christophe Alias, Uday Bondhugula, Thomas Henretty, 1–14, 2021.
Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, [47] J Ramanujam and P Sadayappan. Compile-time techniques for data
Ponnuswamy Sadayappan, Yongjian Chen, Haibo Lin, et al. Data layout distribution in distributed memory machines. IEEE Transactions on
transformation for enhancing data locality on nuca chip multiproces- parallel and distributed systems, 2(4):472–482, 1991.
sors. In 2009 18th International Conference on Parallel Architectures and [48] Jagannathan Ramanujam and Ponnuswamy Sadayappan. A method-
Compilation Techniques, pages 348–357. IEEE, 2009. ology for parallelizing programs for multicomputers and complex
[32] Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xi- memory multiprocessors. In Proceedings of the 1989 ACM/IEEE confer-
aowei Li. Flexflow: A flexible dataflow accelerator architecture for ence on Supercomputing, pages 637–646, 1989.
convolutional neural networks. In 2017 IEEE International Symposium [49] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley,
on High Performance Computer Architecture (HPCA), pages 553–564. Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-
IEEE, 2017. billion parameter language models using model parallelism. arXiv
[33] Jiachen Mao, Zhongda Yang, Wei Wen, Chunpeng Wu, Linghao Song, preprint arXiv:1909.08053, 2019.
Kent W Nixon, Xiang Chen, Hai Li, and Yiran Chen. Mednn: A [50] Jaeyong Song, Jinkyu Yim, Jaewon Jung, Hongsun Jang, Hyung-Jin
distributed mobile system with enhanced partition and deployment Kim, Youngsok Kim, and Jinho Lee. Optimus-cc: Efficient large nlp
for large-scale dnns. In 2017 IEEE/ACM International Conference on model training with 3d parallelism aware communication compression.
Computer-Aided Design (ICCAD), pages 751–756. IEEE, 2017. In Proceedings of the 28th ACM International Conference on Architectural
[34] Igor Z Milosavljevic and Marwan A Jabri. Automatic array alignment Support for Programming Languages and Operating Systems, Volume 2,
in parallel matlab scripts. In Proceedings 13th International Parallel pages 560–573, 2023.
Processing Symposium and 10th Symposium on Parallel and Distributed [51] Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, Hai Li, and
Processing. IPPS/SPDP 1999, pages 285–289. IEEE, 1999. Yiran Chen. Accpar: Tensor partitioning for heterogeneous deep
[35] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, learning accelerators. In 2020 IEEE International Symposium on High
Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Performance Computer Architecture (HPCA), pages 342–355. IEEE, 2020.
Zaharia. Pipedream: Generalized pipeline parallelism for dnn train- [52] Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and
ing. In Proceedings of the 27th ACM Symposium on Operating Systems Yiran Chen. Hypar: Towards hybrid parallelism for deep learning
Principles, pages 1–15, 2019. accelerator array. In 2019 IEEE International Symposium on High Per-
[36] Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and formance Computer Architecture (HPCA), pages 56–68. IEEE, 2019.
Matei Zaharia. Memory-efficient pipeline-parallel dnn training. In [53] Jakub M Tarnawski, Deepak Narayanan, and Amar Phanishayee. Piper:
International Conference on Machine Learning, pages 7937–7947. PMLR, Multidimensional planner for dnn parallelization. Advances in Neural
2021. Information Processing Systems, 34:24829–24840, 2021.
[37] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- [54] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-
ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric
Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large- Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
scale language model training on gpu clusters using megatron-lm. Grave, and Guillaume Lample. Llama: Open and efficient foundation
In Proceedings of the International Conference for High Performance language models, 2023.
Computing, Networking, Storage and Analysis, pages 1–15, 2021. [55] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma-
[38] NVIDIA. Nvidia collective communications library (nccl). https:// hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
developer.nvidia.com/nccl, 2018. Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
[39] NVIDIA. Nvidia cuda. https://developer.nvidia.com/cuda-zone, 2023. Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes,
[40] OPEN-MPI. Cuda aware mpi. https://www.open-mpi.org/faq/ Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami,
?category=runcuda, 2023. Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
[41] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad- Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann,
bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
Luca Antiga, et al. Pytorch: An imperative style, high-performance Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier
deep learning library. Advances in neural information processing sys- Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie,
tems, 32, 2019. Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi,
[42] Michael Philippsen. Automatic alignment of array data and processes Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian,
to reduce communication time on dmpps. In Proceedings of the fifth Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xi-
ACM SIGPLAN symposium on Principles and practice of parallel pro- ang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela
gramming, pages 156–165, 1995. Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert

815
ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han

Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open founda- Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit
tion and fine-tuned chat models, 2023. Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers,
[56] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan
is all you need. Advances in neural information processing systems, 30, Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Eka-
2017. terina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf,
[57] Minjie Wang, Chien-chin Huang, and Jinyang Li. Unifying data, model Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan
and hybrid parallelism in deep learning via tensor tiling. arXiv preprint Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat,
arXiv:1805.04170, 2018. Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer
[58] Minjie Wang, Chien-chin Huang, and Jinyang Li. Supporting very large Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian
models using automatic dataflow graph partitioning. In Proceedings of Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas
the Fourteenth EuroSys Conference 2019, pages 1–17, 2019. Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Pro-
[59] Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gre- tasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov,
gory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana,
Managed communication and consistency for fast data-parallel itera- Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony
tive analytics. In Proceedings of the Sixth ACM Symposium on Cloud Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha
Computing, pages 381–394, 2015. Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade,
[60] BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Bharat Saxena, Carlos Muñoz Ferrandis, Daniel McDuff, Danish Con-
Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexan- tractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen,
dra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline
Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene
Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer,
Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Mar-
Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile got Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna,
Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Lau- Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen
rençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus
Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Al- Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang,
fassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo
Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano,
Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush
Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag
Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady El- Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel
sahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann,
Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec,
Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph To- Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde,
bing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani,
Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi,
Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Ma- Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger,
soud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De
Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang,
Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus
Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Re-
Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, nata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi
Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shub-
Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsud- hanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti
deen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant,
Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Tim- Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj,
poni Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vas- Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie,
silina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf. Bloom: A
Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chen- 176b-parameter open-access multilingual language model, 2023.
glei Si, Davut Emre Taşar, Elizabeth Salesky, Sabrina J. Mielke, Wil- [61] Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yan-
son Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Ar- ping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly,
naud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Marcello Maggioni, et al. Gspmd: General and scalable paralleliza-
Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos tion for ml computation graphs. arxiv e-prints, art. arXiv preprint
Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, arXiv:2105.04663, 2021.
Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng [62] Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji
Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Matsumoto. Luke: deep contextualized entity representations with
Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xi- entity-aware self-attention. arXiv preprint arXiv:2010.01057, 2020.
angru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow [63] Xuan Yang, Jing Pu, Blaine Burton Rister, Nikhil Bhagdikar, Stephen
Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Richardson, Shahar Kvatinsky, Jonathan Ragan-Kelley, Ardavan Pe-
Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bour- dram, and Mark Horowitz. A systematic approach to blocking convo-
foune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Min- lutional neural networks, 2016.
jia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, [64] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R
Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive

816
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training ASPLOS ’24, April 27-May 1, 2024, La Jolla, CA, USA

pretraining for language understanding. Advances in neural informa- Zettlemoyer. Opt: Open pre-trained transformer language models,
tion processing systems, 32, 2019. 2022.
[65] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning [67] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng
with elastic averaging sgd. Advances in neural information processing Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo,
systems, 28, 2015. Eric P Xing, et al. Alpa: Automating inter-and {Intra-Operator} par-
[66] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, allelism for distributed deep learning. In 16th USENIX Symposium
Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria on Operating Systems Design and Implementation (OSDI 22), pages
Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel 559–578, 2022.
Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke

817

Auto Parallel
No ratings yet
Auto Parallel
21 pages
Final Assigment of PDC
No ratings yet
Final Assigment of PDC
12 pages
Deeplearning Ai
No ratings yet
Deeplearning Ai
57 pages
A S M M P: A G F F L M T: Mazon AGE Aker Odel Arallelism Eneral and Lexible Ramework For Arge Odel Raining
No ratings yet
A S M M P: A G F F L M T: Mazon AGE Aker Odel Arallelism Eneral and Lexible Ramework For Arge Odel Raining
24 pages
Transformer Training
No ratings yet
Transformer Training
10 pages
Parallelism Strategies in Machine Learning, Get The Free Cheat Sheet - 2
No ratings yet
Parallelism Strategies in Machine Learning, Get The Free Cheat Sheet - 2
32 pages
Mesh-Tensorflow: Deep Learning For Supercomputers
No ratings yet
Mesh-Tensorflow: Deep Learning For Supercomputers
16 pages
Pase: Parallelization Strategies For Efficient DNN Training
No ratings yet
Pase: Parallelization Strategies For Efficient DNN Training
10 pages
Mesh Tensorflow Deep Learning For Supercomputers
No ratings yet
Mesh Tensorflow Deep Learning For Supercomputers
10 pages
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
No ratings yet
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
14 pages
Vpipe A Virtualized Acceleration System For Achieving Efficient and Scalable Pipeline Parallel DNN Training
No ratings yet
Vpipe A Virtualized Acceleration System For Achieving Efficient and Scalable Pipeline Parallel DNN Training
18 pages
An Survey of Neural Network Compression
No ratings yet
An Survey of Neural Network Compression
73 pages
Tensor Networks For Dimensionality Reduction, Big Data and Deep Learning
No ratings yet
Tensor Networks For Dimensionality Reduction, Big Data and Deep Learning
47 pages
Activation Layers
No ratings yet
Activation Layers
17 pages
Article - Python - TensorFlow: A System For Large-Scale Machine Learning
No ratings yet
Article - Python - TensorFlow: A System For Large-Scale Machine Learning
18 pages
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
No ratings yet
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
15 pages
Tensor Networks Meet Neural Networks: A Survey and Future Perspectives
No ratings yet
Tensor Networks Meet Neural Networks: A Survey and Future Perspectives
20 pages
6040 NN Former Rethinking Grap
No ratings yet
6040 NN Former Rethinking Grap
20 pages
Eecs 2024 108
No ratings yet
Eecs 2024 108
48 pages
ETC: Efficient Training of Temporal Graph Neural Networks Over Large-Scale Dynamic Graphs
No ratings yet
ETC: Efficient Training of Temporal Graph Neural Networks Over Large-Scale Dynamic Graphs
13 pages
Z B P P: ERO Ubble Ipeline Arallelism
No ratings yet
Z B P P: ERO Ubble Ipeline Arallelism
19 pages
A Survey of Quantization Methods For Efficient Neural Network Inference
No ratings yet
A Survey of Quantization Methods For Efficient Neural Network Inference
33 pages
Enabling Fine-Grained Spatial Multitasking On Systolic-Array NPUs Using Dataflow Mirroring
No ratings yet
Enabling Fine-Grained Spatial Multitasking On Systolic-Array NPUs Using Dataflow Mirroring
16 pages
Distributed Training Scaling Deep Dive
No ratings yet
Distributed Training Scaling Deep Dive
33 pages
ICML'22 Big Model Tutorial (Public v2)
No ratings yet
ICML'22 Big Model Tutorial (Public v2)
160 pages
Single-Shot Pruning and Quantization For Hardware-Friendly Neural Network Acceleration
No ratings yet
Single-Shot Pruning and Quantization For Hardware-Friendly Neural Network Acceleration
8 pages
DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
No ratings yet
DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
13 pages
Tensorflow PDF
No ratings yet
Tensorflow PDF
62 pages
Parallel Physics-Informed Neural Networks Via Domain Decomposition
No ratings yet
Parallel Physics-Informed Neural Networks Via Domain Decomposition
23 pages
Automating Parallelism in Deep Learning
No ratings yet
Automating Parallelism in Deep Learning
20 pages
Tensor Networks Meet Neural Networks: A Survey: January 2023
No ratings yet
Tensor Networks Meet Neural Networks: A Survey: January 2023
21 pages
2207.02696v1 2
No ratings yet
2207.02696v1 2
15 pages
Efficient Multi-Task Learning Strategies
No ratings yet
Efficient Multi-Task Learning Strategies
10 pages
Snip: S - N P C S: Ingle Shot Etwork Runing Based On Onnection Ensitivity
No ratings yet
Snip: S - N P C S: Ingle Shot Etwork Runing Based On Onnection Ensitivity
15 pages
XGQ 2023816140319
No ratings yet
XGQ 2023816140319
31 pages
Yolov 7
100% (1)
Yolov 7
17 pages
A Survey On Tensor Techniques and Applications in
No ratings yet
A Survey On Tensor Techniques and Applications in
39 pages
Part 2 Applications and Future Perspectives
No ratings yet
Part 2 Applications and Future Perspectives
81 pages
Allspark: PIM Optimization for Visual Transformers
No ratings yet
Allspark: PIM Optimization for Visual Transformers
14 pages
ZeRO: Optimize Memory for Trillion-Parameter Models
No ratings yet
ZeRO: Optimize Memory for Trillion-Parameter Models
24 pages
2020 - EUSIPCO - Adaptive Algorithms For Tracking Tensor-Train Decomposition of Streaming Tensors
No ratings yet
2020 - EUSIPCO - Adaptive Algorithms For Tracking Tensor-Train Decomposition of Streaming Tensors
5 pages
Pytorch FSDP: Experiences On Scaling Fully Sharded Data Parallel
No ratings yet
Pytorch FSDP: Experiences On Scaling Fully Sharded Data Parallel
13 pages
Chapter DeepLearningwithTensorFlow
No ratings yet
Chapter DeepLearningwithTensorFlow
19 pages
04 Deep Learning Lab Guide-Student Version
No ratings yet
04 Deep Learning Lab Guide-Student Version
33 pages
Neural Networks and Deep Learning: Enhancing Ai Through Neural Network Optimization
No ratings yet
Neural Networks and Deep Learning: Enhancing Ai Through Neural Network Optimization
5 pages
EuroSys19 Parallax
No ratings yet
EuroSys19 Parallax
15 pages
Wu 2017
No ratings yet
Wu 2017
12 pages
Arxiv - 20210823 - Deepak Narayanan - Efficient Large-Scale Language Model Training On GPU Clusters Using Megatron-LM
No ratings yet
Arxiv - 20210823 - Deepak Narayanan - Efficient Large-Scale Language Model Training On GPU Clusters Using Megatron-LM
13 pages
Slides Architecture Based Continual Learning
No ratings yet
Slides Architecture Based Continual Learning
23 pages
T T - A: A T: Ransformers in IME Series Nalysis Utorial
No ratings yet
T T - A: A T: Ransformers in IME Series Nalysis Utorial
29 pages
NeurIPS 2021 Deeply Shared Filter Bases For Parameter Efficient Convolutional Neural Networks Paper
No ratings yet
NeurIPS 2021 Deeply Shared Filter Bases For Parameter Efficient Convolutional Neural Networks Paper
12 pages
MODeL Memory Optimizations For Deep Learning
No ratings yet
MODeL Memory Optimizations For Deep Learning
15 pages
A Hybrid Parallelization Approach For Distributed and Scalable Deep Learning
No ratings yet
A Hybrid Parallelization Approach For Distributed and Scalable Deep Learning
12 pages
YOLOv7: Leading Real-Time Object Detection
No ratings yet
YOLOv7: Leading Real-Time Object Detection
12 pages
Automated Gear Fault Diagnosis
No ratings yet
Automated Gear Fault Diagnosis
15 pages
He & Hofmann, Simplifying Transformer Blocks
No ratings yet
He & Hofmann, Simplifying Transformer Blocks
29 pages
Python - Mathan Aryha S
No ratings yet
Python - Mathan Aryha S
10 pages
Optimization of Nonlinear Convolutional Neural Networks Based On Improved Chameleon Group Algorithm
No ratings yet
Optimization of Nonlinear Convolutional Neural Networks Based On Improved Chameleon Group Algorithm
8 pages
Graphics
No ratings yet
Graphics
3 pages
NVIDIA Intelligent Video Analytics Platform Infographic Poster
No ratings yet
NVIDIA Intelligent Video Analytics Platform Infographic Poster
1 page
Phys X
100% (1)
Phys X
656 pages
User Manual (v7.X) : Tool For Visualization and Analysis of Two and Three - Dimensional Data Sets
No ratings yet
User Manual (v7.X) : Tool For Visualization and Analysis of Two and Three - Dimensional Data Sets
18 pages
Windows 10 System Specs Report
No ratings yet
Windows 10 System Specs Report
4 pages
System Diagnostics Report
No ratings yet
System Diagnostics Report
36 pages
GPU-Accelerated Quantum Chemistry with Python
No ratings yet
GPU-Accelerated Quantum Chemistry with Python
32 pages
Little Guide To Building Large Language Models in 2024
100% (1)
Little Guide To Building Large Language Models in 2024
65 pages
Cpuz Readme
No ratings yet
Cpuz Readme
18 pages
Modern Icon: Reimagining A Much-Loved Classic
No ratings yet
Modern Icon: Reimagining A Much-Loved Classic
52 pages
MIKE by DHI Software Catalogue 2014
No ratings yet
MIKE by DHI Software Catalogue 2014
52 pages
Marathi-English Translation via Transformers
No ratings yet
Marathi-English Translation via Transformers
5 pages
Energy Considerations of Large Language Model Inference and Efficiency Optimizations
No ratings yet
Energy Considerations of Large Language Model Inference and Efficiency Optimizations
16 pages
CRC Mastering Unity
No ratings yet
CRC Mastering Unity
280 pages
l440 Fru List 20140825
No ratings yet
l440 Fru List 20140825
9 pages
Agisoft Metashape: Processing Report 19 May 2022
No ratings yet
Agisoft Metashape: Processing Report 19 May 2022
7 pages
1 - Introduction To InfiniBand
No ratings yet
1 - Introduction To InfiniBand
21 pages
MaximumPC October 2011
100% (1)
MaximumPC October 2011
72 pages
HPE ProLiant DL380 Gen10 server-PSN1010026818USEN
No ratings yet
HPE ProLiant DL380 Gen10 server-PSN1010026818USEN
5 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
7 pages
Trinetra: GPS Device for Visually Impaired
No ratings yet
Trinetra: GPS Device for Visually Impaired
6 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
Serif Pageplus Assignment Latest Update
No ratings yet
Serif Pageplus Assignment Latest Update
24 pages
Boq 841864
No ratings yet
Boq 841864
16 pages
Synopsis On Grocery App
No ratings yet
Synopsis On Grocery App
6 pages
NV JPEG
No ratings yet
NV JPEG
74 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
51 pages
Method of Moments Accelerations and Extensions in
No ratings yet
Method of Moments Accelerations and Extensions in
5 pages
Dell+PowerEdge+Features+ +Participant+Guide
No ratings yet
Dell+PowerEdge+Features+ +Participant+Guide
179 pages
Matrix Multiplication With CUDA - A Basic Introduction To The CUDA Programming Model
No ratings yet
Matrix Multiplication With CUDA - A Basic Introduction To The CUDA Programming Model
44 pages

PrimePar Efficient Spatial-Temporal Tensor

Uploaded by

PrimePar Efficient Spatial-Temporal Tensor

Uploaded by

PrimePar: Efficient Spatial-temporal Tensor

Partitioning for Large Transformer Model Training

Ying Wang Yuming Li Yinhe Han

Abstract an optimization algorithm that can find optimal parallel solu-

where each device only communicates with its neighbors

distribution at the start of Forward. At the start of Forward,

which device (𝑟, 𝑐) holds at the start of Forward. Thus during

The overall intra-operator cost can be modeled as: Transformer block

4.2 Inter-operator Cost Segment 1 Segment 2 Segment 3

4.3 Overall Cost 5.1 Segmented Dynamic Programming

Megatron Alpa PrimePar

Megatron Alpa PrimePar

0.2 fc1.F fc2.F fc2.B fc2.G fc1.B fc1.G

OPT 6.7B OPT 175B Llama2 7B

1.8 1.4 1.8

1.4 1.0 1.4

1.2 0.8 1.2

Llama2 70B Bloom 7B1 Bloom 176B

1.2 Megatron Megatron

1.0 1.4 1.0

References a gpu-specialized parameter server. In Proceedings of the eleventh

You might also like