Simulation Gpu Grid Refinement LBM
Simulation Gpu Grid Refinement LBM
H0
C1 S2 E2 O2 C2 H1 H2 S2 E2 S1 E1 O1
O2 C2 H1 H2 S2 E2 O2 C2 H1
Begin End
H1 H2 H2
C2 H2
E3 C3 H2 E3 C3 H2 S3 E3
E3 C3 E3 C3 S3
E3 C3 S3 E3 C3
C3 S3 S3 E3 C3
S3 S3 E2 O2
S3 H2 S2
[14] focused on scalability on multi-GPU systems, adaptive of Karlin-Bösch-Chikatamarla (KBC) [18]. Without loss of
mesh refinement on multiple GPUs [15] and load-balancing in generality and for conciseness, we only present the simple
distributed CPU-based systems [16], [17]. While these related BGK model here in which C is defined as,
works also aimed to develop grid refinement for LBM on ∆t
GPU, to our knowledge, there is no prior work that focuses {fi (x, t) − fieq (x, t)}
fi∗ (x, t) = fi (x, t) − (3)
τ
primarily on optimizing the single GPU implementation of
grid refinement in LBM. We believe this is a crucial precursory In the above equations ∆t is the time step and τ is a
investigation before further optimizations are undertaken for relaxation time, determined by the total kinematic viscosity
multi-GPU systems. ν of the fluid as,
ν ∆t
In this paper, we propose an optimized implementation τ= 2+ . (4)
of the LBM grid refinement method that utilizes the GPU cs 2
hardware efficiently. Our optimized implementation is based or in other words, ν = c2s (τ − ∆t/2).
on aggressive kernel fusion (Figure 2, bottom) leading to In Equation (1), fieq represents the components of the
improved memory accesses, reduced synchronization, reduced equilibrium distribution function and is defined as,
kernel launch overhead, and simpler code. Additionally, our
implementation uses less memory than the baseline algorithm fieq (x, t) =
by reducing the size of the ghost layer. We show the ef-
ei · u(x, t) (ei · u(x, t))2 ∥u(x, t)∥2
fectiveness of our optimized implementation and compare it wi ρ(x, t) 1 + + −
c2s 2c4s 2c2s
against CPU and GPU implementations where we achieve an (5)
order of magnitude speedup. While there exist many open-
source LBM codes, ours is the first to implement the grid where wi are weights associated with each lattice direction ei .
refinement method on the GPU. Our code is available at The fluid density ρ(x, t), velocity u(x, t) and pressure p(x, t)
https://github.com/Autodesk/Neon. are formally derived from fi as,
X
II. M ATHEMATICAL BACKGROUND density: ρ(x, t) = fi (x, t) (6)
The LBM equations determine the time-evolution of a i
collection of fictitious particles, represented by time-dependent 1 X
velocity: u(x, t) = ei fi (x, t) (7)
velocity distribution functions (fi ) along a set of discrete lattice ρ(x, t) i
directions denoted by ei = (e1 , . . . , eq ). In this work, we
pressure: p(x, t) = c2s ρ(x, t) (8)
employ 3D lattice structures with 19 (D3Q19) or 27 (D3Q27)
directions. The values of fi are evolved in time through in which cs is the constant speed of sound given by c2s =
a collide-and-stream algorithm. If we denote the nonlinear (1/3)(∆x/∆t)2 .
and computationally local collision operation by C, we may
represent the collide-and-stream algorithm as, A. Grid Refinement
Collision: fi∗ (x, t) = C(fi (x, t)) (1) Various techniques for grid-refinement in LBM have been
proposed in the literature which may be broadly categorized
Streaming:fi (x + ei ∆x, t + ∆t) = fi∗ (x, t) (2)
into node-based methods [11], [19]–[22] and volume-based
Throughout this paper, we use two collision operators, namely methods [16], [23]–[25]. In the node-based methods, fi is
(i) the single-relaxation collision model of Bhatnagar-Gross- stored at the cell vertices leading to shared vertices between
Krook (BGK) [2], and (ii) the multi-relaxation entropic model coarse and fine cells. In order to impose conservation of mass
and preserve second-order accuracy, the node-based meth-
ods often resort to scheme-dependent rescaling of fi across
transitions between levels and rely on ad hoc interpolation
and filtering schemes that may lead to spurious results [19].
However, in volume-based techniques, fi values of two sub-
sequent grid levels are not colocated but staggered which
renders the communication between coarse and fine levels
more straightforward. As a result, in volume-based methods a
simple homogeneous redistribution of fi acts as the required
interpolation scheme for coarse-to-fine communications while
conserving mass and momentum [23]. (a) Explosion
In this paper, we adopt the volume-based method of Rohde
et al. [25] as implemented by Schornbaum et al. [5], [16]. We
introduce some novel changes to this algorithm that are aimed
at optimizing the computational performance of the method
when implemented on the GPU.
We select a refinement ratio of 2 where a coarse cell
at level L is uniformly divided into 2d cells–where d is
the dimension– to arrive at level L + 1, or in other words
∆xL+1 = ∆xL /2. For neighboring cells that interface two
grid levels, a maximum jump in grid level of ∆L = 1 is
allowed. Due to acoustic scaling which requires the speed of
sound cs to remain constant across various grid levels [11], (b) Coalescence
∆tL ∝ ∆xL and hence ∆tL+1 = ∆tL /2. In addition, the Fig. 3: Besides collision and streaming, the nonuniform grid
fluid viscosity ν must also remain constant on each grid level refinement in LBM involves (a) coarse-to-fine explosion, and
which implies: (b) fine-to-coarse coalescence. The colored arrows highlight
∆tL ∆t0 how a few of the distribution functions fi travel along certain
νL = ν0 → c2s (τL − ) = c2s (τ0 − )
2 2 directions ei .
or equivalently:
τL τ0 1
= 2L + (1 − 2L ) III. BASELINE G RID R EFINEMENT A LGORITHM
∆tL ∆t0 2
It is often convenient to define a non-dimensional relaxation Our baseline algorithm is that by Schornbaum et al. [5],
parameter ω = ∆t/τ and hence, ωL on various grid levels are [16], which has been designed for distributed systems. This
obtained based on its value at the coarsest level or ω0 as, algorithm uses a strongly balanced octree grid where, as
2ω0 indicated earlier, the transition in resolution from one level to
ωL = L+1 (9) another is strictly 1 in all directions. We denote an octree level
2 + (1 − 2L )ω0
with L, where L = 0 implies the coarsest level of the tree.
The coarse-to-fine and fine-to-coarse communication steps The algorithm input is a grid that tiles the simulation domain
across the grids at level L and L + 1 (also introduced later in at potentially different resolutions. The algorithm depends on
the paper as the Explosion and Coalescence steps) are defined four main computational components (Figure 3)
as,
1) Collision: computes fi∗ using fi as per Equation (1)
Explosion: fi (xβ,L+1 , t) = fi (xα,L , t) (10) similar to the uniform LBM algorithm. This step in-
n volves only local computation and does not require
1X
Coalescence: fi (xα,L , t) = fi (xβ,L+1 , t) (11) any communications with other cells and hence is not
n
β=1 impacted by grid refinement.
which correspond to a homogeneous redistribution of fi for 2) Streaming: propagates the information in time and space
the coarse-to-fine communication (Explosion) and a simple by conducting a shift operation as per Equation (2). We
averaging of fi across n fine cells (denoted by xβ,L+1 ) that designate the same-level Streaming simply by “Stream-
interface with a coarse cell (denoted by xα,L ) for the fine-to- ing” and denote the cross-level Streaming by Explosion
coarse communication (Coalescence). and Coalescence.
In the remainder of this paper, we employ the LBM units in 3) Explosion: performs coarse-to-fine communications to
which it is conventional to non-dimensionalize space and time propagate fi from the coarse neighbor into finer cells as
by ∆x and ∆t. As a result, in our accompanied open-source per Equation (10). Explosion is a one-to-many commu-
code all the parameters are assumed to be dimensionless, for nication where a coarse cell distributes its values of fi
instance, ∆x = ∆t = 1 and c2s = 1/3. along ei to fine cells that interface with that coarse cell
along −ei (Figure 3a). Algorithm 1: Baseline Grid Refinement LBM Algo-
4) Coalescence: performs fine-to-coarse communications rithm [5]
to propagate fi from the fine neighbor into coarse 1 Function NonUniformTimeStep(L):
cells as per Equation (11). Coalescence is a many-to- 2 Collision(L)
one communication where a coarse cell averages the 3 if L ̸= Lmax − 1 then
contributions of fi from its neighboring fine cells along 4 NonUniformTimeStep(L + 1)
a certain direction indicated by the ei vector (Figure 3b). 5 end
At a high level, the algorithm applies the standard LBM 6 if L ̸= 0 then
collide-and-stream algorithm at each level starting from the 7 Explosion(L, L − 1)
coarsest level. However, during the streaming step, the inter- 8 end
face between different levels requires special treatment and 9 Streaming(L)
this is where the uniform versus nonuniform LBM algorithm 10 if L ̸= Lmax − 1 then
differs. Additionally, for a refinement ratio of two between 11 Coalescence(L, L + 1)
levels L and L + 1, the algorithm needs to complete two time 12 end
steps at level L + 1 before proceeding to the next time step of 13 if L == 0 then
the coarse level due to the acoustic scaling (see Section II). 14 return
In other words, given a grid with L levels, the finest grid will 15 end
perform 2L−1 time steps to complete one time step on the 16 Collision(L)
coarsest level. 17 if L ̸= Lmax − 1 then
In order to facilitate the communication between different 18 NonUniformTimeStep(L + 1)
levels of the grid, the interface between a grid at level L and 19 end
a finer neighbor at level L + 1 is extended by a ghost layer 20 if L ̸= 0 then
(Figure 4a) which lives in the finer L + 1 grid and covers two 21 Explosion(L, L − 1)
coarser layers from the grid at level L. These ghost layers 22 end
are used for communicating interface information both ways, 23 Streaming(L)
i.e., from coarse to fine and from fine to coarse. We note that 24 if L ̸= Lmax − 1 then
all reads/writes could be done as a gather or scatter operation 25 Coalescence(L, L + 1)
without the need for any atomic mechanism thanks to these 26 end
added ghost layers.
The baseline method in Algorithm 1 shows a single time
step of the coarsest level. This step is repeated in a loop until
a user-specified stopping criterion is met, e.g., based on the and Coalescence. Here, we demonstrate how the baseline
maximum number of iterations. To succinctly describe one algorithm can be modified to enable kernel fusion, which is
time step of the baseline method in Algorithm 1, we assume otherwise non-trivial. We will show that Streaming, Explosion,
a two-level grid: L (coarse) and L + 1 (fine). Starting from and Coalescence may all be combined into a single kernel.
the post-streaming state, the algorithm performs a Collision
step on L and L + 1. Then, an Explosion operation populates A. Reducing Ghost Layers
ghost layers of L + 1 by copying two layers of L along the
interface between L and L+1. This is followed by a Streaming The baseline algorithm uses four ghost layers on the fine
operation in L and L + 1 including the innermost layer of the grid which overlap two coarse layers. These ghost layers
ghost cells. Then, another Collision and Streaming steps are are employed to duplicate the overlapping coarse layers in
performed on L + 1. Finally, a Coalescence step is performed order to be leveraged (only the two innermost layers) during
where the innermost ghost layer populations in L + 1 are the Streaming step of the fine level. The fine ghost layers
averaged and copied to the overlapping corresponding coarse also allow the Coalescence step to be performed as gather
cells in L. operations (initiated by the coarse layer) avoiding atomic
operations. However, this approach is only reasonable if the
IV. A NALYSIS AND O PTIMIZATION distance in memory between the fine and coarse grid is large.
While the majority of previous work focused on distributed On a single GPU, four ghost layers are excessive and may limit
systems, here we analyze the performance of the baseline the maximum size of the problem that fits in a single GPU.
algorithm (Section III) on a single GPU and investigate its Additionally, depending on the data structure (see Section V),
potential bottlenecks to pave the way for its optimization. the distance between grids at different resolutions may not be
The main drawback of implementing the baseline algorithm large enough to warrant manual caching.
in its original form on the GPU is that every operation is Instead of allocating the ghost layer on the fine level, we
performed in isolation—missing the opportunity for kernel allocate only a single layer on the coarse layer effectively
fusion. This leads to loading the whole grid multiple times to reducing its size to 1/3 of what is needed by the baseline
operate only on a small subset of it, e.g., during Explosion algorithm. In this case, during the Explosion step (i.e., coarse-
(a) Baseline Algorithm (b) Reduced ghost layer
TABLE I: Comparing the performance of the modified baseline algorithm (Figure 4b) to our optimized implementation
(Figure 4f) using the example of flow over sphere where three levels of refinement were used to arrive at the finest level
around the sphere. Size indicates the length of the virtual wind tunnel box described in the finest level along x, y, and z
directions; Distribution indicates the distribution of active voxels across different levels starting from the finest level, Baseline
refers to the MLUPS achieved by our implementation of the modified baseline algorithm; Ours refers to our most optimized/fused
implementation; and Speedup = Ours/Baseline.
2000 achievable
Baseline
1800 Fused CA VII. C ONCLUSION AND F UTURE W ORK
Fused CA and SE
1600 Fused CA, SE, and SO Large-scale and accurate simulation of realistic turbulent
Fused CASE, and SO flows is essential for many applications. In this work, we
1400 investigated the nonuniform grid refinement algorithm in the
1200 Lattice Boltzmann Method (LBM) and optimized its perfor-
MLUPS