KEMBAR78
Simulation Gpu Grid Refinement LBM | PDF | Computational Fluid Dynamics | Program Optimization
0% found this document useful (0 votes)
73 views10 pages

Simulation Gpu Grid Refinement LBM

Uploaded by

lregala.b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views10 pages

Simulation Gpu Grid Refinement LBM

Uploaded by

lregala.b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Optimized GPU Implementation of Grid Refinement

in Lattice Boltzmann Method


Ahmed H. Mahmoud1,2 , Hesam Salehipour1 , Massimiliano Meneghin1
1 Autodesk
Research, Canada, 2 University of California, Davis
{ahmed.mahmoud, hesam.salehipour, massimiliano.meneghin}@autodesk.com

Abstract—Nonuniform grid refinement plays a fundamental


role in simulating realistic flows with a multitude of length scales.
We introduce the first GPU-optimized implementation of this
technique in the context of the lattice Boltzmann method. Our ap-
proach focuses on enhancing GPU performance while minimizing
memory access bottlenecks. We employ kernel fusion techniques
to optimize memory access patterns, reduce synchronization
overhead, and minimize kernel launch latencies. Additionally, our
implementation ensures efficient memory management, resulting
in lower memory requirements compared to the baseline LBM
implementations that were designed for distributed systems. Our Fig. 1: Our optimized LBM grid refinement facilitates large-
implementation allows simulations of unprecedented domain size scale fluid simulations, e.g., airflow over an airplane within a
(e.g., 1596 × 840 × 840) using a single A100-40 GB GPU thanks virtual wind tunnel of size 1596 × 840 × 840 on a single GPU.
to enabling grid refinement capabilities on a single GPU. We
validate our code against published experimental data. Our
optimization improves the performance of the baseline algorithm
by 1.3–2X. We also compare against state-of-the-art current difference, finite-volume or finite element. While LBM is
solutions for grid refinement LBM and show an order of proven to be equivalent to the weakly compressible N-S
magnitude speedup. equations up to second-order accuracy in space and time
Index Terms—Parallel, GPU, Simulation, LBM, Boltzmann, (e.g., through the Chapman-Enskong analysis [2]), its localized
Refinement formulations make it ideally suited for massive paralleliza-
tion, especially on GPUs. The massively parallel nature of
I. I NTRODUCTION LBM computation has made it an appealing CFD solution
Complex fluid flows are prevalent in engineering systems for commercial purposes, e.g., XFlow, PowerFLOW [3], and
and natural environments, and hence simulating their behavior ultraFluidX [4].
is of paramount importance in many disciplines ranging from Optimizing the GPU implementation of LBM has therefore
vehicle aerodynamics and building ventilation to weather gained much attention in recent years leading to the devel-
forecasting and planetary currents. Often in common between opment of just-in-time compiler-based implementations [6]
all these various applications is the substantially wide range as well as advanced algorithms on uniform grid such as
of spatial scales that need to be resolved numerically in the AA-method [7], Esoteric Twist [8], and reduced/mixed
order to capture the realistic fluid flow dynamics accurately. precision [9]. The goal of all these optimizations is to reduce
For example, considering the simulation of air flow around the memory requirements of LBM and ultimately improve the
buildings, one needs to enclose the open environment using wall-clock performance of the simulation [10].
a computational box that is sufficiently large to eliminate the As discussed earlier, large-scale CFD simulations are only
effects of the boundary conditions while resolving small geo- possible by relying on nonuniform grid refinement. The
metric features of interest in the building (e.g., balconies and memory-bounded computations associated with LBM further
windows) notwithstanding the microscale turbulent features of necessitate the need for efficient implementations of this
the flow that are even orders of magnitude smaller. approach on the GPU. However, a robust grid refinement
Given this multiscale nature of such fluid-related phenom- algorithm in LBM involves multiple kernels and complex
ena, the field of computational fluid dynamics has evolved data dependencies and synchronizations (Figure 2, top) which
to enable accurate, efficient, and fast simulations that rely makes caching obsolete and may result in poor performance
on nonuniform grids with enhanced refinements around the without careful hardware-specific optimizations. Various prior
regions of interest [1]. Over the past two decades, the Lattice investigations have focused on implementing the grid re-
Boltzmann Method (LBM) has gained tremendous popularity finement technique in LBM either on single-GPU or multi-
as an alternative mesoscopic representation of fluid flows GPU architecture. An implementation of the grid refinement
which have been classically formulated using a macroscopic technique in LBM on both CPU and a single GPU was
representation based on the Navier-Stokes (N-S) equations presented for 2D domains in [11] which was then extended to
and solved numerically using various methods such as finite- heterogeneous CPU-GPU machines in [12]. Other works [13],
C0
S1 E1 O1 C1 H0 S0 O0

H0
C1 S2 E2 O2 C2 H1 H2 S2 E2 S1 E1 O1
O2 C2 H1 H2 S2 E2 O2 C2 H1
Begin End
H1 H2 H2
C2 H2
E3 C3 H2 E3 C3 H2 S3 E3
E3 C3 E3 C3 S3
E3 C3 S3 E3 C3
C3 S3 S3 E3 C3
S3 S3 E2 O2
S3 H2 S2

(a) Baseline Algorithm


CASE3 CASE3 SOE2

CASE3 CASE3 SOE2 CA2 CASE3 CASE3 SOE2 CA2


SOE1 End
Begin CA0 CA1 CA2 CASE3 CASE3 SOE2 CA2 CA1
SOE1 SO0

(b) Our Optimized Implementation


Fig. 2: Data dependency graph of the baseline grid refinement LBM algorithm [5] ported to the GPU (top) and our optimized
implementation (bottom). The baseline has complex dependencies because of multiple kernels required for different levels.
Each node in this graph represents a GPU kernel where C is Collision, E is Explosion, S is Streaming, O is Coalescence and
A is Accumulate as will be explained further in Section IV. The number represents the grid level where the coarsest level is
0. Our implementation is simpler because of our aggressive kernel fusion (around three times fewer kernels) which makes our
algorithm 1.3–2X faster than the baseline algorithm.

[14] focused on scalability on multi-GPU systems, adaptive of Karlin-Bösch-Chikatamarla (KBC) [18]. Without loss of
mesh refinement on multiple GPUs [15] and load-balancing in generality and for conciseness, we only present the simple
distributed CPU-based systems [16], [17]. While these related BGK model here in which C is defined as,
works also aimed to develop grid refinement for LBM on ∆t
GPU, to our knowledge, there is no prior work that focuses {fi (x, t) − fieq (x, t)}
fi∗ (x, t) = fi (x, t) − (3)
τ
primarily on optimizing the single GPU implementation of
grid refinement in LBM. We believe this is a crucial precursory In the above equations ∆t is the time step and τ is a
investigation before further optimizations are undertaken for relaxation time, determined by the total kinematic viscosity
multi-GPU systems. ν of the fluid as,
ν ∆t
In this paper, we propose an optimized implementation τ= 2+ . (4)
of the LBM grid refinement method that utilizes the GPU cs 2
hardware efficiently. Our optimized implementation is based or in other words, ν = c2s (τ − ∆t/2).
on aggressive kernel fusion (Figure 2, bottom) leading to In Equation (1), fieq represents the components of the
improved memory accesses, reduced synchronization, reduced equilibrium distribution function and is defined as,
kernel launch overhead, and simpler code. Additionally, our
implementation uses less memory than the baseline algorithm fieq (x, t) =
by reducing the size of the ghost layer. We show the ef- 
ei · u(x, t) (ei · u(x, t))2 ∥u(x, t)∥2

fectiveness of our optimized implementation and compare it wi ρ(x, t) 1 + + −
c2s 2c4s 2c2s
against CPU and GPU implementations where we achieve an (5)
order of magnitude speedup. While there exist many open-
source LBM codes, ours is the first to implement the grid where wi are weights associated with each lattice direction ei .
refinement method on the GPU. Our code is available at The fluid density ρ(x, t), velocity u(x, t) and pressure p(x, t)
https://github.com/Autodesk/Neon. are formally derived from fi as,
X
II. M ATHEMATICAL BACKGROUND density: ρ(x, t) = fi (x, t) (6)
The LBM equations determine the time-evolution of a i
collection of fictitious particles, represented by time-dependent 1 X
velocity: u(x, t) = ei fi (x, t) (7)
velocity distribution functions (fi ) along a set of discrete lattice ρ(x, t) i
directions denoted by ei = (e1 , . . . , eq ). In this work, we
pressure: p(x, t) = c2s ρ(x, t) (8)
employ 3D lattice structures with 19 (D3Q19) or 27 (D3Q27)
directions. The values of fi are evolved in time through in which cs is the constant speed of sound given by c2s =
a collide-and-stream algorithm. If we denote the nonlinear (1/3)(∆x/∆t)2 .
and computationally local collision operation by C, we may
represent the collide-and-stream algorithm as, A. Grid Refinement
Collision: fi∗ (x, t) = C(fi (x, t)) (1) Various techniques for grid-refinement in LBM have been
proposed in the literature which may be broadly categorized
Streaming:fi (x + ei ∆x, t + ∆t) = fi∗ (x, t) (2)
into node-based methods [11], [19]–[22] and volume-based
Throughout this paper, we use two collision operators, namely methods [16], [23]–[25]. In the node-based methods, fi is
(i) the single-relaxation collision model of Bhatnagar-Gross- stored at the cell vertices leading to shared vertices between
Krook (BGK) [2], and (ii) the multi-relaxation entropic model coarse and fine cells. In order to impose conservation of mass
and preserve second-order accuracy, the node-based meth-
ods often resort to scheme-dependent rescaling of fi across
transitions between levels and rely on ad hoc interpolation
and filtering schemes that may lead to spurious results [19].
However, in volume-based techniques, fi values of two sub-
sequent grid levels are not colocated but staggered which
renders the communication between coarse and fine levels
more straightforward. As a result, in volume-based methods a
simple homogeneous redistribution of fi acts as the required
interpolation scheme for coarse-to-fine communications while
conserving mass and momentum [23]. (a) Explosion
In this paper, we adopt the volume-based method of Rohde
et al. [25] as implemented by Schornbaum et al. [5], [16]. We
introduce some novel changes to this algorithm that are aimed
at optimizing the computational performance of the method
when implemented on the GPU.
We select a refinement ratio of 2 where a coarse cell
at level L is uniformly divided into 2d cells–where d is
the dimension– to arrive at level L + 1, or in other words
∆xL+1 = ∆xL /2. For neighboring cells that interface two
grid levels, a maximum jump in grid level of ∆L = 1 is
allowed. Due to acoustic scaling which requires the speed of
sound cs to remain constant across various grid levels [11], (b) Coalescence
∆tL ∝ ∆xL and hence ∆tL+1 = ∆tL /2. In addition, the Fig. 3: Besides collision and streaming, the nonuniform grid
fluid viscosity ν must also remain constant on each grid level refinement in LBM involves (a) coarse-to-fine explosion, and
which implies: (b) fine-to-coarse coalescence. The colored arrows highlight
∆tL ∆t0 how a few of the distribution functions fi travel along certain
νL = ν0 → c2s (τL − ) = c2s (τ0 − )
2 2 directions ei .
or equivalently:
 
τL τ0 1
= 2L + (1 − 2L ) III. BASELINE G RID R EFINEMENT A LGORITHM
∆tL ∆t0 2
It is often convenient to define a non-dimensional relaxation Our baseline algorithm is that by Schornbaum et al. [5],
parameter ω = ∆t/τ and hence, ωL on various grid levels are [16], which has been designed for distributed systems. This
obtained based on its value at the coarsest level or ω0 as, algorithm uses a strongly balanced octree grid where, as
2ω0 indicated earlier, the transition in resolution from one level to
ωL = L+1 (9) another is strictly 1 in all directions. We denote an octree level
2 + (1 − 2L )ω0
with L, where L = 0 implies the coarsest level of the tree.
The coarse-to-fine and fine-to-coarse communication steps The algorithm input is a grid that tiles the simulation domain
across the grids at level L and L + 1 (also introduced later in at potentially different resolutions. The algorithm depends on
the paper as the Explosion and Coalescence steps) are defined four main computational components (Figure 3)
as,
1) Collision: computes fi∗ using fi as per Equation (1)
Explosion: fi (xβ,L+1 , t) = fi (xα,L , t) (10) similar to the uniform LBM algorithm. This step in-
n volves only local computation and does not require
1X
Coalescence: fi (xα,L , t) = fi (xβ,L+1 , t) (11) any communications with other cells and hence is not
n
β=1 impacted by grid refinement.
which correspond to a homogeneous redistribution of fi for 2) Streaming: propagates the information in time and space
the coarse-to-fine communication (Explosion) and a simple by conducting a shift operation as per Equation (2). We
averaging of fi across n fine cells (denoted by xβ,L+1 ) that designate the same-level Streaming simply by “Stream-
interface with a coarse cell (denoted by xα,L ) for the fine-to- ing” and denote the cross-level Streaming by Explosion
coarse communication (Coalescence). and Coalescence.
In the remainder of this paper, we employ the LBM units in 3) Explosion: performs coarse-to-fine communications to
which it is conventional to non-dimensionalize space and time propagate fi from the coarse neighbor into finer cells as
by ∆x and ∆t. As a result, in our accompanied open-source per Equation (10). Explosion is a one-to-many commu-
code all the parameters are assumed to be dimensionless, for nication where a coarse cell distributes its values of fi
instance, ∆x = ∆t = 1 and c2s = 1/3. along ei to fine cells that interface with that coarse cell
along −ei (Figure 3a). Algorithm 1: Baseline Grid Refinement LBM Algo-
4) Coalescence: performs fine-to-coarse communications rithm [5]
to propagate fi from the fine neighbor into coarse 1 Function NonUniformTimeStep(L):
cells as per Equation (11). Coalescence is a many-to- 2 Collision(L)
one communication where a coarse cell averages the 3 if L ̸= Lmax − 1 then
contributions of fi from its neighboring fine cells along 4 NonUniformTimeStep(L + 1)
a certain direction indicated by the ei vector (Figure 3b). 5 end
At a high level, the algorithm applies the standard LBM 6 if L ̸= 0 then
collide-and-stream algorithm at each level starting from the 7 Explosion(L, L − 1)
coarsest level. However, during the streaming step, the inter- 8 end
face between different levels requires special treatment and 9 Streaming(L)
this is where the uniform versus nonuniform LBM algorithm 10 if L ̸= Lmax − 1 then
differs. Additionally, for a refinement ratio of two between 11 Coalescence(L, L + 1)
levels L and L + 1, the algorithm needs to complete two time 12 end
steps at level L + 1 before proceeding to the next time step of 13 if L == 0 then
the coarse level due to the acoustic scaling (see Section II). 14 return
In other words, given a grid with L levels, the finest grid will 15 end
perform 2L−1 time steps to complete one time step on the 16 Collision(L)
coarsest level. 17 if L ̸= Lmax − 1 then
In order to facilitate the communication between different 18 NonUniformTimeStep(L + 1)
levels of the grid, the interface between a grid at level L and 19 end
a finer neighbor at level L + 1 is extended by a ghost layer 20 if L ̸= 0 then
(Figure 4a) which lives in the finer L + 1 grid and covers two 21 Explosion(L, L − 1)
coarser layers from the grid at level L. These ghost layers 22 end
are used for communicating interface information both ways, 23 Streaming(L)
i.e., from coarse to fine and from fine to coarse. We note that 24 if L ̸= Lmax − 1 then
all reads/writes could be done as a gather or scatter operation 25 Coalescence(L, L + 1)
without the need for any atomic mechanism thanks to these 26 end
added ghost layers.
The baseline method in Algorithm 1 shows a single time
step of the coarsest level. This step is repeated in a loop until
a user-specified stopping criterion is met, e.g., based on the and Coalescence. Here, we demonstrate how the baseline
maximum number of iterations. To succinctly describe one algorithm can be modified to enable kernel fusion, which is
time step of the baseline method in Algorithm 1, we assume otherwise non-trivial. We will show that Streaming, Explosion,
a two-level grid: L (coarse) and L + 1 (fine). Starting from and Coalescence may all be combined into a single kernel.
the post-streaming state, the algorithm performs a Collision
step on L and L + 1. Then, an Explosion operation populates A. Reducing Ghost Layers
ghost layers of L + 1 by copying two layers of L along the
interface between L and L+1. This is followed by a Streaming The baseline algorithm uses four ghost layers on the fine
operation in L and L + 1 including the innermost layer of the grid which overlap two coarse layers. These ghost layers
ghost cells. Then, another Collision and Streaming steps are are employed to duplicate the overlapping coarse layers in
performed on L + 1. Finally, a Coalescence step is performed order to be leveraged (only the two innermost layers) during
where the innermost ghost layer populations in L + 1 are the Streaming step of the fine level. The fine ghost layers
averaged and copied to the overlapping corresponding coarse also allow the Coalescence step to be performed as gather
cells in L. operations (initiated by the coarse layer) avoiding atomic
operations. However, this approach is only reasonable if the
IV. A NALYSIS AND O PTIMIZATION distance in memory between the fine and coarse grid is large.
While the majority of previous work focused on distributed On a single GPU, four ghost layers are excessive and may limit
systems, here we analyze the performance of the baseline the maximum size of the problem that fits in a single GPU.
algorithm (Section III) on a single GPU and investigate its Additionally, depending on the data structure (see Section V),
potential bottlenecks to pave the way for its optimization. the distance between grids at different resolutions may not be
The main drawback of implementing the baseline algorithm large enough to warrant manual caching.
in its original form on the GPU is that every operation is Instead of allocating the ghost layer on the fine level, we
performed in isolation—missing the opportunity for kernel allocate only a single layer on the coarse layer effectively
fusion. This leads to loading the whole grid multiple times to reducing its size to 1/3 of what is needed by the baseline
operate only on a small subset of it, e.g., during Explosion algorithm. In this case, during the Explosion step (i.e., coarse-
(a) Baseline Algorithm (b) Reduced ghost layer

(c) Fused CA (d) Fused CA and SE

(e) Fused CA, SE, and SO (f) Fused CASE and SO


Fig. 4: Comparing the baseline algorithm against different optimization opportunities using a two-level grid. While the baseline
algorithm (top left) uses a larger ghost layer (dotted cells), the ghost layer in our implementation (top right) is allocated on
the coarse level effectively reducing its size to 1/3. Additional optimizations include fusing Collision (C) and Accumulate (A)
steps by performing the Accumulate step as atomic write (middle left). We could also fuse Streaming (S) and Explosion (E)
(middle right) as well as Streaming and Coalescence (O) steps (bottom left). Finally, we could fuse all steps of a single time
step of the finest resolution (bottom right).
to-fine communication), the fine grid reads directly from the
coarse grid.
We rely on this single ghost layer on the coarse grid to
prepare the information needed by the interface cells of the
coarse level. This approach turns the Coalescence step into
a simple Streaming-like step (Figure 4b). As a result, the
Coalescence step is split into two operations: (i) an Accumulate
step on the ghost layer after every fine-level Collision step, Fig. 5: A simplified example of our block-based memory
and (ii) a Coalescence step by the coarse level to read and layout for vector field data where each 2 × 2 block coincides
average the accumulated information. The Accumulate step with a 2 × 2 CUDA block
is the same as the coarse-level accumulation in the baseline
algorithm (Equation 11) but without division. The division (to
average the data) is done now during the Coalescence on the operations that exchange information between two grids at
coarse level. The Accumulate step could be performed either different levels (e.g., Coalescence and Explosion). There are
as a gather read operation from the ghost layer or scatter two main approaches for designing an LBM data structure [5]:
atomic write operation from the fine level. Here, we note that a forest of octrees or a stack of uniform sparse grids. Each
the contention is not too high as every ghost cell will be written can support LBM single-level operations, with some glue to
by a maximum of 8 other fine cells. After two Accumulate transition between different levels aimed at supporting multi-
steps, the information is ready for the coarse layer to do its level operations. However, since the majority of the work
Coalescence step. occurs on the finest level, a stack of uniform sparse grids is
expected to have superior performance compared to an octree
B. Kernel Fusion structure. This is because most of the operations involved are
The modifications explained thus far open the door for a stencil operations which may be handled better by a uniform
variety of options for kernel fusion. For example, the Collision grid rather than a tree traversal. Thus, we choose the sparse
and Accumulate steps could be fused in a single kernel grid as the basis for our data structure.
(Figure 4c). In this kernel, every cell performs its Collision
A. Block Sparse Data Structure
operations, and if it is on the interface with a coarse cell, it will
simply Accumulate its populations which are now stored in We partition the input domain into 3D blocks of fixed size.
registers. When the coarse cell performs its Coalescence step, We place the blocks in active regions of the space associated
it will reset the ghost layer allowing subsequent Accumulate with the fluid domain. For each block, we allocate a bitmask
steps to be done correctly. Note that if the Accumulate step to track the active cells within the block. Additionally, we
is done as a gather operation by the coarse cell, the data store the indices of all the neighboring blocks in all lattice
dependency would not allow kernel fusion since the coarse directions (e.g., 26 directions for D3Q27) to be used for
level should wait for the fine level to finish its Collision step stencil operations. Given a cell, we can identify its intra-
before performing the Accumulate step. block and inter-block neighbor cells via division and modulo
Additionally, the Explosion and Streaming steps (Fig- operations either using inexpensive bit operations (for intra-
ure 4d), and Coalescence and Streaming steps (Figure 4e) block neighbors) or explicitly (for inter-block neighbors). The
could be fused since the Explosion and Coalescence are now block size B 3 is known at compile time which allows for
simply a sub-step of the Streaming step where the missing various optimizations. To maximize locality, at runtime, a
populations live on a grid at a different resolution. When block is assigned to one CUDA block and every CUDA thread
there are more than two levels, the middle level can fuse is assigned to a cell within the given block.
the Streaming, Explosion, and Coalescence steps all in one To represent vector fields over a grid, e.g., the 19-component
kernel. Finally, and similar to uniform LBM [10], the Collision vector field for D3Q19 populations, we adopt an Array of
and Streaming can be fused in one kernel (Figure 4f) for the Structures of Arrays (AoSoA) (Figure 5) layout i.e., we store
finest resolution. Since the majority of the computation is done block’s field data in a contiguous memory region, where data is
on the finest level, this fusion is expected to be extremely then grouped by field’s component. The memory layout and
beneficial. Unlike the uniform LBM, this final kernel fusion the mapping to CUDA blocks guarantee coalesced accesses
also requires the Collision and Streaming steps to perform the and maximize memory locality within a block. Then, to
Accumulate and Explosion steps respectively. improve the data locality between blocks, we arrange blocks
in memory using space-filling curves (Sweep, Morton, or
V. I MPLEMENTATION D ETAILS Hilbert).
The grid refinement LBM algorithm is based on a typical
octree-based spatial organization—since the branching factor B. Grid Refinement Data Structure
from one level to another is two. The algorithm defines two We implement our grid refinement data structure by stacking
distinct sets of operations: single-level operations that work Lmax block sparse data structures. We then extend the block
at one level (e.g., Collision and Streaming) and multi-level data structure with information to allow access to neighbor
(a) Nonuniform Grid (b) Iteration 1000 (c) Iteration 5000 (d) Iteration 10000
Fig. 6: Snapshots of the lid-driven cavity flow at Re = 100 based on our nonuniform LBM implementation using the BGK
collision model and D3Q19 lattice. The flow is incompressible, steady, and laminar characterized by a Reynolds number of
Re = ulid /ν = 100 where ν is the kinematic viscosity of the fluid and the cubic box is assumed to have a dimensionless size
of 1 in all directions. For this example, 3 levels of refinement with 240 voxels on the finest level (across all sides of the box)
were used.

interface cells at a different resolution. For a grid at level L, VI. R ESULTS


we store the block index of an overlapping ghost block at We now show the effectiveness of our optimization on
level L − 1. We also store the block index containing the fine a set of use cases. We analyze both the correctness and
cells only for the ghost cells on a non-leaf level. Octree-based performance of our implementation. The LBM community
organization of the baseline and optimized algorithm defines relies on MLUPS (Million Lattice Updates Per Second) as the
a 3D blocking of 23 because of its branching factor. Directly primary metric for performance. For a uniform grid, MLUPS
using such an organization for the memory may negatively is simply defined as V N/T where V is the number of voxels
impact the performance since 23 memory blocks provide low and T is the wall-clock time in microseconds to complete N
locality for stencil operations, and 23 CUDA blocks do not LBM
declare enough threads to fill up an entire CUDA warp. Thus, PLiterations.
max −1
For nonuniform LBM, we calculate MLUPS
as L=0 VL NL /TL , where Lmax is the number of levels
we decouple the branching factor from the memory layout. of the multi-resolution grid, VL is the number of active voxels
We do this by grouping a set of 23 sub-blocks into B 3 -sized at level L (excluding any ghost cells), and TL is the wall-clock
memory blocks as shown in Figure 5. time in microseconds to complete NL iterations on level L i.e.,
NL = 2L N . We conduct all experiments on an A100 GPU
C. Programming Model with 40 GB of memory inside an NVIDIA DGX machine.
Our implementation and comparisons use double precision
We implement our data structure in Neon [26]. Neon is a unless stated otherwise. All tests are run in an NVIDIA docker
multi-GPU programming model that transforms a sequential container with CUDA 11.2.
user code into a multi-GPU efficient parallel implementation. We use the KBC model—compatible only with D3Q27
We use Neon’s capability of extracting the data dependency lattice—(Section II) in the experiments with turbulent flow
graph by analyzing the input and output fields of each kernel. simulation (e.g., wind tunnel). For the laminar (e.g., lid-driven
This allows us to run independent kernels concurrently and cavity), we use the BGK model which can be employed with
to place synchronization points only when necessary leading either D3Q19 or D3Q27. Since the results are identical, we
to maximizing concurrency. For composing kernels, Neon reported those based on D3Q19.
exposes a for-each abstraction to define computations that
manipulate fields’ data. In Neon, the user only writes what A. Lid-driven Cavity
happens to a single cell then Neon automatically launches and Validation: A canonical benchmark problem in CFD is
assigns CUDA blocks and threads to all cells. the flow inside a lid-driven cavity. In this problem, fluid inside
Neon was originally designed for uniform grid computations a cubic box is driven by the tangential movement of the box
with built-in dense, element sparse, and block sparse data lid. The no-slip boundary condition on all the walls as well
structures. We extend Neon’s capability to automatically con- as the moving wall boundary condition on the top lid are all
struct a data dependency graph of multi-resolution applications imposed by the halfway bounce-back method [27]. In order
thanks to a stack of uniform sparse grids. For multi-level to capture the boundary layer near the walls accurately, we
kernels, the application defines at what level the kernel should successively refine the voxels in all directions as they get
run. Then, Neon launches the kernel on the specified grid level closer to the boundaries. Figure 6 shows different iterations
while relying on its capabilities for resolving data dependen- of our simulation based on a 3-level nonuniform grid. We
cies and assigning CUDA blocks and threads to active grid validated our simulation against the accepted reference results
cells. of Ghia et al. [28] where we sampled the domain along x and y
The comparison between grid refinement and uniform reso-
lution may not be straightforward as the refinement of the grid
is problem-dependent. For the lid-driven cavity and the level
of refinement we selected, the time to solution for both well-
optimized uniform [26] and grid refinement LBM is almost
the same—grid refinement is only faster by 1.18X.
B. Wind Tunnel
In order to showcase the benefits of employing nonuniform
grid refinement in LBM and to further demonstrate our ef-
ficient implementation of this technique on a single GPU,
we constructed a virtual wind tunnel experiment with two
configurations.
Flow over sphere: We place a sphere inside a virtual
wind tunnel with three levels of refinement around the sphere
(Figure 8 and Table I) where the no-slip boundary condition
is again imposed on the side walls and on the sphere using
the halfway bounce-back method [27]. The inlet velocity
associated with the incoming flow inside the wind tunnel
Fig. 7: Validating our nonuniform LBM implementation is also prescribed using the same bounce-back technique.
against reference results of Ghia et al. [28]. The normalized For the outflow boundary condition where the flow exits the
velocity components in x and y directions, namely u/ulid and domain, the missing directions of the distribution function fi
v/ulid (where ⃗u = (u, v, w) and ⃗ulid = (ulid , 0, 0)) are probed at the boundary are simply assigned the corresponding lattice
along the y and x axes respectively. The indicated distance is weights wi .
measured with respect to the coordinate axis located at the Table I shows the performance of both the baseline and
box center. our most optimized implementation. The baseline here cor-
responds to the algorithm in Figure 4b which has two of
our own improvements compared to the original algorithm
axes that run through the center of the domain. Figure 7 shows proposed in [16]. More precisely, in the modified baseline
that our results are well-aligned with the reference data. here (i) the ghost layer is allocated on the coarse level and
Comparison: As mentioned in Section I, there are no (ii) the Accumulate communication is initiated from the coarse
open-source and optimized GPU implementations of the grid level. Table I clearly demonstrates the speedups gained through
refinement method in LBM. As such, our choice for compari- our optimizations. We note that as the computational domain
son was limited to Palabos [29] and waLBerla [30]. Palabos is becomes larger, the speedup decreases because a larger frac-
a multi-core CPU implementation of LBM targeting applica- tion of time is spent on the domain interior (e.g., to perform
tions with complex coupled physics. We used their nonuniform Collision) and less time is spent on the interface between levels
LBM implementation to compare the same lid-driven cavity (e.g., to perform Coalescence) and hence the overall impact
example against our implementation. Given the same domain of the computational operations that were targeted by our
size and refinement levels, Palabos took 2.3 seconds while optimized implementation become relatively less pronounced.
our implementation took 0.015 seconds per iteration which Nevertheless, our optimized implementation is able to sustain
is more than two orders of magnitude faster. Of course, significantly better performance with up to 30% speedup.
a big factor of this speedup is due to differences between In addition, we carry out an ablation study to show the
CPU and GPU hardware architectures where LBM is more impact of different fusion options shown schematically in
amenable to massively parallel architectures like the GPU. For Figure 4. Figure 9 shows that the fusion of Collide and
that, we compared also against waLBerla; an LBM block- Streaming operation associated with the finest resolution has
structured implementation that relies on meta-programming the largest contribution to the speedup. This is because the
techniques to generate highly efficient code for CPUs and voxels at the finest resolution consume more than half of
GPUs. Again using the lid-driven cavity example, waLBerla the number of active voxels (Table I). We emphasize that
performs O(10) MLUPS while our implementation is more such fusion is only possible thanks to the introduction of the
than 2250 MLUPS i.e., more than two orders of magnitude Accumulate step which breaks the data dependency allowing
speedup. We acknowledge that waLBerla’s developers have the finest level to fuse the Collision and Streaming steps into
only recently ported their grid refinement algorithm to GPU, one.
and therefore it is not well-optimized. However, this highlights Flow over airplane: In this scenario, we demonstrate the
the importance of our careful optimizations and illustrates that impressive capabilities offered by our efficient implementa-
merely porting CPU code to GPU is not enough to utilize the tions of the grid refinement LBM running on a single GPU. We
GPU hardware efficiently. simulate a virtual domain with dimensions of 1596×840×840
(a) Nonuniform Grid (b) Iteration 5k (c) Iteration 10k (d) Iteration 30k
Fig. 8: Evolution of flow over a sphere with radius R at Re = uinlet R/ν = 4000 using the KBC collision model and D3Q27
lattice.

Size Distribution (×106 ) Baseline (MLUPS) Ours (MLUPS) Speedup


272 × 192 × 272 0.602, 0.296, 0.175 482.63 1081.67 2.20
544 × 384 × 544 4.81, 2.37, 1.40 1115.80 1646.37 1.40
816 × 576 × 816 16.25, 8.0, 4.74 1299.7 1805.03 1.30

TABLE I: Comparing the performance of the modified baseline algorithm (Figure 4b) to our optimized implementation
(Figure 4f) using the example of flow over sphere where three levels of refinement were used to arrive at the finest level
around the sphere. Size indicates the length of the virtual wind tunnel box described in the finest level along x, y, and z
directions; Distribution indicates the distribution of active voxels across different levels starting from the finest level, Baseline
refers to the MLUPS achieved by our implementation of the modified baseline algorithm; Ours refers to our most optimized/fused
implementation; and Speedup = Ours/Baseline.

2000 achievable
Baseline
1800 Fused CA VII. C ONCLUSION AND F UTURE W ORK
Fused CA and SE
1600 Fused CA, SE, and SO Large-scale and accurate simulation of realistic turbulent
Fused CASE, and SO flows is essential for many applications. In this work, we
1400 investigated the nonuniform grid refinement algorithm in the
1200 Lattice Boltzmann Method (LBM) and optimized its perfor-
MLUPS

mance on a single GPU. The algorithm which was inherited


1000 from a CPU-focused implementation introduces complex data
800 dependencies leading to inefficient implementation if ported
to the GPU in its original form. We analyzed and optimized
600 this complex data dependency. Our proposed algorithm, which
400 highly leverages kernel fusion optimization and GPU atomic
operations, features a lower memory footprint and reaches
1074048.0 8592384.0 28999296.0 higher performance. Efficient nonuniform grid refinement on
Total Active Voxels the GPU enables us to run large-scale problems on a single
Fig. 9: The impact of different fusion configurations on the 40 GB GPU. Such large spatial scales open new avenues for
achieved MLUPS for the flow over sphere problem. tackling realistic engineering problems that were previously
either impossible or only feasible on multi-GPU systems.
The foundation laid by our optimized single GPU algorithm
and place an aircraft model into this domain. We maintain positions us favorably for future research in extending this
consistent boundary conditions and employ the same collision approach to multi-GPU frameworks, offering even higher
model as in our previous case. In Figure 1, we depict the scalability options for both speed and memory requirements.
intricate turbulent flow patterns around the aircraft. Additionally, we foresee promising research opportunities in
It is worth noting that with a uniform grid approach, Adaptive Mesh Refinement (AMR) for LBM, enabling dy-
the specified domain size of 1596 × 840 × 840 would be namic grid resolution adjustments during runtime, and enhanc-
unattainable due to limitations in accommodating such large ing the flexibility of our simulations.
grids within a single GPU. Even with advanced optimizations R EFERENCES
for uniform grids, such as the AA-method [7] that employs a [1] J. H. Ferziger and M. Perić, Computational Methods for Fluid Dynamics,
single buffer, the largest feasible domain size on a single 40 3rd ed. Springer, 2019.
GB GPU would be restricted to approximately 794×794×794. [2] T. Krüger, H. Kusumaatmaja, A. Kuzmin, O. Shardt, G. Silva, and E. M.
Viggen, The Lattice Boltzmann method, 1st ed. Springer Cham, 2016.
This emphasizes the exceptional capabilities enabled by our [3] D. S. S. Corp., “3DEXPERIENCE SIMULIA,” 2023. [Online].
grid refinement implementation that would otherwise be un- Available: https://www.3ds.com
[4] A. E. Inc., “ultraFluidX,” 2023. [Online]. Available: https://altair.com Applications, vol. 362, pp. 158–167, Mar. 2006. [Online]. Available:
[5] F. Schornbaum and U. Rüde, “Extreme-scale block-structured adaptive https://www.sciencedirect.com/science/article/pii/S0378437105009696
mesh refinement,” SIAM Journal on Scientific Computing, vol. 40, pp. [24] M. Hasert, K. Masilamani, S. Zimny, H. Klimach, J. Qi, J. Bernsdorf,
358–387, 2018. and S. Roller, “Complex fluid simulations with the parallel tree-
[6] M. Ataei and H. Salehipour, “XLB: A differentiable massively parallel based Lattice Boltzmann solver Musubi,” Journal of Computational
lattice boltzmann library in python,” CoRR, vol. abs/2311.16080, no. Science, vol. 5, no. 5, pp. 784–794, Sep. 2014. [Online]. Available:
2311.16080, Dec. 2023. https://www.sciencedirect.com/science/article/pii/S1877750313001270
[7] P. Bailey, J. Myre, S. D. Walsh, D. J. Lilja, and M. O. Saar, “Accelerating [25] M. Rohde, D. Kandhai, J. J. Derksen, and H. E. A. van den
lattice boltzmann fluid flow simulations using graphics processors,” in Akker, “A generic, mass conservative local grid refinement technique
2009 international conference on parallel processing. IEEE, Sep. 2009, for lattice-boltzmann schemes,” International Journal for Numerical
pp. 550–557. Methods in Fluids, vol. 51, pp. 439–468, 2006. [Online]. Available:
[8] M. Geier and M. Schönherr, “Esoteric Twist: An efficient in-place https://onlinelibrary.wiley.com/doi/abs/10.1002/fld.1140
streaming algorithmus for the lattice boltzmann method on massively [26] M. Meneghin, A. H. Mahmoud, P. K. Jayaraman, and N. J. W. Morris,
parallel hardware,” Computation, vol. 5, no. 2, p. 19, Mar. 2017. “Neon: A multi-gpu programming model for grid-based computations,”
[9] M. Lehmann, M. J. Krause, G. Amati, M. Sega, J. Harting, and S. Gekle, in Proceedings of the 36th IEEE International Parallel and Distributed
“Accuracy and performance of the lattice boltzmann method with 64-bit, Processing Symposium, Jun. 2022, pp. 817–827. [Online]. Available:
32-bit, and customized 16-bit number formats,” Physical Review E, vol. https://escholarship.org/uc/item/9fz7k633
106, no. 1, Jul. 2022. [27] A. J. Ladd, “Numerical simulations of particulate suspensions via a
[10] J. Latt, C. Coreixas, and J. Beny, “Cross-platform programming model discretized boltzmann equation. part 1. theoretical foundation,” Journal
for many-core lattice Boltzmann simulations,” PLOS ONE, vol. 16, no. 4, of fluid mechanics, vol. 271, pp. 285–309, 1994.
pp. 1–29, Apr. 2021. [28] U. Ghia, K. Ghia, and C. Shin, “High-Re solutions for incompressible
[11] M. Schönherr, K. Kucher, M. Geier, M. Stiebler, S. Freudiger, and flow using the Navier-Stokes equations and a multigrid method,”
M. Krafczyk, “Multi-thread implementations of the lattice Boltzmann Journal of Computational Physics, vol. 48, pp. 387–411, Dec. 1982.
method on non-uniform grids for CPUs and GPUs,” Computers & [Online]. Available: https://www.sciencedirect.com/science/article/pii/
Mathematics with Applications, vol. 61, pp. 3730–3743, Jun. 2011. 0021999182900584
[Online]. Available: https://www.sciencedirect.com/science/article/pii/ [29] J. Latt, O. Malaspinas, D. Kontaxakis, A. Parmigiani, D. Lagrava,
S0898122111002999 F. Brogi, M. B. Belgacem, Y. Thorimbert, S. Leclaire, S. Li,
[12] P. Valero-Lara and J. Jansson, “Heterogeneous CPU+GPU approaches F. Marson, J. Lemus, C. Kotsalos, R. Conradin, C. Coreixas,
for mesh refinement over Lattice-Boltzmann simulations,” Concurrency R. Petkantchin, F. Raynaud, J. Beny, and B. Chopard, “Palabos:
and Computation: Practice and Experience, vol. 29, no. 7, Apr. 2017. Parallel lattice boltzmann solver,” Computers & Mathematics with
[Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe. Applications, vol. 81, pp. 334–350, 2021. [Online]. Available:
3919 https://www.sciencedirect.com/science/article/pii/S0898122120301267
[13] C.-M. Wu, Y.-S. Zhou, and C.-A. Lin, “Direct numerical simulations [30] M. Bauer, S. Eibl, C. Godenschwager, N. Kohl, M. Kuron,
of turbulent channel flows with mesh-refinement lattice boltzmann C. Rettinger, F. Schornbaum, C. Schwarzmeier, D. Thönnes, H. Köstler,
methods on GPU cluster,” Computers & Fluids, vol. 210, p. 104647, and U. Rüde, “waLBerla: A block-structured high-performance
Oct. 2020. [Online]. Available: https://www.sciencedirect.com/science/ framework for multiphysics simulations,” Computers & Mathematics
article/pii/S0045793020302188 with Applications, vol. 81, pp. 478–501, 2021. [Online]. Available:
[14] C. A., Niedermeier, C. F. Janssen, and T. Indinger, “Massively-parallel https://www.sciencedirect.com/science/article/pii/S0898122120300146
multi-gpu simulations for fast and accurate automotive aerodynamics,”
in Proceedings of the 7th European Conference on Computational Fluid
Dynamics, ser. ECFD 7, Jun. 2018.
[15] S. Watanabe and T. Aoki, “Large-scale flow simulations using lattice
Boltzmann method with AMR following free-surface on multiple
GPUs,” Computer Physics Communications, vol. 264, Jul. 2021.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
S0010465521000291
[16] F. Schornbaum and U. Rüde, “Massively parallel algorithms for the
lattice boltzmann method on nonuniform grids,” SIAM Journal on
Scientific Computing, vol. 38, pp. C96–C126, 2016.
[17] Z. Liu, J. Ruan, W. Song, L. Zhou, W. Guo, and J. Xu, “Parallel
scheme for multi-layer refinement non-uniform grid lattice boltzmann
method based on load balancing,” Energies, vol. 15, no. 21, 2022.
[Online]. Available: https://www.mdpi.com/1996-1073/15/21/7884
[18] I. V. Karlin, F. Bösch, and S. Chikatamarla, “Gibbs’ principle for the
lattice-kinetic theory of fluid dynamics,” Physical Review E, vol. 90,
no. 3, Sep. 2014.
[19] D. Lagrava, O. Malaspinas, J. Latt, and B. Chopard, “Advances in multi-
domain lattice Boltzmann grid refinement,” Journal of Computational
Physics, vol. 231, pp. 4808–4822, May 2012.
[20] A. Dupuis and B. Chopard, “Theory and applications of an alternative
lattice Boltzmann grid refinement algorithm,” Physical Review E,
vol. 67, Jun. 2003. [Online]. Available: https://link.aps.org/doi/10.1103/
PhysRevE.67.066707
[21] B. Dorschner, N. Frapolli, S. S. Chikatamarla, and I. V. Karlin, “Grid
refinement for entropic lattice Boltzmann models,” Physical Review
E, vol. 94, Nov. 2016. [Online]. Available: https://link.aps.org/doi/10.
1103/PhysRevE.94.053311
[22] O. Filippova and D. Hänel, “Grid refinement for lattice-BGK models,”
Journal of Computational Physics, vol. 147, pp. 219–228, Nov. 1998.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
S0021999198960892
[23] H. Chen, O. Filippova, J. Hoch, K. Molvig, R. Shock, C. Teixeira,
and R. Zhang, “Grid refinement in lattice Boltzmann methods based
on volumetric formulation,” Physica A: Statistical Mechanics and its

You might also like