MoraMorraYuen ConcisePythonicLBM
MoraMorraYuen ConcisePythonicLBM
net/publication/336009521
CITATIONS READS
9 7,869
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Identification of Strombolian Eruptions in Seismic Using Cross-Correlation Machine Learning View project
All content following this page was uploaded by Gabriele Morra on 28 April 2021.
Accepted 2019 September 24. Received 2019 August 31; in original form 2019 April 4
SUMMARY
The lattice Boltzmann method (LBM) is a method to simulate fluid dynamics based on
modelling distributions of particles moving and colliding on a lattice. The Python scripting
language provides a clean programming paradigm to develop codes based on the LBM,
however in order to reach performance comparable to compiled languages, it needs to be
carefully implemented, maximizing its vectorized tools, mostly integrated in the NumPy
module. We present here the details of a Python implementation of a concise LBM code, with
the purpose of offering a pedagogical tool for students and professionals in the geosciences
who are approaching this technique for the first time. The first half of the paper focuses on
how to vectorize a 2-D LBM code and show how if carefully done, this allows performance
close to a compiled code. In the second part of the paper, we use the vectorization described
earlier to naturally write a parallel implementation using MPI and test both weak and hard
scaling up to 1280 cores. One benchmark, Poiseuille flow and two applications, one on sound
wave propagation and another on fluid-flow through a simplified model of a rock matrix are
finally shown.
Key words: Permeability and porosity; Geomechanics; Non-linear differential equations;
Numerical approximations and analysis; Numerical modelling; Wave propagation.
1 I N T RO D U C T I O N
This paper aims to provide a clean Python high performance implementation for the lattice Boltzmann method (LBM) to model fluid
flow in geosciences. This method involves simulating the Boltzmann Equations on a discrete lattice—an approach that solves the Navier–
Stokes Equations in the macroscopic limit (Frisch et al. 1986; Chen & Doolen 1998)—rather than modelling the Navier–Stokes equations
themselves. A complete treatise of LBM covering all facets can be found in (Succi 2001). In recent years, the LBM has been applied to
various geophysical problems. This includes the study of viscoelastic waves (Xia et al. 2017), the study of flow in porous media (Keehm
et al. 2004; Guo et al. 2014), the study of imbibition in porous structures (Zheng & Wang 2018), the study of dissolution and precipitation
in porous media (Kang et al. 2003; Huber et al. 2014), the study of reactive flow in porous media (Kang et al. 2010), the study of plumes
and convection in the mantle (Mora & Yuen 2017, 2018a,b), the study of melting with convection (Huber et al. 2008) and the study of
reactive transport (Huber et al. 2008; Parmigiani et al. 2011). In the following, we will review briefly the LBM prior to presenting the
implementation.
The LBM allows fluid dynamics to be modelled by simulating the movement and collision of particle distributions on a discrete lattice
in 2-D or 3-D. LBMs have their origins in lattice gas automata (LGA) in which particles move and collide on a discrete lattice representing a
simplified discrete version of molecules moving and colliding in a gas. LGA were first proven by Frisch et al. (1986) to yield the Navier–Stokes
equations in the macroscopic limit. These initial LGA models were unconditionally stable and conserved mass and momentum perfectly.
However, they were computationally expensive with averaging needed over space to obtain the macroscopic equations and furthermore, costly
calculations were required to evaluate the collision term. Since the initial LGA models, the method has been extended to model distributions
C The Author(s) 2019. Published by Oxford University Press on behalf of The Royal Astronomical Society. 1
2 P. Mora, G. Morra and D.A. Yuen
of particles moving and colliding on a lattice. In these LBMs, one is solving the classical Boltzmann equation on a discrete lattice. An efficient
method to calculate the collision term via relaxation was proposed by Bhatnagar, Gross and Krook (BGK, Bhatnagar et al. 1954), but only
more recently has this been applied to the LBM, enabling efficient algorithms to be developed (e.g. Higuera & Jimenez 1989; Qian et al. 1992;
Chen & Doolen 1998), and accurate pressure and velocity boundary conditions to be derived (Zhuo & He 1997). Because of the advancement
of parallel computing, since the late 1990’s, research and applications of the LBM have exploded (see Huang et al. 2015; Krüger et al. 2017,
for a review).
In particular, numerous studies have been conducted of thermal convection (e.g. Shan 1997; He et al. 1998; Guo et al. 2002; Wang et al.
2013; Arun et al. 2017). Multiphase methods have been developed as well, such as Shan & Chen (1993), and this remains a highly active
research field (Huang et al. 2015; Xie et al. 2017; Di Ilio et al. 2017).
The LBM has been applied to many disciplines of computational science and engineering, due to how well it can be scaled up on high
performance computing (HPC) clusters. Recently developed massively parallel software are the Multiphysics WaLBerla (Feichtinger at al.
2011) able to model 1012 nodes on Petascale computers, the non-uniform grid implementation reaching up to a trillion grid nodes illustrated
in (Schornbaum & Rüde 2016), the widely used Palabos (Lagrava et al. 2012), OpenLB (Heuveline & Latt 2007), LB3D (Groen et al. 2011),
all of them showing excellent performance on HPC. In this work, we illustrate an example of parallel implementation in 2-D, tested up to
1280 cores, only in Python, for pedagogical purposes and for specific applications in geosciences.
This paper provides the details of a technical implementation of the LBM using the Python language(Van Rossum et al. 2007; Lutz
2013). The goal of this manuscript is to provide geo-scientists and non-experts in parallel programming, with all the details necessary to write
their own fast and scalable implementation of the LBM, using only NumPy, the fundamental package with built-in vectorized operations
on N-dimensional array objects (Morra 2018) and mpi4py, the most commonly used package with the bindings for the Message Passing
Interface for Python, (Dalcin et al. 2011). Details on how to optimize every segment of the code are described throughout the paper in
order to deliver scientists with simple and clear instructions on how to write their own vectorized scalable parallel LBM code using only
Python.
2 N U M E R I C A L S I M U L AT I O N M E T H O D O L O G Y
The LBM involves simulating particle number densities moving and colliding on a discrete lattice. In one time-step, the particle number
densities can move by one lattice spacing along the orthogonal axes, or along diagonals, followed by modification of the number densities at
the lattice nodes due to collision. We use the standard notation in LBM denoted DnQm for a simulation in D = n dimensions, and with Q
= m velocities on the discrete lattice. In the following, we restrict ourselves to 2-D and use the D2Q9 lattice Boltzmann lattice arrangement
shown in Fig. 1. In this lattice, we define fα (x, t) as the number density of particles moving in the α-direction where the Q = 9 velocities are
given by
Concise python implementation of the LBM on HPC 3
cα = [(0, 0), (1, 0), (0, 1), ( − 1, 0), (0, −1), (1, 1), ( − 1, 1), ( − 1, −1), ( − 1, 1)]T . This choice means that c0 is the zero velocity vector
and so represents stationary particles, and cα = −cα + 2 for α = (1, 2, 5, 6) are the velocities in the eight directions show in Fig. 1. The lattice
is unitary so the lattice spacing and time step are x = t = 1.
The LBM involves two steps: (a) streaming (movement) and (b) collision of a distribution function. If we wish to model thermal convection
or thermal–chemical convection, we just model additional distribution functions, representing the energy and the chemical component (e.g.
Luo & Girimaji 2002; Arcidiacono et al. 2007; Bartlett 2017). In this paper, we focus on the technical aspects of implementing the
LBM using Python. We therefore model a single distribution function, fα representing the mass density of particles moving and colliding
in the α-direction on the discrete lattice. The evolution equation encompassing the two steps of moving (streaming) and colliding is
given by
where f αc (x, t) is the collision term and represents the redistribution of particle number densities at lattice site (x, t) due to collisions. The
collision term can be calculated exactly, but this is computationally expensive and as such, is rarely done. Alternatively, the collision term can
be calculated by the BGK method (Qian et al. 1992; Chen & Doolen 1998), in which case the distributions relax to the equilibrium distribution.
The BGK method is computationally efficient and gives satisfactory results provided the distributions are not too far from equilibrium. The
BGK collision term is given by
1
f αc (x, t) = ( f αeq (x, t) − f α (x, t)) , (2)
τf
where f αeq (x, t) is used to denote the equilibrium distribution of fα (x, t). It should be noted that since the simple single relaxation time BGK
collision term given by eq. (2) was proposed, more accurate and stable multiple relaxation time methods have been developed (Lallemand
& Luo 2000; d’Humieres et al. 2002). The equilibrium distribution is obtained by a Taylor expansion about the Boltzmann distribution
given by
ρ (c − u)2
f eq (u) = exp − , (3)
(2π RT ) D/2 2RT
where D is the number of dimensions. Taking the Taylor’s expansion of feq , we obtain
ρ c2 c · u 1 (c · u)2 1 u2
f eq (u) = exp − 1 + + − , (4)
(2π RT ) D/2 2RT RT 2 (RT )2 2 RT
√ √ √
Noting that the speed of sound is given by cs = RT , and that on the D2Q9 lattice we have cs = RT = c/ 3 where c = x/t = 1 is the
lattice speed, we obtain the equilibrium distribution on the lattice of
9 3
f αeq (x, t) = ρwα 1 + 3(cα · u) + (cα · u)2 − u2
2 2
= ρwα c1 + c2 (cα · u) + c3 (cα · u)2 + c4 u2 . (5)
In eq. (2), the value of τ f is a relaxation time which relates to the kinematic viscosity ν f through
In eq. (5) for the equilibrium distribution, the weighting scalars wα are given by w0 = 4/9 for α = 0 (stationary particles), wα = 1/9 for
α = (1, 2, 3, 4) (particles travelling along the two Cartesian axes), and wα = 1/36 for α = (5, 6, 7, 8) which are the particles travelling
diagonally.
4 P. Mora, G. Morra and D.A. Yuen
The macroscopic properties, density ρ and velocity u relate to the distribution function fα through
ρ(x, t) = f α (x, t), (7)
α
and
P(x, t) = ρu(x, t) = f α (x, t)cα . (8)
α
where ρ = ρ(x, t) is the macroscopic density, P(x, t) is the momentum density, R is the universal gas constant, and D is the number of
dimensions. In the following, we restrict ourselves to two dimensions (D = 2) and we use units such that R = 1.
3 A V E C T O R I Z E D P Y T H O N I M P L E M E N TAT I O N
Next, we need to set an array, ai[], that contains pointers to opposite directions and an array, c[], that contains the lattice velocities,
and ,w[], the weights associated with each of the nine lattice velocities. Refer to Section 2 for a precise formulation. We use the NumPy
module to define the arrays explicitly:
Finally, we need to define physical quantities associated with the relaxation time. This can be done based on a non-dimensional time
step dt and lattice spacing dx equal to 1. Using a constant viscosity value in the entire domain and based on eq. (6), the relaxation time and
constants c1 , ...c4 of eq. (5) can also be computationally defined:
Next, we must initialize the size of the domains, the number of time steps, and the arrays that we will use. When using NumPy, unlike
standard Python, arrays need to be allocated, using the np.zeros() or np.ones() commands, defining their type (Int, Float, etc.). This allows
the NumPy routines to quickly access the arrays in a vectorized form and greatly speeds the calculations. The type is generically defined as
Concise python implementation of the LBM on HPC 5
float, and will be set to either 32 bit or 64 bit float, depending on the architecture of the machine. The size of the grid is nx×nz. For this
implementation, we initialize nine arrays f, f stream, f eq, Delta f, rho, u, Pi, u2, cu:
To run an example simulation, it is necessary to initialize the density and velocity. We assume for this simple demonstrative example
that there are no obstacles and that the media is homogeneous, and we will show how to add internal heterogeneities. Here the density is
stored in rho and because the initial speed in the medium u is assumed to be zero, the function f will depend on the density only, with
the appropriate weights. We consider here a point source located in the right of the domain, however the initial density can be modified in
any way.
Before running a simulation for nt time steps, a last step is necessary, which is the creation of a vector of indexes. This is a key
passage for running a vectorized simulation. This array of indexes will allow us to apply an instruction in the nine directions of the
LBM algorithm and in every nx×nz point in a vectorized manner, greatly speeding the streaming calculations by about two orders of
magnitude:
where the array c[] that contains the lattice velocities is embedded in a general index indTotal[] of dimension na×(nx×nz). This
architecture of indexes is what allows us to vectorize the Python version of the kernel of the LBM algorithm.
1. The indexes array created above has been used in the definition of the new collision term
f new = f[a].reshape(nx∗ nz)[indexes[a]]. Here two reshape() instructions are used to change shape to the f ar-
ray, but these do not consume any computing time, as they only refer to the internal structure of this array. This is explained in more detail in
(Morra 2018, chapter 3).
2. If in the instruction f = f stream.copy(), copy() would be absent, then f would point to the memory allocated for f stream,
which then would be modified.
3. In the instruction Pi = np.einsum(’azx,ad->dzx’,f,c) we used the function np.einsum(). This powerful tool allows
one to perform any tensorial product, with any high-dimensional array, at the speed of the underlying C optimized code.
4. The instruction u2 = u[0]∗ u[0]+u[1]∗ u[1] could be written using the Einstein summation function
np.einsum(’ijk,ijk->jk’, u, u) or by exploiting the linear algebra library of NumPy np.linalg.norm(u,axis=0)∗∗ 2,
but that would not accelerate the code. In the same way cu = c[a][0]∗ u[0] + c[a][1]∗ u[1] could be written as
np.einsum(’j,jkl->kl’,c[a],u) but it would not accelerate the code. For both cases the vectorization of NumPy is
equally fast.
5. Separating the instructions Delta f = (f eq - f)/tau f and f += Delta f helps to avoid issues with the allocations of the
NumPy array f.
A user might also want to add a level of complexity by implementing regions where waves can penetrate, and others that do not vibrate,
either with absorbing or non absorbing BC. If the impenetrable regions are called ‘solid’, and a solid array defines where flow is allowed
(zero) and where it is not allowed (one), then the streaming step becomes
where the 1D array denoted ai[] that was defined previously contains pointers to the opposite directions of flow of the number densities.
Use of this pointer array allows so called ‘bounce-back’ boundary conditions to be applied to solid regions, which is equivalent to a zero-slip
boundary condition at the edges of solids.
The code above for the streaming step works but it is extremely slow due to its explicit loops and the if statement. The reader can test
this implementation and will find a decrease in speed of about two orders of magnitude relative to the optimized streaming step specified
below. The same experiment can be done with all the other steps, which have been initially written in this unoptimized manner, and then
vectorized.
The optimized streaming step with bounce-back boundary conditions is written as:
where the array solid[a] is a Boolean array that is set to True if the adjacent point in the a-direction is a solid region of the model
where particles cannot penetrate, and to False if the adjacent point is a fluid region of the model. Here we exploited the equivalency in
Python between 0 and 1 and False and True, respectively.
5 PA R A L L E L V E R S I O N
In the following, we present the algorithm and the Python parallel implementation of the LBM. In our implementation the calculations are
performed on the slave nodes, while the master node sends and receives data, and creates output files.
8 P. Mora, G. Morra and D.A. Yuen
Figure 2. The domain decomposition showing the physical domains modelled in each processor.
The first task of developing an MPI code is to determine how the physical domain maps onto slave nodes. The simplest way is to divide the
physical domain into vertical strips. We will assume that the x-axis is longer than the z-axis and choose strips parallel to the z-axis.
Since in the LBM only nearest neighbor communications are needed, in the striped domain decomposition, the portion of data on the
slave nodes that has to be communicated covers only a single unit in width. As an example, we show in Fig. 2 the spatial decomposition
onto slave nodes and in Fig. 2 the expanded view of a two node problem with nx = 11. The first slave node has 6 units of space plus the
two overlapping edges, and the second slave node has 5 units of space plus the two overlapping edges. Each domain is a constant size which
equalizes work between processors, so the second domain has one empty column. Allowing for the one unit overlap of each domain, each
slave node therefore contains seven columns.
L = n 2 /n p = n (−1/2)
p n.
In this case, the communication cost is proportional to four times the length of the square domains. The factor of 4 is due to there being two
sides of each domain along each of the two dimensions. Hence, the communication cost is proportional to
C(d = 2, D = 2) = 4L = 4n p −1/2 n, (10)
One observes that the domain decomposition over d = 1 dimensions is less costly by a factor of 2 relative to the d = 2 dimensional domain
p . Hence, for small problems (eg. np ∼ 1), the d = 1 dimensional domain decomposition
decomposition, but more costly by a factor of n 1/2
is less costly by up to a factor of 2, while when np 1, the d = 1 dimensional domain decomposition becomes more costly than the 2-D
√
domain decomposition (d = 2) by a factor of n p . The number of processors when the two domain decompositions are equally efficient can
be calculated by solving
n = 2n −1/2
p n,
Concise python implementation of the LBM on HPC 9
In other words, for small cases with np < 4, the 1-D domain decomposition is most efficient, but the 2-D domain decomposition rapidly
√
becomes much more efficient for cases of np > 4, and tends towards being more efficient by a factor of n p /2.
Next, we consider the communication cost for a 3-D problem. The cases for a 1-D and 2-D domain decompositions (d = 1, 2) are already
covered by the - example above, aside from an additional factor of n required to calculate the area of domains (i.e. the area of the side of
domains is given by L × n where L is the length of the domains on the x−y plane, and the factor n is the length of domains over the z axis ).
Namely, we have the communication cost for the case of D = 3 for 1-D and 2-D domains (d = 1, 2) given by
C(1, 3) = 2n 2 , (12)
and
C(2, 3) = 4n 2 n −1/2
p . (13)
Next we consider the case of a 3-D problem (D = 3) decomposed into 3-D cubes (d = 3). The volume per processor is given by
n3
V = ,
np
The area of sides of the domain cubes is L2 , and each face of the cubes must be communicated (6 faces of a cube), so the communicational
cost is proportional to
6L 2 = 6n 2 n −2/3
p . (14)
The above formulae can be generalized to work for all of the above cases (i.e. d = 1, . . . , 3, D = 2, 3). The generalized equation for
communication costs is given by
Hence, the communication cost for a 3-D problem (D = 3) using a 3-D relative to a 2-D domain decomposition (d = 3 relative to d = 2) is
given by
C(3, 3) 3n −2/3 3
= n −1/6
p
= −1/2 p , (16)
C(3, 2) 2n p 2
As for the previous 2-D case (D = 2) for 1-D and 2-D domain decompositions (d = 1,2), for a small number of processors (np ∼ 1), the
relative cost of the d = 2 domain decomposition is faster by a small factor (3/2). And as the number of processors increases, the d = 3 domain
p (np 1). The crossover point where the d = 3 domain decomposition becomes more efficient than
decomposition is faster by a factor of n 1/6
the d = 2 decomposition is given by
6
3 3
n 1/6
p = ( ) ⇒ np = ≈ 11. (17)
2 2
The generalized equation for the crossover when it becomes more efficient to use a d-dimensional domain decomposition compared to a (d
− 1)-dimensional decomposition can be solved by setting C(d, D)/C(d − 1, D) to unity. Hence, we have
d(d−1)
d
n crp ossover = .
(d − 1)
In other words, once the number of processors exceeds np = 11, the 3-D domain decomposition is more efficient than a 2-D domain
decomposition, and is much more efficient than a 1-D domain decomposition. We conclude that it is most efficient to use a domain
decomposition over all axes (i.e. d = D). However, to simplify the example and Python coding in this didactic paper, we use a 1-D domain
decomposition in the following 2-D examples. Also, we find that communication costs are small relative to other parts of the LBM algorithm.
A future paper is planned on the implementation of the parallel 3-D LBM for convection.
10 P. Mora, G. Morra and D.A. Yuen
Besides loading the numerical Python library, we have here to load the MPI libraries, and initialize them in order to know the number of
nodes (size) and on which node (rank) the software is presently running:
Next, we must initialize the size of the domains. Let us suppose that the size of the grid is nx×nz. We then partition the grid as even
widths along the x-axis as is possible. Hence, we will have that the number of domains is nr=nx/nx i where nx is the total size of the
x-axis, and nx i is the size of a domain which must include the overlap of the adjacent domains. The size of nx i can be therefore calculated
using:
which first estimates the size of nx i and then updates it to ensure the entire x-axis can fit into the (size-1) slave nodes. Note that
in the above, nxi is the size of the domain exclusive of the two overlap columns.
Next, we need to calculate how many slave processors out of the total number of (size-1), can be used to model the entire domain
of size nx. This is achieved as follows:
where if dnx>0, then dnx is the width exclusive of the overlap columns of the final (nr-1)th active slave node. In the above, nr
denotes the rank or total number of processors that are active including the master node. Hence, processor 0 is the master node and processors
1 through (nr-1) are slave nodes.
It should be noted that due to the fact that we divide the domain into integer widths that are located on an integer number of processors,
sometimes the (nr-1)th node may only span dnx<nxi columns of physical space. Furthermore, the number of nodes to span the physical
space may be less than the number of processors being used, that is(nr-1)<(size-1). This is unavoidable but means that sometimes for
a given number of processors, these cannot all be used to model the (nr-1) regions of space that span the total physical space. If we denote
mx as an array specifying the size exclusive of overlap of each of the (nr-1) active slave nodes, we have:
In Python, arrays are only allocated and not deallocated because the Python Runtime, through the garbage collector, automatically deallocates
the arrays through evaluating when they are not used anymore and therefore deallocates them when required. This is one of the advantages
that makes developing Python codes easier.
Let’s look now at how to allocate and set the values of arrays on the master node and on the slave nodes. In the following, we utilize
rank=0 as the master node on which we initialize the model in the entire model space, and utilize rank>0 to specify each domain of the
domain decomposition. Thus, the density ρ (specified as RHO), the velocity u (specified as U), and the definition of the non deformable parts
of the model (specified as SOLID) in the entire 2-D space of size nz × nx can be initialized on the master node with the following:
where in the above, we specified first the size of the rectangular box where we run the simulations as, for example, nx = 1001 and nz = 251.
Once all of the variables that span the physical space are initialized on the master node (i.e. rank=0), namely, the density denoted RHO
and the velocity denoted U, one must send the nth domain of RHO and U to the nth node (i.e. rank=n). The following code segment achieves
the process of sending each domain of space to the appropriate node and sets an array Solid[a] which defines whether the adjacent point
in direction α is solid or fluid. Subsequently, we specify routine put rho u() which puts the density and velocity onto nodes, and mention
Concise python implementation of the LBM on HPC 11
put solid() which similarly puts the solid array SOLID onto the nodes as array solid, where SOLID is True in solid regions of the
model, and False in fluid regions.
Note that in the present implementation, the blocks sent using MPI are large (the entire interior of each domain associated with each
processor). This delegates the effort of optimizing the MPI communication to the mpi4py library. An alternative would be to send the
columns of each domain one by one. This choice is machine and problem-dependent.
In the Python code shown in this paper, uppercase RHO and U are respectively used to denote the density and velocity on the master
node (rank=0) and lowercase rho and u are, respectively, used to denote the density and velocity on the slave nodes (rank>0). Note that
the routine put rho u() also initializes the number densities denoted f using the equilibrium distribution shown in eq. (5), exactly as was
done in the serial version:
The routine put solid() is written similarly, but taking into account that it is a Boolean array.
1. A pickle-based communication of generic Python objects, using standard commands send(), recv(), bcast(). This option
is very easy to use, as it does not require any type or object size specification. In order to send buffer objects, the receiving array must be
sufficiently large. This option is the slowest, as it is not designed to be optimized for NumPy arrays.
12 P. Mora, G. Morra and D.A. Yuen
2. A dedicated communication of buffer-like objects. This is designed to be based on the same commands as the pickle-based
commands, but with a capitalized first letter of the commands, for example Send(), Recv(), Bcast(), Scatter(), Gather(). In
this case, the data types must be specified in both the sending and receiving commands, for example MPI.INT and MPI.FLOAT.
Because of its greatest efficiency and specific design for NumPy arrays, we use buffer-like instructions in the following (i.e.
commands with capitalized first letters).
When designing the parallel implementation of a parallel code, one has to choose between Point-to-Point and Collective communication.
While for very large problems, it is more efficient to use the Collective options (e.g. Broadcasting, Scattering) we prefer to present here an
implementation based on Point-to-Point instructions. This is based on (i) the observation that the Communication Costs are minimal compared
to the Computing Time, at least up to about 1000 processors, for which we made our tests and (ii) the greater flexibility of a Point-to-Point
implementation, which allows one to send blocks of different sizes for domains that are not a simple multiple of the number of cores.
Another design choice is between blocking and non-blocking instructions. Blocking commands block the program until the data have
been sent to the destination node, and the buffer is available again. This approach is easier to use, but it implies a greater role played by the
local MPI implementation. On some Systems, MPI may save the data freeing immediately the buffer, on others instead the data have to first
reach the destination. We tested both blocking and non-blocking communication and found comparable performances on the Beowulf system
on which we performed our tests. On some systems non-blocking communication might be required. In this case it will be necessary to use
ISend() and IRecv() to send and receive data, and the Wait() call to check whether the communication has finished. In Section 6,
we show that the time required for communication is several orders of magnitude less than the total computing time, which implies that
blocking instructions are sufficient. For larger system, however, a more sophisticated non-blocking implementation might be required. For
pedagogical and simplicity reasons, we also believe that the present implementation is more suitable to help the reader who is new to parallel
programming.
Each slave node contains a segment of physical space plus an edge from adjacent slave nodes to the left and right to allow the near-
est neighbor communications needed for the LBM streaming step. The standard Python command to send an edge to an adjacent slave
node is of form comm.send (edgeR, dest=dest node) for the right edge, and comm.send (edgeL, dest=dest node)
for the left edge. For each signal being sent, there must be a command to receive the message from each slave node of form
edge = comm.recv( source = source node) .
The above are the standard Python commands, valid for any data type. However, we show here the faster, vectorized commands designed
for NumPy arrays. The instruction differs because it starts with a capital letter (Send instead of send, and Recv instead of recv), and
because it exploits the declaration of the variable type (MPI.FLOAT in this case). Here follows the Python code needed to send the two edges
of each slave node to the right and left, and to receive these two edges from the adjacent nodes.
Notice the instruction copy() after the slice of the array f(): this allows reallocating that slice into a contiguous array, which is
necessary when using the fast communication features. Note how the right and left edges use two different allocations for the array, as for
Concise python implementation of the LBM on HPC 13
Figure 3. An enlargement of a domain decomposition showing two physical domains modelled in two processors and the circular boundary conditions.
every slave node they have to be stored separately. This also helps accelerate the synchronization (left and right edges are sent first together,
and then received together) and therefore improves the code performance (Fig. 3).
In the examples shown in this paper, we specify a simple model of a porous solid rock matrix (Torquato 2013), and simulate (i) the flow from
the left to the right of the model due to a density (and hence pressure) difference at the left and right boundaries and (ii) the propagation of a
wave front through the matrix, where the grains are assumed rigid, and are therefore reflective for the incoming wave.
The following code shows the Python implementation that sets simplified left and right boundary conditions by using the left and
right density to calculate the zero velocity equilibrium distributions. It should be noted that setting of accurate pressure or velocity boundary
conditions is much more involved (Zhuo & He 1997) and is not shown here where the main purpose is to show the Python implementation of the
core LBM. As the left boundary is located on node 1 (i.e. rank=1) and the right boundary is located on node (nr-1) (i.e. rank=(nr-1)),
we require no MPI communication step to set the densities at these boundaries. For approximate general boundary conditions at any location
in space, a code similar to the put rho u() routine using the equilibrium distribution to set the number densities can be written (see
Section 5.3), although again, it should be noted that accurate pressure or velocity boundary conditions require a more detailed treatment
(Zhuo & He 1997):
Although the gain is minimal, by calling this routine only once for each loop of the simulation, we vectorize the assignment and compacted
it into one line of code for each case. Notice how the float rho left and rho right are mapped to an array through np.ones(nz) in
order to match the dimensionality of the boundary. The np.outer instruction represents an outer product, which allows two 1-D arrays to
be combined into one 2-D array.
Figure 4. The speedup as a function of processor with up to 24 processors (o = 1001 × 1001 case, × = 101 × 101 case).
operation is vectorized using NumPy arrays (McKinney 2012) and uses the associated vectorized MPI instructions from it mpi4py
(Dalcin et al. 2011).
6 PA R A L L E L P E R F O R M A N C E
The parallel code was benchmarked on two different machines. One home system running up to 24 processor 2.7 GHz Xeon with 63 GB
of memory. On this machine, the Python version was V2.6.6, installed on Linux Centos V 6.6. Fig. 6 shows the speedup as a function of
processor for a 2-D model of sizes 101 × 101 and 1001 × 1001 (Fig. 4).
A second test was performed on the Queen Bee 2 Beowulf Cluster, a 1.5 Petaflop peak performance cluster containing 504 compute
nodes with over 10 000 Intel Xeon processing cores, 56 Gb/sec (FDR) InfiniBand 2:1 and 1 Gb s–1 Ethernet management network. We tested
the code on 1, 2, 4, 8, 16, 32, 48 and 64 nodes, corresponding to 20, 40, 80, 160, 320, 640, 960 and 1280 processors, respectively. Given the
size of the computing tools, we tested the model with an increasing number of computing points. We ran 5000 timesteps on a grid of size
20 000 points on the z-axis and (NCPUS − 1)∗25 on the x-axis, where NCPUS is the number of processors. With this choice of parameters, the
problem size increases linearly with the number of cpus, ranging from 20 000 × 475 points on 20 processors up to 20 000 × 31 975 (over 600
millions computing points) on 1280 processors. To use NCPUS − 1 as a parameter was due to the fact that the domain is divided among all the
processors except the first (master core).
Fig. 5 shows the speedup as a function of the number of processors on the Beowulf cluster for both the weak scaling illustrated above
and strong scaling of a problem of 20 millions computing points (10 000 × 2000) (Numrich 2018) (Chapter 6). The plots for the speedup are
shown in log–log scale (left) and allow the scaling to be estimated over several orders of magnitude in numbers of processors. On the right,
we show the normalized performances on linear-linear scale (i.e. the speed divided by the number of processors normalized by the smallest
case, 20 cpus). Weak scaling results are shown as dots and are based on the numbers shown in Table 1. The scaling is perfectly linear up to
960 processors, and shows a slight decrease in performance for the case of 1280 cpus. The results for the strong scaling problem are more
complex. We observe a threshold number of cpus above which the solver scales linearly with the number of processors, which varies with
the problem size. For 20 millions computing points, the threshold is at about 200 cpus. This behavior is likely related to the size of the cache
involved into handling the large arrays in the MPI implementation.
To understand the role of each component of the code on the performance, we compiled a detailed table with the broken down computing
time for the two communications necessary at every step (get edges() and put rho boundaries()), and of the four computing steps—(1)
streaming, (2) calculation of the macroscopic properties ρ and u from the number densities fα , (3) calculation of the equilibrium number
densities f αeq and (4) calculation and addition of the collision term f αc . The four computing steps are respectively denoted Stream, Macro,
Equilibrium and Collision. The exact timing for all of these steps for all the simulations for all the parallel tests are shown in Table 1. The
Concise python implementation of the LBM on HPC 15
Figure 5. Weak and Strong Speedup and Scalings versus number of processors up to 1280 processors. Top left-hand panel: log–log plot of the speedup for the
Weak scaling case. Top right-hand panel: linear–linear plot of the normalized performance, always assuming 1 for one node (20 cpus). One observes a slight
decrease only for 1280 processors. On the bottom the plots show the same for Strong scaling, i.e. same problem size regardless to the number of processors.
In this case for 20 millions computing points (10 000 × 2000) the code starts scaling linearly from 160 cpus up to 1280. For a smaller number of processors,
there is not enough cache to properly handle the large arrays.
Table 1. Timing table for the weak scaling test of the parallel code, broken down to its detailed components. Times are in percentage of the total time, for the
weak problem described in the test (size growing proportionally to the number of processors). The performance remains stable from 20 up to 1280 processors,
with only a slight decrease in performance at 1280 processors, due to an increase in the time taken by the computational steps of the algorithm, mainly Collision.
Get edges Put rho bound Stream Macro Equilibrium Collision
# Cpus (per cent) (per cent) (per cent) (per cent) (per cent) (per cent)
20 0.002 0.002 30.936 15.655 34.244 19.162
40 0.001 0.001 30.776 15.625 34.316 19.280
80 0.001 0.001 30.752 15.611 34.202 19.432
160 0.001 0.001 30.890 15.547 34.215 19.346
320 0.001 0.001 30.636 15.562 34.328 19.471
640 0.002 0.002 31.099 15.477 34.528 18.892
960 0.001 0.002 30.894 15.570 34.194 19.339
1280 0.001 0.001 30.785 14.811 32.689 21.713
most important observation is that the time required for all the communications is several orders of magnitude smaller than the computing
time, meaning that the algorithm that we present is not prone to deadlocks or slow down for any number of processors. This can be appreciated
clearly in the left-hand part of Fig. 6 where the times of the two communication steps are combined into one. This log-linear plot illustrates
the orders of magnitude difference between computing and communication times, and demonstrates that communication costs are negligible.
A more detailed look at the computing times, on the right of the Fig. 6, finally shows that the slight decrease in performance at 1280
processors is due mainly to an increase in the computing time of the Collision step, and in part to the other three computing steps. Overall,
the communication is optimal enough for practical uses and also for the large problems considered here.
16 P. Mora, G. Morra and D.A. Yuen
Figure 6. Performance for the parallel code, broken down in its parts, the four computational steps (Stream, Macro, Equilibrium and Collision), and the
combined cost of the two communications steps (Communication).
7 A P P L I C AT I O N S
The code has been benchmarked with Poiseuille flow, a common test in the geodynamic literature (Gerya 2009). Fig. 7 shows that the
numerical solution matches the analytical solution on the entire domain. Towards the edges, the kink is due to the bounce-back Boundary
Conditions which are at half way between lattice sites (not exactly at lattice sites).
Simulations with this code have been performed for acoustic wave propagation and for fluid flow, both in a simplified solid rock matrix
made up of circular grains and grain clusters. The matrix used for the fluid flow simulation is shown in Fig. 8 (left), and the matrix used for
the wave propagation simulation is shown on the same figure, to the right.
For the fluid-flow dynamic simulations, we have set the density of the fluid in the pore space to unity, with the left and right boundary,
respectively, having a non-dimensional density of 1.01 and 0.99, respectively. These values have been chosen so that the differential density is
a very small compared to the absolute magnitude. The result for speed of the fluid in the simulation after 5000 time steps is shown in Fig. 9.
It is well known that acoustic waves can propagate inside fluids in a porous medium (Lighthill 1978). The second simulation example for
acoustic wave propagation is a simplified case of waves inside fluids of a porous medium. The results of the acoustic propagation simulations
are shown in Fig. 10, where a wave propagates from an initial point in nz/2, 3/4∗nx. The wave front requires about 330 steps on a resolution
of 401 × 401 to cross the entire domain and interact with the propagating wave front through periodic boundary conditions. Snapshots of
fluid density in the simulation in this figure are shown up to time step 400.
8 C O N C LU S I O N S
The LBM is a flexible computational tool that allows, among other things, one to calculate wave propagation and fluid flow in complex
strongly heterogeneous media. We show how a code developed entirely in Python displays exceptional performance, only inferior to the best
optimized compiled C, if carefully written in a vectorized form (Morra 2018), and with the use of Just in Time compilation for selected
functions. Compared to C and Fortran, however, Python is easier to write (Guttag 2013), to understand, and to debug.
Concise python implementation of the LBM on HPC 17
Figure 8. Two simplified rock matrices showing the solid region in red, and fluid region in blue. The matrix on the left is used for the porous flow simulation
example, while the one on the right is used for the wave propagation example.
Figure 9. The fluid speed after 5000 time steps in the simulation showing the fluid pathways through the rock matrix.
We also developed an MPI parallel implementation and verified that it achieves approximately linear speedup up to 24 processors on
a single node of a home computer, and more relevantly for scientific applications, that it scales linearly up to at least 1280 processors on
a Beowulf cluster. We therefore recommend using Python, NumPy, JiT and mpi4py for developing scientific high performance computing
software.
Scientific problems in geosciences that can be efficiently tackled with the LBM include fluid flow and wave propagation in porous
media, thermochemical mantle convection, magma dynamics and volcanic eruptions. We will start off with classical 3-D mantle convection
problems (e.g. Rabinowicz et al. 1990). Assuming that the computational time continues to scale efficiently up to 10 thousand processors on
a comparable cluster to the one we used for benchmarking (10 000 Xeon processing cores, connected with 56 Gb/sec (FDR) InfiniBand 2:1),
it will be possible to model a lattice size of 109 points (1, 0003 ) in about 3 minutes per 10 000 time steps. And if memory is sufficiently large,
a lattice size of 1012 points (10 0003 ) could be calculated in about 48 hr per 10 000 time-steps. Tests on larger computing facilities are planned
and will be the topic of a follow up publication.
18 P. Mora, G. Morra and D.A. Yuen
Figure 10. Snapshots showing propagation of a wave front in a fluid with solid inclusions in the model. The algorithm captures the complexities of strong
scattering from the inclusions and superposition of waves.
Concise python implementation of the LBM on HPC 19
AC K N OW L E D G E M E N T S
D.A. Yuen would like to thank National Science Foundation,s geochemistry and CISE programs for support. G. Morra would like to thank
the Board of Regents of Louisiana for support through the RCS project LEQSF(2014-17)-RD-A-14 and the Louisiana Optical Network
Infrastructure (LONI) that provided the cluster for the test runs, through the project loni lbm01. D.A. Yuen and G. Morra would like to thank
Matthew G. Knepley for the stimulating discussions on parallel scaling. The authors would like to thank the reviewers, C. Huber and C.
Thieulot for helpful suggestions that improved the paper.
Torquato, S., 2013. Random Heterogeneous Materials: Microstructure and Xie, C., Raeini, A.Q., Wang, Y., Blunt, M.J. & Wang, M., 2017. An im-
Macroscopic Properties, Vol. 16, Springer Science & Business Media. proved pore-network model including viscous coupling effects using di-
Van Rossum, G. et al., 2007. Python programming language, in USENIX rect simulation by the lattice Boltzmann method, Adv. Water Resour., 100,
Annual Technical Conference, Vol. 41, p. 36. 26–34.
Wang, J., Wang, D., Lallemand, P. & Luo, L.-S., 2013. Lattice Boltzmann Zheng, J., Ju, Y. & Wang, M., 2018. Pore-scale modeling of spontaneous
simulations of thermal convective flows in two dimensions, Comput. imbibition behavior in a complex shale porous structure by pseudopo-
Math. Appl., 65(2), 262–286. tential lattice Boltzmann method, J. geophys. Res.: Solid Earth, 123,
Xia, M., Wang, S., Shou, H., Shan, X., Chen, H., Li, Q. & Zhang, Q., 2017. 9586–9600.
Modelling viscoacoustic wave propagation with the lattice Boltzmann Zhou, Q. & He, X., 1997. On pressure and velocity boundary conditions for
method, Scient. Rep., 7, 10169. the lattice Boltzmann BGK model, Phys. Fluids, 9(6), 1591–1598.