KEMBAR78
Cloud Computing | PDF | Cloud Computing | Input/Output
0% found this document useful (0 votes)
43 views39 pages

Cloud Computing

Cloud computing is presented as a viable alternative for performing large-scale quantum chemistry simulations, offering abundant resources and flexibility compared to traditional supercomputers. The chapter discusses the differences in architecture, data accessibility, software stack, and job scheduling between cloud and supercomputers, emphasizing the need for specialized programming and workflow design. It also highlights the importance of resource management and cost control when utilizing cloud services for computational tasks.

Uploaded by

Harikrishnan S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views39 pages

Cloud Computing

Cloud computing is presented as a viable alternative for performing large-scale quantum chemistry simulations, offering abundant resources and flexibility compared to traditional supercomputers. The chapter discusses the differences in architecture, data accessibility, software stack, and job scheduling between cloud and supercomputers, emphasizing the need for specialized programming and workflow design. It also highlights the importance of resource management and cost control when utilizing cloud services for computational tasks.

Uploaded by

Harikrishnan S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

CHAPTER

Working with cloud


7
If we need to perform a large amount of quantum chemistry simulations, where
should these computations take place? Some people’s first thought might be to use a
supercomputer or a high-performance workstation. Aside from these options, cloud
computing is an attractive alternative worth considering.
Although quantum chemistry simulations were traditionally executed on super­
computers, it is worth to consider cloud resources as a viable alternative to perform
large-scale chemistry simulations. Unlike supercomputers, which are often con­
strained by quotas or the availability of idle resources, cloud computing generally
offers a plentiful supply of resources. This abundance makes cloud computing partic­
ularly attractive in high-throughput quantum chemistry simulations. Additionally, the
cloud offers a wide range of computational tools and hardware resources that may not
be available on supercomputers. Some might consider the uniqueness of supercom­
puters and how unlikely the cloud computing will replace them. While this is true,
the presence and significant impact of cloud computing cannot be ignored. What we
can do is to harness it as additional computational resources.
To effectively utilize cloud computing, we need to develop new concepts that
differ from those required for supercomputers. In particular, several technical aspects
should be considered for cloud computing:
• Where are the computing, storage, and network resources located? How can we
access these resources?
• How to co­figure a service and manage its co­figuration?
• How to design a wor­flow and orchestrate the computing resources within the
wor­flow?
• How to schedule parallel jobs and scale them up on the cloud?
Using cloud computing often requires the development of specialized programs.
In these programs, large efforts are devoted to co­figuring cloud services and en­
suring their availability. These techniques and skills are more closely related to the
work of DevOps. Although DevOps skills do not directly impact algorithm designs
or the implementation of scientific programs, they can significantly enhance our pro­
ductivity. In this chapter, we will explore the programming technologies associated
with cloud computing. We will not present a step-by-step tutorial for each cloud ser­
vice. Instead, we will introduce the general idea and then leverage the AI assistant to
complete the remaining works.

Theoretical and Computational Chemistry, Volume 23, ISSN 1380-7323. https://doi.org/10.1016/B978-0-44-323837-6.00015-6 223
Copyright © 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
224 CHAPTER 7 Working with cloud

7.1 Utilizing cloud computing


Cloud computing offers various technologies and features to harness computational
resources. Exploring the details of cloud technology is beyond the scope of this book.
In this section, we will discuss the basic principles of cloud computing. We will
primarily use Python to deploy computation tasks using cloud computing APIs.
Please be aware that cloud computing charges based on the resources you con­
sumed. This includes not only CPU hours but also storage, data transfer, database, and
public IPv4 addresses, all of which are billable resources. In particular, data transfer­
ring services can lead to unexpected costs if not managed carefully. It is important to
ensure that resources are properly terminated after use. Despite years of development
in cloud computing technology, no cloud provider currently offers a solution that is
``smart'' enough to effectively balance cost and performance. An extraordinary skill
in cloud computing is the orchestration of cloud resources in an effective way to min­
imize expenses. For any given problem, there are likely more than one strategy and
combination of resources that can produce the same outcome. If you expect to make
substantial use of cloud resources, it is worthwhile to thoroughly review the cloud
documentation.

7.1.1 Comparison between cloud and supercomputers


Running simulation tasks on the cloud is very different from executing them on tra­
ditional supercomputers with the PBS or SLURM queue systems. Supercomputers
offer an interactive environment that is close to the experience of using a single Linux
desktop. In this sense, cloud computing is not as intuitive. The different user expe­
rience can largely be attributed to the architecture differences between cloud and
traditional supercomputers:
• Data Accessibility. Traditional supercomputers typically offer a large-scale and
high-performance global file system accessible to all computing nodes. Sharing
data among different nodes is straightforward. Manipulating file in global file sys­
tem is similar to accessing data on a local disk. In contrast, cloud typically does
not use the architecture of global file systems. Instead, files are stored in cloud
object storage, such as AWS S3 or Azure Blob. Accessing files stored in these
locations requires the use of specific tools or APIs.
• Software Stack. Traditional supercomputers typically offer pre-installed software
within a homogeneous environment, operating under the same OS. In the cloud,
software is installed in virtual machines (VMs) or Docker images. The runtime
environments can be more flexible, isolated, and diverged for each software. For a
specific task, users only utilize VMs or images that contain the necessary software.
• Job Scheduling. Supercomputers utilize job scheduling systems like PBS or
SLURM to manage and prioritize jobs. These systems launch jobs on reusable
computing nodes, which have an environment almost identical to the one used
in interactive sessions. Cloud computing platforms have their own job schedul­
7.1 Utilizing cloud computing 225

ing mechanisms. Individual jobs are co­figured separately and executed within
isolated VMs in the cloud.
• Connectivity in Parallel Execution. Supercomputers can provide low-latency and
high-performance network that enables efficient communication between proces­
sors. Exchanging data between different processes via MPI (Message Passing
Interface) is relatively simple. This level of connectivity is not as straightforward
in the cloud.

7.1.2 Designing a wor­flow


Normally, we need to design a wor­flow to utilize the cloud computational resources
and manage the computation tasks. A wor­flow is essentially a sequence of pre­
defined computational tasks executed in a particular order. We may need to develop
a series of Python programs to connect the various components of the wor­flow.
When designing the wor­flow, the characteristics of the cloud architecture should
be taken into consideration, particularly, involving the following questions:
• How to access the input data? For example, a large volume of input data can be
stored in the cloud object storage, and accessed via cloud storage APIs. Input
parameters can be stored in GitHub or GitLab, and accessed via HTTP requests or
Git operations. Alternatively, essential input data can be cached within the Docker
image. Additionally, a database can serve as an option for providing input data.
• Where to save the output? Cloud object storage, and database are two common
choices to save results. It is also possible to send results through a remote process
call (RPC) service to a receiver.
• How to connect different components of the wor­flow and how to exchange mes­
sage between them? Cloud object storage and database can be utilized to exchange
data. RPC are often employed to connect different services.
• What resources are required for the computation? Based on the performance and
the cost of different devices, we may need to determine the appropriate combina­
tion of resources, including the number of CPU cores, GPU cards, memory size,
temporary disk space, and network accessibility.
• Which Docker images or virtual machine images to use? Which software or tools
should be installed in the image, and how should they be co­figured?
For example, suppose that we need to design a wor­flow to generate the potential
energy surface (PES) for a molecule using the density functional theory (DFT). This
wor­flow involves two types of executors: the molecule geometry generator and the
executor to perform DFT calculations. As illustrated in Fig. 7.1, the two executors
coordinate in the following manner:
1. The geometry generator reads the initial geometry of the molecule and generates
a coarse-grained grid for the PES.
2. The geometry generator uploads all geometry files to the object storage, such as
AWS S3, assuming the cloud provider is AWS.
226 CHAPTER 7 Working with cloud

FIGURE 7.1
Wor­flow for PES using DFT.

3. The geometry generator then invokes cloud APIs to launch several workers to
execute DFT calculations. Within each job, metadata, such as the S3 paths of the
geometry files, are provided as input to the DFT executor.
4. Each DFT executor downloads a geometry file, and then executes the pre-defined
DFT calculation.
5. The DFT executor uploads the results to the S3 storage, and then sends a status
signal to the geometry generator via RPC.
6. The geometry generator creates additional grids on PES based on the existing
results and returns to step 2 to continue the wor­flow. Alternatively, it can choose
to terminate the wor­flow if sufficient data points have been generated.
This wor­flow exhibits the following characteristics regarding the utilization of
resources:
• The wor­flow primarily consists of CPU-intensive tasks.
• The demands for storage space and I/O operations are small.
• If the selected DFT program supports the use of GPU devices, GPU workers can
be allocated.
Regarding the software stack, to support the tasks d­fined in this wor­flow, we can
create a Docker image containing all the necessary components: a geometry genera­
tor, a DFT program, cloud management toolkits (such as awscli), and certain Python
libraries for RPC services.
To design a complete and robust wor­flow, there are additional matters to con­
sider, such as the fault tolerance for DFT calculations, the authentication scheme, the
wor­flow restart scheme, etc. Developing a wor­flow for cloud computing is analo­
gous to the process of developing a program. It may require continuous testing and
debugging before deploying it on the cloud. Once the wor­flow is fully operational,
it is convenient to repeatedly execute similar jobs in the cloud.
7.1 Utilizing cloud computing 227

7.1.3 Computing resources


Virtual machines (VM), such as AWS EC2 and Azure VM, are the fundamental com­
putational resources offered by cloud platforms. The simplest method to perform
a computation task is to interactively execute the calculation in the cloud virtual
machines. To make the runtime environment reproducible across each VM, we can
customize the VM images [1] by pre-installing commonly used software stacks. Ad­
ditionally, we can leverage block storage to persist the co­figuration files and data
[2]. By combining the custom VM images and the persistent storage, we can easily
spin up VMs for computation with identical environments.
To efficiently execute a batch of computation tasks, we need a non-interactive
scheme to manage the computing resources and environments. There are several
options to consider. One options is to use the Batch services provided by cloud plat­
forms, such as AWS Batch, Azure Batch, or Google Batch. Another option is to use
a containerized solution like AWS ECS, Azure ACS, or Google GKE. Although the
UIs (user interfaces) of these cloud platforms differ, the design and concepts of the
containerized approach are similar. In the following, we will take AWS ECS (Elastic
Container Service) [3] as the example to demonstrate how to deploy the containerized
computational tasks.
If you just signed up a new account on AWS, you may need several preparation
steps to co­figure the basic cloud computation environment, including:
• AWS VPC (Virtual Private Cloud) [4], the foundational infrastructure where you
can allocate virtual networks, virtual machines, and private storages.
• Subnets [5], which provides a range of IP addresses within a VPC.
• S3 (Simple Storage Service) [6], an object storage service that offers persistent
data storage.
• ECR (Elastic Container Registry) [7], which provides a repository for hosting
private Docker images.
• Security Groups [8], acting as a firewall to control inbound and outbound network
traffic for virtual machines.
• AWS credentials [9], providing the necessary keys and authentication to log into
virtual machines.
• IAM (Identity and Access Management) roles [10], which manages the permis­
sions for accessing various services in the cloud.
In our following code example for DFT PES computation, we will utilize the
ECS, S3 storage, ECR and other relevant services. It is straightforward to co­figure
the S3 services using AWS web console, as shown in the screenshots in Fig. 7.2.
We have created an S3 bucket named python-qc-3vtl for later use. Please note that
AWS S3 bucket names are unique across all AWS accounts globally. For instance, the
bucket name python-qc was already in use by other users. Consequently, we could not
use python-qc as the bucket name for our project. To ensure uniqueness, we added the
suffix -3vtl. You would need to choose a different name in your own implementation.
The AWS IAM permission system is complicated. For simplicity, we have created an
IAM policy with access to a wide range of AWS services as follows:
228 CHAPTER 7 Working with cloud

FIGURE 7.2
Creating AWS S3 bucket using AWS web console.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:*",
"ecs:*",
"ecr:*",
"s3:*",
"lambda:*",
"sqs:*"
],
"Resource": "*"
}
]
}

Please note that this AWS IAM co­figuration is insecure. In real applications, it is
recommended to carefully co­figure the IAM rules and limit access only to the nec­
essary services and resources. Please consult the AWS online documentation for more
details.
The containerized approach requires a Docker image specifically designed for
the computational task. Assuming the goal is to establish the DFT PES wor­flow
mentioned in Section 7.1.2, the following Dockerfile can be created:
7.1 Utilizing cloud computing 229

FROM nvidia/cuda:12.0.1-runtime-ubuntu22.04

# Install necessary dependencies


RUN apt-get update && apt-get install -y python3 python3-pip \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir pyscf gpu4pyscf-cuda12x awscli boto3

COPY start.sh /app/start.sh


WORKDIR /app

To accommodate the scenarios that require GPU acceleration, the base image is set
to the nvida/cuda:12.0.1-runtime image, which provides the CUDA 12 runtime
environments. For the DFT computation engine, we utilize the quantum chemistry
program PySCF [11] and its GPU acceleration variant gpu4pyscf-cuda12x [12,13].
The awscli toolkit and boto3 library are installed for managing and interacting with
AWS services. The start.sh script is responsible for managing computation and data
transfer tasks. This script is implemented as follows:
#!/bin/bash
# Set the default time limit to 1 hour
TIMEOUT=${2:-3600}
S3PATH=$1
aws s3 copy $S3PATH/job.py ./
timeout $TIMEOUT python job.py > job.log
aws s3 copy job.log $S3PATH/

The computation task is d­fined in a Python program, job.py, which should be cre­
ated by the geometry generator service. To prevent the computation from running
indefinitely, we use the GNU timeout tool to enforce the time limit on the calcula­
tion. Computational results are uploaded to the S3 storage. We then build this image
and upload it onto the AWS ECR service:
$ aws ecr create-repository --repository-name python-qc/dft-gpu
$ docker build -t 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/
dft-gpu:1.0 .
$ docker push 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dft-
gpu:1.0

The 12-digit number 987654321000 is just a placeholder for the account ID. You
should replace it with your actual account ID.
To deploy the computation wor­flow on the AWS ECS, the process begins with
defining an ECS cluster. An ECS cluster encompasses spec­fications of the VM
hardware resources, operating system, network co­figuration, and various access per­
missions. Using the command provided by the awscli toolkit, we can establish a basic
cluster that operates on dummy VMs, known as FARGATE mode [14], with default
settings.
230 CHAPTER 7 Working with cloud

$ aws ecs create-cluster --cluster-name ecs-basic-cluster

The FARGATE-mode cluster would be sufficient for exploring the basic functionali­
ties of AWS ECS. However, this cluster does not support GPU acceleration. To utilize
GPUs, we need to enable ASG (Auto-scaling Groups) [15] for EC2 GPU instances
when creating ECS clusters. It is recommended to co­figure this type of ECS clus­
ter using AWS web console (Fig. 7.3). For the detailed instructions of ECS cluster
co­figurations, please refer to the AWS ECS documentation [3,16].

FIGURE 7.3
Creating an ECS cluster with ASG for EC2 GPU instances.

Let’s assume that an ECS cluster named python-qc with GPU support is prop­
erly co­figured. The next step is to create a task definition on this ECS cluster.
A task definition acts like a template for a specific computational task. It spec­fies
the computational resources, such as CPU, memory, and the Docker image. The task
definitions can be co­figured on the AWS web console.
To ease the co­figuration process for future works, we will instead generate and
manage task definitions programmatically. The idea is to use code generation tech­
niques along with the AWS Python SDK boto3 to manage the co­figuration. To
7.1 Utilizing cloud computing 231

enhance the readability of the co­figuration, we store it in the YAML format. The
boto3 library is complex to use as it involves numerous API details of the AWS cloud.
Thanks to the progress of AI tools, it is no longer necessary to struggle with the boto3
API documentation. Co­figuring cloud computing is a task particularly suitable for
AI assistant programming. There is a massive amount of open-source examples avail­
able for cloud APIs. AI has been effectively trained on these examples. The output
from AI is quite accurate.
Here, we utilize GPT-4 to generate a co­figuration sample, which can then be
mod­fied to meet specific requirements. Different AI coding assistants might pro­
duce different code samples. However, the differences should be insignificant for the
current task. Our prompt for GPT is:
Create a Python script to register an AWS ECS task definition:
- Use GPU instances on ECS
- Use boto3 library
- Follow the template below
‘‘‘
config = ’’’
- ECS configuration in YAML.
- Include docker image, the CPU, memory and other necessary requirements in
configuration.
’’’
py_config = convert YAML config to Python code
ecs_client = boto3.client(’ecs’)
response = ecs_client.appropriate_ecs_function(**py_config)
‘‘‘

The output from GPT-4 is as follows:


import boto3
import yaml

# Your YAML configuration for the ECS task


config = ’’’
taskDefinition:
family: gpu-task-family
networkMode: "awsvpc"
containerDefinitions:
- name: gpu-container
image: your-docker-image
cpu: 1024
memory: 2048
resourceRequirements:
- type: GPU
value: "1"
232 CHAPTER 7 Working with cloud

requiresCompatibilities:
- EC2
memory: "4GB"
cpu: "1024"
’’’

# Convert YAML config to Python dictionary


py_config = yaml.safe_load(config)

# Initialize a boto3 client for ECS


ecs_client = boto3.client(’ecs’)

# Register the ECS task definition


response = ecs_client.register_task_definition(
**py_config[’task_definition’])

While it is possible to r­fine the prompt to generate improved code samples, we


will not pursue this r­finement in the current context. The code sample above is clear
enough to illustrate the parameters of the register_task_definition function of the
boto3 ECS client. The generated YAML co­figuration sample is also straightforward
to understand. Based on this code sample, we can adjust certain settings and derive a
script to register the task definition as follows:
config_tpl = ’’’
task_definition:
family: dft-demo
networkMode: "awsvpc"
containerDefinitions:
- name: dft-demo-1
image: 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dft-gpu
:1.0
cpu: 4
memory: 2048
resourceRequirements:
- type: GPU
value: "1"
requiresCompatibilities:
- EC2
memory: "4GB"
cpu: "4"
’’’
py_config = yaml.safe_load(config)
ecs_client = boto3.client(’ecs’)
response = ecs_client.register_task_definition(
**py_config[’task_definition’])
7.1 Utilizing cloud computing 233

print(yaml.dump(response))

If you have previously created a basic ECS cluster with FARGATE mode, you can
remove the setting "resourceRequirements" from the task co­figurations.
By using a similar methodology, we can d­fine runnable tasks based on the
above task definition and submit them to ECS. The runnable task requires the use of
run_task function. Here, we omit the intermediate code samples generated by GPT-4
and demonstrate only the revised script.
import boto3
import jinja2
import yaml

task_config_tpl = jinja2.Template(’’’
cluster: python-qc
taskDefinition: dft-demo
count: 1
overrides:
containerOverrides:
- name: dft-demo-1
{%- if image %}
image: {{ image }}
{%- endif %}
command:
- /app/start.sh
- s3://python-qc-3vtl/{{ task }}/{{ job_id }}
- "{{ timeout or 7200 }}"
environment:
- name: JOB_ID
value: "{{ job_id }}"
- name: OMP_NUM_THREADS
value: "{{ threads or 1 }}"
{%- if threads %}
memory: {{ threads * 2 }}GB
{%- else %}
memory: 2GB
{%- endif %}
networkConfiguration:
awsvpcConfiguration:
subnets:
{%- for v in subnets %}
- {{ v }}
{%- endfor %}
’’’)
ec2_client = boto3.client(’ec2’)
234 CHAPTER 7 Working with cloud

resp = ec2_client.describe_subnets()
subnets = [v[’SubnetId’] for v in resp[’Subnets’]]

config = task_config_tpl.render(
task=’dft-demo’, job_id=’219a3d6c’, threads=2, subnets=subnets)
config = yaml.safe_load(config)

ecs_client = boto3.client(’ecs’)
resp = ecs_client.run_task(**config)
print(yaml.dump(resp))

If a task is successfully submitted, the ECS will start a EC2 VM with GPU
devices. It then executes the docker run command within the VM, based on the pa­
rameters d­fined in the task and the task definition. For example, the task in this case
is effectively identical to the following docker run command:
$ docker run --runtime=nvidia-container-runtime \
-e OMP_NUM_THREADS=2 --cpus 2 --memory 8G \
987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dft-gpu:1.0 \
/app/start.sh s3://python-qc-3vtl/dft-demo/219a3d6c 7200

The ECS task is not limited to just a single run-until-completion job. We can use
ECS workers to run Dask, Ray or other job schedulers. These schedulers can execute
various types of tasks, simplifying the complex process of co­figuring individual
ECS tasks. We will explore this approach in Section 7.2.

7.1.4 Communications among cloud services


In this PES wor­flow, we may desire to adjust the geometry grids dynamically based
on the progress of the PES calculations. To accomplish this, it is necessary to establish
interactions between the geometry generator program and the DFT calculation tasks.
Instead of generating all grids on the PES at once, the geometry generator program
can generate coarse-grained grids on PES initially and then r­fine the grids based on
the accomplished results. The function geometry_on_pes below outlines the prototype
of this functionality.
def geometry_on_pes(molecule, pes_params, existing_results=None):
’’’Generates molecule geometry for the important grids on PES

Arguments:
- molecule: Molecular formula
- pes_params: Targets to scan, such as bonds, bond angles.
’’’
# This function should produce molecule configurations
# based on the results of accomplished calculations.
# Here is a fake implementation to demonstrate the functionality.
7.1 Utilizing cloud computing 235

if len(existing_results) > 1:
return []
h2o_xyz = ’O 0 0 0; H 0.757 0.587 0; H -0.757 0.587 0’
return [h2o_xyz] * 3

Since the DFT tasks are executed in containers or VMs remotely, it is not possible
to achieve direct interactions between the geometry generator and DFT tasks within
the same memory space. We thus utilize RPC framework to enable the communica­
tion between the two programs. The geometry generator can act as a service which
generates tasks and launches these tasks in the cloud. Additionally, this geometry
service provides an RPC endpoint to accept requests from DFT executors. Each DFT
task can send its status or results to the geometry service through this RPC. As shown
in Fig. 7.4, the PES wor­flow requires the collaboration of the following components:
• A geometry service to generate grids on PES.
• A job launcher to initiate DFT tasks running on ECS.
• An RPC system for the communication between the geometry service and the
DFT task.

FIGURE 7.4
The design of PES wor­flow based on RPC service.

The co­figuration of the DFT task executor


To facilitate the communication with the RPC server, the server address can be pro­
vided as a parameter or an environment variable for the Docker container of the DFT
task. Based on the ECS task developed in Section 7.1.3, we make slight adjustments
to set up DFT tasks:
236 CHAPTER 7 Working with cloud

rpc_config_tpl = jinja2.Template(’’’
cluster: {{ cluster }}
taskDefinition: {{ task }}
count: 1
overrides:
containerOverrides:
- name: {{ task }}-1
command:
- /app/start.sh
- {{ job_path }}
- {{ timeout or 7200 }}
environment:
- name: JOB_ID
value: "{{ job_id }}"
- name: RPC_SERVER
value: "{{ rpc_server }}"
- name: OMP_NUM_THREADS
value: "{{ threads }}"
resourceRequirements:
- type: GPU
value: "1"
cpu: {{ threads }} vcpu
memory: {{ threads * 2 }}GB
launchType:
EC2
networkConfiguration:
awsvpcConfiguration:
subnets:
{%- for v in subnets %}
- {{ v }}
{%- endfor %}
’’’)

The start.sh script for the DFT executor performs the following operations:
1. It retrieves the previously created job.py program from S3 storage.
2. It executes the DFT computation by calling job.py.
3. It sends the results to the RPC server using the rpc.py program.
A possible implementation for start.sh is provided below:
#!/bin/bash
# Set the default time limit to 1 hour
TIMEOUT=${1:-3600}
S3PATH=$1
7.1 Utilizing cloud computing 237

aws s3 copy $S3PATH/job.py ./


if (timeout $TIMEOUT python job.py > job.log); then
aws s3 copy job.log $S3PATH/
python /app/rpc.py finished job.log
else
python /app/rpc.py failed job.log
fi

The rpc.py program functions as an RPC client. It invokes the RPC endpoint
set_result of the geometry service to transmit messages, such as the DFT energy
or the computation status, back to the geometry service. The code snippet below
illustrates the basic functionality of rpc.py:
from xmlrpc.client import ServerProxy
rpc_server = os.getenv(’RPC_SERVER’)
job_id = os.getenv(’JOB_ID’)
status = sys.argv[1]
logfile = sys.argv[2]

def parse_log(logfile):
’’’Reads the log file and finds the energy’’’
log = open(logfile, ’r’).read()
...
return energy

with ServerProxy(rpc_server) as proxy:


if status == ’finished’:
proxy.set_result(job_id, parse_log(logfile))
else:
proxy.set_result(job_id, status)

The RPC communication can be implemented in terms of REST APIs, as demon­


strated in Chapter 6. However, for simplicity, we use the Python standard library
xmlrpc to create an RPC client here. The client requires the address of the RPC
server, which is stored in the environment variable RPC_SERVER. This environment
variable was co­figured previously in the ECS task.

The co­figuration of the RPC server


It is not a trivial task to establish the network communication between the RPC server
and the DFT executor containers. One approach is to bind an endpoint to the RPC
server that is accessible by other Docker containers. This endpoint can be an IP ad­
dress, a DNS-resolvable address, or a proxy that forwards messages to the machine
hosting the RPC server. In this example, the geometry service and the DFT executors
are co­figured to operate within the AWSVPC network mode [17]. In this network
mode, all containers can directly access each other via IP addresses. The IP address
238 CHAPTER 7 Working with cloud

of each container instance can be obtained through the metadata URI provided by
AWS [18].
import os, requests

def self_Ip():
’’’IP address of the current container or EC2 instance’’’
metadata_uri = os.getenv(’ECS_CONTAINER_METADATA_URI’)
resp = requests.get(f’{metadata_uri}/task’).json()
return resp[’Networks’][0][’IPv4Addresses’]

To prepare DFT tasks on the RPC server, we use the following template to create
the Python script job.py:
rpc_job_tpl = jinja2.Template(’’’
import pyscf
from gpu4pyscf.dft import RKS
mol = pyscf.M(atom="""{{ geom }}""", basis=’def2-tzvp’, verbose=4)
mf = RKS(mol, xc=’wb97x’).density_fit().run()
’’’)

The job.py script, once generated, is uploaded to S3 storage at the location spec­fied
by job_path for remote access.
These task preparation steps can be integrated into the launch_tasks function as
shown below:
import boto3, hashlib, jinja2, json, yaml
from typing import List, Dict
from concurrent.futures import Future

CLUSTER = ’python-qc’
TASK = ’dft-demo’
RPC_PORT = 5005
s3_client = boto3.client(’s3’)
ecs_client = boto3.client(’ecs’)

job_pool: Dict[str, Future] = {}

def launch_tasks(geom_grids, results_path, timeout=7200):


assert results_path.startswith(’s3://’)
ip = self_Ip()

jobs = {}
for geom in geom_grids:
job_conf = rpc_job_tpl.render(geom=geom).encode()
job_id = hashlib.md5(job_conf).hexdigest()
7.1 Utilizing cloud computing 239

bucket, key = results_path.replace(’s3://’, ’’).split(’/’, 1)


job_path = f’s3://{bucket}/{key}/{job_id}’
s3_client.put_object(Bucket=bucket,
Key=f’{key}/{job_id}/job.py’,
Body=job_conf)

task_config = rpc_config_tpl.render(
cluster=CLUSTER, task=TASK, job_id=job_id, job_path=job_path,
rpc_server=f’{ip}:{RPC_PORT}’, timeout=timeout, threads=2)
try:
ecs_client.run_task(**yaml.safe_load(task_config))
except Exception:
pass
else:
fut = Future() # (1)
fut._timeout = timeout
job_pool[job_id] = fut
jobs[job_id] = fut
return jobs

def set_result(job_id, result):


fut = job_pool[job_id]
fut.set_result(result) # (2)

Given a set of molecular geometries, the launch_tasks function launches ECS


tasks and returns a collection of Future objects corresponding to the DFT tasks. These
DFT tasks and the geometry generator service operate asynchronously. To track the
outcome of the DFT task, we introduce the Future object in line (1), which represents
the result of the u­finished DFT task. When the Future.result() method is invoked,
the asynchronous program will be blocked until the Future.result() method returns
a result. The Future.set_result method, at line (2), can be used to set the result,
which informs the Future object to unblock the state. The set_result method is ex­
posed to the RPC service, allowing it to be executed remotely by the RPC client on
the DFT workers. The combination of the Future object and the set_result method is
a common technique in asynchronous programming, which we will explore in more
detail in Chapter 10.

Deploying RPC server


To prevent the launch_tasks function from being blocked by the RPC service, we run
the RPC server in the background.
from contextlib import contextmanager
from threading import Thread
from xmlrpc.server import SimpleXMLRPCServer
240 CHAPTER 7 Working with cloud

@contextmanager
def rpc_service(funcs):
’’’Creates an RPC service in background’’’
try:
rpc_server = SimpleXMLRPCServer(("", RPC_PORT))
for fn in funcs:
rpc_server.register_function(fn, fn.__name__)
rpc_service = Thread(target=rpc_server.serve_forever)
rpc_service.start()
yield
finally:
# Terminate the RPC service
SimpleXMLRPCServer.shutdown(rpc_server)
rpc_service.join()

We then develop the driver function pes_app to manage the RPC server and the
wor­flow. This function is d­fined in the pes_scan.py file. Depending on the results
from the completed DFT tasks, the driver can either proceed to generate the next
batch of computational tasks or terminate the PES wor­flow if necessary.
def parse_config(config_file):
assert config_file.startswith(’s3://’)
bucket, key = config_file.replace(’s3://’, ’’).split(’/’, 1)
config = json.loads(s3_client.get_object(Bucket=bucket, Key=key))
return config

def pes_app(config_file):
config = parse_config(config_file)
molecule = config[’molecule’]
pes_params = config[’params’]
results_path = config_file.rsplit(’/’, 1)[0]

with rpc_service([set_result]):
# Scan geometry until enough data are generated
results = {}
geom_grids = geometry_on_pes(molecule, pes_params, results)
while geom_grids:
jobs = launch_tasks(geom_grids, results_path)
for key, fut in jobs.items():
try:
result = fut.result(fut._timeout)
except TimeoutError:
result = ’timeout’
results[key] = result
geom_grids = geometry_on_pes(molecule, pes_params, results)
7.1 Utilizing cloud computing 241

return results

if __name__ == ’__main__’:
pes_app(sys.argv[1])

The RPC server program can be deployed on ECS as a task.


rpc_server_config = ’’’
family: pes-rpc-server
containerDefinitions:
- name: rpc-server
image: 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dft-gpu:1.0
networkBindings:
- containerPort: 5005
hostPort: 5005
command:
- python
- pes_scan.py
- config_file
’’’
config = yaml.safe_load(rpc_server_config)
ecs_client.register_task_definition(**config)

In this co­figuration, the networkBindings must be co­figured to expose port 5005


(spec­fied by the RPC_PORT previously) for RPC communication.

7.1.5 Function-as-a-Service
Developing a wor­flow on the cloud from scratch requires some coding work and
expertise with cloud platforms. Once a wor­flow is successfully co­figured, we may
want to enhance its accessibility, making it more convenient for future use or avail­
able to other users. Here, developing a Function-as-a-Service (FaaS) is one viable
option that can simplify the complex deployment procedures associated with cloud
platforms. This service can be accessed using RESTful APIs over the HTTP protocol.
One only needs to send an HTTP request that spec­fies the molecular structure and
DFT parameters to initiate a PES scan task.
The PES scan service requires the coordination of multiple components:
• An endpoint to receive HTTP requests.
• Middleware to forward HTTP requests to the backend server.
• A backend server to process the request and launch the PES scan job on AWS
ECS. It then returns the execution results in an HTTP response.
On the AWS cloud platform, we can leverage various AWS components to implement
this service (Fig. 7.5). The AWS API Gateway can provide an HTTP endpoint and act
as a proxy to forward HTTP requests [19]. AWS Lambda can serve as the backend to
242 CHAPTER 7 Working with cloud

FIGURE 7.5
The architecture of FaaS for the PES scan service.

parse the request and launch the PES scan wor­flow we developed in Section 7.1.4.
The results of the PES scan job can be stored in the S3 storage. The path of the S3
storage is returned in the HTTP response. The co­figuration of the Lambda handler
and the API Gateway trigger are illustrated in Figs. 7.6--7.9.
Let’s start the project from the AWS Lambda. AWS Lambda is a serverless com­
puting service that enables us to execute code without the need to provision or
manage servers. It automatically allocates computing resources and executes pre­
defined functions based on the incoming request. Here, we create a Lambda function
using the co­figurations shown in Fig. 7.6. We name this Lambda function pes-scan.
This name will be automatically applied to the co­figurations of other services. You
may choose any other name for this function. If you choose a different name, be sure
to update all instances of pes-scan in the subsequent settings accordingly.
In the AWS Lambda console (Fig. 7.7), we can deploy the following Python func­
tion handler [20] in Box 1 of Fig. 7.7.

import json, hashlib, boto3


s3_client = boto3.client(’s3’)
ec2_client = boto3.client(’ec2’)
ecs_client = boto3.client(’ecs’)

def lambda_handler(event, context):


# The data structure of event can be found in
# https://docs.aws.amazon.com/lambda/latest/dg/services-apigateway.html
method = event[’httpMethod’]
if method != ’POST’:
return {’statusCode’: 400}

body = json.loads(event[’body’])
job = body[’job’]
7.1 Utilizing cloud computing 243

if job.upper() != ’PES’:
msg = f’Unknown job {job}’
results = ’N/A’

config = json.dumps({
’molecule’: body[’molecule’],
’params’: body[’params’]
}).encode()
job_id = hashlib.md5(config).hexdigest()

resp = ec2_client.describe_subnets()
subnets = [v[’SubnetId’] for v in resp[’Subnets’]]
bucket, path = ’python-qc’, f’pes-faas/{job_id}’
s3_client.put_object(Bucket=bucket, Key=f’{path}/config.json’, Body=
config)
s3path = f’s3://{bucket}/{path}/config.json’
resp = ecs_client.run_task(
cluster=’python-qc’,
taskDefinition=’pes-rpc-server’,
count=1,
launchType=’FARGATE’,
networkConfiguration={’awsvpcConfiguration’: {’subnets’: subnets}},
overrides={
’containerOverrides’: [{
’name’: ’rpc-server’,
’command’: [’python’, ’pes_scan.py’, s3path]
}]
}
)
msg = json.dumps(resp)
results = f’s3://{bucket}/pes-faas/{job_id}/results.log’
return {
’statusCode’: 200,
’body’: {
’results’: results,
’detail’: msg,
}
}

This handler decodes the HTTP requests and launches the PES scan task on ECS. Fur­
thermore, a trigger for the Lambda function can be co­figured in Box 3 of Fig. 7.7
in the AWS Lambda console. The API Gateway is assigned to the trigger, as shown
in the Box 1 of Fig. 7.8. For simplicity, we set the security mechanism of the API
Gateway to Open, which allows the gateway to accept all requests and forward them
244 CHAPTER 7 Working with cloud

FIGURE 7.6
Creating an AWS Lambda function to trigger the PES scan service.

FIGURE 7.7
Co­figuring AWS Lambda.
7.1 Utilizing cloud computing 245

FIGURE 7.8
Adding an AWS API Gateway trigger.

FIGURE 7.9
Checking API Gateway endpoint and co­figuring permissions.
246 CHAPTER 7 Working with cloud

to the Lambda function. Please note that this setting is insecure. It is recommended to
co­figure an authentication method to enhance the security service [21]. By deploy­
ing the API Gateway trigger, an HTTP endpoint is generated. This endpoint can be
accessed by clicking on Box 1 in Fig. 7.9, which indicates the following URL:
https://d8iah6mwib.execute-api.us-east-1.amazonaws.com/default/pes-scan

The Lambda function handler needs access to several AWS services, including
the read and write permissions for S3, EC2, and ECS. We can co­figure the permis­
sions of the Lambda function in the AWS Lambda console, as shown in Fig. 7.9.
For the co­figuration method of the IAM policy, refer to Section 7.1.3. After assign­
ing necessary permissions to the Lambda function, we are ready to provide a PES
scan service through the API Gateway endpoint. Here is an example of invoking the
HTTP API using a JSON document, which spec­fies the geometry of a molecule and
the PES parameters:
import requests
molecule = ’O 0 0 0; H 0.757 0.587 0; H -0.757 0.587 0’
resp = requests.post(
’https://d8iah6mwib.execute-api.us-east-1.amazonaws.com/default/pes-
scan’,
json={’job’: ’pes’, ’molecule’: molecule, ’params’: ’O,H’}
)
print(resp.json()[’detail’])

The response will include an S3 path that indicates where the results are stored.
The above example is the most basic framework of an FaaS, in which we have
omitted many technical details. In a comprehensive FaaS, there are more engineer­
ing and usability issues to consider, such as authorization, authentication, network
security, fault tolerance, monitoring, and data persistence, etc. Nevertheless, this ba­
sic framework can serve as a starting point, from which you can gradually introduce
more features to enhance the usability of the FaaS.

7.2 Distributed job executors


In high-performance computing (HPC) environments, job schedulers such as PBS
and SLURM are used to allocate the computational resources for job execution. Al­
though applicable, they are not as commonly used in cloud environments. In cloud
settings, resources such as computing nodes and storage are allocated on demand.
It is inconvenient to reco­figure the traditional HPC management system each time
new computing resources are allocated. Consequently, simple and lightweight tools
for job scheduling have become more popular in cloud environments.
To execute Python programs in parallel, Celery, Dask, and Ray are three popular
distributed executors well suited for cloud computing. Each of these tools has its
unique characteristics, advantages, and suitable scenarios:
7.2 Distributed job executors 247

• Celery [22] is a distributed computing tool that relies on message queues to dis­
tribute tasks across multiple computing nodes. It is easy to scale up workers to
process a high volume of similar tasks. Celery is commonly used for handling
background tasks in web applications. It can also effectively support quantum
chemistry data generation workloads.
• Dask [23] is a library that focuses on large-scale data processing. It is often em­
ployed in data analysis and numerical computation jobs that are distributed across
multiple computing nodes. Its distribution system is a pure Python application,
which does not require any external components (such as message queues or
databases). Dask is a flexible framework for distributing workloads. It is not lim­
ited to pre-defined functions. Python functions can be created at runtime and sent
to workers for execution.
• Ray [24] is a high-performance distributed computing framework. It shares cer­
tain similarities with Dask in distributed task execution. The highlight of the Ray
project is its native integration with cloud computing environment. It offers con­
venient cloud resource management and task scheduling capabilities. In addition
to distributing workloads, Ray offers a comprehensive solution for memory shar­
ing and parallel computation in a distributed system. If the goal is to develop
a distributed application that does more than just workload offloading, the Ray
framework is an excellent option.
Distributed job executors typically consist of servers, clients, workers, and other
components. Each component requires a different deployment strategy. Servers and
workers are usually deployed on the cloud. Clients can be installed in the local Python
environment.
Distributed computing is often closely related to parallel computation techniques.
In this section, we will focus only on the capabilities and the deployments of dis­
tributed job executors in the context of cloud computing. The designs of parallel
programs will be explored in Chapter 10.

7.2.1 Celery
The distributed execution in Celery is based on the producer-consumer parallel com­
puting model. In this setup, the Celery client acts as the producer, pushing new tasks
into a queue. Celery workers then retrieve and execute tasks from this queue. To
deploy Celery, we need to set up a message broker to manage the task queue.
Celery supports a variety of message brokers, such as RabbitMQ, Redis, and
Amazon SQS (Simple Queue Service). Each message broker specializes in differ­
ent functionalities, including security, performance, priority handling, error tolerance,
and crash recovery. These considerations are crucial for high-concurrent web applica­
tions. However, for tasks such as DFT data generation or other chemistry simulation
calculations, the differences among these brokers are minimal, and all are generally
sufficient for our use cases.
Let’s choose Redis to server as the broker. It can be deployed locally using the
command:
248 CHAPTER 7 Working with cloud

$ docker run -d -p 6379:6379 redis:7.2

We can accordingly install the Celery library with the Redis extension:
$ pip install celery[redis]

The first step in applying Celery is to develop a Celery application. A Celery


application is essentially a Python module that can be imported both locally and by
Celery workers.
$ cat dft_app.py
import hashlib
import pyscf
from gpu4pyscf.dft import RKS
from celery import Celery

app = Celery(’dft-pes’, broker=’redis://localhost:6379/0’, # (1)


backend=’redis://localhost:6379/1’) # (2)

@app.task
def dft_energy(molecule):
job_id = hashlib.md5(’f{molecule=}).hexdigest()
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’, verbose=4)
mf = RKS(mol, xc=’wb97x’).density_fit().run()
return job_id, mf.e_tot

The message broker is spec­fied by the keyword borker when creating the Celery ap­
plication, as shown in line (1). It only serves as a task queue for scheduling jobs. To
receive output from the remote Celery workers, it is necessary to co­figure a database
backend to store results for the Celery application. There are several options avail­
able for result backends in Celery [25]. For simplicity, we continue to use Redis as
the results backend. Redis uses numbers in the URL to identify databases. Differ­
ent numbers correspond to different databases. Results are co­figured to store in a
separate database, as shown in line (2).
In the Celery application code, we should use the decorator @app.task to register
any remote functions, as required by the Celery library. In our local environment, we
can import this application and invoke the .delay() method of the remote function to
submit tasks.
In [1]: from dft_app import dft_energy
results = [
dft_energy.delay(’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’),
dft_energy.delay(’O 0 0 0; H 0.757 0.6 0; H -0.757 0.6 0’)]
print(results)
print(results[0].status, results[1].status)
7.2 Distributed job executors 249

Out[1]:
[<AsyncResult: 41100ef3-2fed-4809-b8e7-a599b161b5a8>, <AsyncResult: 17271
cc9-6aba-4bef-a8ce-544a512f8e9a>]
PENDING PENDING

Celery returns an AsyncResult object to represent the result of the remote function.
Initially, the status of the submitted tasks is PENDING. This is because we have not yet
deployed any workers for this application.
In another terminal, we can start a Celery worker by executing the command:
$ celery --app=dft_app worker

The application, as spec­fied by the --app argument, must be a module that Celery
workers can import. In this local setup, the worker successfully loads the Celery
application because the command is executed in the same directory where the ap­
plication program file resides. When deploying workers remotely, the application
should be properly installed to ensure that it is importable. The package distribution
techniques we discussed in Chapter 1 can be employed in this regard.
The Celery worker retrieves tasks from the message broker and sends the results
to the backend database. We can access the running status, error messages, and results
via the AsyncResult objects.
In [2]: print(results[0].get())
print(results[1].status)
Out[2]:
[’2bb88ae9b074ca2c190ac9969b4e7397’, -76.43283440169702]
SUCCESS

This example illustrates the basic usage of Celery, which is sufficient for the cur­
rent data generation tasks. In order to efficiently process a large volume of tasks, we
plan to deploy Celery workers on the AWS cloud. We will use AWS ECS to launch
Celery workers and AWS SQS to serve as the broker. We choose SQS over deploying
a standalone message queue because SQS can be conveniently accessed from a local
machine. This eliminates the complex network co­figurations that would otherwise
be necessary for a Celery client to access the message queue.
Despite its accessibility, the use of AWS SQS faces a different restriction. Celery
does not support SQS as a backend for result storage. An alternative solution is to
use other AWS database services as the result backend. For simplicity, we will just
upload the results to AWS S3 storage. The Celery object and the task functions in
dft_app.py are mod­fied accordingly:

import tempfile
import pyscf
from gpu4pyscf.dft import RKS
from celery import Celery
import boto3
250 CHAPTER 7 Working with cloud

app = Celery(’dft-pes’, broker=’sqs://’)


s3 = boto3.client(’s3’)

@app.task
def dft_energy(molecule, s3_path):
output = tempfile.mktemp()
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’,
verbose=4, output=output)
mf = RKS(mol, xc=’wb97x’).density_fit().run()
bucket, key = s3_path.replace(’s3://’, ’’).split(’/’, 1)
s3.upload_file(output, bucket, key)

The next step is to co­figure AWS ECS for Celery workers. This is similar to the
process we discussed in Section 7.1.3. We need to create a Docker image and upload
it to AWS ECR. The Dockerfile for the Docker image is spec­fied as follows:
FROM nvidia/cuda:12.0.1-runtime-ubuntu22.04

RUN apt-get update && \


apt-get install -y python3 python3-pip && \
rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir pyscf gpu4pyscf-cuda12x boto3 celery[sqs]

COPY dft_app.py /app/


WORKDIR /app
CMD ["celery", "-A", "dft_app", "worker", "--loglevel=info"]

This Docker image is labeled as python-qc/celery-dft:1.0.


$ aws ecr create-repository --repository-name python-qc/celery-dft
$ docker build -t 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/
celery-dft:1.0 .
$ docker push 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/celery
-dft:1.0

We then register an ECS task definition using the new Docker image.
config = yaml.safe_load(’’’
family: celery-gpu-worker
networkMode: awsvpc
containerDefinitions:
- name: celery-worker
image: 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/celery-dft
:1.0
environment:
7.2 Distributed job executors 251

- name: OMP_NUM_THREADS
value: "4"
runtimePlatform:
cpuArchitecture: X86_64
operatingSystemFamily: LINUX
cpu: 4 vcpu
memory: 8GB
’’’)
ecs_client = boto3.client(’ecs’)
resp = ecs_client.register_task_definition(**config)
print(yaml.dump(resp))

We then create ECS tasks to launch Celery workers.


ec2_client = boto3.client(’ec2’)
resp = ec2_client.describe_subnets()
subnets = [v[’SubnetId’] for v in resp[’Subnets’]]

config = yaml.safe_load(’’’
cluster: python-qc
taskDefinition: celery-gpu-worker
count: 2
overrides:
containerOverrides:
- name: celery-worker
environment:
- name: AWS_ACCESS_KEY_ID
value: xxxxxxxxxxxxxxxxxxxx
- name: AWS_SECRET_ACCESS_KEY
value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
resourceRequirements:
- type: GPU
value: "1"
networkConfiguration:
awsvpcConfiguration:
subnets: {}
’’’.format(subnets))

ecs_client = boto3.client(’ecs’)
resp = ecs_client.run_task(**config)
print(yaml.dump(resp))

If more computational resources are required, the count field in the co­figuration can
be mod­fied to adjust the number of ECS workers. In the ECS task co­figurations,
the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are con­
252 CHAPTER 7 Working with cloud

figured to provide AWS access credentials [26], which are necessary for the Celery
workers to access SQS and S3. It is crucial to avoid hardcoding the AWS access cre­
dentials directly into the source code or Docke­file of the application, as they might
be publicly accessible.
Now, we are ready to load the mod­fied dft_app.py and run the DFT computation
tasks in the AWS cloud.
In [3]: from dft_app import dft_energy
results = [
dft_energy.delay(’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’),
dft_energy.delay(’O 0 0 0; H 0.757 0.6 0; H -0.757 0.6 0’)]
print(results)

Out[3]:
[<AsyncResult: 32f68705-04f8-483c-83eb-45bf2622ced3>, <AsyncResult: 5833
cf35-7fd2-4d7d-9ef2-f9e1acd4579a>]

Please note that the Celery workers do not automatically shut down after com­
pleting all computational tasks. To terminate Celery workers, we can stop the corre­
sponding ECS tasks using the
aws ecs stop-task

command via the CLI or by deleting the allocated resources through the web console.
Don’t forget this cleanup step or you will be continually charged for ECS computing
resources. It is possible to implement functions to monitor the status of the task queue
and automatically terminate Celery workers upon completion. However, implement­
ing this feature requires the coordination of several AWS cloud services, which is
beyond the scope of Python programming. We will not cover this topic in this book.
As shown by this example, using Celery is generally not complicated. If we have a
large number of similar computational tasks to execute, Celery is an excellent choice
for batch processing.
Let’s briefly summarize the strengths and disadvantages of using Celery in dis­
tributed computing. Celery has the following advantages:
• Efficient for batch executing a large number of pre-defined jobs, such as the data
generation tasks.
• Quick response. As long as the Celery workers are active, they can repeatedly ex­
ecute tasks. New tasks incur only a small overhead when retrieving input variables
from the broker.
• Thanks to the message-queue architecture, Celery provides resilience against
worker failures. If any workers crash, the u­finished tasks can be picked up and
continued by other workers.
The disadvantages of Celery include:
• Celery uses JSON serialization to encode the task parameters and the results. Con­
sequently, a general Python object may not be used as the input arguments of a
7.2 Distributed job executors 253

task. If we need to distribute general Python functions or objects, other distributed


executors, such as dask.distributed can be considered.
• There is no native support to cloud deployment. Deploying Celery requires certain
expertise and knowledge of the cloud architecture.
• Scaling workers up or down is not automatic.
• There is no built-in monitor in Celery for conveniently tracking and managing the
status of tasks. The package provides only a CLI tool for monitoring tasks. To
manage Celery tasks through a GUI, third-party libraries such as Flower [27] are
required.

7.2.2 Dask
Dask offers the dask.distributed module for distributed computing over a Dask clus­
ter. A Dask cluster consists of a scheduler and several workers, as shown in Fig. 7.10.
To install the required components for Dask distributed executors, we can execute
$ pip install dask distributed bokeh

Here, the package distributed provides the necessary tools for managing a Dask
cluster, and bokeh offers a visualization dashboard for monitoring the status of the
Dask cluster.

FIGURE 7.10
The architecture of the Dask cluster.

Let’s first explore the functionality of dask.distributed in a local environment.


We can start a Dask cluster by running two commands:
$ dask scheduler
$ dask worker localhost:8786

The command dask-scheduler initializes the scheduler for the Dask cluster. The com­
mand dask-worker launches workers.
Once the Dask cluster is initialized, the dask-scheduler functions as the server.
We can then create a Client to connect to the scheduler. The .submit() method
254 CHAPTER 7 Working with cloud

of the Dask client can be used to offload computation to Dask workers. This op­
eration returns a Future object, which represents the pending result of the dis­
tributed task. This Future object offers the functionality similar to the Python built-in
concurrent.futures.Future class. Additionally, the Dask client provides a shortcut
method, .gather(), to quickly retrieve computational results.
import pyscf
from dask.distributed import Client

client = Client(’localhost:8786’)

def dft_energy(molecule):
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’, verbose=4)
mf = mol.RKS(xc=’wb97x’).density_fit().run()
return mf.e_tot

molecules = [’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’,


’O 0 0 0; H 0.757 0.6 0; H -0.757 0.6 0’]
results = [client.submit(dft_energy, x) for x in molecules]
print(client.gather(results))

By default, starting a scheduler automatically serves a bokeh dashboard on port


8787 of the same machine. In a local setup, the dashboard can be accessed at:
http://localhost:8787/status

We can use this dashboard to monitor the status of the Dask cluster.
Next, let’s examine how to deploy the Dask cluster in the cloud. Deploying the
Dask cluster scheduler and the Dask workers may involve different co­figurations.
The Dask scheduler needs to be accessible to the Client running on our local device,
either through its public IP address or a proxy. Data transfers via public IP addresses
are subject to charges. To avoid data transfers through public IP addresses, commu­
nication between workers and the scheduler can occur over the internal network.
If we deploy the Dask scheduler using ECS, additional network co­figurations
will be required to ensure the accessibility of the Dask scheduler. For simplicity,
we can launch an EC2 instance to host the Dask scheduler. An EC2 instance can
have both public and private IP addresses.1 After installing the essential packages
including dask, distributed, and bokeh in the EC2 instance, we can start the dask­
scheduler as we have done in the local environment.
On the other hand, for Dask workers, we can still utilize the containerized service.
Similar to the ECS deployment method for Celery, we can create a Docker image
python-qc/dask-dft:1.0 using the following Dockerfile:

1 Exposing public IP addresses can introduce security risks. It is recommended to co­figure the firewall
(Security Group) of the EC2 instance to only allow connections from trusted IP addresses.
7.2 Distributed job executors 255

FROM nvidia/cuda:12.0.1-runtime-ubuntu22.04
RUN apt-get update && \
apt-get install -y python3 python3-pip && \
rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir gpu4pyscf-cuda12x boto3 dask distributed

The ECS task definition for Dask workers is essentially the same as that for Celery
workers, which can be reused here.
An ECS worker might be assigned multiple IP addresses. To ensure that the Dask
worker attaches to the private address, we should specify the address when starting
the dask-worker command. This address can be acquired from the AWS web console,
or through the boto3 library. We then override the image and command in the task
definition, leading to the following ECS task co­figuration to launch Dask workers:
ec2_client = boto3.client(’ec2’)
resp = ec2_client.describe_subnets()
subnets = [v[’SubnetId’] for v in resp[’Subnets’]]
private_ip = ec2_client.describe_instances(
Filters=[{’Name’: ’tag:Name’, ’Values’: [’dask-server’]}]
)[’Reservations’][0][’Instances’][0][’PrivateIpAddress’]

config = yaml.safe_load(’’’
cluster: python-qc
taskDefinition: celery-gpu-worker
count: 2
overrides:
containerOverrides:
- name: dask-gpu-worker
image: 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dask-dft
:1.0
command:
- dask-worker
- "{0}:port"
resourceRequirements:
- type: GPU
value: "1"
networkConfiguration:
awsvpcConfiguration:
subnets: {1}
’’’.format(private_ip, subnets}))
ecs_client = boto3.client(’ecs’)
resp = ecs_client.run_task(**config)
print(yaml.dump(resp))
256 CHAPTER 7 Working with cloud

Compared to Celery, a major advantage of the Dask distributed solution is its


support for general Python functions and objects (subject to certain serialization con­
straints, as discussed in Chapter 6). However, deploying Dask distributed systems in
the cloud is complex. Additionally, similar to the autoscaling issues faced with Cel­
ery, managing Dask workers and scaling them up or down is relatively inconvenient
in cloud environments.

7.2.3 Ray
Ray is more accurately described as a distributed execution framework rather than
merely a distributed task executor. It is designed for building and running scalable
and distributed applications. It also offers scalable toolkits for developing distributed
machine learning applications.
When Ray is used as a distributed task scheduler, its functionality has certain
overlaps with Dask. It can offload a general Python function to a remote worker for
execution. One of the key advantages in Ray is its native integration with various
cloud platforms, such as AWS, GCP, Azure, and Aliyun. Ray allows users to manage
a cluster for distributed computation without needing extensive cloud expertise.
A Ray cluster is made of a head node and several worker nodes, as shown in
Fig. 7.11. Unlike Dask, which runs a simple service on each node, Ray employs
a more complex architecture. On each node, Ray runs multiple services for com­
munication, job scheduling, distributed storage, and other functions. The head node
additionally operates an auto-scaling service, which can automatically scale worker
nodes according to the workloads. Deploying these complex services does not require
a manual, step-by-step setup. By specifying a concise cluster co­figuration, Ray can
automatically co­figure the virtual cluster on the cloud, including all necessary net­
work connections, message queues, schedulers, storage, autoscaler, and other cloud
services required by Ray.

FIGURE 7.11
The architecture of the Ray cluster.

The Ray package integrates both distributed computation functionality and cluster
management capabilities. The management of Ray cluster can be operated locally.
7.2 Distributed job executors 257

Taking the AWS cloud as an example, to deploy a Ray cluster on the AWS cloud, the
following packages should be installed:
$ pip install -U ray[default,aws] boto3

The boto3 library is required because Ray relies on this library to access AWS APIs.
To offload a computation task to the Ray cluster, we need to prepare and submit a
Ray job. A Ray job is typically composed of the following components:
• Functions for distribution. These functions are annotated with the @ray.remote
decorator. This decorator adds the .remote() method to them.
• Calling the .remote() method on these functions to generate Future objects.
• To ensure completion, retrieving the results of the Future objects, using the
ray.get function.

Below is an example of Ray jobs suitable for the aforementioned DFT calculation
task.
$ cat dft_job.py
import pyscf
import ray

@ray.remote
def dft_energy(molecule):
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’, verbose=4)
mf = mol.RKS(xc=’wb97x’).density_fit().run()
return mf.e_tot

molecules = [’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’,


’O 0 0 0; H 0.757 0.6 0; H -0.757 0.6 0’]
futures = [dft_energy.remote(x) for x in molecules]
print(ray.get(futures))

To experience the basic usage of Ray jobs, we can start by running the following
command to initiate a Ray cluster in the local environment:
$ ray start --head

According to instructions displayed in the output, we can submit a Ray job using the
command:
$ export RAY_ADDRESS=’http://127.0.0.1:8265’
$ ray job submit --working-dir . -- python3 dft_job.py

The double dash -- notation is a delimiter, which separates the Ray options from the
command of the job to be executed remotely. working-dir is a local directory where
we can put the job scripts and necessary input data.
258 CHAPTER 7 Working with cloud

Please note the --working-dir path, which can be a common source of confusion.
Cloud environments do not have the shared file systems as that on HPC. Script files,
such as dft_job.py that we develop locally, are not automatically accessible on the
worker nodes. To address the file sharing problem, the command

ray job submit

will replicate the working-dir onto worker nodes. The command for the job (such
as python3 dft_job.py in our example) is executed within this directory. Therefore,
we should avoid using absolute paths specific to the local machine within the job
command. Additionally, we need to limit the amount of data stored in the working­
dir. Sending a large volume of data can lead to file transfer errors during the Ray job
submission process.
If our job requires large libraries or a significant amount of input data, how to
transfer them to the Ray worker nodes?
One approach is to create a script within the working-dir, which installs the re­
quired libraries and download necessary data from cloud object storage. This script
can be executed prior to running the main computation task.
Alternatively, libraries or data can be pre-installed on the worker nodes during the
Ray cluster deployment. These data preparation commands can be spec­fied in the
cluster co­figuration, which will be more clear in subsequent discussions.
Let’s move on to setting up a Ray cluster on AWS. Below is a sample co­fig­
uration for a Ray cluster on AWS, which we have saved in a YAML file named
config.yaml.

cluster_name: dft-ray

provider:
type: aws
region: us-east-1
availability_zone: us-east-1a

available_node_types:
ray.head.default:
node_config:
InstanceType: m5.xlarge
ImageId: ami-080e1f13689e07408 # Ubuntu 22.04
ray.worker.default:
max_workers: 3
node_config:
InstanceType: p3.2xlarge # GPU instance
ImageId: ami-0b54855df82eef3a3 # Ubuntu 22.04 with Deep learning
tools

setup_commands:
Summary 259

- sudo apt install -y python3-pip


- pip install ray[default] boto3 pyscf gpu4pyscf-cuda12x

This co­figuration spec­fies several key settings:


• The resources are provisioned on the AWS cloud platform, as indicated by the
provider field.
• An autoscaling scheme is co­figured, which allows to launch one head node and
up to three worker nodes.
• The co­figuration spec­fies the EC2 instance type and the operating system
for each node. We can search for a public OS image in the AWS market­
place, or a custom image created by ourselves. In this example, ImageId: ami­
0b54855df82eef3a3 is a public deep learning image, which pre-installs the
CUDA 12 drivers and runtime tools.
• The setup_commands field spec­fies the initialization commands to run after the
nodes boot up. This allows for the installation of new packages and the preparation
of data. The software installed in this phase must be compatible with the CUDA
toolkit provided by the OS image.
There are more custom options available in the Ray cluster co­figuration, such as
the disk size, the autoscaling strategy, etc. For further co­figuration options, you can
refer to the Ray documentation [28] or the cluster co­figuration examples available
in the Ray source code.
After defining the cluster co­figuration, we can use the ray up command to create
or update a cluster.
$ ray up -y config.yaml

Once all computations are finished, the Ray cluster can be shut down using the com­
mand:
$ ray down -y config.yaml

By executing this command, Ray sends requests to the cloud platform to free and
clean up the resources it has allocated. However, this command does not fully track
the success of these requests. To prevent potential resource waste and unexpected
bills, it is recommended to log into the web console and manually verify the status of
each resource.

Summary
In this chapter, we have briefly demonstrated the design and implementation of a
quantum chemistry simulation wor­flow on the cloud. We utilized the containerized
cloud service to provide computing resources and the object storage service for data
exchange. When co­figuring cloud services, we chose to manage them by calling
260 CHAPTER 7 Working with cloud

their APIs. This approach was combined with Python templating techniques. Al­
though we have used the AWS cloud to demonstrate this process, similar approaches
can be applied to other cloud platforms.
Based on the knowledge of cloud computing, we explored several distributed job
executors, including the Celery, Dask, and Ray libraries. We illustrated how to de­
ploy them on the cloud and how to use them to execute computation tasks remotely.
Notably, Ray exhibits excellent integration with cloud environments, making it a con­
venient choice for cloud computing.
During the development of cloud programs, we leveraged an AI coding assistant,
specifically the GPT-4 model, to provide prototypes for computation tasks and con­
figuration programs. We then r­fined the functionalities based on the AI-generated
prototypes. The world of cloud computing remains complex. Relevant tools and tech­
nologies evolve rapidly. Using AI to assist the development of cloud programs is a
very effective method to accommodate this situation.

References
[1] Amazon Web Services, Amazon machine images (AMIs), https://docs.aws.amazon.com/
AWSEC2/latest/UserGuide/AMIs.html, 2024.
[2] Amazon Web Services, Amazon EBS volumes, https://docs.aws.amazon.com/ebs/latest/
userguide/ebs-volumes.html, 2024.
[3] Amazon Web Services, What is Amazon elastic container service?, https://docs.aws.
amazon.com/AmazonECS/latest/developerguide/Welcome.html, 2024.
[4] Amazon Web Services, What is Amazon VPC?, https://docs.aws.amazon.com/vpc/latest/
userguide/what-is-amazon-vpc.html, 2024.
[5] Amazon Web Services, How Amazon VPC works, https://docs.aws.amazon.com/vpc/
latest/userguide/how-it-works.html, 2024.
[6] Amazon Web Services, Get started with Amazon S3, https://docs.aws.amazon.com/
AmazonS3/latest/userguide/GetStartedWithS3.html, 2024.
[7] Amazon Web Services, Moving an image through its lifecycle in Amazon ECR, https://
docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html, 2024.
[8] Amazon Web Services, Control traffic to your AWS resources using security groups,
https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-groups.html, 2024.
[9] Amazon Web Services, Co­figuration and credential file settings, https://docs.aws.
amazon.com/cli/latest/userguide/cli-configure-files.html, 2024.
[10] Amazon Web Services, IAM roles, https://docs.aws.amazon.com/IAM/latest/UserGuide/
id_roles.html, 2024.
[11] The PySCF Developers, Quantum chemistry with Python, https://pyscf.org/, 2024.
[12] R. Li, Q. Sun, X. Zhang, G.K.-L. Chan, Introducing GPU-acceleration into the Python­
based simulations of chemistry framework, arXiv:2407.09700, https://arxiv.org/abs/
2407.09700, 2024.
[13] X. Wu, Q. Sun, Z. Pu, T. Zheng, W. Ma, W. Yan, X. Yu, Z. Wu, M. Huo, X. Li, W. Ren, S.
Gong, Y. Zhang, W. Gao, Enhancing GPU-acceleration in the Python-based simulations
of chemistry framework, arXiv:2404.09452, https://arxiv.org/abs/2404.09452, 2024.
[14] Amazon Web Services, Aws fargate for Amazon ECS, https://docs.aws.amazon.com/
AmazonECS/latest/developerguide/AWS_Fargate.html, 2024.
References 261

[15] Amazon Web Services, What is Amazon EC2 Auto Scaling?, https://docs.aws.amazon.
com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html, 2024.
[16] Amazon Web Services, Amazon ECS task definitions for GPU workloads, https://docs.
aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html, 2024.
[17] Amazon Web Services, Amazon ECS task networking options for the EC2 launch
type, https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking.
html, 2024.
[18] Amazon Web Services, Amazon ECS task metadata endpoint version 3, https://docs.aws.
amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v3.html, 2024.
[19] Amazon Web Services, Tutorial: create a REST API with a Lambda proxy in­
tegration, https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-
create-api-as-simple-proxy-for-lambda.html, 2024.
[20] Amazon Web Services, D­fine Lambda function handler in Python, https://docs.aws.
amazon.com/lambda/latest/dg/python-handler.html, 2024.
[21] Amazon Web Services, Authenticating using an API gateway method, https://
docs.aws.amazon.com/transfer/latest/userguide/authentication-api-gateway.html#
authentication-custom-ip, 2024.
[22] A. Solem, contributors, Celery - distributed task queue, https://docs.celeryq.dev/en/
stable/, 2024.
[23] Dask Development Team, Dask: library for dynamic task scheduling, http://dask.pydata.
org, 2016.
[24] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W.
Paul, M.I. Jordan, I. Stoica, Ray: a distributed framework for emerging AI applications,
in: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI
18), USENIX Association, Carlsbad, CA, 2018, pp. 561--577, https://www.usenix.org/
conference/osdi18/presentation/moritz.
[25] A. Solem, contributors, Celery documentation - backends and brokers, https://docs.
celeryq.dev/en/stable/getting-started/backends-and-brokers/index.html, 2024.
[26] Amazon Web Services, Environment variables to co­figure the AWS CLI, https://docs.
aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html, 2024.
[27] M. Movsisyan, Flower documentation, https://flower.readthedocs.io/en/latest/, 2024.
[28] The Ray Team, Ray documentation - cluster yaml co­figuration options, https://docs.ray.
io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config, 2024.

You might also like