Cloud Computing
Cloud Computing
Theoretical and Computational Chemistry, Volume 23, ISSN 1380-7323. https://doi.org/10.1016/B978-0-44-323837-6.00015-6 223
Copyright © 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
224 CHAPTER 7 Working with cloud
ing mechanisms. Individual jobs are cofigured separately and executed within
isolated VMs in the cloud.
• Connectivity in Parallel Execution. Supercomputers can provide low-latency and
high-performance network that enables efficient communication between proces
sors. Exchanging data between different processes via MPI (Message Passing
Interface) is relatively simple. This level of connectivity is not as straightforward
in the cloud.
FIGURE 7.1
Worflow for PES using DFT.
3. The geometry generator then invokes cloud APIs to launch several workers to
execute DFT calculations. Within each job, metadata, such as the S3 paths of the
geometry files, are provided as input to the DFT executor.
4. Each DFT executor downloads a geometry file, and then executes the pre-defined
DFT calculation.
5. The DFT executor uploads the results to the S3 storage, and then sends a status
signal to the geometry generator via RPC.
6. The geometry generator creates additional grids on PES based on the existing
results and returns to step 2 to continue the worflow. Alternatively, it can choose
to terminate the worflow if sufficient data points have been generated.
This worflow exhibits the following characteristics regarding the utilization of
resources:
• The worflow primarily consists of CPU-intensive tasks.
• The demands for storage space and I/O operations are small.
• If the selected DFT program supports the use of GPU devices, GPU workers can
be allocated.
Regarding the software stack, to support the tasks dfined in this worflow, we can
create a Docker image containing all the necessary components: a geometry genera
tor, a DFT program, cloud management toolkits (such as awscli), and certain Python
libraries for RPC services.
To design a complete and robust worflow, there are additional matters to con
sider, such as the fault tolerance for DFT calculations, the authentication scheme, the
worflow restart scheme, etc. Developing a worflow for cloud computing is analo
gous to the process of developing a program. It may require continuous testing and
debugging before deploying it on the cloud. Once the worflow is fully operational,
it is convenient to repeatedly execute similar jobs in the cloud.
7.1 Utilizing cloud computing 227
FIGURE 7.2
Creating AWS S3 bucket using AWS web console.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:*",
"ecs:*",
"ecr:*",
"s3:*",
"lambda:*",
"sqs:*"
],
"Resource": "*"
}
]
}
Please note that this AWS IAM cofiguration is insecure. In real applications, it is
recommended to carefully cofigure the IAM rules and limit access only to the nec
essary services and resources. Please consult the AWS online documentation for more
details.
The containerized approach requires a Docker image specifically designed for
the computational task. Assuming the goal is to establish the DFT PES worflow
mentioned in Section 7.1.2, the following Dockerfile can be created:
7.1 Utilizing cloud computing 229
FROM nvidia/cuda:12.0.1-runtime-ubuntu22.04
To accommodate the scenarios that require GPU acceleration, the base image is set
to the nvida/cuda:12.0.1-runtime image, which provides the CUDA 12 runtime
environments. For the DFT computation engine, we utilize the quantum chemistry
program PySCF [11] and its GPU acceleration variant gpu4pyscf-cuda12x [12,13].
The awscli toolkit and boto3 library are installed for managing and interacting with
AWS services. The start.sh script is responsible for managing computation and data
transfer tasks. This script is implemented as follows:
#!/bin/bash
# Set the default time limit to 1 hour
TIMEOUT=${2:-3600}
S3PATH=$1
aws s3 copy $S3PATH/job.py ./
timeout $TIMEOUT python job.py > job.log
aws s3 copy job.log $S3PATH/
The computation task is dfined in a Python program, job.py, which should be cre
ated by the geometry generator service. To prevent the computation from running
indefinitely, we use the GNU timeout tool to enforce the time limit on the calcula
tion. Computational results are uploaded to the S3 storage. We then build this image
and upload it onto the AWS ECR service:
$ aws ecr create-repository --repository-name python-qc/dft-gpu
$ docker build -t 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/
dft-gpu:1.0 .
$ docker push 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dft-
gpu:1.0
The 12-digit number 987654321000 is just a placeholder for the account ID. You
should replace it with your actual account ID.
To deploy the computation worflow on the AWS ECS, the process begins with
defining an ECS cluster. An ECS cluster encompasses specfications of the VM
hardware resources, operating system, network cofiguration, and various access per
missions. Using the command provided by the awscli toolkit, we can establish a basic
cluster that operates on dummy VMs, known as FARGATE mode [14], with default
settings.
230 CHAPTER 7 Working with cloud
The FARGATE-mode cluster would be sufficient for exploring the basic functionali
ties of AWS ECS. However, this cluster does not support GPU acceleration. To utilize
GPUs, we need to enable ASG (Auto-scaling Groups) [15] for EC2 GPU instances
when creating ECS clusters. It is recommended to cofigure this type of ECS clus
ter using AWS web console (Fig. 7.3). For the detailed instructions of ECS cluster
cofigurations, please refer to the AWS ECS documentation [3,16].
FIGURE 7.3
Creating an ECS cluster with ASG for EC2 GPU instances.
Let’s assume that an ECS cluster named python-qc with GPU support is prop
erly cofigured. The next step is to create a task definition on this ECS cluster.
A task definition acts like a template for a specific computational task. It specfies
the computational resources, such as CPU, memory, and the Docker image. The task
definitions can be cofigured on the AWS web console.
To ease the cofiguration process for future works, we will instead generate and
manage task definitions programmatically. The idea is to use code generation tech
niques along with the AWS Python SDK boto3 to manage the cofiguration. To
7.1 Utilizing cloud computing 231
enhance the readability of the cofiguration, we store it in the YAML format. The
boto3 library is complex to use as it involves numerous API details of the AWS cloud.
Thanks to the progress of AI tools, it is no longer necessary to struggle with the boto3
API documentation. Cofiguring cloud computing is a task particularly suitable for
AI assistant programming. There is a massive amount of open-source examples avail
able for cloud APIs. AI has been effectively trained on these examples. The output
from AI is quite accurate.
Here, we utilize GPT-4 to generate a cofiguration sample, which can then be
modfied to meet specific requirements. Different AI coding assistants might pro
duce different code samples. However, the differences should be insignificant for the
current task. Our prompt for GPT is:
Create a Python script to register an AWS ECS task definition:
- Use GPU instances on ECS
- Use boto3 library
- Follow the template below
‘‘‘
config = ’’’
- ECS configuration in YAML.
- Include docker image, the CPU, memory and other necessary requirements in
configuration.
’’’
py_config = convert YAML config to Python code
ecs_client = boto3.client(’ecs’)
response = ecs_client.appropriate_ecs_function(**py_config)
‘‘‘
requiresCompatibilities:
- EC2
memory: "4GB"
cpu: "1024"
’’’
print(yaml.dump(response))
If you have previously created a basic ECS cluster with FARGATE mode, you can
remove the setting "resourceRequirements" from the task cofigurations.
By using a similar methodology, we can dfine runnable tasks based on the
above task definition and submit them to ECS. The runnable task requires the use of
run_task function. Here, we omit the intermediate code samples generated by GPT-4
and demonstrate only the revised script.
import boto3
import jinja2
import yaml
task_config_tpl = jinja2.Template(’’’
cluster: python-qc
taskDefinition: dft-demo
count: 1
overrides:
containerOverrides:
- name: dft-demo-1
{%- if image %}
image: {{ image }}
{%- endif %}
command:
- /app/start.sh
- s3://python-qc-3vtl/{{ task }}/{{ job_id }}
- "{{ timeout or 7200 }}"
environment:
- name: JOB_ID
value: "{{ job_id }}"
- name: OMP_NUM_THREADS
value: "{{ threads or 1 }}"
{%- if threads %}
memory: {{ threads * 2 }}GB
{%- else %}
memory: 2GB
{%- endif %}
networkConfiguration:
awsvpcConfiguration:
subnets:
{%- for v in subnets %}
- {{ v }}
{%- endfor %}
’’’)
ec2_client = boto3.client(’ec2’)
234 CHAPTER 7 Working with cloud
resp = ec2_client.describe_subnets()
subnets = [v[’SubnetId’] for v in resp[’Subnets’]]
config = task_config_tpl.render(
task=’dft-demo’, job_id=’219a3d6c’, threads=2, subnets=subnets)
config = yaml.safe_load(config)
ecs_client = boto3.client(’ecs’)
resp = ecs_client.run_task(**config)
print(yaml.dump(resp))
If a task is successfully submitted, the ECS will start a EC2 VM with GPU
devices. It then executes the docker run command within the VM, based on the pa
rameters dfined in the task and the task definition. For example, the task in this case
is effectively identical to the following docker run command:
$ docker run --runtime=nvidia-container-runtime \
-e OMP_NUM_THREADS=2 --cpus 2 --memory 8G \
987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dft-gpu:1.0 \
/app/start.sh s3://python-qc-3vtl/dft-demo/219a3d6c 7200
The ECS task is not limited to just a single run-until-completion job. We can use
ECS workers to run Dask, Ray or other job schedulers. These schedulers can execute
various types of tasks, simplifying the complex process of cofiguring individual
ECS tasks. We will explore this approach in Section 7.2.
Arguments:
- molecule: Molecular formula
- pes_params: Targets to scan, such as bonds, bond angles.
’’’
# This function should produce molecule configurations
# based on the results of accomplished calculations.
# Here is a fake implementation to demonstrate the functionality.
7.1 Utilizing cloud computing 235
if len(existing_results) > 1:
return []
h2o_xyz = ’O 0 0 0; H 0.757 0.587 0; H -0.757 0.587 0’
return [h2o_xyz] * 3
Since the DFT tasks are executed in containers or VMs remotely, it is not possible
to achieve direct interactions between the geometry generator and DFT tasks within
the same memory space. We thus utilize RPC framework to enable the communica
tion between the two programs. The geometry generator can act as a service which
generates tasks and launches these tasks in the cloud. Additionally, this geometry
service provides an RPC endpoint to accept requests from DFT executors. Each DFT
task can send its status or results to the geometry service through this RPC. As shown
in Fig. 7.4, the PES worflow requires the collaboration of the following components:
• A geometry service to generate grids on PES.
• A job launcher to initiate DFT tasks running on ECS.
• An RPC system for the communication between the geometry service and the
DFT task.
FIGURE 7.4
The design of PES worflow based on RPC service.
rpc_config_tpl = jinja2.Template(’’’
cluster: {{ cluster }}
taskDefinition: {{ task }}
count: 1
overrides:
containerOverrides:
- name: {{ task }}-1
command:
- /app/start.sh
- {{ job_path }}
- {{ timeout or 7200 }}
environment:
- name: JOB_ID
value: "{{ job_id }}"
- name: RPC_SERVER
value: "{{ rpc_server }}"
- name: OMP_NUM_THREADS
value: "{{ threads }}"
resourceRequirements:
- type: GPU
value: "1"
cpu: {{ threads }} vcpu
memory: {{ threads * 2 }}GB
launchType:
EC2
networkConfiguration:
awsvpcConfiguration:
subnets:
{%- for v in subnets %}
- {{ v }}
{%- endfor %}
’’’)
The start.sh script for the DFT executor performs the following operations:
1. It retrieves the previously created job.py program from S3 storage.
2. It executes the DFT computation by calling job.py.
3. It sends the results to the RPC server using the rpc.py program.
A possible implementation for start.sh is provided below:
#!/bin/bash
# Set the default time limit to 1 hour
TIMEOUT=${1:-3600}
S3PATH=$1
7.1 Utilizing cloud computing 237
The rpc.py program functions as an RPC client. It invokes the RPC endpoint
set_result of the geometry service to transmit messages, such as the DFT energy
or the computation status, back to the geometry service. The code snippet below
illustrates the basic functionality of rpc.py:
from xmlrpc.client import ServerProxy
rpc_server = os.getenv(’RPC_SERVER’)
job_id = os.getenv(’JOB_ID’)
status = sys.argv[1]
logfile = sys.argv[2]
def parse_log(logfile):
’’’Reads the log file and finds the energy’’’
log = open(logfile, ’r’).read()
...
return energy
of each container instance can be obtained through the metadata URI provided by
AWS [18].
import os, requests
def self_Ip():
’’’IP address of the current container or EC2 instance’’’
metadata_uri = os.getenv(’ECS_CONTAINER_METADATA_URI’)
resp = requests.get(f’{metadata_uri}/task’).json()
return resp[’Networks’][0][’IPv4Addresses’]
To prepare DFT tasks on the RPC server, we use the following template to create
the Python script job.py:
rpc_job_tpl = jinja2.Template(’’’
import pyscf
from gpu4pyscf.dft import RKS
mol = pyscf.M(atom="""{{ geom }}""", basis=’def2-tzvp’, verbose=4)
mf = RKS(mol, xc=’wb97x’).density_fit().run()
’’’)
The job.py script, once generated, is uploaded to S3 storage at the location specfied
by job_path for remote access.
These task preparation steps can be integrated into the launch_tasks function as
shown below:
import boto3, hashlib, jinja2, json, yaml
from typing import List, Dict
from concurrent.futures import Future
CLUSTER = ’python-qc’
TASK = ’dft-demo’
RPC_PORT = 5005
s3_client = boto3.client(’s3’)
ecs_client = boto3.client(’ecs’)
jobs = {}
for geom in geom_grids:
job_conf = rpc_job_tpl.render(geom=geom).encode()
job_id = hashlib.md5(job_conf).hexdigest()
7.1 Utilizing cloud computing 239
task_config = rpc_config_tpl.render(
cluster=CLUSTER, task=TASK, job_id=job_id, job_path=job_path,
rpc_server=f’{ip}:{RPC_PORT}’, timeout=timeout, threads=2)
try:
ecs_client.run_task(**yaml.safe_load(task_config))
except Exception:
pass
else:
fut = Future() # (1)
fut._timeout = timeout
job_pool[job_id] = fut
jobs[job_id] = fut
return jobs
@contextmanager
def rpc_service(funcs):
’’’Creates an RPC service in background’’’
try:
rpc_server = SimpleXMLRPCServer(("", RPC_PORT))
for fn in funcs:
rpc_server.register_function(fn, fn.__name__)
rpc_service = Thread(target=rpc_server.serve_forever)
rpc_service.start()
yield
finally:
# Terminate the RPC service
SimpleXMLRPCServer.shutdown(rpc_server)
rpc_service.join()
We then develop the driver function pes_app to manage the RPC server and the
worflow. This function is dfined in the pes_scan.py file. Depending on the results
from the completed DFT tasks, the driver can either proceed to generate the next
batch of computational tasks or terminate the PES worflow if necessary.
def parse_config(config_file):
assert config_file.startswith(’s3://’)
bucket, key = config_file.replace(’s3://’, ’’).split(’/’, 1)
config = json.loads(s3_client.get_object(Bucket=bucket, Key=key))
return config
def pes_app(config_file):
config = parse_config(config_file)
molecule = config[’molecule’]
pes_params = config[’params’]
results_path = config_file.rsplit(’/’, 1)[0]
with rpc_service([set_result]):
# Scan geometry until enough data are generated
results = {}
geom_grids = geometry_on_pes(molecule, pes_params, results)
while geom_grids:
jobs = launch_tasks(geom_grids, results_path)
for key, fut in jobs.items():
try:
result = fut.result(fut._timeout)
except TimeoutError:
result = ’timeout’
results[key] = result
geom_grids = geometry_on_pes(molecule, pes_params, results)
7.1 Utilizing cloud computing 241
return results
if __name__ == ’__main__’:
pes_app(sys.argv[1])
7.1.5 Function-as-a-Service
Developing a worflow on the cloud from scratch requires some coding work and
expertise with cloud platforms. Once a worflow is successfully cofigured, we may
want to enhance its accessibility, making it more convenient for future use or avail
able to other users. Here, developing a Function-as-a-Service (FaaS) is one viable
option that can simplify the complex deployment procedures associated with cloud
platforms. This service can be accessed using RESTful APIs over the HTTP protocol.
One only needs to send an HTTP request that specfies the molecular structure and
DFT parameters to initiate a PES scan task.
The PES scan service requires the coordination of multiple components:
• An endpoint to receive HTTP requests.
• Middleware to forward HTTP requests to the backend server.
• A backend server to process the request and launch the PES scan job on AWS
ECS. It then returns the execution results in an HTTP response.
On the AWS cloud platform, we can leverage various AWS components to implement
this service (Fig. 7.5). The AWS API Gateway can provide an HTTP endpoint and act
as a proxy to forward HTTP requests [19]. AWS Lambda can serve as the backend to
242 CHAPTER 7 Working with cloud
FIGURE 7.5
The architecture of FaaS for the PES scan service.
parse the request and launch the PES scan worflow we developed in Section 7.1.4.
The results of the PES scan job can be stored in the S3 storage. The path of the S3
storage is returned in the HTTP response. The cofiguration of the Lambda handler
and the API Gateway trigger are illustrated in Figs. 7.6--7.9.
Let’s start the project from the AWS Lambda. AWS Lambda is a serverless com
puting service that enables us to execute code without the need to provision or
manage servers. It automatically allocates computing resources and executes pre
defined functions based on the incoming request. Here, we create a Lambda function
using the cofigurations shown in Fig. 7.6. We name this Lambda function pes-scan.
This name will be automatically applied to the cofigurations of other services. You
may choose any other name for this function. If you choose a different name, be sure
to update all instances of pes-scan in the subsequent settings accordingly.
In the AWS Lambda console (Fig. 7.7), we can deploy the following Python func
tion handler [20] in Box 1 of Fig. 7.7.
body = json.loads(event[’body’])
job = body[’job’]
7.1 Utilizing cloud computing 243
if job.upper() != ’PES’:
msg = f’Unknown job {job}’
results = ’N/A’
config = json.dumps({
’molecule’: body[’molecule’],
’params’: body[’params’]
}).encode()
job_id = hashlib.md5(config).hexdigest()
resp = ec2_client.describe_subnets()
subnets = [v[’SubnetId’] for v in resp[’Subnets’]]
bucket, path = ’python-qc’, f’pes-faas/{job_id}’
s3_client.put_object(Bucket=bucket, Key=f’{path}/config.json’, Body=
config)
s3path = f’s3://{bucket}/{path}/config.json’
resp = ecs_client.run_task(
cluster=’python-qc’,
taskDefinition=’pes-rpc-server’,
count=1,
launchType=’FARGATE’,
networkConfiguration={’awsvpcConfiguration’: {’subnets’: subnets}},
overrides={
’containerOverrides’: [{
’name’: ’rpc-server’,
’command’: [’python’, ’pes_scan.py’, s3path]
}]
}
)
msg = json.dumps(resp)
results = f’s3://{bucket}/pes-faas/{job_id}/results.log’
return {
’statusCode’: 200,
’body’: {
’results’: results,
’detail’: msg,
}
}
This handler decodes the HTTP requests and launches the PES scan task on ECS. Fur
thermore, a trigger for the Lambda function can be cofigured in Box 3 of Fig. 7.7
in the AWS Lambda console. The API Gateway is assigned to the trigger, as shown
in the Box 1 of Fig. 7.8. For simplicity, we set the security mechanism of the API
Gateway to Open, which allows the gateway to accept all requests and forward them
244 CHAPTER 7 Working with cloud
FIGURE 7.6
Creating an AWS Lambda function to trigger the PES scan service.
FIGURE 7.7
Cofiguring AWS Lambda.
7.1 Utilizing cloud computing 245
FIGURE 7.8
Adding an AWS API Gateway trigger.
FIGURE 7.9
Checking API Gateway endpoint and cofiguring permissions.
246 CHAPTER 7 Working with cloud
to the Lambda function. Please note that this setting is insecure. It is recommended to
cofigure an authentication method to enhance the security service [21]. By deploy
ing the API Gateway trigger, an HTTP endpoint is generated. This endpoint can be
accessed by clicking on Box 1 in Fig. 7.9, which indicates the following URL:
https://d8iah6mwib.execute-api.us-east-1.amazonaws.com/default/pes-scan
The Lambda function handler needs access to several AWS services, including
the read and write permissions for S3, EC2, and ECS. We can cofigure the permis
sions of the Lambda function in the AWS Lambda console, as shown in Fig. 7.9.
For the cofiguration method of the IAM policy, refer to Section 7.1.3. After assign
ing necessary permissions to the Lambda function, we are ready to provide a PES
scan service through the API Gateway endpoint. Here is an example of invoking the
HTTP API using a JSON document, which specfies the geometry of a molecule and
the PES parameters:
import requests
molecule = ’O 0 0 0; H 0.757 0.587 0; H -0.757 0.587 0’
resp = requests.post(
’https://d8iah6mwib.execute-api.us-east-1.amazonaws.com/default/pes-
scan’,
json={’job’: ’pes’, ’molecule’: molecule, ’params’: ’O,H’}
)
print(resp.json()[’detail’])
The response will include an S3 path that indicates where the results are stored.
The above example is the most basic framework of an FaaS, in which we have
omitted many technical details. In a comprehensive FaaS, there are more engineer
ing and usability issues to consider, such as authorization, authentication, network
security, fault tolerance, monitoring, and data persistence, etc. Nevertheless, this ba
sic framework can serve as a starting point, from which you can gradually introduce
more features to enhance the usability of the FaaS.
• Celery [22] is a distributed computing tool that relies on message queues to dis
tribute tasks across multiple computing nodes. It is easy to scale up workers to
process a high volume of similar tasks. Celery is commonly used for handling
background tasks in web applications. It can also effectively support quantum
chemistry data generation workloads.
• Dask [23] is a library that focuses on large-scale data processing. It is often em
ployed in data analysis and numerical computation jobs that are distributed across
multiple computing nodes. Its distribution system is a pure Python application,
which does not require any external components (such as message queues or
databases). Dask is a flexible framework for distributing workloads. It is not lim
ited to pre-defined functions. Python functions can be created at runtime and sent
to workers for execution.
• Ray [24] is a high-performance distributed computing framework. It shares cer
tain similarities with Dask in distributed task execution. The highlight of the Ray
project is its native integration with cloud computing environment. It offers con
venient cloud resource management and task scheduling capabilities. In addition
to distributing workloads, Ray offers a comprehensive solution for memory shar
ing and parallel computation in a distributed system. If the goal is to develop
a distributed application that does more than just workload offloading, the Ray
framework is an excellent option.
Distributed job executors typically consist of servers, clients, workers, and other
components. Each component requires a different deployment strategy. Servers and
workers are usually deployed on the cloud. Clients can be installed in the local Python
environment.
Distributed computing is often closely related to parallel computation techniques.
In this section, we will focus only on the capabilities and the deployments of dis
tributed job executors in the context of cloud computing. The designs of parallel
programs will be explored in Chapter 10.
7.2.1 Celery
The distributed execution in Celery is based on the producer-consumer parallel com
puting model. In this setup, the Celery client acts as the producer, pushing new tasks
into a queue. Celery workers then retrieve and execute tasks from this queue. To
deploy Celery, we need to set up a message broker to manage the task queue.
Celery supports a variety of message brokers, such as RabbitMQ, Redis, and
Amazon SQS (Simple Queue Service). Each message broker specializes in differ
ent functionalities, including security, performance, priority handling, error tolerance,
and crash recovery. These considerations are crucial for high-concurrent web applica
tions. However, for tasks such as DFT data generation or other chemistry simulation
calculations, the differences among these brokers are minimal, and all are generally
sufficient for our use cases.
Let’s choose Redis to server as the broker. It can be deployed locally using the
command:
248 CHAPTER 7 Working with cloud
We can accordingly install the Celery library with the Redis extension:
$ pip install celery[redis]
@app.task
def dft_energy(molecule):
job_id = hashlib.md5(’f{molecule=}).hexdigest()
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’, verbose=4)
mf = RKS(mol, xc=’wb97x’).density_fit().run()
return job_id, mf.e_tot
The message broker is specfied by the keyword borker when creating the Celery ap
plication, as shown in line (1). It only serves as a task queue for scheduling jobs. To
receive output from the remote Celery workers, it is necessary to cofigure a database
backend to store results for the Celery application. There are several options avail
able for result backends in Celery [25]. For simplicity, we continue to use Redis as
the results backend. Redis uses numbers in the URL to identify databases. Differ
ent numbers correspond to different databases. Results are cofigured to store in a
separate database, as shown in line (2).
In the Celery application code, we should use the decorator @app.task to register
any remote functions, as required by the Celery library. In our local environment, we
can import this application and invoke the .delay() method of the remote function to
submit tasks.
In [1]: from dft_app import dft_energy
results = [
dft_energy.delay(’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’),
dft_energy.delay(’O 0 0 0; H 0.757 0.6 0; H -0.757 0.6 0’)]
print(results)
print(results[0].status, results[1].status)
7.2 Distributed job executors 249
Out[1]:
[<AsyncResult: 41100ef3-2fed-4809-b8e7-a599b161b5a8>, <AsyncResult: 17271
cc9-6aba-4bef-a8ce-544a512f8e9a>]
PENDING PENDING
Celery returns an AsyncResult object to represent the result of the remote function.
Initially, the status of the submitted tasks is PENDING. This is because we have not yet
deployed any workers for this application.
In another terminal, we can start a Celery worker by executing the command:
$ celery --app=dft_app worker
The application, as specfied by the --app argument, must be a module that Celery
workers can import. In this local setup, the worker successfully loads the Celery
application because the command is executed in the same directory where the ap
plication program file resides. When deploying workers remotely, the application
should be properly installed to ensure that it is importable. The package distribution
techniques we discussed in Chapter 1 can be employed in this regard.
The Celery worker retrieves tasks from the message broker and sends the results
to the backend database. We can access the running status, error messages, and results
via the AsyncResult objects.
In [2]: print(results[0].get())
print(results[1].status)
Out[2]:
[’2bb88ae9b074ca2c190ac9969b4e7397’, -76.43283440169702]
SUCCESS
This example illustrates the basic usage of Celery, which is sufficient for the cur
rent data generation tasks. In order to efficiently process a large volume of tasks, we
plan to deploy Celery workers on the AWS cloud. We will use AWS ECS to launch
Celery workers and AWS SQS to serve as the broker. We choose SQS over deploying
a standalone message queue because SQS can be conveniently accessed from a local
machine. This eliminates the complex network cofigurations that would otherwise
be necessary for a Celery client to access the message queue.
Despite its accessibility, the use of AWS SQS faces a different restriction. Celery
does not support SQS as a backend for result storage. An alternative solution is to
use other AWS database services as the result backend. For simplicity, we will just
upload the results to AWS S3 storage. The Celery object and the task functions in
dft_app.py are modfied accordingly:
import tempfile
import pyscf
from gpu4pyscf.dft import RKS
from celery import Celery
import boto3
250 CHAPTER 7 Working with cloud
@app.task
def dft_energy(molecule, s3_path):
output = tempfile.mktemp()
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’,
verbose=4, output=output)
mf = RKS(mol, xc=’wb97x’).density_fit().run()
bucket, key = s3_path.replace(’s3://’, ’’).split(’/’, 1)
s3.upload_file(output, bucket, key)
The next step is to cofigure AWS ECS for Celery workers. This is similar to the
process we discussed in Section 7.1.3. We need to create a Docker image and upload
it to AWS ECR. The Dockerfile for the Docker image is specfied as follows:
FROM nvidia/cuda:12.0.1-runtime-ubuntu22.04
We then register an ECS task definition using the new Docker image.
config = yaml.safe_load(’’’
family: celery-gpu-worker
networkMode: awsvpc
containerDefinitions:
- name: celery-worker
image: 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/celery-dft
:1.0
environment:
7.2 Distributed job executors 251
- name: OMP_NUM_THREADS
value: "4"
runtimePlatform:
cpuArchitecture: X86_64
operatingSystemFamily: LINUX
cpu: 4 vcpu
memory: 8GB
’’’)
ecs_client = boto3.client(’ecs’)
resp = ecs_client.register_task_definition(**config)
print(yaml.dump(resp))
config = yaml.safe_load(’’’
cluster: python-qc
taskDefinition: celery-gpu-worker
count: 2
overrides:
containerOverrides:
- name: celery-worker
environment:
- name: AWS_ACCESS_KEY_ID
value: xxxxxxxxxxxxxxxxxxxx
- name: AWS_SECRET_ACCESS_KEY
value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
resourceRequirements:
- type: GPU
value: "1"
networkConfiguration:
awsvpcConfiguration:
subnets: {}
’’’.format(subnets))
ecs_client = boto3.client(’ecs’)
resp = ecs_client.run_task(**config)
print(yaml.dump(resp))
If more computational resources are required, the count field in the cofiguration can
be modfied to adjust the number of ECS workers. In the ECS task cofigurations,
the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are con
252 CHAPTER 7 Working with cloud
figured to provide AWS access credentials [26], which are necessary for the Celery
workers to access SQS and S3. It is crucial to avoid hardcoding the AWS access cre
dentials directly into the source code or Dockefile of the application, as they might
be publicly accessible.
Now, we are ready to load the modfied dft_app.py and run the DFT computation
tasks in the AWS cloud.
In [3]: from dft_app import dft_energy
results = [
dft_energy.delay(’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’),
dft_energy.delay(’O 0 0 0; H 0.757 0.6 0; H -0.757 0.6 0’)]
print(results)
Out[3]:
[<AsyncResult: 32f68705-04f8-483c-83eb-45bf2622ced3>, <AsyncResult: 5833
cf35-7fd2-4d7d-9ef2-f9e1acd4579a>]
Please note that the Celery workers do not automatically shut down after com
pleting all computational tasks. To terminate Celery workers, we can stop the corre
sponding ECS tasks using the
aws ecs stop-task
command via the CLI or by deleting the allocated resources through the web console.
Don’t forget this cleanup step or you will be continually charged for ECS computing
resources. It is possible to implement functions to monitor the status of the task queue
and automatically terminate Celery workers upon completion. However, implement
ing this feature requires the coordination of several AWS cloud services, which is
beyond the scope of Python programming. We will not cover this topic in this book.
As shown by this example, using Celery is generally not complicated. If we have a
large number of similar computational tasks to execute, Celery is an excellent choice
for batch processing.
Let’s briefly summarize the strengths and disadvantages of using Celery in dis
tributed computing. Celery has the following advantages:
• Efficient for batch executing a large number of pre-defined jobs, such as the data
generation tasks.
• Quick response. As long as the Celery workers are active, they can repeatedly ex
ecute tasks. New tasks incur only a small overhead when retrieving input variables
from the broker.
• Thanks to the message-queue architecture, Celery provides resilience against
worker failures. If any workers crash, the ufinished tasks can be picked up and
continued by other workers.
The disadvantages of Celery include:
• Celery uses JSON serialization to encode the task parameters and the results. Con
sequently, a general Python object may not be used as the input arguments of a
7.2 Distributed job executors 253
7.2.2 Dask
Dask offers the dask.distributed module for distributed computing over a Dask clus
ter. A Dask cluster consists of a scheduler and several workers, as shown in Fig. 7.10.
To install the required components for Dask distributed executors, we can execute
$ pip install dask distributed bokeh
Here, the package distributed provides the necessary tools for managing a Dask
cluster, and bokeh offers a visualization dashboard for monitoring the status of the
Dask cluster.
FIGURE 7.10
The architecture of the Dask cluster.
The command dask-scheduler initializes the scheduler for the Dask cluster. The com
mand dask-worker launches workers.
Once the Dask cluster is initialized, the dask-scheduler functions as the server.
We can then create a Client to connect to the scheduler. The .submit() method
254 CHAPTER 7 Working with cloud
of the Dask client can be used to offload computation to Dask workers. This op
eration returns a Future object, which represents the pending result of the dis
tributed task. This Future object offers the functionality similar to the Python built-in
concurrent.futures.Future class. Additionally, the Dask client provides a shortcut
method, .gather(), to quickly retrieve computational results.
import pyscf
from dask.distributed import Client
client = Client(’localhost:8786’)
def dft_energy(molecule):
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’, verbose=4)
mf = mol.RKS(xc=’wb97x’).density_fit().run()
return mf.e_tot
We can use this dashboard to monitor the status of the Dask cluster.
Next, let’s examine how to deploy the Dask cluster in the cloud. Deploying the
Dask cluster scheduler and the Dask workers may involve different cofigurations.
The Dask scheduler needs to be accessible to the Client running on our local device,
either through its public IP address or a proxy. Data transfers via public IP addresses
are subject to charges. To avoid data transfers through public IP addresses, commu
nication between workers and the scheduler can occur over the internal network.
If we deploy the Dask scheduler using ECS, additional network cofigurations
will be required to ensure the accessibility of the Dask scheduler. For simplicity,
we can launch an EC2 instance to host the Dask scheduler. An EC2 instance can
have both public and private IP addresses.1 After installing the essential packages
including dask, distributed, and bokeh in the EC2 instance, we can start the dask
scheduler as we have done in the local environment.
On the other hand, for Dask workers, we can still utilize the containerized service.
Similar to the ECS deployment method for Celery, we can create a Docker image
python-qc/dask-dft:1.0 using the following Dockerfile:
1 Exposing public IP addresses can introduce security risks. It is recommended to cofigure the firewall
(Security Group) of the EC2 instance to only allow connections from trusted IP addresses.
7.2 Distributed job executors 255
FROM nvidia/cuda:12.0.1-runtime-ubuntu22.04
RUN apt-get update && \
apt-get install -y python3 python3-pip && \
rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir gpu4pyscf-cuda12x boto3 dask distributed
The ECS task definition for Dask workers is essentially the same as that for Celery
workers, which can be reused here.
An ECS worker might be assigned multiple IP addresses. To ensure that the Dask
worker attaches to the private address, we should specify the address when starting
the dask-worker command. This address can be acquired from the AWS web console,
or through the boto3 library. We then override the image and command in the task
definition, leading to the following ECS task cofiguration to launch Dask workers:
ec2_client = boto3.client(’ec2’)
resp = ec2_client.describe_subnets()
subnets = [v[’SubnetId’] for v in resp[’Subnets’]]
private_ip = ec2_client.describe_instances(
Filters=[{’Name’: ’tag:Name’, ’Values’: [’dask-server’]}]
)[’Reservations’][0][’Instances’][0][’PrivateIpAddress’]
config = yaml.safe_load(’’’
cluster: python-qc
taskDefinition: celery-gpu-worker
count: 2
overrides:
containerOverrides:
- name: dask-gpu-worker
image: 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dask-dft
:1.0
command:
- dask-worker
- "{0}:port"
resourceRequirements:
- type: GPU
value: "1"
networkConfiguration:
awsvpcConfiguration:
subnets: {1}
’’’.format(private_ip, subnets}))
ecs_client = boto3.client(’ecs’)
resp = ecs_client.run_task(**config)
print(yaml.dump(resp))
256 CHAPTER 7 Working with cloud
7.2.3 Ray
Ray is more accurately described as a distributed execution framework rather than
merely a distributed task executor. It is designed for building and running scalable
and distributed applications. It also offers scalable toolkits for developing distributed
machine learning applications.
When Ray is used as a distributed task scheduler, its functionality has certain
overlaps with Dask. It can offload a general Python function to a remote worker for
execution. One of the key advantages in Ray is its native integration with various
cloud platforms, such as AWS, GCP, Azure, and Aliyun. Ray allows users to manage
a cluster for distributed computation without needing extensive cloud expertise.
A Ray cluster is made of a head node and several worker nodes, as shown in
Fig. 7.11. Unlike Dask, which runs a simple service on each node, Ray employs
a more complex architecture. On each node, Ray runs multiple services for com
munication, job scheduling, distributed storage, and other functions. The head node
additionally operates an auto-scaling service, which can automatically scale worker
nodes according to the workloads. Deploying these complex services does not require
a manual, step-by-step setup. By specifying a concise cluster cofiguration, Ray can
automatically cofigure the virtual cluster on the cloud, including all necessary net
work connections, message queues, schedulers, storage, autoscaler, and other cloud
services required by Ray.
FIGURE 7.11
The architecture of the Ray cluster.
The Ray package integrates both distributed computation functionality and cluster
management capabilities. The management of Ray cluster can be operated locally.
7.2 Distributed job executors 257
Taking the AWS cloud as an example, to deploy a Ray cluster on the AWS cloud, the
following packages should be installed:
$ pip install -U ray[default,aws] boto3
The boto3 library is required because Ray relies on this library to access AWS APIs.
To offload a computation task to the Ray cluster, we need to prepare and submit a
Ray job. A Ray job is typically composed of the following components:
• Functions for distribution. These functions are annotated with the @ray.remote
decorator. This decorator adds the .remote() method to them.
• Calling the .remote() method on these functions to generate Future objects.
• To ensure completion, retrieving the results of the Future objects, using the
ray.get function.
Below is an example of Ray jobs suitable for the aforementioned DFT calculation
task.
$ cat dft_job.py
import pyscf
import ray
@ray.remote
def dft_energy(molecule):
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’, verbose=4)
mf = mol.RKS(xc=’wb97x’).density_fit().run()
return mf.e_tot
To experience the basic usage of Ray jobs, we can start by running the following
command to initiate a Ray cluster in the local environment:
$ ray start --head
According to instructions displayed in the output, we can submit a Ray job using the
command:
$ export RAY_ADDRESS=’http://127.0.0.1:8265’
$ ray job submit --working-dir . -- python3 dft_job.py
The double dash -- notation is a delimiter, which separates the Ray options from the
command of the job to be executed remotely. working-dir is a local directory where
we can put the job scripts and necessary input data.
258 CHAPTER 7 Working with cloud
Please note the --working-dir path, which can be a common source of confusion.
Cloud environments do not have the shared file systems as that on HPC. Script files,
such as dft_job.py that we develop locally, are not automatically accessible on the
worker nodes. To address the file sharing problem, the command
will replicate the working-dir onto worker nodes. The command for the job (such
as python3 dft_job.py in our example) is executed within this directory. Therefore,
we should avoid using absolute paths specific to the local machine within the job
command. Additionally, we need to limit the amount of data stored in the working
dir. Sending a large volume of data can lead to file transfer errors during the Ray job
submission process.
If our job requires large libraries or a significant amount of input data, how to
transfer them to the Ray worker nodes?
One approach is to create a script within the working-dir, which installs the re
quired libraries and download necessary data from cloud object storage. This script
can be executed prior to running the main computation task.
Alternatively, libraries or data can be pre-installed on the worker nodes during the
Ray cluster deployment. These data preparation commands can be specfied in the
cluster cofiguration, which will be more clear in subsequent discussions.
Let’s move on to setting up a Ray cluster on AWS. Below is a sample cofig
uration for a Ray cluster on AWS, which we have saved in a YAML file named
config.yaml.
cluster_name: dft-ray
provider:
type: aws
region: us-east-1
availability_zone: us-east-1a
available_node_types:
ray.head.default:
node_config:
InstanceType: m5.xlarge
ImageId: ami-080e1f13689e07408 # Ubuntu 22.04
ray.worker.default:
max_workers: 3
node_config:
InstanceType: p3.2xlarge # GPU instance
ImageId: ami-0b54855df82eef3a3 # Ubuntu 22.04 with Deep learning
tools
setup_commands:
Summary 259
Once all computations are finished, the Ray cluster can be shut down using the com
mand:
$ ray down -y config.yaml
By executing this command, Ray sends requests to the cloud platform to free and
clean up the resources it has allocated. However, this command does not fully track
the success of these requests. To prevent potential resource waste and unexpected
bills, it is recommended to log into the web console and manually verify the status of
each resource.
Summary
In this chapter, we have briefly demonstrated the design and implementation of a
quantum chemistry simulation worflow on the cloud. We utilized the containerized
cloud service to provide computing resources and the object storage service for data
exchange. When cofiguring cloud services, we chose to manage them by calling
260 CHAPTER 7 Working with cloud
their APIs. This approach was combined with Python templating techniques. Al
though we have used the AWS cloud to demonstrate this process, similar approaches
can be applied to other cloud platforms.
Based on the knowledge of cloud computing, we explored several distributed job
executors, including the Celery, Dask, and Ray libraries. We illustrated how to de
ploy them on the cloud and how to use them to execute computation tasks remotely.
Notably, Ray exhibits excellent integration with cloud environments, making it a con
venient choice for cloud computing.
During the development of cloud programs, we leveraged an AI coding assistant,
specifically the GPT-4 model, to provide prototypes for computation tasks and con
figuration programs. We then rfined the functionalities based on the AI-generated
prototypes. The world of cloud computing remains complex. Relevant tools and tech
nologies evolve rapidly. Using AI to assist the development of cloud programs is a
very effective method to accommodate this situation.
References
[1] Amazon Web Services, Amazon machine images (AMIs), https://docs.aws.amazon.com/
AWSEC2/latest/UserGuide/AMIs.html, 2024.
[2] Amazon Web Services, Amazon EBS volumes, https://docs.aws.amazon.com/ebs/latest/
userguide/ebs-volumes.html, 2024.
[3] Amazon Web Services, What is Amazon elastic container service?, https://docs.aws.
amazon.com/AmazonECS/latest/developerguide/Welcome.html, 2024.
[4] Amazon Web Services, What is Amazon VPC?, https://docs.aws.amazon.com/vpc/latest/
userguide/what-is-amazon-vpc.html, 2024.
[5] Amazon Web Services, How Amazon VPC works, https://docs.aws.amazon.com/vpc/
latest/userguide/how-it-works.html, 2024.
[6] Amazon Web Services, Get started with Amazon S3, https://docs.aws.amazon.com/
AmazonS3/latest/userguide/GetStartedWithS3.html, 2024.
[7] Amazon Web Services, Moving an image through its lifecycle in Amazon ECR, https://
docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html, 2024.
[8] Amazon Web Services, Control traffic to your AWS resources using security groups,
https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-groups.html, 2024.
[9] Amazon Web Services, Cofiguration and credential file settings, https://docs.aws.
amazon.com/cli/latest/userguide/cli-configure-files.html, 2024.
[10] Amazon Web Services, IAM roles, https://docs.aws.amazon.com/IAM/latest/UserGuide/
id_roles.html, 2024.
[11] The PySCF Developers, Quantum chemistry with Python, https://pyscf.org/, 2024.
[12] R. Li, Q. Sun, X. Zhang, G.K.-L. Chan, Introducing GPU-acceleration into the Python
based simulations of chemistry framework, arXiv:2407.09700, https://arxiv.org/abs/
2407.09700, 2024.
[13] X. Wu, Q. Sun, Z. Pu, T. Zheng, W. Ma, W. Yan, X. Yu, Z. Wu, M. Huo, X. Li, W. Ren, S.
Gong, Y. Zhang, W. Gao, Enhancing GPU-acceleration in the Python-based simulations
of chemistry framework, arXiv:2404.09452, https://arxiv.org/abs/2404.09452, 2024.
[14] Amazon Web Services, Aws fargate for Amazon ECS, https://docs.aws.amazon.com/
AmazonECS/latest/developerguide/AWS_Fargate.html, 2024.
References 261
[15] Amazon Web Services, What is Amazon EC2 Auto Scaling?, https://docs.aws.amazon.
com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html, 2024.
[16] Amazon Web Services, Amazon ECS task definitions for GPU workloads, https://docs.
aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html, 2024.
[17] Amazon Web Services, Amazon ECS task networking options for the EC2 launch
type, https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking.
html, 2024.
[18] Amazon Web Services, Amazon ECS task metadata endpoint version 3, https://docs.aws.
amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v3.html, 2024.
[19] Amazon Web Services, Tutorial: create a REST API with a Lambda proxy in
tegration, https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-
create-api-as-simple-proxy-for-lambda.html, 2024.
[20] Amazon Web Services, Dfine Lambda function handler in Python, https://docs.aws.
amazon.com/lambda/latest/dg/python-handler.html, 2024.
[21] Amazon Web Services, Authenticating using an API gateway method, https://
docs.aws.amazon.com/transfer/latest/userguide/authentication-api-gateway.html#
authentication-custom-ip, 2024.
[22] A. Solem, contributors, Celery - distributed task queue, https://docs.celeryq.dev/en/
stable/, 2024.
[23] Dask Development Team, Dask: library for dynamic task scheduling, http://dask.pydata.
org, 2016.
[24] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W.
Paul, M.I. Jordan, I. Stoica, Ray: a distributed framework for emerging AI applications,
in: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI
18), USENIX Association, Carlsbad, CA, 2018, pp. 561--577, https://www.usenix.org/
conference/osdi18/presentation/moritz.
[25] A. Solem, contributors, Celery documentation - backends and brokers, https://docs.
celeryq.dev/en/stable/getting-started/backends-and-brokers/index.html, 2024.
[26] Amazon Web Services, Environment variables to cofigure the AWS CLI, https://docs.
aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html, 2024.
[27] M. Movsisyan, Flower documentation, https://flower.readthedocs.io/en/latest/, 2024.
[28] The Ray Team, Ray documentation - cluster yaml cofiguration options, https://docs.ray.
io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config, 2024.