0% found this document useful (0 votes)

43 views39 pages

Cloud Computing

Cloud computing is presented as a viable alternative for performing large-scale quantum chemistry simulations, offering abundant resources and flexibility compared to traditional supercomputers. The chapter discusses the differences in architecture, data accessibility, software stack, and job scheduling between cloud and supercomputers, emphasizing the need for specialized programming and workflow design. It also highlights the importance of resource management and cost control when utilizing cloud services for computational tasks.

Uploaded by

Harikrishnan S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views39 pages

Cloud Computing

Uploaded by

Harikrishnan S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

CHAPTER

Working with cloud

7
If we need to perform a large amount of quantum chemistry simulations, where
should these computations take place? Some people’s first thought might be to use a
supercomputer or a high-performance workstation. Aside from these options, cloud
computing is an attractive alternative worth considering.
Although quantum chemistry simulations were traditionally executed on super
computers, it is worth to consider cloud resources as a viable alternative to perform
large-scale chemistry simulations. Unlike supercomputers, which are often con
strained by quotas or the availability of idle resources, cloud computing generally
offers a plentiful supply of resources. This abundance makes cloud computing partic
ularly attractive in high-throughput quantum chemistry simulations. Additionally, the
cloud offers a wide range of computational tools and hardware resources that may not
be available on supercomputers. Some might consider the uniqueness of supercom
puters and how unlikely the cloud computing will replace them. While this is true,
the presence and significant impact of cloud computing cannot be ignored. What we
can do is to harness it as additional computational resources.
To effectively utilize cloud computing, we need to develop new concepts that
differ from those required for supercomputers. In particular, several technical aspects
should be considered for cloud computing:
• Where are the computing, storage, and network resources located? How can we
access these resources?
• How to cofigure a service and manage its cofiguration?
• How to design a worflow and orchestrate the computing resources within the
worflow?
• How to schedule parallel jobs and scale them up on the cloud?
Using cloud computing often requires the development of specialized programs.
In these programs, large efforts are devoted to cofiguring cloud services and en
suring their availability. These techniques and skills are more closely related to the
work of DevOps. Although DevOps skills do not directly impact algorithm designs
or the implementation of scientific programs, they can significantly enhance our pro
ductivity. In this chapter, we will explore the programming technologies associated
with cloud computing. We will not present a step-by-step tutorial for each cloud ser
vice. Instead, we will introduce the general idea and then leverage the AI assistant to
complete the remaining works.

Theoretical and Computational Chemistry, Volume 23, ISSN 1380-7323. https://doi.org/10.1016/B978-0-44-323837-6.00015-6 223
Copyright © 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
224 CHAPTER 7 Working with cloud

7.1 Utilizing cloud computing

Cloud computing offers various technologies and features to harness computational
resources. Exploring the details of cloud technology is beyond the scope of this book.
In this section, we will discuss the basic principles of cloud computing. We will
primarily use Python to deploy computation tasks using cloud computing APIs.
Please be aware that cloud computing charges based on the resources you con
sumed. This includes not only CPU hours but also storage, data transfer, database, and
public IPv4 addresses, all of which are billable resources. In particular, data transfer
ring services can lead to unexpected costs if not managed carefully. It is important to
ensure that resources are properly terminated after use. Despite years of development
in cloud computing technology, no cloud provider currently offers a solution that is
``smart'' enough to effectively balance cost and performance. An extraordinary skill
in cloud computing is the orchestration of cloud resources in an effective way to min
imize expenses. For any given problem, there are likely more than one strategy and
combination of resources that can produce the same outcome. If you expect to make
substantial use of cloud resources, it is worthwhile to thoroughly review the cloud
documentation.

7.1.1 Comparison between cloud and supercomputers

Running simulation tasks on the cloud is very different from executing them on tra
ditional supercomputers with the PBS or SLURM queue systems. Supercomputers
offer an interactive environment that is close to the experience of using a single Linux
desktop. In this sense, cloud computing is not as intuitive. The different user expe
rience can largely be attributed to the architecture differences between cloud and
traditional supercomputers:
• Data Accessibility. Traditional supercomputers typically offer a large-scale and
high-performance global file system accessible to all computing nodes. Sharing
data among different nodes is straightforward. Manipulating file in global file sys
tem is similar to accessing data on a local disk. In contrast, cloud typically does
not use the architecture of global file systems. Instead, files are stored in cloud
object storage, such as AWS S3 or Azure Blob. Accessing files stored in these
locations requires the use of specific tools or APIs.
• Software Stack. Traditional supercomputers typically offer pre-installed software
within a homogeneous environment, operating under the same OS. In the cloud,
software is installed in virtual machines (VMs) or Docker images. The runtime
environments can be more flexible, isolated, and diverged for each software. For a
specific task, users only utilize VMs or images that contain the necessary software.
• Job Scheduling. Supercomputers utilize job scheduling systems like PBS or
SLURM to manage and prioritize jobs. These systems launch jobs on reusable
computing nodes, which have an environment almost identical to the one used
in interactive sessions. Cloud computing platforms have their own job schedul
7.1 Utilizing cloud computing 225

ing mechanisms. Individual jobs are cofigured separately and executed within
isolated VMs in the cloud.
• Connectivity in Parallel Execution. Supercomputers can provide low-latency and
high-performance network that enables efficient communication between proces
sors. Exchanging data between different processes via MPI (Message Passing
Interface) is relatively simple. This level of connectivity is not as straightforward
in the cloud.

7.1.2 Designing a worflow

Normally, we need to design a worflow to utilize the cloud computational resources
and manage the computation tasks. A worflow is essentially a sequence of pre
defined computational tasks executed in a particular order. We may need to develop
a series of Python programs to connect the various components of the worflow.
When designing the worflow, the characteristics of the cloud architecture should
be taken into consideration, particularly, involving the following questions:
• How to access the input data? For example, a large volume of input data can be
stored in the cloud object storage, and accessed via cloud storage APIs. Input
parameters can be stored in GitHub or GitLab, and accessed via HTTP requests or
Git operations. Alternatively, essential input data can be cached within the Docker
image. Additionally, a database can serve as an option for providing input data.
• Where to save the output? Cloud object storage, and database are two common
choices to save results. It is also possible to send results through a remote process
call (RPC) service to a receiver.
• How to connect different components of the worflow and how to exchange mes
sage between them? Cloud object storage and database can be utilized to exchange
data. RPC are often employed to connect different services.
• What resources are required for the computation? Based on the performance and
the cost of different devices, we may need to determine the appropriate combina
tion of resources, including the number of CPU cores, GPU cards, memory size,
temporary disk space, and network accessibility.
• Which Docker images or virtual machine images to use? Which software or tools
should be installed in the image, and how should they be cofigured?
For example, suppose that we need to design a worflow to generate the potential
energy surface (PES) for a molecule using the density functional theory (DFT). This
worflow involves two types of executors: the molecule geometry generator and the
executor to perform DFT calculations. As illustrated in Fig. 7.1, the two executors
coordinate in the following manner:
1. The geometry generator reads the initial geometry of the molecule and generates
a coarse-grained grid for the PES.
2. The geometry generator uploads all geometry files to the object storage, such as
AWS S3, assuming the cloud provider is AWS.
226 CHAPTER 7 Working with cloud

FIGURE 7.1
Worflow for PES using DFT.

3. The geometry generator then invokes cloud APIs to launch several workers to
execute DFT calculations. Within each job, metadata, such as the S3 paths of the
geometry files, are provided as input to the DFT executor.
4. Each DFT executor downloads a geometry file, and then executes the pre-defined
DFT calculation.
5. The DFT executor uploads the results to the S3 storage, and then sends a status
signal to the geometry generator via RPC.
6. The geometry generator creates additional grids on PES based on the existing
results and returns to step 2 to continue the worflow. Alternatively, it can choose
to terminate the worflow if sufficient data points have been generated.
This worflow exhibits the following characteristics regarding the utilization of
resources:
• The worflow primarily consists of CPU-intensive tasks.
• The demands for storage space and I/O operations are small.
• If the selected DFT program supports the use of GPU devices, GPU workers can
be allocated.
Regarding the software stack, to support the tasks dfined in this worflow, we can
create a Docker image containing all the necessary components: a geometry genera
tor, a DFT program, cloud management toolkits (such as awscli), and certain Python
libraries for RPC services.
To design a complete and robust worflow, there are additional matters to con
sider, such as the fault tolerance for DFT calculations, the authentication scheme, the
worflow restart scheme, etc. Developing a worflow for cloud computing is analo
gous to the process of developing a program. It may require continuous testing and
debugging before deploying it on the cloud. Once the worflow is fully operational,
it is convenient to repeatedly execute similar jobs in the cloud.
7.1 Utilizing cloud computing 227

7.1.3 Computing resources

Virtual machines (VM), such as AWS EC2 and Azure VM, are the fundamental com
putational resources offered by cloud platforms. The simplest method to perform
a computation task is to interactively execute the calculation in the cloud virtual
machines. To make the runtime environment reproducible across each VM, we can
customize the VM images [1] by pre-installing commonly used software stacks. Ad
ditionally, we can leverage block storage to persist the cofiguration files and data
[2]. By combining the custom VM images and the persistent storage, we can easily
spin up VMs for computation with identical environments.
To efficiently execute a batch of computation tasks, we need a non-interactive
scheme to manage the computing resources and environments. There are several
options to consider. One options is to use the Batch services provided by cloud plat
forms, such as AWS Batch, Azure Batch, or Google Batch. Another option is to use
a containerized solution like AWS ECS, Azure ACS, or Google GKE. Although the
UIs (user interfaces) of these cloud platforms differ, the design and concepts of the
containerized approach are similar. In the following, we will take AWS ECS (Elastic
Container Service) [3] as the example to demonstrate how to deploy the containerized
computational tasks.
If you just signed up a new account on AWS, you may need several preparation
steps to cofigure the basic cloud computation environment, including:
• AWS VPC (Virtual Private Cloud) [4], the foundational infrastructure where you
can allocate virtual networks, virtual machines, and private storages.
• Subnets [5], which provides a range of IP addresses within a VPC.
• S3 (Simple Storage Service) [6], an object storage service that offers persistent
data storage.
• ECR (Elastic Container Registry) [7], which provides a repository for hosting
private Docker images.
• Security Groups [8], acting as a firewall to control inbound and outbound network
traffic for virtual machines.
• AWS credentials [9], providing the necessary keys and authentication to log into
virtual machines.
• IAM (Identity and Access Management) roles [10], which manages the permis
sions for accessing various services in the cloud.
In our following code example for DFT PES computation, we will utilize the
ECS, S3 storage, ECR and other relevant services. It is straightforward to cofigure
the S3 services using AWS web console, as shown in the screenshots in Fig. 7.2.
We have created an S3 bucket named python-qc-3vtl for later use. Please note that
AWS S3 bucket names are unique across all AWS accounts globally. For instance, the
bucket name python-qc was already in use by other users. Consequently, we could not
use python-qc as the bucket name for our project. To ensure uniqueness, we added the
suffix -3vtl. You would need to choose a different name in your own implementation.
The AWS IAM permission system is complicated. For simplicity, we have created an
IAM policy with access to a wide range of AWS services as follows:
228 CHAPTER 7 Working with cloud

FIGURE 7.2
Creating AWS S3 bucket using AWS web console.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:*",
"ecs:*",
"ecr:*",
"s3:*",
"lambda:*",
"sqs:*"
],
"Resource": "*"
}
]
}

Please note that this AWS IAM cofiguration is insecure. In real applications, it is
recommended to carefully cofigure the IAM rules and limit access only to the nec
essary services and resources. Please consult the AWS online documentation for more
details.
The containerized approach requires a Docker image specifically designed for
the computational task. Assuming the goal is to establish the DFT PES worflow
mentioned in Section 7.1.2, the following Dockerfile can be created:
7.1 Utilizing cloud computing 229

FROM nvidia/cuda:12.0.1-runtime-ubuntu22.04

# Install necessary dependencies

RUN apt-get update && apt-get install -y python3 python3-pip \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir pyscf gpu4pyscf-cuda12x awscli boto3

COPY start.sh /app/start.sh

WORKDIR /app

To accommodate the scenarios that require GPU acceleration, the base image is set
to the nvida/cuda:12.0.1-runtime image, which provides the CUDA 12 runtime
environments. For the DFT computation engine, we utilize the quantum chemistry
program PySCF [11] and its GPU acceleration variant gpu4pyscf-cuda12x [12,13].
The awscli toolkit and boto3 library are installed for managing and interacting with
AWS services. The start.sh script is responsible for managing computation and data
transfer tasks. This script is implemented as follows:
#!/bin/bash
# Set the default time limit to 1 hour
TIMEOUT=${2:-3600}
S3PATH=$1
aws s3 copy $S3PATH/job.py ./
timeout $TIMEOUT python job.py > job.log
aws s3 copy job.log $S3PATH/

The computation task is dfined in a Python program, job.py, which should be cre
ated by the geometry generator service. To prevent the computation from running
indefinitely, we use the GNU timeout tool to enforce the time limit on the calcula
tion. Computational results are uploaded to the S3 storage. We then build this image
and upload it onto the AWS ECR service:
$ aws ecr create-repository --repository-name python-qc/dft-gpu
$ docker build -t 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/
dft-gpu:1.0 .
$ docker push 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dft-
gpu:1.0

The 12-digit number 987654321000 is just a placeholder for the account ID. You
should replace it with your actual account ID.
To deploy the computation worflow on the AWS ECS, the process begins with
defining an ECS cluster. An ECS cluster encompasses specfications of the VM
hardware resources, operating system, network cofiguration, and various access per
missions. Using the command provided by the awscli toolkit, we can establish a basic
cluster that operates on dummy VMs, known as FARGATE mode [14], with default
settings.
230 CHAPTER 7 Working with cloud

$ aws ecs create-cluster --cluster-name ecs-basic-cluster

The FARGATE-mode cluster would be sufficient for exploring the basic functionali
ties of AWS ECS. However, this cluster does not support GPU acceleration. To utilize
GPUs, we need to enable ASG (Auto-scaling Groups) [15] for EC2 GPU instances
when creating ECS clusters. It is recommended to cofigure this type of ECS clus
ter using AWS web console (Fig. 7.3). For the detailed instructions of ECS cluster
cofigurations, please refer to the AWS ECS documentation [3,16].

FIGURE 7.3
Creating an ECS cluster with ASG for EC2 GPU instances.

Let’s assume that an ECS cluster named python-qc with GPU support is prop
erly cofigured. The next step is to create a task definition on this ECS cluster.
A task definition acts like a template for a specific computational task. It specfies
the computational resources, such as CPU, memory, and the Docker image. The task
definitions can be cofigured on the AWS web console.
To ease the cofiguration process for future works, we will instead generate and
manage task definitions programmatically. The idea is to use code generation tech
niques along with the AWS Python SDK boto3 to manage the cofiguration. To
7.1 Utilizing cloud computing 231

enhance the readability of the cofiguration, we store it in the YAML format. The
boto3 library is complex to use as it involves numerous API details of the AWS cloud.
Thanks to the progress of AI tools, it is no longer necessary to struggle with the boto3
API documentation. Cofiguring cloud computing is a task particularly suitable for
AI assistant programming. There is a massive amount of open-source examples avail
able for cloud APIs. AI has been effectively trained on these examples. The output
from AI is quite accurate.
Here, we utilize GPT-4 to generate a cofiguration sample, which can then be
modfied to meet specific requirements. Different AI coding assistants might pro
duce different code samples. However, the differences should be insignificant for the
current task. Our prompt for GPT is:
Create a Python script to register an AWS ECS task definition:
- Use GPU instances on ECS
- Use boto3 library
- Follow the template below
‘‘‘
config = ’’’
- ECS configuration in YAML.
- Include docker image, the CPU, memory and other necessary requirements in
configuration.
’’’
py_config = convert YAML config to Python code
ecs_client = boto3.client(’ecs’)
response = ecs_client.appropriate_ecs_function(**py_config)
‘‘‘

The output from GPT-4 is as follows:

import boto3
import yaml

# Your YAML configuration for the ECS task

config = ’’’
taskDefinition:
family: gpu-task-family
networkMode: "awsvpc"
containerDefinitions:
- name: gpu-container
image: your-docker-image
cpu: 1024
memory: 2048
resourceRequirements:
- type: GPU
value: "1"
232 CHAPTER 7 Working with cloud

requiresCompatibilities:
- EC2
memory: "4GB"
cpu: "1024"
’’’

# Convert YAML config to Python dictionary

py_config = yaml.safe_load(config)

# Initialize a boto3 client for ECS

ecs_client = boto3.client(’ecs’)

# Register the ECS task definition

response = ecs_client.register_task_definition(
**py_config[’task_definition’])

While it is possible to rfine the prompt to generate improved code samples, we

will not pursue this rfinement in the current context. The code sample above is clear
enough to illustrate the parameters of the register_task_definition function of the
boto3 ECS client. The generated YAML cofiguration sample is also straightforward
to understand. Based on this code sample, we can adjust certain settings and derive a
script to register the task definition as follows:
config_tpl = ’’’
task_definition:
family: dft-demo
networkMode: "awsvpc"
containerDefinitions:
- name: dft-demo-1
image: 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dft-gpu
:1.0
cpu: 4
memory: 2048
resourceRequirements:
- type: GPU
value: "1"
requiresCompatibilities:
- EC2
memory: "4GB"
cpu: "4"
’’’
py_config = yaml.safe_load(config)
ecs_client = boto3.client(’ecs’)
response = ecs_client.register_task_definition(
**py_config[’task_definition’])
7.1 Utilizing cloud computing 233

print(yaml.dump(response))

If you have previously created a basic ECS cluster with FARGATE mode, you can
remove the setting "resourceRequirements" from the task cofigurations.
By using a similar methodology, we can dfine runnable tasks based on the
above task definition and submit them to ECS. The runnable task requires the use of
run_task function. Here, we omit the intermediate code samples generated by GPT-4
and demonstrate only the revised script.
import boto3
import jinja2
import yaml

task_config_tpl = jinja2.Template(’’’
cluster: python-qc
taskDefinition: dft-demo
count: 1
overrides:
containerOverrides:
- name: dft-demo-1
{%- if image %}
image: {{ image }}
{%- endif %}
command:
- /app/start.sh
- s3://python-qc-3vtl/{{ task }}/{{ job_id }}
- "{{ timeout or 7200 }}"
environment:
- name: JOB_ID
value: "{{ job_id }}"
- name: OMP_NUM_THREADS
value: "{{ threads or 1 }}"
{%- if threads %}
memory: {{ threads * 2 }}GB
{%- else %}
memory: 2GB
{%- endif %}
networkConfiguration:
awsvpcConfiguration:
subnets:
{%- for v in subnets %}
- {{ v }}
{%- endfor %}
’’’)
ec2_client = boto3.client(’ec2’)
234 CHAPTER 7 Working with cloud

resp = ec2_client.describe_subnets()
subnets = [v[’SubnetId’] for v in resp[’Subnets’]]

config = task_config_tpl.render(
task=’dft-demo’, job_id=’219a3d6c’, threads=2, subnets=subnets)
config = yaml.safe_load(config)

ecs_client = boto3.client(’ecs’)
resp = ecs_client.run_task(**config)
print(yaml.dump(resp))

If a task is successfully submitted, the ECS will start a EC2 VM with GPU
devices. It then executes the docker run command within the VM, based on the pa
rameters dfined in the task and the task definition. For example, the task in this case
is effectively identical to the following docker run command:
$ docker run --runtime=nvidia-container-runtime \
-e OMP_NUM_THREADS=2 --cpus 2 --memory 8G \
987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dft-gpu:1.0 \
/app/start.sh s3://python-qc-3vtl/dft-demo/219a3d6c 7200

The ECS task is not limited to just a single run-until-completion job. We can use
ECS workers to run Dask, Ray or other job schedulers. These schedulers can execute
various types of tasks, simplifying the complex process of cofiguring individual
ECS tasks. We will explore this approach in Section 7.2.

7.1.4 Communications among cloud services

In this PES worflow, we may desire to adjust the geometry grids dynamically based
on the progress of the PES calculations. To accomplish this, it is necessary to establish
interactions between the geometry generator program and the DFT calculation tasks.
Instead of generating all grids on the PES at once, the geometry generator program
can generate coarse-grained grids on PES initially and then rfine the grids based on
the accomplished results. The function geometry_on_pes below outlines the prototype
of this functionality.
def geometry_on_pes(molecule, pes_params, existing_results=None):
’’’Generates molecule geometry for the important grids on PES

Arguments:
- molecule: Molecular formula
- pes_params: Targets to scan, such as bonds, bond angles.
’’’
# This function should produce molecule configurations
# based on the results of accomplished calculations.
# Here is a fake implementation to demonstrate the functionality.
7.1 Utilizing cloud computing 235

if len(existing_results) > 1:
return []
h2o_xyz = ’O 0 0 0; H 0.757 0.587 0; H -0.757 0.587 0’
return [h2o_xyz] * 3

Since the DFT tasks are executed in containers or VMs remotely, it is not possible
to achieve direct interactions between the geometry generator and DFT tasks within
the same memory space. We thus utilize RPC framework to enable the communica
tion between the two programs. The geometry generator can act as a service which
generates tasks and launches these tasks in the cloud. Additionally, this geometry
service provides an RPC endpoint to accept requests from DFT executors. Each DFT
task can send its status or results to the geometry service through this RPC. As shown
in Fig. 7.4, the PES worflow requires the collaboration of the following components:
• A geometry service to generate grids on PES.
• A job launcher to initiate DFT tasks running on ECS.
• An RPC system for the communication between the geometry service and the
DFT task.

FIGURE 7.4
The design of PES worflow based on RPC service.

The cofiguration of the DFT task executor

To facilitate the communication with the RPC server, the server address can be pro
vided as a parameter or an environment variable for the Docker container of the DFT
task. Based on the ECS task developed in Section 7.1.3, we make slight adjustments
to set up DFT tasks:
236 CHAPTER 7 Working with cloud

rpc_config_tpl = jinja2.Template(’’’
cluster: {{ cluster }}
taskDefinition: {{ task }}
count: 1
overrides:
containerOverrides:
- name: {{ task }}-1
command:
- /app/start.sh
- {{ job_path }}
- {{ timeout or 7200 }}
environment:
- name: JOB_ID
value: "{{ job_id }}"
- name: RPC_SERVER
value: "{{ rpc_server }}"
- name: OMP_NUM_THREADS
value: "{{ threads }}"
resourceRequirements:
- type: GPU
value: "1"
cpu: {{ threads }} vcpu
memory: {{ threads * 2 }}GB
launchType:
EC2
networkConfiguration:
awsvpcConfiguration:
subnets:
{%- for v in subnets %}
- {{ v }}
{%- endfor %}
’’’)

The start.sh script for the DFT executor performs the following operations:
1. It retrieves the previously created job.py program from S3 storage.
2. It executes the DFT computation by calling job.py.
3. It sends the results to the RPC server using the rpc.py program.
A possible implementation for start.sh is provided below:
#!/bin/bash
# Set the default time limit to 1 hour
TIMEOUT=${1:-3600}
S3PATH=$1
7.1 Utilizing cloud computing 237

aws s3 copy $S3PATH/job.py ./

if (timeout $TIMEOUT python job.py > job.log); then
aws s3 copy job.log $S3PATH/
python /app/rpc.py finished job.log
else
python /app/rpc.py failed job.log
fi

The rpc.py program functions as an RPC client. It invokes the RPC endpoint
set_result of the geometry service to transmit messages, such as the DFT energy
or the computation status, back to the geometry service. The code snippet below
illustrates the basic functionality of rpc.py:
from xmlrpc.client import ServerProxy
rpc_server = os.getenv(’RPC_SERVER’)
job_id = os.getenv(’JOB_ID’)
status = sys.argv[1]
logfile = sys.argv[2]

def parse_log(logfile):
’’’Reads the log file and finds the energy’’’
log = open(logfile, ’r’).read()
...
return energy

with ServerProxy(rpc_server) as proxy:

if status == ’finished’:
proxy.set_result(job_id, parse_log(logfile))
else:
proxy.set_result(job_id, status)

The RPC communication can be implemented in terms of REST APIs, as demon

strated in Chapter 6. However, for simplicity, we use the Python standard library
xmlrpc to create an RPC client here. The client requires the address of the RPC
server, which is stored in the environment variable RPC_SERVER. This environment
variable was cofigured previously in the ECS task.

The cofiguration of the RPC server

It is not a trivial task to establish the network communication between the RPC server
and the DFT executor containers. One approach is to bind an endpoint to the RPC
server that is accessible by other Docker containers. This endpoint can be an IP ad
dress, a DNS-resolvable address, or a proxy that forwards messages to the machine
hosting the RPC server. In this example, the geometry service and the DFT executors
are cofigured to operate within the AWSVPC network mode [17]. In this network
mode, all containers can directly access each other via IP addresses. The IP address
238 CHAPTER 7 Working with cloud

of each container instance can be obtained through the metadata URI provided by
AWS [18].
import os, requests

def self_Ip():
’’’IP address of the current container or EC2 instance’’’
metadata_uri = os.getenv(’ECS_CONTAINER_METADATA_URI’)
resp = requests.get(f’{metadata_uri}/task’).json()
return resp[’Networks’][0][’IPv4Addresses’]

To prepare DFT tasks on the RPC server, we use the following template to create
the Python script job.py:
rpc_job_tpl = jinja2.Template(’’’
import pyscf
from gpu4pyscf.dft import RKS
mol = pyscf.M(atom="""{{ geom }}""", basis=’def2-tzvp’, verbose=4)
mf = RKS(mol, xc=’wb97x’).density_fit().run()
’’’)

The job.py script, once generated, is uploaded to S3 storage at the location specfied
by job_path for remote access.
These task preparation steps can be integrated into the launch_tasks function as
shown below:
import boto3, hashlib, jinja2, json, yaml
from typing import List, Dict
from concurrent.futures import Future

CLUSTER = ’python-qc’
TASK = ’dft-demo’
RPC_PORT = 5005
s3_client = boto3.client(’s3’)
ecs_client = boto3.client(’ecs’)

job_pool: Dict[str, Future] = {}

def launch_tasks(geom_grids, results_path, timeout=7200):

assert results_path.startswith(’s3://’)
ip = self_Ip()

jobs = {}
for geom in geom_grids:
job_conf = rpc_job_tpl.render(geom=geom).encode()
job_id = hashlib.md5(job_conf).hexdigest()
7.1 Utilizing cloud computing 239

bucket, key = results_path.replace(’s3://’, ’’).split(’/’, 1)

job_path = f’s3://{bucket}/{key}/{job_id}’
s3_client.put_object(Bucket=bucket,
Key=f’{key}/{job_id}/job.py’,
Body=job_conf)

task_config = rpc_config_tpl.render(
cluster=CLUSTER, task=TASK, job_id=job_id, job_path=job_path,
rpc_server=f’{ip}:{RPC_PORT}’, timeout=timeout, threads=2)
try:
ecs_client.run_task(**yaml.safe_load(task_config))
except Exception:
pass
else:
fut = Future() # (1)
fut._timeout = timeout
job_pool[job_id] = fut
jobs[job_id] = fut
return jobs

def set_result(job_id, result):

fut = job_pool[job_id]
fut.set_result(result) # (2)

Given a set of molecular geometries, the launch_tasks function launches ECS

tasks and returns a collection of Future objects corresponding to the DFT tasks. These
DFT tasks and the geometry generator service operate asynchronously. To track the
outcome of the DFT task, we introduce the Future object in line (1), which represents
the result of the ufinished DFT task. When the Future.result() method is invoked,
the asynchronous program will be blocked until the Future.result() method returns
a result. The Future.set_result method, at line (2), can be used to set the result,
which informs the Future object to unblock the state. The set_result method is ex
posed to the RPC service, allowing it to be executed remotely by the RPC client on
the DFT workers. The combination of the Future object and the set_result method is
a common technique in asynchronous programming, which we will explore in more
detail in Chapter 10.

Deploying RPC server

To prevent the launch_tasks function from being blocked by the RPC service, we run
the RPC server in the background.
from contextlib import contextmanager
from threading import Thread
from xmlrpc.server import SimpleXMLRPCServer
240 CHAPTER 7 Working with cloud

@contextmanager
def rpc_service(funcs):
’’’Creates an RPC service in background’’’
try:
rpc_server = SimpleXMLRPCServer(("", RPC_PORT))
for fn in funcs:
rpc_server.register_function(fn, fn.__name__)
rpc_service = Thread(target=rpc_server.serve_forever)
rpc_service.start()
yield
finally:
# Terminate the RPC service
SimpleXMLRPCServer.shutdown(rpc_server)
rpc_service.join()

We then develop the driver function pes_app to manage the RPC server and the
worflow. This function is dfined in the pes_scan.py file. Depending on the results
from the completed DFT tasks, the driver can either proceed to generate the next
batch of computational tasks or terminate the PES worflow if necessary.
def parse_config(config_file):
assert config_file.startswith(’s3://’)
bucket, key = config_file.replace(’s3://’, ’’).split(’/’, 1)
config = json.loads(s3_client.get_object(Bucket=bucket, Key=key))
return config

def pes_app(config_file):
config = parse_config(config_file)
molecule = config[’molecule’]
pes_params = config[’params’]
results_path = config_file.rsplit(’/’, 1)[0]

with rpc_service([set_result]):
# Scan geometry until enough data are generated
results = {}
geom_grids = geometry_on_pes(molecule, pes_params, results)
while geom_grids:
jobs = launch_tasks(geom_grids, results_path)
for key, fut in jobs.items():
try:
result = fut.result(fut._timeout)
except TimeoutError:
result = ’timeout’
results[key] = result
geom_grids = geometry_on_pes(molecule, pes_params, results)
7.1 Utilizing cloud computing 241

return results

if __name__ == ’__main__’:
pes_app(sys.argv[1])

The RPC server program can be deployed on ECS as a task.

rpc_server_config = ’’’
family: pes-rpc-server
containerDefinitions:
- name: rpc-server
image: 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dft-gpu:1.0
networkBindings:
- containerPort: 5005
hostPort: 5005
command:
- python
- pes_scan.py
- config_file
’’’
config = yaml.safe_load(rpc_server_config)
ecs_client.register_task_definition(**config)

In this cofiguration, the networkBindings must be cofigured to expose port 5005

(specfied by the RPC_PORT previously) for RPC communication.

7.1.5 Function-as-a-Service
Developing a worflow on the cloud from scratch requires some coding work and
expertise with cloud platforms. Once a worflow is successfully cofigured, we may
want to enhance its accessibility, making it more convenient for future use or avail
able to other users. Here, developing a Function-as-a-Service (FaaS) is one viable
option that can simplify the complex deployment procedures associated with cloud
platforms. This service can be accessed using RESTful APIs over the HTTP protocol.
One only needs to send an HTTP request that specfies the molecular structure and
DFT parameters to initiate a PES scan task.
The PES scan service requires the coordination of multiple components:
• An endpoint to receive HTTP requests.
• Middleware to forward HTTP requests to the backend server.
• A backend server to process the request and launch the PES scan job on AWS
ECS. It then returns the execution results in an HTTP response.
On the AWS cloud platform, we can leverage various AWS components to implement
this service (Fig. 7.5). The AWS API Gateway can provide an HTTP endpoint and act
as a proxy to forward HTTP requests [19]. AWS Lambda can serve as the backend to
242 CHAPTER 7 Working with cloud

FIGURE 7.5
The architecture of FaaS for the PES scan service.

parse the request and launch the PES scan worflow we developed in Section 7.1.4.
The results of the PES scan job can be stored in the S3 storage. The path of the S3
storage is returned in the HTTP response. The cofiguration of the Lambda handler
and the API Gateway trigger are illustrated in Figs. 7.6--7.9.
Let’s start the project from the AWS Lambda. AWS Lambda is a serverless com
puting service that enables us to execute code without the need to provision or
manage servers. It automatically allocates computing resources and executes pre
defined functions based on the incoming request. Here, we create a Lambda function
using the cofigurations shown in Fig. 7.6. We name this Lambda function pes-scan.
This name will be automatically applied to the cofigurations of other services. You
may choose any other name for this function. If you choose a different name, be sure
to update all instances of pes-scan in the subsequent settings accordingly.
In the AWS Lambda console (Fig. 7.7), we can deploy the following Python func
tion handler [20] in Box 1 of Fig. 7.7.

import json, hashlib, boto3

s3_client = boto3.client(’s3’)
ec2_client = boto3.client(’ec2’)
ecs_client = boto3.client(’ecs’)

def lambda_handler(event, context):

# The data structure of event can be found in
# https://docs.aws.amazon.com/lambda/latest/dg/services-apigateway.html
method = event[’httpMethod’]
if method != ’POST’:
return {’statusCode’: 400}

body = json.loads(event[’body’])
job = body[’job’]
7.1 Utilizing cloud computing 243

if job.upper() != ’PES’:
msg = f’Unknown job {job}’
results = ’N/A’

config = json.dumps({
’molecule’: body[’molecule’],
’params’: body[’params’]
}).encode()
job_id = hashlib.md5(config).hexdigest()

resp = ec2_client.describe_subnets()
subnets = [v[’SubnetId’] for v in resp[’Subnets’]]
bucket, path = ’python-qc’, f’pes-faas/{job_id}’
s3_client.put_object(Bucket=bucket, Key=f’{path}/config.json’, Body=
config)
s3path = f’s3://{bucket}/{path}/config.json’
resp = ecs_client.run_task(
cluster=’python-qc’,
taskDefinition=’pes-rpc-server’,
count=1,
launchType=’FARGATE’,
networkConfiguration={’awsvpcConfiguration’: {’subnets’: subnets}},
overrides={
’containerOverrides’: [{
’name’: ’rpc-server’,
’command’: [’python’, ’pes_scan.py’, s3path]
}]
}
)
msg = json.dumps(resp)
results = f’s3://{bucket}/pes-faas/{job_id}/results.log’
return {
’statusCode’: 200,
’body’: {
’results’: results,
’detail’: msg,
}
}

This handler decodes the HTTP requests and launches the PES scan task on ECS. Fur
thermore, a trigger for the Lambda function can be cofigured in Box 3 of Fig. 7.7
in the AWS Lambda console. The API Gateway is assigned to the trigger, as shown
in the Box 1 of Fig. 7.8. For simplicity, we set the security mechanism of the API
Gateway to Open, which allows the gateway to accept all requests and forward them
244 CHAPTER 7 Working with cloud

FIGURE 7.6
Creating an AWS Lambda function to trigger the PES scan service.

FIGURE 7.7
Cofiguring AWS Lambda.
7.1 Utilizing cloud computing 245

FIGURE 7.8
Adding an AWS API Gateway trigger.

FIGURE 7.9
Checking API Gateway endpoint and cofiguring permissions.
246 CHAPTER 7 Working with cloud

to the Lambda function. Please note that this setting is insecure. It is recommended to
cofigure an authentication method to enhance the security service [21]. By deploy
ing the API Gateway trigger, an HTTP endpoint is generated. This endpoint can be
accessed by clicking on Box 1 in Fig. 7.9, which indicates the following URL:
https://d8iah6mwib.execute-api.us-east-1.amazonaws.com/default/pes-scan

The Lambda function handler needs access to several AWS services, including
the read and write permissions for S3, EC2, and ECS. We can cofigure the permis
sions of the Lambda function in the AWS Lambda console, as shown in Fig. 7.9.
For the cofiguration method of the IAM policy, refer to Section 7.1.3. After assign
ing necessary permissions to the Lambda function, we are ready to provide a PES
scan service through the API Gateway endpoint. Here is an example of invoking the
HTTP API using a JSON document, which specfies the geometry of a molecule and
the PES parameters:
import requests
molecule = ’O 0 0 0; H 0.757 0.587 0; H -0.757 0.587 0’
resp = requests.post(
’https://d8iah6mwib.execute-api.us-east-1.amazonaws.com/default/pes-
scan’,
json={’job’: ’pes’, ’molecule’: molecule, ’params’: ’O,H’}
)
print(resp.json()[’detail’])

The response will include an S3 path that indicates where the results are stored.
The above example is the most basic framework of an FaaS, in which we have
omitted many technical details. In a comprehensive FaaS, there are more engineer
ing and usability issues to consider, such as authorization, authentication, network
security, fault tolerance, monitoring, and data persistence, etc. Nevertheless, this ba
sic framework can serve as a starting point, from which you can gradually introduce
more features to enhance the usability of the FaaS.

7.2 Distributed job executors

In high-performance computing (HPC) environments, job schedulers such as PBS
and SLURM are used to allocate the computational resources for job execution. Al
though applicable, they are not as commonly used in cloud environments. In cloud
settings, resources such as computing nodes and storage are allocated on demand.
It is inconvenient to recofigure the traditional HPC management system each time
new computing resources are allocated. Consequently, simple and lightweight tools
for job scheduling have become more popular in cloud environments.
To execute Python programs in parallel, Celery, Dask, and Ray are three popular
distributed executors well suited for cloud computing. Each of these tools has its
unique characteristics, advantages, and suitable scenarios:
7.2 Distributed job executors 247

• Celery [22] is a distributed computing tool that relies on message queues to dis
tribute tasks across multiple computing nodes. It is easy to scale up workers to
process a high volume of similar tasks. Celery is commonly used for handling
background tasks in web applications. It can also effectively support quantum
chemistry data generation workloads.
• Dask [23] is a library that focuses on large-scale data processing. It is often em
ployed in data analysis and numerical computation jobs that are distributed across
multiple computing nodes. Its distribution system is a pure Python application,
which does not require any external components (such as message queues or
databases). Dask is a flexible framework for distributing workloads. It is not lim
ited to pre-defined functions. Python functions can be created at runtime and sent
to workers for execution.
• Ray [24] is a high-performance distributed computing framework. It shares cer
tain similarities with Dask in distributed task execution. The highlight of the Ray
project is its native integration with cloud computing environment. It offers con
venient cloud resource management and task scheduling capabilities. In addition
to distributing workloads, Ray offers a comprehensive solution for memory shar
ing and parallel computation in a distributed system. If the goal is to develop
a distributed application that does more than just workload offloading, the Ray
framework is an excellent option.
Distributed job executors typically consist of servers, clients, workers, and other
components. Each component requires a different deployment strategy. Servers and
workers are usually deployed on the cloud. Clients can be installed in the local Python
environment.
Distributed computing is often closely related to parallel computation techniques.
In this section, we will focus only on the capabilities and the deployments of dis
tributed job executors in the context of cloud computing. The designs of parallel
programs will be explored in Chapter 10.

7.2.1 Celery
The distributed execution in Celery is based on the producer-consumer parallel com
puting model. In this setup, the Celery client acts as the producer, pushing new tasks
into a queue. Celery workers then retrieve and execute tasks from this queue. To
deploy Celery, we need to set up a message broker to manage the task queue.
Celery supports a variety of message brokers, such as RabbitMQ, Redis, and
Amazon SQS (Simple Queue Service). Each message broker specializes in differ
ent functionalities, including security, performance, priority handling, error tolerance,
and crash recovery. These considerations are crucial for high-concurrent web applica
tions. However, for tasks such as DFT data generation or other chemistry simulation
calculations, the differences among these brokers are minimal, and all are generally
sufficient for our use cases.
Let’s choose Redis to server as the broker. It can be deployed locally using the
command:
248 CHAPTER 7 Working with cloud

$ docker run -d -p 6379:6379 redis:7.2

We can accordingly install the Celery library with the Redis extension:
$ pip install celery[redis]

The first step in applying Celery is to develop a Celery application. A Celery

application is essentially a Python module that can be imported both locally and by
Celery workers.
$ cat dft_app.py
import hashlib
import pyscf
from gpu4pyscf.dft import RKS
from celery import Celery

app = Celery(’dft-pes’, broker=’redis://localhost:6379/0’, # (1)

backend=’redis://localhost:6379/1’) # (2)

@app.task
def dft_energy(molecule):
job_id = hashlib.md5(’f{molecule=}).hexdigest()
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’, verbose=4)
mf = RKS(mol, xc=’wb97x’).density_fit().run()
return job_id, mf.e_tot

The message broker is specfied by the keyword borker when creating the Celery ap
plication, as shown in line (1). It only serves as a task queue for scheduling jobs. To
receive output from the remote Celery workers, it is necessary to cofigure a database
backend to store results for the Celery application. There are several options avail
able for result backends in Celery [25]. For simplicity, we continue to use Redis as
the results backend. Redis uses numbers in the URL to identify databases. Differ
ent numbers correspond to different databases. Results are cofigured to store in a
separate database, as shown in line (2).
In the Celery application code, we should use the decorator @app.task to register
any remote functions, as required by the Celery library. In our local environment, we
can import this application and invoke the .delay() method of the remote function to
submit tasks.
In [1]: from dft_app import dft_energy
results = [
dft_energy.delay(’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’),
dft_energy.delay(’O 0 0 0; H 0.757 0.6 0; H -0.757 0.6 0’)]
print(results)
print(results[0].status, results[1].status)
7.2 Distributed job executors 249

Out[1]:
[<AsyncResult: 41100ef3-2fed-4809-b8e7-a599b161b5a8>, <AsyncResult: 17271
cc9-6aba-4bef-a8ce-544a512f8e9a>]
PENDING PENDING

Celery returns an AsyncResult object to represent the result of the remote function.
Initially, the status of the submitted tasks is PENDING. This is because we have not yet
deployed any workers for this application.
In another terminal, we can start a Celery worker by executing the command:
$ celery --app=dft_app worker

The application, as specfied by the --app argument, must be a module that Celery
workers can import. In this local setup, the worker successfully loads the Celery
application because the command is executed in the same directory where the ap
plication program file resides. When deploying workers remotely, the application
should be properly installed to ensure that it is importable. The package distribution
techniques we discussed in Chapter 1 can be employed in this regard.
The Celery worker retrieves tasks from the message broker and sends the results
to the backend database. We can access the running status, error messages, and results
via the AsyncResult objects.
In [2]: print(results[0].get())
print(results[1].status)
Out[2]:
[’2bb88ae9b074ca2c190ac9969b4e7397’, -76.43283440169702]
SUCCESS

This example illustrates the basic usage of Celery, which is sufficient for the cur
rent data generation tasks. In order to efficiently process a large volume of tasks, we
plan to deploy Celery workers on the AWS cloud. We will use AWS ECS to launch
Celery workers and AWS SQS to serve as the broker. We choose SQS over deploying
a standalone message queue because SQS can be conveniently accessed from a local
machine. This eliminates the complex network cofigurations that would otherwise
be necessary for a Celery client to access the message queue.
Despite its accessibility, the use of AWS SQS faces a different restriction. Celery
does not support SQS as a backend for result storage. An alternative solution is to
use other AWS database services as the result backend. For simplicity, we will just
upload the results to AWS S3 storage. The Celery object and the task functions in
dft_app.py are modfied accordingly:

import tempfile
import pyscf
from gpu4pyscf.dft import RKS
from celery import Celery
import boto3
250 CHAPTER 7 Working with cloud

app = Celery(’dft-pes’, broker=’sqs://’)

s3 = boto3.client(’s3’)

@app.task
def dft_energy(molecule, s3_path):
output = tempfile.mktemp()
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’,
verbose=4, output=output)
mf = RKS(mol, xc=’wb97x’).density_fit().run()
bucket, key = s3_path.replace(’s3://’, ’’).split(’/’, 1)
s3.upload_file(output, bucket, key)

The next step is to cofigure AWS ECS for Celery workers. This is similar to the
process we discussed in Section 7.1.3. We need to create a Docker image and upload
it to AWS ECR. The Dockerfile for the Docker image is specfied as follows:
FROM nvidia/cuda:12.0.1-runtime-ubuntu22.04

RUN apt-get update && \

apt-get install -y python3 python3-pip && \
rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir pyscf gpu4pyscf-cuda12x boto3 celery[sqs]

COPY dft_app.py /app/

WORKDIR /app
CMD ["celery", "-A", "dft_app", "worker", "--loglevel=info"]

This Docker image is labeled as python-qc/celery-dft:1.0.

$ aws ecr create-repository --repository-name python-qc/celery-dft
$ docker build -t 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/
celery-dft:1.0 .
$ docker push 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/celery
-dft:1.0

We then register an ECS task definition using the new Docker image.
config = yaml.safe_load(’’’
family: celery-gpu-worker
networkMode: awsvpc
containerDefinitions:
- name: celery-worker
image: 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/celery-dft
:1.0
environment:
7.2 Distributed job executors 251

- name: OMP_NUM_THREADS
value: "4"
runtimePlatform:
cpuArchitecture: X86_64
operatingSystemFamily: LINUX
cpu: 4 vcpu
memory: 8GB
’’’)
ecs_client = boto3.client(’ecs’)
resp = ecs_client.register_task_definition(**config)
print(yaml.dump(resp))

We then create ECS tasks to launch Celery workers.

ec2_client = boto3.client(’ec2’)
resp = ec2_client.describe_subnets()
subnets = [v[’SubnetId’] for v in resp[’Subnets’]]

config = yaml.safe_load(’’’
cluster: python-qc
taskDefinition: celery-gpu-worker
count: 2
overrides:
containerOverrides:
- name: celery-worker
environment:
- name: AWS_ACCESS_KEY_ID
value: xxxxxxxxxxxxxxxxxxxx
- name: AWS_SECRET_ACCESS_KEY
value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
resourceRequirements:
- type: GPU
value: "1"
networkConfiguration:
awsvpcConfiguration:
subnets: {}
’’’.format(subnets))

ecs_client = boto3.client(’ecs’)
resp = ecs_client.run_task(**config)
print(yaml.dump(resp))

If more computational resources are required, the count field in the cofiguration can
be modfied to adjust the number of ECS workers. In the ECS task cofigurations,
the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are con
252 CHAPTER 7 Working with cloud

figured to provide AWS access credentials [26], which are necessary for the Celery
workers to access SQS and S3. It is crucial to avoid hardcoding the AWS access cre
dentials directly into the source code or Dockefile of the application, as they might
be publicly accessible.
Now, we are ready to load the modfied dft_app.py and run the DFT computation
tasks in the AWS cloud.
In [3]: from dft_app import dft_energy
results = [
dft_energy.delay(’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’),
dft_energy.delay(’O 0 0 0; H 0.757 0.6 0; H -0.757 0.6 0’)]
print(results)

Out[3]:
[<AsyncResult: 32f68705-04f8-483c-83eb-45bf2622ced3>, <AsyncResult: 5833
cf35-7fd2-4d7d-9ef2-f9e1acd4579a>]

Please note that the Celery workers do not automatically shut down after com
pleting all computational tasks. To terminate Celery workers, we can stop the corre
sponding ECS tasks using the
aws ecs stop-task

command via the CLI or by deleting the allocated resources through the web console.
Don’t forget this cleanup step or you will be continually charged for ECS computing
resources. It is possible to implement functions to monitor the status of the task queue
and automatically terminate Celery workers upon completion. However, implement
ing this feature requires the coordination of several AWS cloud services, which is
beyond the scope of Python programming. We will not cover this topic in this book.
As shown by this example, using Celery is generally not complicated. If we have a
large number of similar computational tasks to execute, Celery is an excellent choice
for batch processing.
Let’s briefly summarize the strengths and disadvantages of using Celery in dis
tributed computing. Celery has the following advantages:
• Efficient for batch executing a large number of pre-defined jobs, such as the data
generation tasks.
• Quick response. As long as the Celery workers are active, they can repeatedly ex
ecute tasks. New tasks incur only a small overhead when retrieving input variables
from the broker.
• Thanks to the message-queue architecture, Celery provides resilience against
worker failures. If any workers crash, the ufinished tasks can be picked up and
continued by other workers.
The disadvantages of Celery include:
• Celery uses JSON serialization to encode the task parameters and the results. Con
sequently, a general Python object may not be used as the input arguments of a
7.2 Distributed job executors 253

task. If we need to distribute general Python functions or objects, other distributed

executors, such as dask.distributed can be considered.
• There is no native support to cloud deployment. Deploying Celery requires certain
expertise and knowledge of the cloud architecture.
• Scaling workers up or down is not automatic.
• There is no built-in monitor in Celery for conveniently tracking and managing the
status of tasks. The package provides only a CLI tool for monitoring tasks. To
manage Celery tasks through a GUI, third-party libraries such as Flower [27] are
required.

7.2.2 Dask
Dask offers the dask.distributed module for distributed computing over a Dask clus
ter. A Dask cluster consists of a scheduler and several workers, as shown in Fig. 7.10.
To install the required components for Dask distributed executors, we can execute
$ pip install dask distributed bokeh

Here, the package distributed provides the necessary tools for managing a Dask
cluster, and bokeh offers a visualization dashboard for monitoring the status of the
Dask cluster.

FIGURE 7.10
The architecture of the Dask cluster.

Let’s first explore the functionality of dask.distributed in a local environment.

We can start a Dask cluster by running two commands:
$ dask scheduler
$ dask worker localhost:8786

The command dask-scheduler initializes the scheduler for the Dask cluster. The com
mand dask-worker launches workers.
Once the Dask cluster is initialized, the dask-scheduler functions as the server.
We can then create a Client to connect to the scheduler. The .submit() method
254 CHAPTER 7 Working with cloud

of the Dask client can be used to offload computation to Dask workers. This op
eration returns a Future object, which represents the pending result of the dis
tributed task. This Future object offers the functionality similar to the Python built-in
concurrent.futures.Future class. Additionally, the Dask client provides a shortcut
method, .gather(), to quickly retrieve computational results.
import pyscf
from dask.distributed import Client

client = Client(’localhost:8786’)

def dft_energy(molecule):
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’, verbose=4)
mf = mol.RKS(xc=’wb97x’).density_fit().run()
return mf.e_tot

molecules = [’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’,

’O 0 0 0; H 0.757 0.6 0; H -0.757 0.6 0’]
results = [client.submit(dft_energy, x) for x in molecules]
print(client.gather(results))

By default, starting a scheduler automatically serves a bokeh dashboard on port

8787 of the same machine. In a local setup, the dashboard can be accessed at:
http://localhost:8787/status

We can use this dashboard to monitor the status of the Dask cluster.
Next, let’s examine how to deploy the Dask cluster in the cloud. Deploying the
Dask cluster scheduler and the Dask workers may involve different cofigurations.
The Dask scheduler needs to be accessible to the Client running on our local device,
either through its public IP address or a proxy. Data transfers via public IP addresses
are subject to charges. To avoid data transfers through public IP addresses, commu
nication between workers and the scheduler can occur over the internal network.
If we deploy the Dask scheduler using ECS, additional network cofigurations
will be required to ensure the accessibility of the Dask scheduler. For simplicity,
we can launch an EC2 instance to host the Dask scheduler. An EC2 instance can
have both public and private IP addresses.1 After installing the essential packages
including dask, distributed, and bokeh in the EC2 instance, we can start the dask
scheduler as we have done in the local environment.
On the other hand, for Dask workers, we can still utilize the containerized service.
Similar to the ECS deployment method for Celery, we can create a Docker image
python-qc/dask-dft:1.0 using the following Dockerfile:

1 Exposing public IP addresses can introduce security risks. It is recommended to cofigure the firewall
(Security Group) of the EC2 instance to only allow connections from trusted IP addresses.
7.2 Distributed job executors 255

FROM nvidia/cuda:12.0.1-runtime-ubuntu22.04
RUN apt-get update && \
apt-get install -y python3 python3-pip && \
rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir gpu4pyscf-cuda12x boto3 dask distributed

The ECS task definition for Dask workers is essentially the same as that for Celery
workers, which can be reused here.
An ECS worker might be assigned multiple IP addresses. To ensure that the Dask
worker attaches to the private address, we should specify the address when starting
the dask-worker command. This address can be acquired from the AWS web console,
or through the boto3 library. We then override the image and command in the task
definition, leading to the following ECS task cofiguration to launch Dask workers:
ec2_client = boto3.client(’ec2’)
resp = ec2_client.describe_subnets()
subnets = [v[’SubnetId’] for v in resp[’Subnets’]]
private_ip = ec2_client.describe_instances(
Filters=[{’Name’: ’tag:Name’, ’Values’: [’dask-server’]}]
)[’Reservations’][0][’Instances’][0][’PrivateIpAddress’]

config = yaml.safe_load(’’’
cluster: python-qc
taskDefinition: celery-gpu-worker
count: 2
overrides:
containerOverrides:
- name: dask-gpu-worker
image: 987654321000.dkr.ecr.us-east-1.amazonaws.com/python-qc/dask-dft
:1.0
command:
- dask-worker
- "{0}:port"
resourceRequirements:
- type: GPU
value: "1"
networkConfiguration:
awsvpcConfiguration:
subnets: {1}
’’’.format(private_ip, subnets}))
ecs_client = boto3.client(’ecs’)
resp = ecs_client.run_task(**config)
print(yaml.dump(resp))
256 CHAPTER 7 Working with cloud

Compared to Celery, a major advantage of the Dask distributed solution is its

support for general Python functions and objects (subject to certain serialization con
straints, as discussed in Chapter 6). However, deploying Dask distributed systems in
the cloud is complex. Additionally, similar to the autoscaling issues faced with Cel
ery, managing Dask workers and scaling them up or down is relatively inconvenient
in cloud environments.

7.2.3 Ray
Ray is more accurately described as a distributed execution framework rather than
merely a distributed task executor. It is designed for building and running scalable
and distributed applications. It also offers scalable toolkits for developing distributed
machine learning applications.
When Ray is used as a distributed task scheduler, its functionality has certain
overlaps with Dask. It can offload a general Python function to a remote worker for
execution. One of the key advantages in Ray is its native integration with various
cloud platforms, such as AWS, GCP, Azure, and Aliyun. Ray allows users to manage
a cluster for distributed computation without needing extensive cloud expertise.
A Ray cluster is made of a head node and several worker nodes, as shown in
Fig. 7.11. Unlike Dask, which runs a simple service on each node, Ray employs
a more complex architecture. On each node, Ray runs multiple services for com
munication, job scheduling, distributed storage, and other functions. The head node
additionally operates an auto-scaling service, which can automatically scale worker
nodes according to the workloads. Deploying these complex services does not require
a manual, step-by-step setup. By specifying a concise cluster cofiguration, Ray can
automatically cofigure the virtual cluster on the cloud, including all necessary net
work connections, message queues, schedulers, storage, autoscaler, and other cloud
services required by Ray.

FIGURE 7.11
The architecture of the Ray cluster.

The Ray package integrates both distributed computation functionality and cluster
management capabilities. The management of Ray cluster can be operated locally.
7.2 Distributed job executors 257

Taking the AWS cloud as an example, to deploy a Ray cluster on the AWS cloud, the
following packages should be installed:
$ pip install -U ray[default,aws] boto3

The boto3 library is required because Ray relies on this library to access AWS APIs.
To offload a computation task to the Ray cluster, we need to prepare and submit a
Ray job. A Ray job is typically composed of the following components:
• Functions for distribution. These functions are annotated with the @ray.remote
decorator. This decorator adds the .remote() method to them.
• Calling the .remote() method on these functions to generate Future objects.
• To ensure completion, retrieving the results of the Future objects, using the
ray.get function.

Below is an example of Ray jobs suitable for the aforementioned DFT calculation
task.
$ cat dft_job.py
import pyscf
import ray

@ray.remote
def dft_energy(molecule):
mol = pyscf.M(atom=molecule, basis=’def2-tzvp’, verbose=4)
mf = mol.RKS(xc=’wb97x’).density_fit().run()
return mf.e_tot

molecules = [’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’,

’O 0 0 0; H 0.757 0.6 0; H -0.757 0.6 0’]
futures = [dft_energy.remote(x) for x in molecules]
print(ray.get(futures))

To experience the basic usage of Ray jobs, we can start by running the following
command to initiate a Ray cluster in the local environment:
$ ray start --head

According to instructions displayed in the output, we can submit a Ray job using the
command:
$ export RAY_ADDRESS=’http://127.0.0.1:8265’
$ ray job submit --working-dir . -- python3 dft_job.py

The double dash -- notation is a delimiter, which separates the Ray options from the
command of the job to be executed remotely. working-dir is a local directory where
we can put the job scripts and necessary input data.
258 CHAPTER 7 Working with cloud

Please note the --working-dir path, which can be a common source of confusion.
Cloud environments do not have the shared file systems as that on HPC. Script files,
such as dft_job.py that we develop locally, are not automatically accessible on the
worker nodes. To address the file sharing problem, the command

ray job submit

will replicate the working-dir onto worker nodes. The command for the job (such
as python3 dft_job.py in our example) is executed within this directory. Therefore,
we should avoid using absolute paths specific to the local machine within the job
command. Additionally, we need to limit the amount of data stored in the working
dir. Sending a large volume of data can lead to file transfer errors during the Ray job
submission process.
If our job requires large libraries or a significant amount of input data, how to
transfer them to the Ray worker nodes?
One approach is to create a script within the working-dir, which installs the re
quired libraries and download necessary data from cloud object storage. This script
can be executed prior to running the main computation task.
Alternatively, libraries or data can be pre-installed on the worker nodes during the
Ray cluster deployment. These data preparation commands can be specfied in the
cluster cofiguration, which will be more clear in subsequent discussions.
Let’s move on to setting up a Ray cluster on AWS. Below is a sample cofig
uration for a Ray cluster on AWS, which we have saved in a YAML file named
config.yaml.

cluster_name: dft-ray

provider:
type: aws
region: us-east-1
availability_zone: us-east-1a

available_node_types:
ray.head.default:
node_config:
InstanceType: m5.xlarge
ImageId: ami-080e1f13689e07408 # Ubuntu 22.04
ray.worker.default:
max_workers: 3
node_config:
InstanceType: p3.2xlarge # GPU instance
ImageId: ami-0b54855df82eef3a3 # Ubuntu 22.04 with Deep learning
tools

setup_commands:
Summary 259

- sudo apt install -y python3-pip

- pip install ray[default] boto3 pyscf gpu4pyscf-cuda12x

This cofiguration specfies several key settings:

• The resources are provisioned on the AWS cloud platform, as indicated by the
provider field.
• An autoscaling scheme is cofigured, which allows to launch one head node and
up to three worker nodes.
• The cofiguration specfies the EC2 instance type and the operating system
for each node. We can search for a public OS image in the AWS market
place, or a custom image created by ourselves. In this example, ImageId: ami
0b54855df82eef3a3 is a public deep learning image, which pre-installs the
CUDA 12 drivers and runtime tools.
• The setup_commands field specfies the initialization commands to run after the
nodes boot up. This allows for the installation of new packages and the preparation
of data. The software installed in this phase must be compatible with the CUDA
toolkit provided by the OS image.
There are more custom options available in the Ray cluster cofiguration, such as
the disk size, the autoscaling strategy, etc. For further cofiguration options, you can
refer to the Ray documentation [28] or the cluster cofiguration examples available
in the Ray source code.
After defining the cluster cofiguration, we can use the ray up command to create
or update a cluster.
$ ray up -y config.yaml

Once all computations are finished, the Ray cluster can be shut down using the com
mand:
$ ray down -y config.yaml

By executing this command, Ray sends requests to the cloud platform to free and
clean up the resources it has allocated. However, this command does not fully track
the success of these requests. To prevent potential resource waste and unexpected
bills, it is recommended to log into the web console and manually verify the status of
each resource.

Summary
In this chapter, we have briefly demonstrated the design and implementation of a
quantum chemistry simulation worflow on the cloud. We utilized the containerized
cloud service to provide computing resources and the object storage service for data
exchange. When cofiguring cloud services, we chose to manage them by calling
260 CHAPTER 7 Working with cloud

their APIs. This approach was combined with Python templating techniques. Al
though we have used the AWS cloud to demonstrate this process, similar approaches
can be applied to other cloud platforms.
Based on the knowledge of cloud computing, we explored several distributed job
executors, including the Celery, Dask, and Ray libraries. We illustrated how to de
ploy them on the cloud and how to use them to execute computation tasks remotely.
Notably, Ray exhibits excellent integration with cloud environments, making it a con
venient choice for cloud computing.
During the development of cloud programs, we leveraged an AI coding assistant,
specifically the GPT-4 model, to provide prototypes for computation tasks and con
figuration programs. We then rfined the functionalities based on the AI-generated
prototypes. The world of cloud computing remains complex. Relevant tools and tech
nologies evolve rapidly. Using AI to assist the development of cloud programs is a
very effective method to accommodate this situation.

References
[1] Amazon Web Services, Amazon machine images (AMIs), https://docs.aws.amazon.com/
AWSEC2/latest/UserGuide/AMIs.html, 2024.
[2] Amazon Web Services, Amazon EBS volumes, https://docs.aws.amazon.com/ebs/latest/
userguide/ebs-volumes.html, 2024.
[3] Amazon Web Services, What is Amazon elastic container service?, https://docs.aws.
amazon.com/AmazonECS/latest/developerguide/Welcome.html, 2024.
[4] Amazon Web Services, What is Amazon VPC?, https://docs.aws.amazon.com/vpc/latest/
userguide/what-is-amazon-vpc.html, 2024.
[5] Amazon Web Services, How Amazon VPC works, https://docs.aws.amazon.com/vpc/
latest/userguide/how-it-works.html, 2024.
[6] Amazon Web Services, Get started with Amazon S3, https://docs.aws.amazon.com/
AmazonS3/latest/userguide/GetStartedWithS3.html, 2024.
[7] Amazon Web Services, Moving an image through its lifecycle in Amazon ECR, https://
docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html, 2024.
[8] Amazon Web Services, Control traffic to your AWS resources using security groups,
https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-groups.html, 2024.
[9] Amazon Web Services, Cofiguration and credential file settings, https://docs.aws.
amazon.com/cli/latest/userguide/cli-configure-files.html, 2024.
[10] Amazon Web Services, IAM roles, https://docs.aws.amazon.com/IAM/latest/UserGuide/
id_roles.html, 2024.
[11] The PySCF Developers, Quantum chemistry with Python, https://pyscf.org/, 2024.
[12] R. Li, Q. Sun, X. Zhang, G.K.-L. Chan, Introducing GPU-acceleration into the Python
based simulations of chemistry framework, arXiv:2407.09700, https://arxiv.org/abs/
2407.09700, 2024.
[13] X. Wu, Q. Sun, Z. Pu, T. Zheng, W. Ma, W. Yan, X. Yu, Z. Wu, M. Huo, X. Li, W. Ren, S.
Gong, Y. Zhang, W. Gao, Enhancing GPU-acceleration in the Python-based simulations
of chemistry framework, arXiv:2404.09452, https://arxiv.org/abs/2404.09452, 2024.
[14] Amazon Web Services, Aws fargate for Amazon ECS, https://docs.aws.amazon.com/
AmazonECS/latest/developerguide/AWS_Fargate.html, 2024.
References 261

[15] Amazon Web Services, What is Amazon EC2 Auto Scaling?, https://docs.aws.amazon.
com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html, 2024.
[16] Amazon Web Services, Amazon ECS task definitions for GPU workloads, https://docs.
aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html, 2024.
[17] Amazon Web Services, Amazon ECS task networking options for the EC2 launch
type, https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking.
html, 2024.
[18] Amazon Web Services, Amazon ECS task metadata endpoint version 3, https://docs.aws.
amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v3.html, 2024.
[19] Amazon Web Services, Tutorial: create a REST API with a Lambda proxy in
tegration, https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-
create-api-as-simple-proxy-for-lambda.html, 2024.
[20] Amazon Web Services, Dfine Lambda function handler in Python, https://docs.aws.
amazon.com/lambda/latest/dg/python-handler.html, 2024.
[21] Amazon Web Services, Authenticating using an API gateway method, https://
docs.aws.amazon.com/transfer/latest/userguide/authentication-api-gateway.html#
authentication-custom-ip, 2024.
[22] A. Solem, contributors, Celery - distributed task queue, https://docs.celeryq.dev/en/
stable/, 2024.
[23] Dask Development Team, Dask: library for dynamic task scheduling, http://dask.pydata.
org, 2016.
[24] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W.
Paul, M.I. Jordan, I. Stoica, Ray: a distributed framework for emerging AI applications,
in: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI
18), USENIX Association, Carlsbad, CA, 2018, pp. 561--577, https://www.usenix.org/
conference/osdi18/presentation/moritz.
[25] A. Solem, contributors, Celery documentation - backends and brokers, https://docs.
celeryq.dev/en/stable/getting-started/backends-and-brokers/index.html, 2024.
[26] Amazon Web Services, Environment variables to cofigure the AWS CLI, https://docs.
aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html, 2024.
[27] M. Movsisyan, Flower documentation, https://flower.readthedocs.io/en/latest/, 2024.
[28] The Ray Team, Ray documentation - cluster yaml cofiguration options, https://docs.ray.
io/en/latest/cluster/vms/references/ray-cluster-configuration.html#cluster-config, 2024.

Project Preet PDF
No ratings yet
Project Preet PDF
59 pages
Cloud Computing RGPV (Sem 8th)
100% (4)
Cloud Computing RGPV (Sem 8th)
123 pages
Cloud Computing - A Hands-On Approach (Arshdeep Bahga, Vijay Madisetti)
50% (6)
Cloud Computing - A Hands-On Approach (Arshdeep Bahga, Vijay Madisetti)
431 pages
CC Final 5
No ratings yet
CC Final 5
34 pages
Cloud Computing - CS307 - Unit1
No ratings yet
Cloud Computing - CS307 - Unit1
40 pages
Cloud Computing Organizer
No ratings yet
Cloud Computing Organizer
95 pages
Mca - Sem I Cloud Computing-1
No ratings yet
Mca - Sem I Cloud Computing-1
34 pages
Cloud Computing
No ratings yet
Cloud Computing
248 pages
Unit-I - Lm-Cse
No ratings yet
Unit-I - Lm-Cse
24 pages
00 Introduction
No ratings yet
00 Introduction
47 pages
Cloudcomputing
No ratings yet
Cloudcomputing
70 pages
Cloud Computing Note
No ratings yet
Cloud Computing Note
15 pages
Cloud Computing: Department of Computer Science & Engineering
No ratings yet
Cloud Computing: Department of Computer Science & Engineering
22 pages
Research Cloud2
No ratings yet
Research Cloud2
8 pages
Cloud Computing
No ratings yet
Cloud Computing
57 pages
Module - 05 CC (Bcs601) Search Creators
100% (2)
Module - 05 CC (Bcs601) Search Creators
35 pages
CC Syllabus
No ratings yet
CC Syllabus
6 pages
Cloud Computing Lab Manual AL-606
No ratings yet
Cloud Computing Lab Manual AL-606
71 pages
frmCourseSyllabusIPDownload Aspx
No ratings yet
frmCourseSyllabusIPDownload Aspx
2 pages
Cloud Computing Assignment 2025
No ratings yet
Cloud Computing Assignment 2025
2 pages
DM Exp
No ratings yet
DM Exp
12 pages
Module - 05 CC
No ratings yet
Module - 05 CC
28 pages
Unit 1
No ratings yet
Unit 1
22 pages
CC
No ratings yet
CC
69 pages
CC Module 1 Sjbit
No ratings yet
CC Module 1 Sjbit
35 pages
My Cloud Community
No ratings yet
My Cloud Community
37 pages
2 - 22CST602 - Cloud Computing
No ratings yet
2 - 22CST602 - Cloud Computing
3 pages
Dynamic Parallel Data Processing in Heterogeneous Cloud
No ratings yet
Dynamic Parallel Data Processing in Heterogeneous Cloud
30 pages
Chapter 1
No ratings yet
Chapter 1
101 pages
Cloud Computing Module1
No ratings yet
Cloud Computing Module1
97 pages
Whole Cloud Computing 2
No ratings yet
Whole Cloud Computing 2
261 pages
6 Csesyll
No ratings yet
6 Csesyll
49 pages
CloudComputing Syllabus
No ratings yet
CloudComputing Syllabus
5 pages
Cloud Computing.
No ratings yet
Cloud Computing.
105 pages
Davija-Cs8791 Cloud Computing
No ratings yet
Davija-Cs8791 Cloud Computing
242 pages
CloudComputing Flipped
No ratings yet
CloudComputing Flipped
18 pages
Trends in CC
No ratings yet
Trends in CC
14 pages
Report
No ratings yet
Report
33 pages
TYBSC CS Cloud Computing
No ratings yet
TYBSC CS Cloud Computing
95 pages
Cloud Computing
No ratings yet
Cloud Computing
36 pages
Unit 1 - Cloud Computing - Digital Content
No ratings yet
Unit 1 - Cloud Computing - Digital Content
69 pages
IBM Cloud Professional Certification Program
No ratings yet
IBM Cloud Professional Certification Program
44 pages
Cloud Computing for ML Engineers
No ratings yet
Cloud Computing for ML Engineers
32 pages
Cloud Security Book
No ratings yet
Cloud Security Book
61 pages
Gudlavalleru Engineering College: Handout
No ratings yet
Gudlavalleru Engineering College: Handout
23 pages
Cloud Computing Overview Brief Notes (Unit 1)
No ratings yet
Cloud Computing Overview Brief Notes (Unit 1)
5 pages
CSE423
No ratings yet
CSE423
1 page
E11QA
No ratings yet
E11QA
494 pages
BE Elective Enterprise Computing V4.0
No ratings yet
BE Elective Enterprise Computing V4.0
3 pages
2.cloud Computing and Edge Computing
No ratings yet
2.cloud Computing and Edge Computing
2 pages
Cloud Computing Syllabus New Scheme
No ratings yet
Cloud Computing Syllabus New Scheme
4 pages
SS ZG527
No ratings yet
SS ZG527
17 pages
UNIT - VI Cloud Computing
No ratings yet
UNIT - VI Cloud Computing
13 pages
CC Lec01 2024
No ratings yet
CC Lec01 2024
48 pages
Unit 16.assignment Brief 1
0% (1)
Unit 16.assignment Brief 1
19 pages
Module 1.0
No ratings yet
Module 1.0
48 pages
ANS:-There Are Many Characteristics of Cloud Computing Here Are Few of Them
No ratings yet
ANS:-There Are Many Characteristics of Cloud Computing Here Are Few of Them
15 pages
CN Unitwise QB (2022-2026 Batch)
No ratings yet
CN Unitwise QB (2022-2026 Batch)
5 pages
Rajath Hatwar01
No ratings yet
Rajath Hatwar01
2 pages
Handwritten Notes Unit 1 Introduction To Artificial Intelligence & Machine Learning
No ratings yet
Handwritten Notes Unit 1 Introduction To Artificial Intelligence & Machine Learning
13 pages
DSA
No ratings yet
DSA
2 pages
Syllabus
0% (1)
Syllabus
3 pages
Arduino Reference - Arduino Reference
No ratings yet
Arduino Reference - Arduino Reference
3 pages
MPP Troubleshooting FAQ V1 2 1
No ratings yet
MPP Troubleshooting FAQ V1 2 1
38 pages
SWIFT Error Codes
No ratings yet
SWIFT Error Codes
199 pages
Dmu 20
0% (1)
Dmu 20
11 pages
A Simple Way To Design UI, UX, and CX Using Mind Maps
No ratings yet
A Simple Way To Design UI, UX, and CX Using Mind Maps
14 pages
Program To Implement The Bisection Method: Anshul Siwach
No ratings yet
Program To Implement The Bisection Method: Anshul Siwach
49 pages
PCF Users Guide
No ratings yet
PCF Users Guide
104 pages
Components of Computer
No ratings yet
Components of Computer
12 pages
Software Quality Assurance Course
No ratings yet
Software Quality Assurance Course
3 pages
FocusScan Rx Ultrasonic Software Guide
No ratings yet
FocusScan Rx Ultrasonic Software Guide
121 pages
DSA Assignemnt
No ratings yet
DSA Assignemnt
4 pages
Docu68889 VMAX All Flash Security Configuration Guide
No ratings yet
Docu68889 VMAX All Flash Security Configuration Guide
120 pages
Logcat Home Fota Update Log
No ratings yet
Logcat Home Fota Update Log
564 pages
Factory Simulation Software System: Software Requirements Specification (SRS)
No ratings yet
Factory Simulation Software System: Software Requirements Specification (SRS)
24 pages
Not Yet Answered Marked Out of 1.00: Clear My Choice
No ratings yet
Not Yet Answered Marked Out of 1.00: Clear My Choice
34 pages
CHANGE ME PLEASE Compressed Compressed
No ratings yet
CHANGE ME PLEASE Compressed Compressed
9 pages
Chapter Three 3.1 Architectural Styles: Software Design and Architecture Handout
No ratings yet
Chapter Three 3.1 Architectural Styles: Software Design and Architecture Handout
29 pages
Verilog Introduction: Synopsis
No ratings yet
Verilog Introduction: Synopsis
7 pages
LAB 1 Microcontroller & Application: Objective
No ratings yet
LAB 1 Microcontroller & Application: Objective
8 pages
Python Beginner Programming Exercises
No ratings yet
Python Beginner Programming Exercises
7 pages
Facial Recognition Attendance System
No ratings yet
Facial Recognition Attendance System
10 pages
The Little Handbook of Dfir Just Some Thoughts About Digital Forensics and Incident Response Little Handbooks 9798371735119
No ratings yet
The Little Handbook of Dfir Just Some Thoughts About Digital Forensics and Incident Response Little Handbooks 9798371735119
62 pages
Input Device
No ratings yet
Input Device
1 page
ACIT 16501 Cyber Security BTech CSE 6th Sem
No ratings yet
ACIT 16501 Cyber Security BTech CSE 6th Sem
2 pages
CS6611 Mobile Application Development Lab
No ratings yet
CS6611 Mobile Application Development Lab
54 pages

Cloud Computing

Uploaded by

Cloud Computing

Uploaded by

CHAPTER

Working with cloud

7.1 Utilizing cloud computing

7.1.1 Comparison between cloud and supercomputers

7.1.2 Designing a wor­flow

7.1.3 Computing resources

# Install necessary dependencies

COPY start.sh /app/start.sh

$ aws ecs create-cluster --cluster-name ecs-basic-cluster

The output from GPT-4 is as follows:

# Your YAML configuration for the ECS task

# Convert YAML config to Python dictionary

# Initialize a boto3 client for ECS

# Register the ECS task definition

While it is possible to r­fine the prompt to generate improved code samples, we

7.1.4 Communications among cloud services

The co­figuration of the DFT task executor

aws s3 copy $S3PATH/job.py ./

with ServerProxy(rpc_server) as proxy:

The RPC communication can be implemented in terms of REST APIs, as demon­

The co­figuration of the RPC server

job_pool: Dict[str, Future] = {}

def launch_tasks(geom_grids, results_path, timeout=7200):

bucket, key = results_path.replace(’s3://’, ’’).split(’/’, 1)

def set_result(job_id, result):

Given a set of molecular geometries, the launch_tasks function launches ECS

Deploying RPC server

The RPC server program can be deployed on ECS as a task.

In this co­figuration, the networkBindings must be co­figured to expose port 5005

import json, hashlib, boto3

def lambda_handler(event, context):

7.2 Distributed job executors

$ docker run -d -p 6379:6379 redis:7.2

The first step in applying Celery is to develop a Celery application. A Celery

app = Celery(’dft-pes’, broker=’redis://localhost:6379/0’, # (1)

app = Celery(’dft-pes’, broker=’sqs://’)

RUN apt-get update && \

COPY dft_app.py /app/

This Docker image is labeled as python-qc/celery-dft:1.0.

We then create ECS tasks to launch Celery workers.

task. If we need to distribute general Python functions or objects, other distributed

Let’s first explore the functionality of dask.distributed in a local environment.

molecules = [’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’,

By default, starting a scheduler automatically serves a bokeh dashboard on port

Compared to Celery, a major advantage of the Dask distributed solution is its

molecules = [’O 0 0 0; H 0.757 0.5 0; H -0.757 0.5 0’,

ray job submit

- sudo apt install -y python3-pip

This co­figuration spec­fies several key settings:

You might also like

7.1.2 Designing a worflow

While it is possible to rfine the prompt to generate improved code samples, we

The cofiguration of the DFT task executor

The RPC communication can be implemented in terms of REST APIs, as demon

The cofiguration of the RPC server

In this cofiguration, the networkBindings must be cofigured to expose port 5005

This cofiguration specfies several key settings: