Cloud Computing Complete Notes
Cloud Computing Complete Notes
History:
Then after, distributed computing came into picture, where all the
computers are networked together and share their resources when needed.
But of course time has passed and the technology caught that idea
and after few years we mentioned that:
1
In 2009, Google Apps also started to provide cloud computing
enterprise applications.
Of course, all the big players are present in the cloud computing
evolution, some were earlier, and some were later. In
2009, Microsoft launched Windows Azure, and companies like Oracle and
HP have all joined the game. This proves that today, cloud computing has
become mainstream.
Definition of cloud:
Cloud Computing can be defined as delivering computing power (CPU,
RAM, Network Speeds, Storage OS software) a service over a network
(usually on the internet) rather than physically having the computing
resources at the customer location.
1) Agility
3) High Scalability
2
4) Multi-Sharing
With the help of cloud computing, multiple users and applications can work
more efficiently with cost reductions by sharing common infrastructure.
Cloud computing enables the users to access systems using a web browser
regardless of their location or what device they use e.g. PC, mobile phone
etc. As infrastructure is off-site (typically provided by a third-party) and
accessed via the Internet, users can connect from anywhere.
6) Maintenance
7) Low Cost
By using cloud computing, the cost will be reduced because to take the
services of cloud computing, IT company need not to set its own
infrastructure and pay-as-per usage of resources.
Cloud Models:
The service models are categorized into three basic models:
1) Software-as-a-Service (SaaS)
2) Platform-as-a-Service (PaaS)
3) Infrastructure-as-a-Service (IaaS)
3
1) Software-as-a-Service (SaaS)
SaaS is a software distribution model in which applications are hosted by a
cloud service provider and made available to customers over internet. SaaS
is also known as "On-Demand Software".
In SaaS, software and associated data are centrally hosted on the cloud
server. SaaS is accessed by users using a thin client via a web browser.
Advantages
1) SaaS is easy to buy
SaaS pricing is based on a monthly fee or annual fee, SaaS allows
organizations to access business functionality at a low cost which is
less than licensed applications.
2) Less hardware required for SaaS
The software is hosted remotely, so organizations don't need to
invest in additional hardware.
4
3) Low Maintenance required for SaaS
Software as a service removes the necessity of installation, set-up,
and often daily unkeep and maintenance for organizations. Initial
set-up cost for SaaS is typically less than the enterprise software.
SaaS vendors actually pricing their applications based on some
usage parameters, such as number of users using the application. So
SaaS does easy to monitor and automatic updates.
Disadvantages
1) Security
Actually data is stored in cloud, so security may be an issue for some users.
However, cloud computing is not more secure than in-house deployment.
Learn more cloud security.
2) Latency issue
Because the data and application are stored in cloud at a variable distance
from the end user, so there is a possibility that there may be more latency
while interacting with the application than a local deployment. So, SaaS
model is not suitable for applications whose demand response times are in
milliseconds.
IaaS cloud computing platform layer eliminates the need for every
organization to maintain the IT infrastructure.
IaaS is offered in three models: public, private, and hybrid cloud. Private
cloud implies that the infrastructure resides at the customer-premise. In case
of public cloud, it is located at the cloud computing platform vendor's data
center; and hybrid cloud is a combination of two with customer choosing
the best of both worlds.
Advantages
2) You easily access the vast computing power available on IaaS cloud
platform.
6
Disadvantages
1) There is a risk of IaaS cloud computing platform vendor by gaining the
access to the organization’s data. But it can be avoided by opting for private
cloud.
4) IaaS cloud computing platform can limit the user privacy and
customization options.
3) Platform-as-a-Service (PaaS):
PaaS cloud computing platform is a developer programming platform
which is created for the programmer to develop, test, run and manage the
applications.
7
manage the infrastructure.All the infrastructure to run the applications will
be over the internet.
Advantages
1) Simplified Development
2) Lower risk
4) Instant community
5) Scalability
Applications deployed can scale from one to thousands of users without any
changes to the applications.
Disadvantages
1) Vendor lock-in
8
2) Data Privacy
It may happen that some applications are local and some are in cloud.
So there will be chances of increased complexity when we want to use
data which in the cloud with the local data.
IaaS:
• Amazon EC2
• Google Compute Engine
• Windows Azure VMs
• PaaS:
• Google App Engine
• SaaS:
• Salesforce
Hypervisor:
10
• Type-1 Hypervisor
• Type-I or the native hypervisors run directly on the host hardware and
control the hardware and monitor the guest
Operating systems.
• Type-2 Hypervisor
• Type 2 hypervisors or hosted hypervisors run on top of a conventional
(main/host) operating system and monitor the guest operating systems.
Load Balancing:
Cloud computing resources can be scaled up on demand to meet the
performance requirements of applications.
• Load balancing distributes workloads across multiple servers to meet the
application workloads.
• The goals of load balancing techniques include:
• Achieve maximum utilization of resources
• Minimizing the response times
Maximizing throughput.
11
Load Balancing Algorithms :
• Round Robin load balancing
• Weighted Round Robin load balancing
• Low Latency load balancing
• Least Connections load balancing
• Priority load balancing
• Overflow load balancing
• Persistence Approaches
• Sticky sessions
• Session Database
12
• Browser cookies
• URL re-writing
Scaling Approaches:
• Vertical Scaling/Scaling up:
• Involves upgrading the hardware resources (adding additional computing,
memory,storage or network resources).
13
• Horizontal Scaling/Scaling out
• Involves addition of more resources of the same type.
Deployment:
• Cloud application deployment design is an iterative process that
involves:
• Deployment Design
• The variables in this step include the number of servers in each tier,
computing, memory and storage capacities
of servers, server interconnection, load balancing and replication strategies.
• Performance Evaluation
• To verify whether the application meets the performance requirements
with the deployment.
• Involves monitoring the workload on the application and measuring
various workload parameters such as
response time and throughput.
• Utilization of servers (CPU, memory, disk, I/O, etc.) in each tier is also
monitored.
• Deployment Refinement
• Various alternatives can exist in this step such as vertical scaling (or
scaling up), horizontal scaling (or scaling
out), alternative server interconnections, alternative load balancing and
replication strategies, for instance.
Replication:
Replication is used to create and maintain multiple copies of the data in the
cloud.
• Cloud enables rapid implementation of replication solutions for disaster
recovery for
organizations.
• With cloud-based data replication organizations can plan for disaster
recovery without
making any capital expenditures on purchasing, configuring and managing
secondary site locations.
14
• Types:
• Array-based Replication
• Network-based Replication
• Host-based Replication
Monitoring:
• Monitoring services allow cloud users to collect and analyze the data
on various monitoring metrics.
15
Examples of Monitoring Metrics
Type Metrics
CPU CPU-Usage, CPU-Idle
Disk Disk-Usage, Bytes/sec (read/write),
Operations/sec
Memory Memory-Used, Memory-Free, Page-Cache
Interface Packets/sec (incoming/outgoing),
Octets/sec(incoming/outgoing)
16
• SDN Architecture
• The control and data planes are decoupled and the network controller is
centralized.
17
Network Function Virtualization:
• Network Function Virtualization (NFV) is a technology that leverages
virtualization to
consolidate the heterogeneous network devices onto industry standard high
volume
servers, switches and storage.
• Relationship to SDN
• NFV is complementary to SDN as NFV can provide the infrastructure on
which SDN can run.
• NFV and SDN are mutually beneficial to each other but not dependent.
• Network functions can be virtualized without SDN, similarly, SDN can
run without NFV.
• NFV comprises of network functions implemented in software that run on
virtualized resources in the cloud.
• NFV enables a separation the network functions which are implemented
in software from the underlying
hardware.
NFV Architecture
• Key elements of the NFV architecture are
• Virtualized Network Function (VNF): VNF is asoftware implementation
of a network function
which is capable of running over the NFV Infrastructure (NFVI).
• NFV Infrastructure (NFVI): NFVI includes compute,network and storage
resources that are virtualized.
• NFV Management and Orchestration: NFV Management and
Orchestration focuses on all
virtualization-specific management tasks and coversthe orchestration and
lifecycle management of
physical and/or software resources that support theinfrastructure
virtualization, and the lifecycle management of VNFs.
18
Fig NFV Architecture
MapReduce:
• MapReduce is a parallel data processing model for processing and
analysis of massive scale data.
• MapReduce phases:
• Map Phase: In the Map phase, data is read from a distributed file system,
partitioned among a set of computing nodes in the cluster, and sent to the
nodes as a set of key-value pairs.
• The Map tasks process the input records independently of each other and
produce intermediate results as key-value pairs.
• The intermediate results are stored on the local disk of the node running
the Map task.
• Reduce Phase: When all the Map tasks are completed, the Reduce phase
begins in which the intermediate data with the same key is aggregated.
Customer-based SLA
Service-based SLA
Multilevel SLA
Few Service Level Agreements are enforceable as contracts, but mostly are
agreements or contracts which are more along the lines of an Operating
Level Agreement (OLA) and may not have the restriction of law. It is fine
to have an attorney review the documents before making a major agreement
20
to the cloud service provider. Service Level Agreements usually
specify some parameters which are mentioned below:
1. Availability of the Service (uptime)
2. Latency or the response time
3. Service components reliability
4. Each party accountability
5. Warranties
Billing:
Cloud service providers offer a number of billing models described as
follows:
• Elastic Pricing
• In elastic pricing or pay-as-you-use pricing model, the customers are
charged based on the usage of cloud resources.
• Fixed Pricing
• In fixed pricing models, customers are charged a fixed amount per month
for the cloud resources.
• Spot Pricing
• Spot pricing models offer variable pricing for cloud resources which is
driven by market demand.
Compute Services
22
Compute Services – Amazon EC2
• Amazon Elastic Compute Cloud (EC2) is a compute service provided by
Amazon.
• Launching EC2 Instances
• To launch a new instance click on the launch instance button. This will
open a wizard where you can select the Amazon machine image (AMI)
with which you want to launch the instance. You can also create their own
AMIs with custom applications, libraries and data. Instances can be
launched with a variety of operating systems.
• Instance Sizes
• When you launch an instance you specify the instance type (micro, small,
medium, large, extra-large, etc.), the number of instances to launch based
on the selected AMI and availability zones for the instances.
• Key-pairs
• When launching a new instance, the user selects a key-pair from existing
keypairs or creates a new keypair for the instance. Keypairs are used to
securely connect to an instance after it launches.
• Security Groups
• The security groups to be associated with the instance can be selected
from the instance launch wizard. Security groups are used to open or block
a specific network port for the launched instances.
23
Storage Services:
• Cloud storage services allow storage and retrieval of any amount of data,
at any time from anywhere on the web.
• Most cloud storage services organize data into buckets or containers.
• Scalability
• Cloud storage services provide high capacity and scalability. Objects upto
several tera-bytes in size can be uploaded and
multiple buckets/containers can be created on cloud storages.
• Replication
• When an object is uploaded it is replicated at multiple facilities and/or on
multiple devices within each facility.
• Access Policies
• Cloud storage services provide several security features such as Access
Control Lists (ACLs), bucket/container level policies, etc.
ACLs can be used to selectively grant access permissions on individual
objects. Bucket/container level policies can also be
defined to allow or deny permissions across some or all of the objects
within a single bucket/container.
• Encryption
• Cloud storage services provide Server Side Encryption (SSE) options to
encrypt all data stored in the cloud storage
• Consistency
• Strong data consistency is provided for all upload and delete operations.
Therefore, any object that is uploaded can be immediately downloaded after
the upload is complete.
Storage Services – Amazon S3
• Amazon Simple Storage Service(S3) is an online cloud-based data storage
infrastructure for storing and retrieving any amount of data.
• S3 provides highly reliable, scalable, fast, fully redundant and affordable
storage infrastructure.
• Buckets
• Data stored on S3 is organized in the form of buckets. You must create a
bucket before you can store data on S3.
• Uploading Files to Buckets
• S3 console provides simple wizards for creating a new bucket and
uploading files to Buckets
24
• You can upload any kind of file to S3.
• While uploading a file, you can specify the redundancy and encryption
options and access permissions.
Database Services:
• Cloud database services allow you to set-up and operate relational or non-
relational databases in the cloud.
• Relational Databases
• Popular relational databases provided by various cloud service providers
include MySQL, Oracle, SQL Server, etc.
• Non-relational Databases
• The non-relational (No-SQL) databases provided by cloud service
providers are mostly proprietary solutions.
• Scalability
• Cloud database services allow provisioning as much compute and storage
resources as required to meet the application
workload levels. Provisioned capacity can be scaled-up or down. For read-
heavy workloads, read-replicas can be created.
• Reliability
• Cloud database services are reliable and provide automated backup and
snapshot options.
• Performance
• Cloud database services provide guaranteed performance with options
such as guaranteed input/output operations persecond (IOPS) which can be
provisioned upfront.
• Security
• Cloud database services provide several security features to restrict the
access to the database instances and stored data,such as network firewalls
and authentication mechanisms
25
Database Services – Amazon RDS
• Amazon Relational Database Service (RDS) is a web service that makes
it easy to setup, operate and scale a relational database in the cloud.
• Launching DB Instances
• The console provides an instance launch wizard that allows you to select
the type of database to create (MySQL, Oracle or SQL Server) database
instance size, allocated storage, DB instance identifier, DB username and
password. The status of the launched DB instances can be viewed from the
console.
• Connecting to a DB Instance
• Once the instance is available, you can note the instance end point from
the instance properties tab. This end point can then be used for securely
connecting to the instance.
Application services:
Cloud Computing has its applications in almost all the fields such
as business, entertainment, data storage, social networking, management,
entertainment, education, art and global positioning system, etc. Some of
the widely famous cloud computing applications are discussed here in this
tutorial:
Business Applications
Cloud computing has made businesses more collaborative and easy by
incorporating various apps such as MailChimp, Chatter, Google Apps
for business, and Quickbooks.
26
SN Application Description
1
MailChimp
It offers an e-mail publishing platform. It is widely employed
by the businesses to design and send their e-mail campaigns.
2
Chatter
Chatter app helps the employee to share important
information about organization in real time. One can get the
instant feed regarding any issue.
3
Google Apps for Business
Google offers creating text documents, spreadsheets,
presentations, etc., on Google Docs which allows the business
users to share them in collaborating manner.
4
Quickbooks
It offers online accounting solutions for a business. It helps
in monitoring cash flow, creating VAT returns and creating
business reports.
SN Application Description
1
Box.com
Box.com offers drag and drop service for files. The users need
to drop the files into Box and access from anywhere.
27
2
Mozy
Mozy offers online backup service for files to prevent data
loss.
3
Joukuu
Joukuu is a web-based interface. It allows to display a single
list of contents for files stored in Google Docs, Box.net and
Dropbox.
Management Applications
There are apps available for management task such as time tracking,
organizing notes. Applications performing such tasks are discussed
below:
SN Application Description
1
Toggl
It helps in tracking time period assigned to a particular project.
2
Evernote
It organizes the sticky notes and even can read the text from
images which helps the user to locate the notes easily.
3
Outright
It is an accounting app. It helps to track income, expenses,
profits and losses in real time.
Social Applications
There are several social networking services providing websites such as
Facebook, Twitter, etc.
28
SN Application Description
1
Facebook
It offers social networking service. One can
share photos, videos, files, status and much
more.
2
Twitter
It helps to interact with the public directly.
One can follow any celebrity, organization
and any person, who is on twitter and can
have latest updates regarding the same.
Entertainment Applications
SN Application Description
1
Audio box.fm
It offers streaming service. The music files are
stored online and can be played from cloud
using the own media player of the service.
Art Applications
SN Application Description
1
Moo
It offers art services such as designing and
printing business cards, postcards and mini
cards.
29
Content Delivery Services:
• Cloud-based content delivery service include Content Delivery Networks
(CDNs).
• CDN is a distributed system of servers located across multiple geographic
locations to serve content to enduser with high availability and high
performance.
• CDNs are useful for serving static content such as text, images, scripts,
etc., and streaming media.
• CDNs have a number of edge locations deployed in multiple locations,
often over multiple backbones.
• Requests for static for streaming media content that is served by a CDN
are directed to the nearest edge location.
• Amazon CloudFront
• Amazon CloudFront is a content delivery service from Amazon.
CloudFront can be used to deliver dynamic, static and streaming content
using a global network of edge locations.
• Windows Azure Content Delivery Network
• Windows Azure Content Delivery Network (CDN) is the content delivery
service from Microsoft.
Analytics Services:
• Cloud-based analytics services allow analyzing massive data sets stored in
the cloud either in cloud storages or in cloud databases using programming
models such as MapReduce.
• Amazon Elastic MapReduce
• Amazon Elastic MapReduce is the MapReduce service from Amazon
based the Hadoop framework running on Amazon EC2 and S3
• EMR supports various job types such as Custom JAR, Hive program,
Streaming job, Pig programs and Hbase
• Google MapReduce Service
• Google MapReduce Service is a part of the App Engine platform and can
be accessed using the Google MapReduce API.
• Google BigQuery
• Google BigQuery is a service for querying massive datasets. BigQuery
allows querying datasets using SQL-like queries.
• Windows Azure HDInsight
• Windows Azure HDInsight is an analytics service from Microsoft.
HDInsight deploys and provisions Hadoop clusters in the Azure cloud and
makes Hadoop available as a service.
30
Deployment & Management Services
• Cloud-based deployment & management services allow you to easily
deploy and manage applications in the
cloud. These services automatically handle deployment tasks such as
capacity provisioning, load balancing,
auto-scaling, and application health monitoring.
• Amazon Elastic Beanstalk
• Amazon provides a deployment service called Elastic Beanstalk that
allows you to quickly deploy and manage applications in the AWS cloud.
• Elastic Beanstalk supports Java, PHP, .NET, Node.js, Python, and Ruby
applications.
• With Elastic Beanstalk you just need to upload the application and specify
configuration settings in a simple wizard and the service
automatically handles instance provisioning, server configuration, load
balancing and monitoring.
• Amazon CloudFormation
• Amazon CloudFormation is a deployment management service from
Amazon.
• With CloudFront you can create deployments from a collection of AWS
resources such as Amazon Elastic Compute Cloud, Amazon
Elastic Block Store, Amazon Simple Notification Service, Elastic Load
Balancing and Auto Scaling.
• A collection of AWS resources that you want to manage together are
organized into a stack.
32
Fig : Open Stack Architecture
2. Define virtualization.
5. Define scalability?
7. What is replication?
11. What are the various criteria for service level agreements?
12. What are the various layers in the cloud reference model?
13. What are the benefits of using a sandbox environment for a PaaS?
33
14. What is a push messaging service? What are its uses?
34
UNIT-II
36
• A functional HDFS filesystem has more than one DataNode, with data
replicated across them.
• DataNodes respond to requests from the NameNode for filesystem
operations.
• Client applications can talk directly to a DataNode, once the NameNode
has provided the location of the data.
• Similarly, MapReduce operations assigned to TaskTracker instances near
a DataNode, talk directly to the DataNode to access the files.
• TaskTracker instances can be deployed on the same servers that host
DataNode instances, so that MapReduce operations are performed close to
the data.
MapReduce:
MapReduce job consists of two phases:
• Map: In the Map phase, data is read from a distributed file system and
partitioned among a set of computing nodes in the cluster. The data is sent
to the nodes as a set of key-value pairs. The Map tasks process the input
records independently of each other and produce intermediate results as
key-value pairs. The intermediate results are stored on the local disk of the
node running the Map task.
• Reduce: When all the Map tasks are completed, the Reduce phase begins
in which the intermediate data with the same key is aggregated.
• Optional Combine Task
• An optional Combine task can be used to perform data aggregation on the
intermediate data of the same key for the output of the mapper before
transferring the output to the Reduce task.
MapReduce job execution starts when the client applications submit jobs to
the Job tracker.
• The JobTracker returns a JobID to the client application. The JobTracker
talks to the NameNode to determine the location of the data.
• The JobTracker locates TaskTracker nodes with available slots at/or near
the data.
• The TaskTrackers send out heartbeat messages to the JobTracker, usually
every few minutes, to reassure the
JobTracker that they are still alive. These messages also inform the
JobTracker of the number of available
slots, so the JobTracker can stay up to date with where in the cluster, new
work can be delegated.
38
• YARN architecture divides architecture divides the two major functions
of the JobTracker - resource management and job life-cycle management -
into separate components:
• ResourceManager
• ApplicationMaster.
YARN Components
• Resource Manager (RM): RM manages the global assignment of
compute resources to applications. RM consists of two main services:
• Scheduler: Scheduler is a pluggable service that manages and enforces
the resource scheduling policy in the cluster.
• Applications Manager (AsM): AsM manages the running Application
Masters in the cluster. AsM is responsible for starting application masters
and for monitoring and restarting them on different nodes in case of
failures.
• Application Master (AM): A per-application AM manages the
application’s life cycle. AM is responsible for negotiating resources from
the RM and working with the NMs to execute and monitor the tasks.
• Node Manager (NM): A per-machine NM manages the user processes on
that machine.
39
• Containers: Container is a bundle of resources allocated by RM
(memory, CPU, network, etc.). A container is a conceptual entity
that grants an application the privilege to use a certain amount of resources
on a given machine to run a component task.
Hadoop Schedulers
• Hadoop scheduler is a pluggable component that makes it open to support
different scheduling algorithms.
• The default scheduler in Hadoop is FIFO.
• Two advanced schedulers are also available - the Fair Scheduler,
developed at Facebook, and the Capacity Scheduler, developed at Yahoo.
• The pluggable scheduler framework provides the flexibility to support a
variety of workloads with varying priority and performance constraints.
• Efficient job scheduling makes Hadoop a multi-tasking system that can
process multiple data sets for multiple jobs for multiple users
simultaneously.
FIFO Scheduler
• FIFO is the default scheduler in Hadoop that maintains a work queue in
which the jobs are queued.
• The scheduler pulls jobs in first in first out manner (oldest job first) for
scheduling.
• There is no concept of priority or size of job in FIFO scheduler.
Fair Scheduler
• The Fair Scheduler allocates resources evenly between multiple jobs and
also provides capacity guarantees.
• Fair Scheduler assigns resources to jobs such that each job gets an equal
share of the available resources on average over time.
• Tasks slots that are free are assigned to the new jobs, so that each job gets
roughly the same amount of CPU time.
• Job Pools
• The Fair Scheduler maintains a set of pools into which jobs are placed.
Each pool has a guaranteed capacity.
• When there is a single job running, all the resources are assigned to that
job. When there are multiple jobs in the pools, each pool gets at least as
many task slots as guaranteed.
• Each pool receives at least the minimum share.
• When a pool does not require the guaranteed share the excess capacity is
split between other jobs.
40
• Fairness
• The scheduler computes periodically the difference between the
computing time received by each job and the time it should have received
in ideal scheduling.
• The job which has the highest deficit of the compute time received is
scheduled next.
Capacity Scheduler
• The Capacity Scheduler has similar functionally as the Fair Scheduler but
adopts a different scheduling philosophy.
• Queues
• In Capacity Scheduler, you define a number of named queues each with a
configurable number of map and reduce slots.
• Each queue is also assigned a guaranteed capacity.
• The Capacity Scheduler gives each queue its capacity when it contains
jobs, and shares any unused capacity between the queues. Within each
queue FIFO scheduling with priority is used.
• Fairness
• For fairness, it is possible to place a limit on the percentage of running
tasks per user, so that users share a cluster equally.
• A wait time for each queue can be configured. When a queue is not
scheduled for more than the wait time, it can preempt tasks of other queues
to get its fair share.
Follow the steps given below to have Hadoop Multi-Node cluster setup.
Installing Java
Java is the main prerequisite for Hadoop. First of all, you should verify the
existence of java in your system using “java -version”. The syntax of java version
command is given below.
$ java –version
If everything works fine it will give you the following output.
java version "1.7.0_71" Java(TM) SE Runtime Environment (build 1.7.0_71-
b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If java is not installed in your system, then follow the given steps for installing
java.
41
Step 1
Download java (JDK - X64.tar.gz) by visiting the following link
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-
1880260.html Then jdk-7u71-linux-x64.tar.gz will be downloaded into your
system.
Step 2
Generally you will find the downloaded java file in Downloads folder. Verify it
and extract the jdk7u71-linux-x64.gz file using the following commands.
$ cd Downloads/
$ ls
jdk-7u71-Linux-x64.gz
$ tar zxf jdk-7u71-Linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-Linux-x64.gz
Step 3
To make java available to all the users, you have to move it to the location
“/usr/local/”. Open the root, and type the following commands.
$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit
Step 4
For setting up PATH and JAVA_HOME variables, add the following commands
to ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=PATH:$JAVA_HOME/bin
Now verify the java -version command from the terminal as explained above.
Follow the above process and install java in all your cluster nodes.
Creating User Account
Create a system user account on both master and slave systems to use the Hadoop
installation.
# useradd hadoop
# passwd hadoop
Mapping the nodes
You have to edit hosts file in /etc/ folder on all nodes, specify the IP address of
each system followed by their host names.
# vi /etc/hosts
enter the following lines in the /etc/hosts file.
192.168.1.109 hadoop-master
192.168.1.145 hadoop-slave-1
192.168.56.1 hadoop-slave-2
Configuring Key Based Login
42
Setup ssh in every node such that they can communicate with one another without
any prompt for password.
# su hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2
$ chmod 0600 ~/.ssh/authorized_keys
$ exit
Installing Hadoop
You have to configure Hadoop server by making the following changes as
given below.
# mkdir /opt/hadoop
# cd /opt/hadoop/
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-
1.2.0.tar.gz
# tar -xzf hadoop-1.2.0.tar.gz
# mv hadoop-1.2.0 hadoop
# chown -R hadoop /opt/hadoop
# cd /opt/hadoop/hadoop/
Configuring Hadoop
You have to configure Hadoop server by making the following changes as
given below.
core-site.xml
Open the core-site.xml file and edit it as shown below.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:9000/</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
43
Starting Hadoop Services
The following command is to start all the Hadoop services on the Hadoop-
Master.
$ cd $HADOOP_HOME/sbin
$ start-all.sh
44
• Performance
• Applications should be designed while keeping the
performance requirements in mind.
45
Reference Architectures –Content delivery apps
• Figure shows a typical deployment architecture for content delivery
applications such as online photo albums, video webcasting, etc.
• Both relational and non-relational data stores are shown in this
deployment.
• A content delivery network (CDN) which consists of a global network of
edge locations is used for media delivery.
• CDN is used to speed up the delivery of static content such as images and
videos.
46
• Data analysis jobs (such as MapReduce) jobs are submitted to the
analytics tier from the application servers.
• The jobs are queued for execution and upon completion the analyzed data
is presented from the application servers.
47
• WSDL is an XML-based web services description language that is used to
create service descriptions containing information on the functions
performed by a service and the inputs and outputs of the service.
SOA Layers
• Business Systems
• This layer consists of custom built applications and legacy systems such
as Enterprise Resource Planning (ERP), Customer Relationship
Management (CRM), Supply Chain Management (SCM), etc.
• Service Components
• The service components allow the layers above to interact with the
business
systems. The service components are responsible for realizing the
functionality of the services exposed.
• Composite Services
• These are coarse-grained services which are composed of two or
more service components. Composite services can be used to create
enterprise scale components or business-unit specific components.
48
• Orchestrated Business Processes
• Composite services can be orchestrated to create higher level
business processes. In this layers the compositions and orchestrations
of the composite services are defined to create business processes.
• Presentation Services
• This is the topmost layer that includes user interfaces that exposes
the services and the orchestrated business processes to the users.
• Enterprise Service Bus
• This layer integrates the services through adapters, routing,
transformation and messaging mechanisms.
49
• CCM is an architectural approach for cloud applications that is not tied to
any specific programming language or cloud platform.
• Cloud applications designed with CCM approach can have innovative
hybrid deployments in which different components of an application can be
deployed on cloud infrastructure and platforms of different cloud vendors.
• Applications designed using CCM have better portability and
interoperability.
• CCM based applications have better scalability by decoupling application
components and providing asynchronous communication mechanisms.
CCM Application Design Methodology
• CCM approach for application design involves:
• Component Design
• Architecture Design
• Deployment Design
50
CCM Component Design:
• Cloud Component Model is created for the application based on
comprehensive analysis of the application’s functions and building blocks.
• Cloud component model allows identifying the building blocks of a cloud
application which are classified based on the functions performed and type
of cloud resources required.
• Each building block performs a set of actions to produce the desired
outputs for other components.
• Each component takes specific inputs, performs a predefined set of
actions and produces the desired outputs.
• Components offer their functions as services through a functional
interface which can be used by other components.
• Components report their performance to a performance database through a
performance interface.
52
• With this fiexibility in application design and deployment, the application
developers can ensure that the applications meet the performance and cost
requirements with changing contexts.
SOA vs CCM
Similarities
S
O
A
Standard SOA advocates principles CCM is
ization of reuse and well defined based on
& Re- relationship between service reusable
use provider and service components
consumer. which can
be used by
multiple
cloud
applications
.
53
Loose SOA is based on loosely CCM is
coupling coupled services that based on
minimize dependencies. loosely
coupled
compone
nts that
communi
cate
asynchro
nously
Statelessness SOA services minimize CCM
resource consumption by component
deferring the management of s are
state information. stateless.
State is
stored
outside of
the
component
s.
• Relations
• A relational database has a collection of relations (or tables). A relation is
a set of tuples (or rows).
• Schema
• Each relation has a fixed schema that defines the set of attributes (or
columns in a table) and the constraints on the attributes.
• Tuples
• Each tuple in a relation has the same attributes (columns). The tuples in a
relation can have any order and the relation is not sensitive to the ordering
of the tuples.
• Attributes
• Each attribute has a domain, which is the set of possible values for the
attribute.
• Insert/Update/Delete
• Relations can be modified using insert, update and delete operations.
Every relation has a primary key that uniquely identifies each tuple in the
relation.
• Primary Key
• An attribute can be made a primary key if it does not have repeated values
in different tuples
ACID Guarantees.
• Relational databases provide ACID guarantees.
• Atomicity
• Atomicity property ensures that each transaction is either “all or nothing”.
An atomic transaction ensures that all parts of the transaction complete or
the database state is left unchanged.
• Consistency
• Consistency property ensures that each transaction brings the database
from one valid state to another. In other words, the data in a database
always conforms to the defined schema and constraints.
56
• Isolation
• Isolation property ensures that the database state obtained after a set of
concurrent transactions is the same as would have been if the transactions
were executed serially. This provides concurrency control, i.e. the results of
incomplete transactions are not visible to other transactions. The
transactions are isolated from each other until they finish.
• Durability
• Durability property ensures that once a transaction is committed, the data
remains as it is, i.e. it is not affected by system outages such as power loss.
Durability guarantees that the database can keep track of changes and can
recover from abnormal terminations.
Non-Relational Databases
• Non-relational databases (or popularly called No-SQL databases) are
becoming popular with the growth of cloud computing.
• Non-relational databases have better horizontal scaling capability and
improved performance for big data at the cost of less rigorous consistency
models.
• Unlike relational databases, non-relational databases do not provide ACID
guarantees.
• Most non-relational databases offer “eventual” consistency, which means
that given a sufficiently long period of time over which no updates are
made, all updates can be expected to propagate eventually through the
system and the replicas will be consistent.
• The driving force behind the non-relational databases is the need for
databases that can achieve high scalability, fault tolerance and availability.
• These databases can be distributed on a large cluster of machines. Fault
tolerance is provided by storing multiple replicas of data on different
machines.
Non-Relational Databases – Types
• Key-value store
• Key-value store databases are suited for applications that require storing
unstructured data without a fixed schema. Most key-value stores have
support for native programming language data types.
• Document store
• Document store databases store semi-structured data in the form of
documents which are encoded in different
standards such as JSON, XML, BSON, YAML, etc.
57
• Graph store
• Graph stores are designed for storing data that has graph structure (nodes
and edges). These solutions are suitable for applications that involve graph
data such as social networks, transportation systems, etc.
• Object store
• Object store solutions are designed for storing data in the form of objects
de?ned in an object-oriented programming language.
Python Basics
Python:
Python is a general-purpose high level programming language and suitable
for providing a solid foundation to the reader in the area of cloud
computing.
• The main characteristics of Python are:
• Multi-paradigm programming language
• Python supports more than one programming paradigms including object-
oriented programming and structured programming
• Interpreted Language
• Python is an interpreted language and does not require an explicit
compilation step. The Python interpreter executes the program source code
directly, statement by statement, as a processor or scripting engine does.
• Interactive Language
• Python provides an interactive mode in which the user can submit
commands at the Python prompt and interact with the interpreter directly.
Python – Benefits
• Easy-to-learn, read and maintain
• Python is a minimalistic language with relatively few keywords, uses
English keywords and has fewer syntactical constructions as compared to
other languages. Reading Python programs feels like English with pseudo-
code like constructs. Python is easy to learn yet an extremely powerful
language for a wide range of applications.
• Object and Procedure Oriented
• Python supports both procedure-oriented programming and object-
oriented programming. Procedure oriented paradigm allows programs to be
written around procedures or functions that allow reuse of code. Procedure
oriented paradigm allows programs to be written around objects that
include both data and functionality.
58
• Extendable
• Python is an extendable language and allows integration of low-level
modules written in languages such as C/C++. This is useful when you want
to speed up a critical portion of a program.
• Scalable
• Due to the minimalistic nature of Python, it provides a manageable
structure for large programs.
• Portable
• Since Python is an interpreted language, programmers do not have to
worry about compilation, linking and loading of programs. Python
programs can be directly executed from source.
• Broad Library Support
• Python has a broad library support and works on various platforms such as
Windows,Linux, Mac, etc.
Python – Setup
• Windows
• Python binaries for Windows can be downloaded from
http://www.python.org/getit.
• For the examples and exercise in this book, you would require Python 2.7
which can be directly downloaded from:
http://www.python.org/ftp/python/2.7.5/python-2.7.5.msi
• Once the python binary is installed you can run the python shell at the
command prompt using > python
• Linux
#Install Dependencies
sudo apt-get install build-essential
sudo apt-get install libreadline-gplv2-dev libncursesw5-dev libssl-dev
libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev
#Download Python
wget http://python.org/ftp/python/2.7.5/Python-2.7.5.tgz
tar -xvf Python-2.7.5.tgz
cd Python-2.7.5
#Install Python
./configure
make
sudo make install
59
Python Data Types
The Python data types are as fallows:
Numbers
Strings
Lists
Tuples
Dictionaries
Type Conversions
Numbers
Number data type is used to store numeric values. Numbers are
immutable data types, therefore changing the value of a number data
type results in a newly allocated object.
Eg1.
#Integer
>>>a=5
>>>type(a)
<type ’int’>
#Floating Point
>>>b=2.5
>>>type(b)
<type ’float’>
#Long
>>>x=9898878787676L
>>>type(x)
<type ’long’>
#Complex
>>>y=2+5j
>>>y
(2+5j)
>>>type(y)
<type ’complex’>
>>>y.real
2 >>>y.imag
5
60
Eg2.
#Addition
>>>c=a+b
>>>c
7.5
>>>type(c)
<type ’float’>
#Subtraction
>>>d=a-b
>>>d
2.5
>>>type(d)
<type ’float’>
#Multiplication
>>>e=a*b
>>>e
12.5
>>>type(e)
<type ’float’>
Eg3:
#Division
>>>f=b/a
>>>f
0.5
>>>type(f)
<type float’>
#Power
>>>g=a**2
>>>g
25
Strings
• A string is simply a list of characters in order. There are no limits to the
number of characters you can have in a string.
61
Eg1:
#Create string
>>>s="Hello World!"
>>>type(s)
<type ’str’>
#String concatenation
>>>t="This is sample program."
>>>r = s+t
>>>r
’Hello World!This is sample program.’
#Get length of string
>>>len(s)
12
#Convert string to integer
>>>x="100"
>>>type(s)
<type ’str’>
>>>y=int(x)
>>>y
100
Eg2:
#Print string
>>>print s
Hello World!
#Formatting output
>>>print "The string (The string (Hello World!)
has 12 characters
#Convert to upper/lower case
>>>s.upper()
’HELLO WORLD!’
>>>s.lower()
’hello world!’
#Accessing sub-strings
>>>s[0]
’H’
>>>s[6:]
’World!’
62
>>>s[6:-1]’World’
Lists
• List a compound data type used to group together other values. List items
need not all have the same type. A list contains items separated by commas
and enclosed within square brackets.
Eg1:
Create List
>>>fruits=[’apple’,’orange’,’banana’,’mango’]
>>>type(fruits)
<type ’list’>
#Get Length of List
>>>len(fruits)
4
#Access List Elements
>>>fruits[1]
’orange’
>>>fruits[1:3]
[’orange’, ’banana’]
>>>fruits[1:]
[’orange’, ’banana’, ’mango’]
#Appending an item to a list
>>>fruits.append(’pear’)
>>>fruits
[’apple’, ’orange’, ’banana’, ’mango’, ’pear’]
Eg2:
>>>fruits.remove(’mango’)
>>>fruits
[’apple’, ’orange’, ’banana’, ’pear’]
#Inserting an item to a list
>>>fruits.insert(1,’mango’)
>>>fruits
[’apple’, ’mango’, ’orange’, ’banana’, ’pear’]
#Combining lists
>>>vegetables=[’potato’,’carrot’,’onion’,’beans’,’r
adish’]
63
>>>vegetables
[’potato’, ’carrot’, ’onion’, ’beans’, ’radish’]
>>>eatables=fruits+vegetables
>>>eatables
[’appl
e’,
’mang
o’,
’orang
e’,
’banan
a’,
’pear’, ’potato’, ’carrot’, ’onion’, ’beans’, ’radish’]
Tuples
• A tuple is a sequence data type that is similar to the list. A tuple consists
of a number of values separated by commas and enclosed within
parentheses. Unlike lists, the elements of tuples cannot be changed, so
tuples can be thought of as read-only lists.
Eg1:
#Create a Tuple
>>>fruits=("apple","mango","banana","pineapple")
>>>fruits
(’apple’, ’mango’, ’banana’, ’pineapple’)
>>>type(fruits)
<type ’tuple’>
#Get length of tuple
>>>len(fruits)
4
Eg2:
Dictionaries
• Dictionary is a mapping data type or a kind of hash table that maps keys
to values. Keys in a dictionary can be of any data type, though numbers and
strings are commonly used for keys. Values in a dictionary can be any data
type or object.
Eg1:
#Create a dictionary
>>>student={’name’:’Mary’,’id’:’8776’,’major’:’CS’}
>>>student
{’major’: ’CS’, ’name’: ’Mary’, ’id’: ’8776’}
>>>type(student)
<type ’dict’>
#Get length of a dictionary
>>>len(student)
3
#Get the value of a key in dictionary
>>>student[’name’]
’Mary’
#Get all items in a dictionary
>>>student.items()
[(’gender’, ’female’), (’major’, ’CS’), (’name’, ’Mary’),
(’id’, ’8776’)]
Eg2:
Type conversion
Using Python, we can easily convert data into different types. There are
different functions for Type Conversion. We can convert string type objects
to numeric values, perform conversion between different container types
etc.
Eg1:
#Convert to string
>>>a=10000
>>>str(a)
’10000’
#Convert to int
>>>b="2013"
>>>int(b)
2013
#Convert to float
>>>float(b)
2013.0
Eg2:
>>>long(b)
2013L
#Convert to list
>>>s="aeiou"
>>>list(s)
66
[’a’, ’e’, ’i’, ’o’, ’u’]
#Convert to set
>>>x=[’mango’,’apple’,’banana’,’mango’,’banana’]
>>>set(x)
set([’mango’, ’apple’, ’banana’])
Eg1:
>>>a = 25**5
>>>if a>10000:
print "More"
else:
print "Less"
More
Eg2:
>>>if a>10000:
if a<1000000:
print "Between 10k and 100k"
else:
print "More than 100k"
elif a==10000:
print "Equal to 10k"
else:
print "Less than 10k"
More than 100k
67
Eg1:
#Looping over characters in a string
helloString = "Hello World"
for c in helloString:
print c
Eg2:
#Looping over items in a list
fruits=[’apple’,’orange’,’banana’,’mango’]
i=0
for item in fruits:
print "Fruit-%d: %s" % (i,item)
i=i+1
Eg3:
#Looping over keys in a dictionary
student
=
’nam
e’:
’Mar
y’, ’id’: ’8776’,’gender’: ’female’, ’major’: ’CS’
for key in student:
print "%s: %s" % (key,student[key]
68
Eg1:
#Generate a list of numbers from 0 – 9
>>>range (10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Eg2:
#Generate a list of numbers from 10 - 100 with increments of 10
>>>range(10,110,10)
[10, 20, 30, 40, 50, 60, 70, 80, 90,100]
Eg1:
students = { '1': {'name': 'Bob', 'grade': 2.5},
'2': {'name': 'Mary', 'grade': 3.5},
'3': {'name': 'David', 'grade': 4.2},
'4': {'name': 'John', 'grade': 4.1},
'5': {'name': 'Alex', 'grade': 3.8}}
def averageGrade(students):
“This function computes the average grade”
sum = 0.0
for key in students:
sum = sum + students[key]['grade']
average = sum/len(students)
return average
70
avg = averageGrade(students)
print "The average garde is: %0.2f" % (avg)
Eg1:
>>>def displayFruits(fruits=[’apple’,’orange’]):
print "There are %d fruits in the list" % (len(fruits))
for item in fruits:
print item
#Using default arguments
>>>displayFruits()
apple
orange
>>>fruits = [’banana’, ’pear’, ’mango’]
>>>displayFruits(fruits)
banana
pear
mango
Eg1:
>>>def displayFruits(fruits):
print "There are %d fruits in the list" % (len(fruits))
for item in fruits:
print item
print "Adding one more fruit"
fruits.append('mango')
>>>fruits = ['banana', 'pear', 'apple']
>>>displayFruits(fruits)
71
There are 3 fruits in the list
banana
pear
apple
#Adding one more fruit
>>>print "There are %d fruits in the list" % (len(fruits))
There are 4 fruits in the list
Functions - Keyword Arguments
• Functions can also be called using keyword arguments that identifies the
arguments by the parameter name when the function is called.
Eg1:
>>>def
printStudentRecords(name,age=20,major=’CS’):
print "Name: " + name
print "Age: " + str(age)
print "Major: " + major
#This will give error as name is required argument
>>>printStudentRecords()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: printStudentRecords() takes at least 1
argument (0 given)
Eg2:
#Correct use
>>>printStudentRecords(name=’Alex’)
Name: Alex
Age: 20
Major: CS
>>>printStudentRecords(name=’Bob’,age=22,major=’EC
E’)
Name: Bob
Age: 22
Major: ECE
>>>printStudentRecords(name=’Alan’,major=’ECE’)
Name: Alan
Age: 20
Major: ECE
72
Eg3:
name is a formal argument.
#**kwargs is a keyword argument that receives all
arguments except the formal argument as a
dictionary.
>>>def student(name, **kwargs):
print "Student Name: " + name
for key in kwargs:
print key + ’: ’ + kwargs[key]
>>>student(name=’Bob’, age=’20’, major = ’CS’)
Student Name: Bob
age: 20
major: CS
Eg1:
Modules
• Python allows organizing the program code into different modules which
improves the code readability and management.
• A module is a Python file that defines some functionality in the form of
functions or classes.
73
• Modules can be imported using the import keyword.
• Modules to be imported must be present in the search path.
Eg1:
74
Eg1:
# skimage package listing
skimage/ Top level package
__init__.py Treat directory as a package
color/ color color subpackage
__init__.py
colorconv.py
colorlabel.py
rgb_colors.py
draw/ draw draw subpackage
__init__.py
draw.py
setup.py
exposure/ exposure subpackage
__init__.py
_adapthist.py
exposure.py
feature/ feature subpackage
__init__.py
_brief.py
_daisy.py
File Handling
• Python allows reading and writing to files using the file object.
• The open(filename, mode) function is used to get a file object.
• The mode can be read (r), write (w), append (a), read and write (r+ or
w+), read-binary (rb), write-binary (wb), etc.
• After the file contents have been read the close function is called which
closes the file object.
Eg1:
# Example of reading an entire file
>>>fp = open('file.txt','r')
>>>content = fp.read()
>>>print content
This is a test file.
>>>fp.close()
75
Eg2:
# Example of reading line by line
>>>fp = open('file1.txt','r')
>>>print "Line-1: " + fp.readline()
Line-1: Python supports more than one programming paradigms.
>>>print "Line-2: " + fp.readline()
Line-2: Python is an interpreted language.
>>>fp.close()
:
Eg3
# Example of writing to a file
>>>fo = open('file1.txt','w')
>>>content='This is an example of writing to a file in
Python.'
>>>fo.write(content)
>>>fo.close()
Date/Time Operations
• Python provides several functions for date and time access and
conversions.
• The datetime module allows manipulating date and time in several ways.
• The time module in Python provides various time-related functions.
Eg1:
76
Eg2:
Classes
• Python is an Object-Oriented Programming (OOP) language. Python
provides all the standard features of Object Oriented Programming such as
classes, class variables, class methods, inheritance, function overloading,
and operator overloading.
Class
• A class is simply a representation of a type of object and user-defined
prototype for an object that is composed of three things: a name, attributes,
and operations/methods.
Instance/Object
• Object is an instance of the data structure defined by a class.
Inheritance
• Inheritance is the process of forming a new class from an existing class or
base class.
Function overloading
• Function overloading is a form of polymorphism that allows a function to
have different meanings, depending on its context.
Operator overloading
• Operator overloading is a form of polymorphism that allows assignment
of more than one function to a particular operator.
77
Function overriding
• Function overriding allows a child class to provide a specific
implementation of a function that is already provided by the base class.
Child class implementation of the overridden function has the same name,
parameters and return type as the function in the base class.
Class Example
The variable studentCount is a class variable that is shared by all instances
of the class
Student and is accessed by Student.studentCount.
• The variables name, id and grades are instance variables which are
specific to each instance of the class.
• There is a special method by the name __init__() which is the class
constructor.
• The class constructor initializes a new instance when it is created. The
function __del__() is the class destructor.
Eg1:
# Examples of a class
class Student:
studentCount = 0
def __init__(self, name, id):
print "Constructor called"
self.name = name
self.id = id
Student.studentCount = Student.studentCount + 1
self.grades={}
def __del__(self):
print "Destructor called"
def getStudentCount(self):
return Student.studentCount
def addGrade(self,key,value):
self.grades[key]=value
def getGrade(self,key):
return self.grades[key]
def printGrades(self):
for key in self.grades:
print key + ": " + self.grades[key]
>>>s = Student(’Steve’,’98928’)
Constructor called
78
>>>s.addGrade(’Math’,’90’)
>>>s.addGrade(’Physics’,’85’)
>>>s.printGrades()
Physics: 85
Math: 90
>>>mathgrade = s.getGrade(’Math’)
>>>print mathgrade
90
>>>count = s.getStudentCount()
>>>print count
1
>>>del s
Destructor called
Class Inheritance
• In this example Shape is the base class and Circle is the derived class. The
class Circle inherits the attributes of the Shape class.
• The child class Circle overrides the methods and attributes of the base
class (eg. draw() function defined in the base class Shape is overridden in
child class Circle).
Eg1:
# Examples of class inheritance
class Shape:
def __init__(self):
print "Base class constructor"
self.color = ’Green’
self.lineWeight = 10.0
def draw(self):
print "Draw - to be implemented"
def setColor(self, c):
self.color = c
def getColor(self):
return self.color
def setLineWeight(self,lwt):
self.lineWeight = lwt
def getLineWeight(self):
return self.lineWeight
79
class Circle(Shape):
def __init__(self, c,r):
print "Child class constructor"
self.center = c
self.radius = r
self.color = ’Green’
self.lineWeight = 10.0
self.__label = ’Hidden circle label’
def setCenter(self,c):
self.center = c
def getCenter(self):
return self.center
def setRadius(self,r):
self.radius = r
def getRadius(self):
return self.radius
def draw(self):
print "Draw Circle (overridden function)"
class Point:
def __init__(self, x, y):
self.xCoordinate = x
self.yCoordinate = y
def setXCoordinate(self,x):
self.xCoordinate = x
def getXCoordinate(self):
return self.xCoordinate
def setYCoordinate(self,y):
self.yCoordinate = y
def getYCoordinate(self):
return self.yCoordinate
>>>p = Point(2,4)
>>>circ = Circle(p,7)
Child class constructor
>>>circ.getColor()
’Green’
>>>circ.setColor(’Red’)
>>>circ.getColor()
’Red’
>>>circ.getLineWeight()
80
10.0
>>>circ.getCenter().getXCoordinate()
2 >>>circ.getCenter().getYCoordinate()
4 >>>circ.draw()
Draw Circle (overridden function)
>>>circ.radius
7
9. Define Module.
4. Write an algorithm to accept two numbers, compute the sum and print
the result. Evaluate.
81
5.Explain Reference Architecture for Cloud Applications.
82
UNIT-III
Eg:-
83
Amazon AutoScaling – Python Example
• AutoScaling Service
• A connection to AutoScaling service is first established by calling
boto.ec2.autoscale.connect_to_region function.
• Launch Configuration
• After connecting to AutoScaling service, a new launch configuration is
created by calling conn.create_launch_conf iguration. Launch configuration
contains instructions on how to launch new instances including the AMI-
ID, instance type, security groups, etc.
• AutoScaling Group
• After creating a launch configuration, it is then associated with a new
AutoScaling group. AutoScaling group is created by calling
conn.create_auto_scaling_group. The settings for AutoScaling group such
as the maximum and minimum number of instances in the group, the launch
configuration, availability zones, optional load balancer to use with the
group, etc.
Eg:-
84
• AutoScaling Policies
• After creating an AutoScaling group, the policies for scaling up
and scaling down are defined.
• In this example, a scale up policy with adjustment type
ChangeInCapacity and scaling_ad justment = 1 is defined.
• Similarly a scale down policy with adjustment type
ChangeInCapacity and scaling_ad justment = -1 is defined.
Eg :-
Eg:-
#Connecting to CloudWatch
cloudwatch = boto.ec2.cloudwatch.connect_to_region(REGION,
85
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
alarm_dimensions = {"AutoScalingGroupName": 'My-Group'}
#Creating scale-up alarm
scale_up_alarm = MetricAlarm(
name='scale_up_on_cpu', namespace='AWS/EC2',
metric='CPUUtilization', statistic='Average',
comparison='>', threshold='70',
period='60', evaluation_periods=2,
alarm_actions=[scale_up_policy.policy_arn],
dimensions=alarm_dimensions)
cloudwatch.create_alarm(scale_up_alarm)
#Creating scale-down alarm
scale_down_alarm = MetricAlarm(
name='scale_down_on_cpu', namespace='AWS/EC2',
metric='CPUUtilization', statistic='Average',
comparison='<', threshold='40',
period='60', evaluation_periods=2,
alarm_actions=[scale_down_policy.policy_arn],
dimensions=alarm_dimensions)
cloudwatch.create_alarm(scale_down_alarm)
86
Amazon RDS – Python Example
Eg:-
87
Amazon DynamoDB – Python Example
• In this example, a connection to DynamoDB service is first established by
calling boto.dynamodb.connect_to_region.
• After connecting to DynamoDB service, a schema for the new table is
created by calling
conn.create_schema.
• The schema includes the hash key and range key names and types.
• A DynamoDB table is then created by calling conn.create_table function
with the table schema, read units and write units as input parameters.
88
• After completing the OAuth authorization, an instance of the Google
Compute Engine service is obtained.
• To launch a new instance the instances().insert method of the Google
Compute Engine API is used.
• The request body to this method contains the properties such as instance
name, machine type, zone, network interfaces, etc., specified in JSON
format.
Eg:-
90
• The request body of this method contains properties such as instance,
project, tier, Pricing Plan and replication Type.
Eg:-
Eg:-
Eg:-
Eg:-
93
Eg:-
• XML
• XML (Extensible Markup Language) is a data format for structured
document interchange. The Python minidom library provides a minimal
implementation of the Document Object Model interface and has an API
similar to that in other languages.
• HTTPLib & URLLib
• HTTPLib2 and URLLib2 are Python libraries used in network/internet
programming
94
• SMTPLib
• Simple Mail Transfer Protocol (SMTP) is a protocol which handles
sending email and routing e-mail between mail servers. The
Python smtplib module provides an SMTP client session object that can be
used to send email.
• NumPy
• NumPy is a package for scientific computing in Python. NumPy provides
support for large multi-dimensional arrays and matrices
• Scikit-learn
• Scikit-learn is an open source machine learning library for Python that
provides implementations of various machine learning algorithms for
classification, clustering, regression and dimension reduction problems.
Django Architecture
95
Django model is a Python class that outlines the variables and methods for
a particular type of data.
• Template
• In a typical Django web application, the template is simply an HTML
page with a few extra placeholders. Django’s template language can be
used to create various forms of text files (XML, email, CSS, Javascript,
CSV, etc.)
• View
• The view ties the model to the template. The view is where you write the
code that actually generates the web pages. View determines what data is to
be displayed, retrieves the data from the database and passes the data to the
template.
Devices
Configurations
Here configuration is sub-resource of a device. A device can have
many configuration options.
96
Note that both objects/resources in our above model will have a unique
identifier, which is the integer id property.
Now when object model is ready, it’s time to decide the resource URIs. At
this step, while designing the resource URIs – focus on the relationship
between resources and its sub-resources. These resource URIs are
endpoints for RESTful services.
/configurations
/configurations/{id}
/devices/{id}/configurations
/devices/{id}/configurations/{id}
Notice that these URIs do not use any verb or operation. It’s very
important to not include any verb in URIs. URIs should all be nouns only.
Determine Representations
Now when resource URIs have been decided, let’s work on their
representations. Mostly representations are defined in either XML or
JSON format. We will see XML examples as its more expressive on how
data is composed.
<devices size="2">
<link rel="self" href="/devices"/>
97
<device id="12345">
<link rel="self" href="/devices/12345"/>
<deviceFamily>apple-es</deviceFamily>
<OSVersion>10.3R2.11</OSVersion>
<platform>SRX100B</platform>
<serialNumber>32423457</serialNumber>
<connectionStatus>up</connectionStatus>
<ipAddr>192.168.21.9</ipAddr>
<name>apple-srx_200</name>
<status>active</status>
</device>
<device id="556677">
<link rel="self" href="/devices/556677"/>
<deviceFamily>apple-es</deviceFamily>
<OSVersion>10.3R2.11</OSVersion>
<platform>SRX100B</platform>
<serialNumber>6453534</serialNumber>
<connectionStatus>up</connectionStatus>
<ipAddr>192.168.20.23</ipAddr>
<name>apple-srx_200</name>
<status>active</status>
</device>
</devices>
98
Single Device Resource
<device id="12345">
<link rel="self" href="/devices/12345"/>
<id>12345</id>
<deviceFamily>apple-es</deviceFamily>
<OSVersion>10.0R2.10</OSVersion>
<platform>SRX100-LM</platform>
<serialNumber>32423457</serialNumber>
<name>apple-srx_100_lehar</name>
<hostName>apple-srx_100_lehar</hostName>
<ipAddr>192.168.21.9</ipAddr>
<status>active</status>
<configurations size="2">
<link rel="self" href="/configurations" />
<configuration id="42342">
<link rel="self" href="/configurations/42342" />
</configuration>
<configuration id="675675">
<link rel="self" href="/configurations/675675" />
</configuration>
</configurations>
<configuration id="42342">
<link rel="self" href="/configurations/42342" />
</configuration>
<configuration id="675675">
<link rel="self" href="/configurations/675675" />
</configuration>
...
...
</configurations>
Please note that configurations collection representation inside device is
similar to top-level configurations URI. Only difference is
that configurations for a device are only two, so only two configuration
items are listed as subresource under device.
<configuration id="42342">
<link rel="self" href="/configurations/42342" />
<content><![CDATA[...]]></content>
<status>active</status>
<link rel="raw configuration content" href="/configurations/42342/raw"
/>
</configuration>
<configurations size="2">
100
<link rel="self" href="/devices/12345/configurations" />
<configuration id="53324">
<link rel="self" href="/devices/12345/configurations/53324" />
<link rel="detail" href="/configurations/53324" />
</configuration>
<configuration id="333443">
<link rel="self" href="/devices/12345/configurations/333443" />
<link rel="detail" href="/configurations/333443" />
</configuration>
</configurations>
Notice that this subresource collection has two links. One for its direct
representation inside sub-collection
i.e. /devices/12345/configurations/333443 and other pointing to its location
in primary collection i.e. /configurations/333443.
Having two links is important as you can provide access to a device specific
configuration in more unique manner, and you will have ability to mask
some fields (if design require it) which shall not be visible in a secondary
collection.
<configuration id="11223344">
<link rel="self" href="/devices/12345/configurations/11223344" />
<link rel="detail" href="/configurations/11223344" />
<content><![CDATA[...]]></content>
<status>active</status>
<link rel="raw configuration content" href="/configurations/11223344/raw" />
101
</configuration>
Now, before moving forward to next section, let’s note down few
observations so you don’t miss them.
So our resource URIs and their representation are fixed now. Let’s decide
the possible operations in application and map these operations on resource
URIs. A user of network application can perform browse, create, update or
delete operations. So let’s map them.
102
If the collection size is large, you can apply paging and filtering as well.
e.g. Below requests will fetch first 20 records from collection.
103
HTTP/1.1 201 Created
Content-Type: application/xml
Location: http://example.com/network-app/configurations/678678
<configuration id="678678">
<link rel="self" href="/configurations/678678" />
<content><![CDATA[...]]></content>
<status>active</status>
<link rel="raw configuration content" href="/configurations/678678/raw" />
</configuration>
HTTP/1.1 200 OK
Content-Type: application/xml
<configuration id="678678">
<link rel="self" href="/configurations/678678" />
<content><![CDATA[. updated content here .]]></content>
<status>active</status>
<link rel="raw configuration content" href="/configurations/678678/raw" />
</configuration>
104
Remove a device or configuration
Please note that you should put enough analysis in deciding the behavior
when a subresource is deleted from system. Normally, you may want
to SOFT DELETE a resource in these requests – in other words, set their
status INACTIVE. By following this approach, you will not need to find
and remove its references from other places as well.
More Actions
So far we have designed only object model, URIs and then decided HTTP
methods or operations on them. You need to work on other aspects of the
application as well:
105
1) Logging
2) Security
3) Discovery etc.
Component Design
Architecture Design
Deployment Design
• Map the application components to specific cloud resources (such as
web servers, application servers, database servers, etc.)
Design methodology for PaaS service model
106
• In the component design step, the developers have to take into
consideration the platform specific features.
• Platform Specific Software
• Different PaaS offerings such as Google App Engine, Windows Azure
Web Sites, etc., provide platform specific software development kits
(SDKs) for developing cloud applications.
• Sandbox Environments
• Applications designed for specific PaaS offerings run in sandbox
environments and are allowed to perform only those actions that do not
interfere with the performance of other applications.
• Deployment & Scaling
• The deployment and scaling is handled by the platform while the
developers focus on the application development using the platform-
specific SDKs.
• Portability
• Portability is a major constraint for PaaS based applications as it is
difficult to move the
• Functionality:
• A cloud-based Image Processing application.
• This application provides online image filtering capability.
• Users can upload image files and choose the filters to apply.
• The selected filters are applied to the image and theprocessed image can
then be downloaded.
• Component Design
• Web Tier: The web tier for the image processing app has front ends for
image submission and displaying processed images.
• Application Tier: The application tier has components for processing the
image submission requests, processing the submitted image and processing
requests for displaying the results.
• Storage Tier: The storage tier comprises of the storage for processed
images.
107
Fig: Component design for Image Processing App
108
Image Processing App – Deployment Design
• Deployment for the app is a multi-tier architecture comprising of load
balancer, application servers and cloud storage for processed images.
• For each resource in the deployment the corresponding Amazon Web
Services (AWS) cloud service is mentioned.
110
Fig : Architecture design for MapReduce App
112
Fig : Architecture design for Social Media Analytics App
113
Social Media Analytics App – Dashboard
114
8.What are the key features of Python?
9. What are the social media analytics app.?
10. what are the document storage app?
115
UNIT-IV
Big Data Analytics
Introduction:
Volume
• Though there is no fixed threshold for the volume of data to be considered
as big data, however, typically, the term big data is used for massive scale
data that is difficult to store, manage and process using traditional databases
and data processing Architectures. The volumes of data generated by
modern IT, industrial, healthcare and systems is growing exponentially
116
driven by the lowering costs of data storage and processing architectures
and the need to extract valuable insights from the data improve business
processes, efficiency and service to consumers.
Velocity
• Velocity is another important characteristic of big data and the primary
reason for exponential growth of data. Velocity of data refers to how fast
the data is generated. Modern IT, industrial and other systems are
generating data at increasingly higher speeds generating big data.
Variety
• Variety refers to the forms of the data. Big data comes in different forms
such as structured or unstructured data, including text data, image, audio,
video and sensor data.
k-means Clustering
• k-means is a clustering algorithm that groups data items into k clusters,
where k is user defined.
• Each cluster is defined by a centroid point.
• k-means clustering begins with a set of k centroid points which are either
randomly chosen from the dataset or chosen using some initialization
algorithm such as canopy clustering.
• The algorithm proceeds by finding the distance between each data point in
the data set and the centroid points.
• Based on the distance measure, each data point is assigned to a cluster
belonging to the closest centroid.
117
• In the next step the centroids are recomputed by taking the mean value of
all the data points in a cluster.
• This process is repeated till the centroids no longer move more than a
specified threshold.
Fig: Example of clustering 300 points with k-means: (a) iteration 1, (b)
iteration 2, (c) iteration 3, (d) iteration 5, (e) iteration 10, (f) iteration 100.
118
• Document clustering problem occurs in many big data applications such
as finding similar news articles, finding similar patients using electronic
health records, etc.
• Before applying k-means algorithm for document clustering, the
documents need to be vectorized. Since documents contain textual
information, the process of vectorization is required for clustering
documents.
• The process of generating document vectors involves several steps:
• A dictionary of all words used in the tokenized records is generated. Each
word in the dictionary has a dimension number assigned to it which is used
to represent the dimension the word occupies in the document vector.
• The number of occurrences or term frequency (TF) of each word is
computed.
• Inverse Document Frequency (IDF) for each word is computed.
Document Frequency (DF) for a word is the number of documents (or
records) in which the word occurs.
• Weight for each word is computed. The term weight Wi is used in the
document vector as the value for the dimension-i.
• Similarity between documents is computed using a distance measure such
as Euclidean distance measure.
119
Fig: Parallel implementation of K-means clustering with Map Reduce
DBSCAN clustering
• DBSCAN is a density clustering algorithm that works on the notions of
density reachability and density connectivity.
Density Reachability
• Is defined on the basis of Eps-neighborhood, where Eps-neighborhood
means that for every point p in a cluster C there is a point q in C so that p is
inside of the Eps-neighborhood of q and there are at least a minimum
number (MinPts) of points in an Eps-neighborhood of that point.
• A point p is called directly density-reachable from a point q if it is not
farther away than a given distance (Eps) and if it is surrounded by at least a
minimum number (MinPts) of points that may be considered to be part of a
cluster.
Density Connectivity
• A point p is density connected to a point q if there is a point o such that
both, p and q are density-reachable from o wrt. Eps and MinPts.
• A cluster, is then defined based on the following two properties:
• Maximality: For all point p, q if p belongs to cluster C and q is density-
reachable from p (wrt. Eps and MinPts), then q also belongs to the cluster
C.
• Connectivity: For all point p, q in cluster C, p is density-connected to q
(wrt. Eps and MinPts).
120
DBSCAN vs K-means
• DBSCAN can find irregular shaped clusters as seen from this example
and can even find a cluster completely surrounded by different cluster.
• DBSCAN considers some points as noise and does not assign the to any
cluster.
121
Classification of Big Data
122
Accuracy:
Naive Bayes
Naive Bayes is a probabilistic classification algorithm based on the
Bayes theorem with a naive assumption about the independence of
feature attributes. Given a class variable C and feature variables
F1,...,Fn , the conditional probability (posterior according to Bayes
theorem is given as,
Since the evidence P(F1,...,Fn ) is constant for a given input and does
not depend on the class variable C, only the numerator of the
123
posterior probability is important for classification. With this
simplification, classification can then be done as follows,
124
Decision Trees
• Decision Trees are a supervised learning method that use a tree created
from simple decision rules learned from the training data as a predictive
model.
• The predictive model is in the form of a tree that can be used to predict the
value of a target variable based on a several attribute variables.
• Each node in the tree corresponds to one attribute in the dataset on which
the “split” is performed.
• Each leaf in a decision tree represents a value of the target variable.
• The learning process involves recursively splitting on the attributes until
all the samples in the child node have the same value of the target variable
or splitting further results in no further information gain.
• To select the best attribute for splitting at each stage, different metrics can
be used.
125
Splitting Attributes in Decision Trees
To select the best attribute for splitting at each stage, different metrics can
be used such as:
Information Gain
• Information content of a discrete random variable X with probability mass
function (PMF), P(X), is defined as,
Gini Coefficient
• Gini coefficient measures the inequality, i.e. how often a randomly chosen
sample that is labeled based on the distribution of labels, would be labeled
incorrectly. Gini coefficient is defined as,
126
Decision Tree Algorithms
• There are different algorithms for building decisions trees, popular ones
being ID3 and C4.5.
ID3:
• Attributes are discrete. If not, discretize the continuous attributes.
• Calculate the entropy of every attribute using the dataset.
• Choose the attribute with the highest information gain.
• Create branches for each value of the selected attribute.
• Repeat with the remaining attributes.
• The ID3 algorithm can be result in over-fitting to the training data
and can be expensive to train especially for continuous attributes.
C4.5
• The C4.5 algorithm is an extension of the ID3 algorithm. C4.5 supports
both discrete and continuous attributes.
• To support continuous attributes, C4.5 finds thresholds for the continuous
attributes and then splits based on the threshold values. C4.5 prevents over-
fitting by pruning trees after they have been created.
• Pruning involves removing or aggregating those branches which provide
little discriminatory power.
Random Forest
• Random Forest is an ensemble learning method that is based on
randomized decision trees.
• Random Forest trains a number decision trees and then takes the majority
vote by using the mode of the class predicted by the individual trees.
127
Breiman’s Algorithm
129
Fig: Binary classification with RBF SVM
Recommendation Systems
• Recommendation systems are an important part of modern cloud
applications such as e-Commerce, social networks, content delivery
networks, etc.
• Item-based or Content-based Recommendation
• Provides recommendations to users (for items such as books, movies,
songs, or restaurants) for unrated items base on the characteristics of the
item.
• Collaborative Filtering
• Provides recommendations based on the ratings given by the user and
other users to similar items.
130
Multimedia Cloud
Design methodology for PaaS service model
• For applications that use the Platform-as-a-service (PaaS) cloud service
model, the architectureand deployment design steps are not required since
the platform takes care of the architecture and deployment.
Component Design
• In the component design step, the developers have to take into
consideration the platform specific features.
Platform Specific Software
• Different PaaS offerings such as Google App Engine, Windows Azure
Web Sites, etc., provide platform specific software development kits
(SDKs) for developing cloud applications.
Sandbox Environments
• Applications designed for specific PaaS offerings run in sandbox
environments and are allowed to perform only those actions that do not
interfere with the performance of other applications.
Deployment & Scaling
• The deployment and scaling is handled by the platform while the
developers focus on the application development using the platform-
specific SDKs.
Portability
• Portability is a major constraint for PaaS based applications as it is
difficult to move.
Multimedia Cloud Reference Architecture
Infrastructure Services
• In the Multimedia Cloud reference architecture, the first layer is the
infrastructure services layer that includes computing and storage resources.
Platform Services
• On top of the infrastructure services layer is the platform services layer
that includes frameworks and services for streaming an associated tasks
such as transcoding and analytics that can be leveraged for rapid
development of multimedia applications.
Applications
• The topmost layer is the applications such as live video streaming, video
transcoding, video-on-demand, multimedia processing etc.
• Cloud-based multimedia applications alleviates the burden of installing
and maintaining multimedia applications locally on the multimedia
131
consumption devices (desktops, tablets, smartphone, etc and provide access
to rich multimedia content.
Service Models
• A multimedia cloud can have various service models such as IaaS, PaaS
and SaaS that offer infrastructure, platform or application services.
132
Fig: Workflow for live video streaming using multimedia cloud
Streaming Protocols
• RTMP Dynamic Streaming (Unicast)
• High-quality, low-latency media streaming with support for live and on-
demand and full adaptive bitrate.
• RTMPE (encrypted RTMP)
• Real-time encryption of RTMP.
• RTMFP (multicast)
• IP multicast encrypted with support for both ASM or SSM multicast for
multicast-enabled network.
• RTMFP (P2P)
• P2P live video delivery between Flash Player clients.
• RTMFP (multicast fusion)
• IP and P2P working together to support higher QoS within enterprise
networks.
• HTTP Dynamic Streaming (HDS)
• Enabling on-demand and live adaptive bitrate video streaming of
standards-based MP4 media over regular HTTP connections.
• Protected HTTP Dynamic Streaming (PHDS)
• Real-time encryption of HDS.
• HTTP Live Streaming (HLS)
• HTTP streaming to iOS devices or devices that support the HLS format;
optional encryption with AES128 encryption standard.
RTMP Streaming
• Real Time Messaging Protocol (RTMP) is a protocol for streaming audio,
video and data over the Internet.
• The plain version of RTMP protocol works on top of TCP. RTMPS is a
secure variation of RTMP that works over TLS/SSL.
• RTMP provides a bidirectional message multiplex service over a reliable
stream transport, such as TCP.
133
• RTMP maintains persistent TCP connections that allow low-latency
communication.
• RTMP is intended to carry parallel streams of video, audio, and data
messages, with associated timing information, between a pair of
communicating peers.
• Streams are split into fragments so that delivery of the streams smoothly.
• The size of the stream fragments is either fixed or negotiated dynamically
between the client and server.
• Default fragment sizes used are 64-bytes for audio data, and 128 bytes for
video data.
• RTMP implementations typically assign different priorities to different
classes of messages, which can affect the order in which messages are
enqueued to the underlying stream transport when transport capacity is
constrained.
HTTP Live Streaming
• HTTP Live Streaming (HLS) can dynamically adjust playback quality to
match the available speed of wired or wireless networks.
• HLS supports multiple alternate streams at different bit rates, and the
client software can switch streams intelligently as network bandwidth
changes.
• HLS also provides for media encryption and user authentication over
HTTPS, allowing publishers to protect their work.
• The protocol works by splitting the stream into small chunks which are
specified in a playlist file.
• Playlist file is an ordered list of media URIs and informational tags.
• The URIs and their associated tags specify a series of media segments.
• To play the stream, the client first obtains the playlist file and then obtains
and plays each media segment in the playlist .
HTTP Dynamic Streaming
• HTTP Dynamic Streaming (HDS) enables on-demand and live adaptive
bitrate video delivery of standards-based MP4 media (H.264 or VPC) over
regular HTTP connections.
• HDS combines HTTP (progressive download) and RTMP (streaming
download) to provide the ability to deliver video content in a steaming
manner over HTTP.
• HDS supports adaptive bitrate which allows HDS to detect the client’s
bandwidth and computer resources and serve content fragments encoded at
the most appropriate bitrate for the best viewing experience.
134
• HDS supports high-definition video up to 1080p, with bitrates from 700
kbps up to and beyond 6Mbps, using either H.264 or VP6 video codecs, or
AAC and MP3 audio codecs.
• HDS allows leveraging existing caching infrastructures, content delivery
networks (CDNs) and standard HTTP server hardware to deliver on-
demand and live content.
Fig: Screenshots of live video streaming application showing video streaming page
Video Transcoding App – Case Study
Functionality
• Video transcoding application is based on multimedia cloud.
• The transcoding application allows users to upload video files and choose
the conversion presets.
Development
• The application is built upon the Amazon Elastic Transcoder.
135
• Elastic Transcoder is highly scalable, relatively easy to use service from
Amazon that allows converting video files from their source format into
versions that will playback on mobile devices like smartphones, tablets and
PCs.
136
Cloud Application Benchmarking & Tuning
Benchmarking
• Benchmarking of cloud applications is important or the following reasons:
• Provisioning and capacity planning
• The process of provisioning and capacity planning for cloud applications
involves determining the amount of computing, memory and network
resources to provision for the application.
• Benchmarking can help in comparing alternative deployment architectures
and choosing the best and most cost effective deployment architecture that
can meet the application performance requirements.
• Ensure proper utilization of resources
• Benchmarking can help in determining the utilization of computing,
memory and network resources for applications and identify resources
which are either under-utilized or overprovisioned and hence save
deployments costs.
• Market readiness of applications
• Performance of an application depends on the characteristics of the
workloads it experiences. Different types of workloads can dead to different
performance for the same application.
• To ensure the market readiness of an application it is important to model
all types of workloads the application can experience and benchmark the
application with such workloads.
Cloud Application Benchmarking - Steps
• Trace Collection/Generation
• The first step in benchmarking cloud applications is to collect/generate
traces of real application workloads. For generating a trace of workload, the
application is instrumented to log information such as the requests
submitted by the users, the time-stamps of the requests, etc.
• Workload Modeling
• Workload modeling involves creation of mathematical models that can be
used for generation of synthetic workloads.
• Workload Specification
• Since the workload models of each class of cloud computing applications
can have different workload attributes, a Workload Specification Language
(WSL) is often used for specification of application workloads. WSL can
provide a structured way for specifying the workload attributes that are
critical to the performance of the applications. WSL can be used by
137
synthetic workload generators for generating workloads with slightly
varying the characteristics.
• Synthetic Workload Generation
• Synthetic workloads are used for benchmarking cloud applications. An
important requirement for a synthetic workload generator is that the
generated workloads should be representative of the real workloads.
139
Flexibility
• A good benchmarking methodology should allow fine grained control
over the workload attributes such as think time, inter-session interval,
session length, workload mix, for instance, to perform sensitivity analysis.
• Sensitivity analysis is performed by varying one workload characteristic
at a time while keeping the others constant.
Wide Application Coverage
• A good benchmarking methodology is one that works for a wide range of
applications and not tied to the application architecture or workload types.
Types of Tests
Baseline Tests
• Baseline tests are done to collect the performance metrics data of the
entire application or a component of the application.
• The performance metrics data collected from baseline tests is used to
compare various performance tuning changes which are
subsequently made to the application or a component.
Load Tests
• Load tests evaluate the performance of the system with multiple users and
workload levels that are encountered in the production phase.
• The number of users and workload mix are usually specified in the load
test configuration.
Stress Tests
• Stress tests load the application to a point where it breaks down.
• These tests are done to determine how the application fails, the conditions
in which the application fails and the metrics to monitor
which can warn about impending failures under elevated workload levels.
Soak Tests
• Soak tests involve subjecting the application to a fixed workload level for
long periods of time.
• Soak tests help in determining the stability of the application under
prolonged use and how the performance changes with time.
Deployment Prototyping
• Deployment prototyping can help in making deployment architecture
design choices.
• By comparing performance of alternative deployment architectures,
deployment prototyping can help in choosing the best and most cost
140
effective deployment architecture that can meet the application performance
requirements.
• Deployment design is an iterative process that involves the following
steps:
Deployment Design
• Create the deployment with various tiers as specified in the deployment
configuration and deploy the application.
Performance Evaluation
• Verify whether the application meets the performance requirements with
the deployment.
Deployment Refinement
• Deployments are refined based on the performance evaluations. Various
alternatives can exist in this step such as vertical scaling, horizontal scaling,
for instance.
141
Performance Evaluation Workflow
142
Performance Evaluation Workflow
Analysis
• Throughput continuously increases as the demanded request rate increases
from 10 to 40 req/sec. Beyond 40req/sec demanded request rate, we
observe that throughput saturates, which is due to the high CPU utilization
density of the database server CPU. From the analysis of density plots of
various system resources we observe that the database CPU is a system
bottleneck.
144
Fig(a) Plot of average throughput and response time results obtained from
the load testing with httperf,(b)application server-1 cpu utilization
density,(c)database sserrver cpu utilization density,(d)database server disk
I/O bandwidth,(e)application server-1 network outgoing rate
density,(f)database server network outgoing density
145
PART-A (2-Mark Questions)
1.What is big data approach?
8. Define Throughput.
11.What are the various tests that can be done using the benchmarking
tools.
146
UNIT-V
Cloud Security
Introduction:
“Security in the Cloud is much like security in your on-premises data
centers – only without the costs of maintaining facilities and hardware. In
the Cloud, you don’t have to manage physical servers or storage devices.
Instead, you use software-based security tools to monitor and protect the
flow of information into and of out of your Cloud resources.”
Auditing
• Auditing is very important for applications deployed in cloud computing
environments.
• In traditional in-house IT environments, organizations have complete
visibility of their applications and accesses to the protected information.
• For cloud applications appropriate auditing mechanisms are required to
get visibility into the application, data accesses and actions performed by
the application users, including mobile users and devices such as wireless
laptops and smartphones.
148
Fig: Security and risk management (SRM) domain.
Authentication
• Authentication refers to confirming the digital identity of the entity
requesting access to some protected information.
• The process of authentication involves, but is not limited to, validating the
at least one factor of identification of the entity to be authenticated.
• A factor can be something the entity or the user knows (password or pin),
something the user has (such as a smart card), or something that can
uniquely identify the user (such as fingerprints).
• In multifactor authentication more than one of these factors are used for
authentication.
• There are various mechanisms for authentication including:
• SSO
• SAML-Token
• OTP
Single Sign-on (SSO)
• Single Sign-on (SSO) enables users to access multiple systems or
applications after signing in only once, for the first time.
149
• When a user signs in, the user identity is recognized and there is no need
to sign in again and again to access related systems or applications.
• Since different systems or applications may be internally using different
authentication mechanisms, SSO upon receiving initial credential translates
to different credentials for different systems or applications.
• The benefit of using SSO is that it reduces human error and saves time
spent in authenticating with different systems or applications for the same
identity.
There are different implementation mechanisms:
• SAML-Token
• Kerberos
SAML-Token
• Security Assertion Markup Language (SAML) is an XML-based open
standard data format for exchanging security informatio(authentication and
authorization data) between an identity provider and a service provider.
SAML-token based SSO authentication
• When a user tries to access the cloud application, a SAML request is
generated and the user is redirected to the identity provider.
• The identity provider parses the SAML request and authenticates the
user.A SAML token is returned to the user, who then accesses the cloud
application with the token.
• SAML prevents man-in-the-middle and replay attacks by requiring the use
of SSL encryption when transmitting assertions and messages.
• SAML also provides a digital signature mechanism that enables the
assertion to have a validity time range to prevent replay attacks.
150
Kerberos
• Kerberos is an open authentication protocol that was developed At MIT.
• Kerberos uses tickets for authenticating client to a service that
communicate over an un-secure network.
• Kerberos provides mutual authentication, i.e. both the client and the server
authenticate with each other.
151
• Time-based OTP algorithm (TOTP) is a popular time synchronization
based algorithm for generating OTPs.
Authorization
• Authorization refers to specifying the access rights to the protected
resources using access policies.
OAuth
• OAuth is an open standard for authorization that allows resource owners
to share their private resources stored on one site with another site without
handing out the credentials.
• In the OAuth model, an application (which is not the resource owner)
requests access to resources controlled by the resource owner (but hosted
by the server).
• The resource owner grants permission to access the resources in the form
of a token and matching shared-secret.
• Tokens make it unnecessary for the resource owner to share its credentials
with the application.
• Tokens can be issued with a restricted scope and limited lifetime, and
revoked independently.
152
Identity & Access Management
• Identity management provides consistent methods for digitally identifying
persons and maintaining associated identity attribute for the users across
multiple organizations.
Access management deals with user privileges.
• Identity and access management deal with user identities, their
authentication, authorization and access policies.
Federated Identity Management
• Federated identity management allows users of one domain to securely
access data or systems of another domain seamlessly without the need for
maintaining identity information separately for multiple domains.
• Federation is enabled through the use single sign-on mechanisms such as
SAML token and Kerberos.
Role-based access control
• Used for restricting access to confidential information to authorized users.
• These access control policies allow defining different roles for different
users.
153
Securing Data at Rest
• Data at rest is the data that is stored in database in the form of
tables/records, files on a file server or raw data on a distributed storage or
storage area network (SAN).
• Data at rest is secured by encryption.
• Encryption is the process of converting data from its original form (i.e.,
plaintext) to a scrambled form (ciphertext) that is unintelligible. Decryption
converts data from ciphertext to plaintext.
Encryption can be of two types:
• Symmetric Encryption (symmetric-key algorithms)
• Asymmetric Encryption (public-key algorithms)
Symmetric Encryption
• Symmetric encryption uses the same secret key for both encryption and
decryption.
• The secret key is shared between the sender and the receiver.
• Symmetric encryption is best suited for securing data at rest since the data
is accessed by known entities from known locations.
• Popular symmetric encryption algorithms include:
• Advanced Encryption Standard (AES)
• Twofish
• Blowfish
• Triple Data Encryption Standard (3DES)
• Serpent
• RC6
• MARS
Asymmetric Encryption
• Asymmetric encryption uses two keys, one for encryption (public key)
and other for decryption (private key).
• The two keys are linked to each other such that one key encrypts
plaintext to ciphertext and other decrypts ciphertext back to plaintext.
• Public key can be shared or published while the private key is known only
to the user.
• Asymmetric encryption is best suited for securing data that is exchanged
between two parties where symmetric encryption can be unsafe because the
secret key has to be exchanged between the parties and anyone who
manages to obtain the secret key can decrypt the data.
• In asymmetric encryption a separate key is used for decryption which is
kept private.
154
Fig: Asymmetric encryption using public/private keys
Encryption Levels
Encryption can be performed at various levels:
Application
• Application level encryption involves encrypting application data right at
the point where it originates i.e. within the application.
• Application level encryption provides security at the level of both the
operating system and from other applications.
• An application encrypts all data generated in the application before it
flows to the lower levels and resents decrypted data to the user.
Host
• In host-level encryption, encryption is performed at the file-level for all
applications running on the host.
• Host level encryption can be done in software in which case additional
computational resource is required for encryption or it can be performed
with specialized hardware such as a cryptographic accelerator card.
Network
• Network-level encryption is best suited for cases where the threats to data
are at the network or storage level and not at the application or host level.
• Network-level encryption is performed when moving the data form a
creation point to its destination using a specialized hardware that encrypts
all incoming data in real-time.
155
Device
• Device-level encryption is performed on a disk controller or a storage
server.
• Device level encryption is easy to implement and is best suited for cases
where the primary concern about data security is to protect data residing on
storage media.
156
Fig: TLS Handshake
Key Management
• Management of encryption keys is critical to ensure security of encrypted
data.
The key management lifecycle involves different phases including:
• Creation
• Backup
• Deployment
• Monitoring
• Rotation
• Expiration
• Archival
• Destruction
Key Management Approach (example)
• All keys for encryption must be stored in a data store which is separate
and distinct from the
actual data store.
• Additional security features such as key rotation and key encrypting keys
can be used.
• Keys can be automatically or manually rotated.
• In the automated key change approach, the key is changed after a certain
number of
transactions.
• All keys can themselves be encrypted using a master key.
157
Fig: Example of a key management approach
Auditing
• Auditing is mandated by most data security regulations.
• Auditing requires that all read and write accesses to data be logged.
• Logs can include the user involved, type of access, timestamp, actions
performed and records accessed.
• The main purpose of auditing is to find security breaches, so that
necessary changes can be made in the application and deployment to
prevent a further security breach.
The objectives of auditing include:
• Verify efficiency and compliance of identity and access management
controls as per established access policies.
• Verifying that authorized users are granted access to data and services
based on their roles.
• Verify whether access policies are updated in a timely manner upon
change in the roles of the users.
• Verify whether the data protection policies are sufficient.
• Assessment of support activities such as problem management.
159
THE CLOUD SERVICE OFFERINGS AND DEPLOYMENT
MODELS
Cloud computing has been an attractive proposition both for
the CFO and the CTO of an enterprise primarily due its
ease of usage. This has been achieved by large data center
service vendors or now better known as cloud service
vendors again primarily due to their scale of operations.
IaaS • Abstract Compute/Storage/Bandwidth Resources
• Amazon Web Services[10,9] – EC2, S3, SDB, CDN,
IT Folks
START
Assess
Optimize Isolate
END
Test Map
Re-
Augment
169
costly data center operational activities; and
switch to focus on value-added activities
● Keep integration (technologies) simple
170
UNFREEZE TRANSITION REFREEZE
171
Study the result; Understand the gap
redesign systems to between residents’
reflect learning – expectations and what is
change standards and being delivered; set
ACT PLAN
BETTER
ENVIRONME
NT FOR
CITIES
172
Elements of organizational culture may include:
● Stated values and belief
● Expectations for member behavior
Processe
s
Rewards
and Culture
Managemen
● Customs and rituals
● Stories and myths about the history of the organization
173
subset of Davenport’s. They define a process as “a
collection of activities that takes one or more kinds of
input and creates an output that is of value to the
customer.”
Whereas
175
● Should AML consider cloud computing part of the solution?
● Is AML ready for cloud computing?
● What does “done” look like?
● How can the organization overcome these challenges of change?
177
Technical
Does your organization implement any
industry management standards?
● ITIL
● COBIT
● ITSM
● others
Does your organization have a well-
established pol- icy to classify and
manage the full lifecycle of all
corporate data?
Can you tell which percentage of your
applications is CPU-intensive, and
which percentage of your appli- cations
is data-intensive?
LEGAL ISSUES
The legal issues that arise in cloud computing are wide ranging.
Significant issues regarding privacy of data and data security
exist, specifically as they relate to protecting personally identifiable
information of individuals, but also as they relate to protection of
sensitive and potentially confidential business information either
directly accessible through or gleaned from the cloud systems (e.g.,
identification of a company‘s customer by evaluating traffic across
the network).
Additionally, there are multiple contracting models under which
cloud services may be offered to customers (e.g., licensing, service
agreements, on-line agreements, etc.).
The appropriate model depends on the nature of the services as well
as the potential sensitivity of the systems being implemented or data
being released into the cloud. In this regard, the risk profile (i.e.,
which party bears the risk of harm in certain foreseeable and other
not-so-foreseeable situations) of the agreement and the cloud
178
provider‘s limits on its liability also require a careful look when
reviewing contracting models.
Additionally, complex jurisdictional issues may arise due to the
potential for data to reside in disparate or multiple geographies. This
geographical diversity is inherent in cloud service offerings. This
means that both virtualization of and physical locations of servers
storing and processing data may potentially impact what country‘s
law might govern in the event of a data breach or intrusion into
cloud systems.
Licensing Agreements
A traditional software license agreement is used when a licensor is
providing a copy of software to a licensee for its use (which is usually non-
exclusive). This copy is not being sold or transferred to the licensee, but a
physical copy is being conveyed to the licensee.
The software license is important because it sets forth the terms under
which the software may be used by the licensee. The license protects the
licensor against the inadvertent transfer of ownership of the software to the
person or company that holds the copy. It also provides a mechanism for
the licensor of the software to (among other things) retrieve the copy it
provided to the licensee in the event that the licensee (a) stops complying
with the terms of the license agreement or (b) stops paying the fee the
licensee charges for the license.
Service Agreement.
service agreement, on theother hand, is not designed to protect against the
perils of providing a copy of software to a user. It is primarily designed to
provide the terms under which a service can be accessed or used by a
customer. The service agreement may also set forth quality parameters
179
around which the service will be provided to the users. Since there is no
transfer of possession of a copy of software and the service is controlled by
the company providing it, a service agreement does not necessarily need to
cover infringement risk, nor does it need to set forth the scenarios and
manner in which a copy of software is to be returned to the vendor when a
relationship is terminated.
181
information is virtually, not physically, separated from other users. The
major benefit of this model is cost-effectiveness for the cloud provider.
182
International Conflicts of Laws
The body of law known as “conflict of laws” acknowledges that the laws of
different countries may operate in opposition to each other, even as those
laws relate to the same subject matter. In such an event, it is necessary to
decide which country’s law will be applied.
Every nation is sovereign within its own territory. That means that the laws
of that nation affect all property and people within it, including all contracts
made and actions carried out within its borders. When there is either
(1) no statement of the law that governs a contract,
(2) no discussion of the rules regarding conflicts of laws in the agreement,
or
(3) a public policy in the jurisdiction which mandates that the governing
law in the agreement will be ignored, the question of which nation’s law
will apply to the transaction will be decided based on a number of factors
and circumstances surrounding the transaction. This cannot be reduced to a
simple or easy-to-apply rule.
Minimizing Risk
Maintaining Data Integrity. Data integrity ensures that data at rest are not
subject to corruption. Multi-tenancy is a core technological approach to
creating efficiencies in the cloud, but the technology, if implemented or
maintained improperly, can put a cloud user’s data at risk of corruption,
contamination, or unauthorized access. A cloud user should expect
contractual provisions obligating a cloud provider to protect its data, and
the user ultimately may be entitled to some sort of contract remedy if data
integrity is not maintained.
183
Accessibility and Availability of Data/SLAs.
Disaster Recovery. For the cloud user that has outsourced the processing
of its data to a cloud provider, a relevant question is, What is the cloud
provider’s disaster recovery plan? What happens when the unanticipated,
catastrophic event affects the data center(s) where the cloud services are
being provided? It is important for both parties to have an understanding of
the cloud provider’s disaster recovery plan.
Though the ability for the cloud user to have continual access to the cloud
service is a top consideration, a close second, at least from a business
continuity standpoint, is keeping access to its data. This section introduces
three scenarios that a cloud user should contemplate when placing its data
into the cloud. There are no clear answers in any scenario. The most
conservative or riskaverse cloud user may consider having a plan to keep a
184
copy of its cloud-stored dataset in a location not affiliated with the cloud
provider.
SPECIAL TOPICS
Litigation Issues/e-Discovery
From a U.S. law perspective, a significant effort must be made during the
course of litigation to produce electronically stored information (ESI). This
production of ESI is called “e-discovery.” The overall e-discovery process
has three basic components: (1) information management, where a
company decides where and how its information is processed and retained,
(2) identifying, preserving, collecting, and processing ESI once litigation
has been threatened or started, and (3) review, processing, analysis, and
production of the ESI for opposing counsel [26]. The Federal Rules of Civil
Procedure require a party to produce information within its “possession,
custody, or control.”
185
Cloud for Industry,Healthcare & Education
Heathcase Ecosystem
• The healthcare ecosystem consists of numerous entities including
healthcare providers (primary care physicians, private health insurance
companies, employers), pharmaceutical, device and medical service
companies, IT solutions and services firms, and patients.
Healthcare Data
• The process of provisioning healthcare involves massive healthcare data
that exists in different forms (structured or unstructured), is stored in
disparate data sources (such as relational databases, file servers, for
instance) and in many different formats.
• The cloud can provide several benefits to all the stakeholders in the
healthcare ecosystem through systems such as
Health Information Management System (HIMS),
Laboratory Information System (LIS), Radiology Information
System (RIS), Pharmacy Information System (PIS), for
instance.
186
Benefits of Cloud for Healthcare
187
• The EHR data can be used for advanced healthcare applications such as
population-level health surveillance, disease detection, outbreak prediction,
public health mapping, similarity-based clinical decision intelligence,
medical prognosis, syndromic diagnosis, visual-analytics investigation, for
instance.
• To exploit the potential to aggregate data for advanced healthcare
applications there is a need for efficiently integrating information from
distributed and heterogeneous healthcare IT systems and analyzing the
integrated information.
Cloud EHRs
Save Infrastructure Costs
• Traditional client-server EHR systems with dedicated hosting require a
team of IT experts to install, configure, test, run, secure and update
hardware and software.
• With cloud-based EHR systems, organizations can save on the upfront
capital investments for setting up the computing infrastructure as well as
the costs of managing the infrastructure.
Data Integration & Interoperability
• Traditional EHR systems use different and often confiicting technical and
semantic standards which leads to data integration and interoperability
problems.
• To address interoperability problems, several electronic health record
(EHR) standards that enable structured clinical content for the purpose of
exchange are currently under development.
• Interoperability of EHR systems will contribute to more effective and
efficient patient care by facilitating the retrieval and processing of clinical
information about a patient from different sites.
Scalability and Performance
• Traditional EHR systems are built on a client-server model with dedicated
hosting that involves a server which is installed within the organization’s
network and multiple clients that access the server. Scaling up such systems
requires additional hardware.
• Cloud computing is a hosting abstraction in which the underlying
computing infrastructure is provisioned on demand and can be scaled up or
down based on the workload.
• Scaling up cloud applications is easier as compared to client-server
applications.
188
Security for Cloud EHRs
Security Concerns for Cloud EHRs
• Security of patient information is one of the biggest obstacles in the
widespread adoption of cloud computing technology for EHR systems due
to the outsourced nature of cloud computing.
Government Regulations
• Government regulations require privacy protection and security of patient
health information.
• In the U.S., organizations called covered entities (CE), that create,
maintain, transmit, use, and disclose an individual’s protected health
information (PHI) are required to meet Health Insurance Portability and
Accountability Act (HIPAA) requirements.
• HIPAA requires covered entities (CE) to assure their customers that the
integrity, confidentiality, and availability of PHI information they collect,
maintain, use, or transmit is protected.
• HIPAA was expanded by the Health Information Technology for
Economic and Clinical Health Act (HITECH), which addresses the privacy
and security concerns associated with the electronic transmission of health
information.
189
• Zookeeper is used to provide a distributed coordination service for
maintaining configuration information, naming, providing distributed
synchronization, and providing group services.
190
Cloud Computing for Energy Systems
• Complex clean energy systems (such as smart grids, power plants, wind turbine farms, for
instance.) have a large number of critical components that must function correctly so that the
systems can perform their operations correctly.
• Energy systems have thousands of sensors that gather real-time maintenance data
continuously for condition monitoring and failure prediction purposes.
• Analyzing massive amounts of maintenance data collected from sensors in energy systems
and equipment can provide predictions for the impending failures (potentially in real-time) so
that their reliability and availability can be improved.
191
inexpensive commodity computers which are connected to work in parallel.
• Such systems are designed to work on commodity hardware which has high probability of
failure using techniques such as replication of file blocks on multiple machines in a cluster.
Collecting Sensor Data in Cloud
• Workflow for aggregating sensor data in a cloud:
• The first step in this workflow is data aggregation. Each incoming data stream is mapped to
a data aggregator.
• Since the raw sensor data comes from a large number of machines in the form of data
streams, the data has to be preprocessed to make the data analysis using cloud-based parallel
processing frameworks (such as Hadoop) more efficient. For example, the Hadoop
MapReduce data processing model works more efficiently with a small number of large files
rather than a large number of small files.
• The data aggregators buffer the streaming data into larger chunks.
• The next step is to filter data and remove bad records in which some sensor readings are
missing.
• The filtered data then compressed and archived to a cloud storage.
192
Case Based Reasoning for Fault Prediction
• Case-based reasoning (CBR) is popular method that has been used for fault prediction.
• CBR finds solutions to new problems based on past experience.
• CBR is an effective technique for problem solving in the fields in which it is hard to
establish a quantitative mathematical model, such as prognostic health management.
• In CBR, the past experience is organized and represented as cases in a case-base.
• The steps involved in CBR are:
• Retrieve: retrieving similar cases from case-base
• Reuse: reusing the information in the retrieved cases
• Revise: revising the solution
• Retain: retaining a new experience into the case-base.
193
• Storage collection and analysis of smarts grids data in the cloud can help in dynamic
optimization of system operations, maintenance, and planning.
• Cloud-based monitoring of smart grids data can improve energy usage levels via energy
feedback to users coupled with real-time pricing information and from users with energy
consumption status to reduce energy usage.
• Real- time demand response and management strategies can be used for lowering peak
demand and overall load via appliance control and energy storage mechanisms.
• Condition monitoring data collected from power generation and transmission systems can
help in detecting faults and predicting outages.
194
• Condition Monitoring Condition monitoring solutions for transportation systems allow
monitoring the conditions inside containers.
• Planning, Operations & Services
• Different transportation solutions (such as fleet tracking, condition monitoring, route
generation, scheduling, cargo operations, fleet maintenance, customer service, order booking,
billing & collection, for instance.) can be moved to the cloud to provide a seamless
integration between order management, tactical planning & execution and customer facing
processes & systems.
195
Cloud Computing for Manufacturing Industry
• Modern transportation systems are driven by data collected from multiple sources which is
processed to provide new services to the stakeholders.
• By collecting large amount of data from various sources and processing the data into useful
information, data-driven transportation systems can provide new services such as:
• Advanced route guidance
• Dynamic vehicle routing
• Anticipating customer demands for pickup and delivery Problem
196
• Cloud-based auto-grading applications are used for grading exams and assignments. Cloud-
based applications for peer grading of exams and assignments are also used in some MOOCs.
• Online Programs
• Many universities across the world are using cloud platforms for providing online degree
programs.
• Lectures are delivered through live/recorded video using cloud based content delivery
networks to students across the world.
197
PART-A (2-Marks Questions)
1.What are the cloud Cloud Security Challenges.
3.Define identity.
4.Define Encryption
198