KEMBAR78
L2 AWS Basics | PDF | Apache Hadoop | Distributed Computing Architecture
0% found this document useful (0 votes)
157 views56 pages

L2 AWS Basics

Amazon EC2 is a cloud computing service that provides scalable computing capacity. It allows users to launch server instances that can be accessed via the internet. Instances are based on Amazon Machine Images (AMIs) which define the operating system and applications. Users can launch multiple instances of an AMI and pay by the hour for active servers. EC2 provides options for storage, networking, security, monitoring, and auto-scaling of instances to meet varying computing needs.

Uploaded by

hammad khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views56 pages

L2 AWS Basics

Amazon EC2 is a cloud computing service that provides scalable computing capacity. It allows users to launch server instances that can be accessed via the internet. Instances are based on Amazon Machine Images (AMIs) which define the operating system and applications. Users can launch multiple instances of an AMI and pay by the hour for active servers. EC2 provides options for storage, networking, security, monitoring, and auto-scaling of instances to meet varying computing needs.

Uploaded by

hammad khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Cloud

Computing
Lecture 2 AWS Basics
Infrastructure as a
Service
(IaaS)
Amazon EC2

2
 Amazon Elastic Compute Cloud (EC2) is a web service that
provides resizeable computing capacity that one uses to build and
host different software systems.
 Designed to make web-scale computing easier for developers.
 A user can create, launch, and terminate server instances as
needed, paying by the hour for active servers, hence the term
"elastic".
 Provides scalable, pay as-you-go compute capacity
What is EC2 ?  Elastic - scales in both direction
EC2 Infrastructure
Concepts

4
AMI & Instance

Region & Zones

Storage

EC2 Concepts Networking and Security

Monitoring

Auto Scaling

Load Balancer

5
 Is an immutable representation of a set of disks that contain an
operating system, user applications and/or data.
 From an AMI, one can launch multiple instances, which are running
copies of the AMI.

Amazon Machine
Images (AMI)
 Amazon Machine Image (AMI) is a template for software
configuration (Operating System, Application Server, and
Applications)
 Instance is a AMI running on virtual servers in the cloud
 Each instance type offers different compute and memory facilities

AMI and Instance

Diagram Source: http://docs.aws.amazon.com


 Amazon have data centers in different region across the globe
 An instance can be launched in different regions depending on the
need.
 Closer to specific customer
 To meet legal or other requirements
 Each region has set of zones
 Zones are isolated from failure in other zones
 Inexpensive, low latency connectivity between zones in same region
Region and Zones
 Amazon EC2 provides three type of storage option
 Amazon EBS
 Amazon S3
 Instance Storage

Storage

Diagram Source: http://docs.aws.amazon.com


 An EBS volume is a read/write disk that can be created by an AMI
and mounted by an instance.
 Volumes are suited for applications that require a database, a file
system, or access to raw block-level storage.

Elastic Block
Store(EBS) volume
 S3 = Simple storage Service
 A SOA – Service Oriented Architecture which provides online
storage using web services.
 Allows read, write and delete permissions on objects.
 Uses REST and SOAP protocols for messaging.

Amazon S3
 Amazon SimpleDB is a highly available, flexible, and scalable non-
relational data store that offloads the work of database
administration.
 Creates and manages multiple geographically distributed replicas
of your data automatically to enable high availability and data
durability.
 The service charges you only for the resources actually consumed
in storing your data and serving your requests.
Amazon SimpleDB
 Instances can be launched on one of the two platforms
 EC2-Classic
 EC2-VPC
 Each instance launched is assigned two addresses a private
address and a public IP address.
 A replacement instance has a different public IP address.
 Instance IP address is dynamic.
Networking and  new IP address is assigned every time instance is launched
Security  Amazon EC2 offers Elastic IP addresses (static IP addresses) for
dynamic cloud computing.
 Remap the Elastic IP to new instance to mask failure
 Separate pool for EC2-Classic and VPC
 Security Groups to access control to instance
 Monitor statistics of instances and EBS
 CloudWatch
 Automatically scales amazon EC2 capacity up and down based on
rules
 Add and remove compute resource based on demand
 Suitable for businesses experiencing variability in usage

Monitoring, Auto  Distribute incoming traffic across multiple instances


 Elastic Load Balancing
Scaling, and Load
Balancing
 AWS Console
 http://console.aws.amazon.com
 Command Line Tools
 Programmatic Interface
 EC2 APIs
 AWS SDK

How to access EC2


AWS Management
Console
EC2 Pricing
 Mobile cloud computing: Big Picture by M. Reza Rahimi
 http://aws.amazon.com/ec2, http://docs.aws.amazon.com
 Amazon Elastic Compute Cloud – User Guide, API Version 2011-02-
28.
 Above the Clouds: A Berkeley View of Cloud Computing - Michael
Armbrust et.al 2009
 International telecommunication union – Focus Group Cloud
References Technical Report
Hadoop, a
distributed
framework for Big
Data
 Introduction: Hadoop’s history and advantages

 Architecture in detail

 Hadoop in industry

Introduction
 Apache top level project, open-source implementation of
frameworks for reliable, scalable, distributed computing and data
storage.
 It is a flexible and highly-available architecture for large scale
computation and data processing on a network of commodity
hardware.

 Designed to answer the question: “How to process big data with


What is Hadoop? reasonable cost and time?”
1996 1996

Search engines in
1990s

1997
1996
1998

Google search
engines

2013
 2005: Doug Cutting and Michael J. Cafarella developed Hadoop to
support distribution for the Nutch search engine project.

 The project was funded by Yahoo.

 2006: Yahoo gave the project to Apache


Hadoop’s  Software Foundation.
Developers
2003

Google Origins 2004

2006
 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte
of data in 209 seconds, compared to previous record of 297
seconds)

 2009 - Avro and Chukwa became new members of Hadoop


Framework family

Some Hadoop  2010 - Hadoop's Hbase, Hive and Pig subprojects completed,
Milestones adding more computational power to Hadoop framework

 2011 - ZooKeeper Completed

 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.


 - Ambari, Cassandra, Mahout have been added
 An open-source software framework that supports data-intensive
distributed applications, licensed under the Apache v2 license.
 Abstract and facilitate the storage and processing of large and/or
rapidly growing data sets
 Structured and non-structured data
 Simple programming models
 High scalability and availability
What is Hadoop?  Use commodity (cheap!) hardware with little redundancy

 Fault-tolerance

 Move computation rather than data


Hadoop Framework
Tools
 A MapReduce Process (org.apache.hadoop.mapred)
 JobClient
 Submit job
 JobTracker
 Manage and schedule job, split job into tasks;
 Splits up data into smaller tasks(“Map”) and sends it to the
TaskTracker process in each node
 TaskTracker
Hadoop  Start and monitor the task execution;
MapReduce Engine  reports back to the JobTracker node and reports on job progress,
sends data (“Reduce”) or requests new jobs
 Child
 The process that really executes the task
Hadoop’s
Architecture:
MapReduce Engine
 Distributed, with some centralization

 Main nodes of cluster are where most of the computational power


and storage of the system lies

 Main nodes run TaskTracker to accept and reply to MapReduce


Hadoop’s tasks, Main Nodes run DataNode to store needed blocks closely as
MapReduce possible
Architecture
 Central control node runs NameNode to keep track of HDFS
directories & files, and JobTracker to dispatch compute tasks to
TaskTracker

 Written in Java, also supports Python and Ruby


Hadoop’s
Architecture
 Tailored to needs of MapReduce
 Targeted towards many reads of filestreams
 Writes are more costly
 Open Data Format
 Flexible Schema
 Queryable Database
Hadoop Distributed
 Fault Tolerance
FileSystem  High degree of data replication (3x by default)
 No need for RAID on normal nodes
 Large blocksize (64MB)
 Location awareness of DataNodes in network
NameNode: DataNode:

• Stores metadata for the files, like the directory • Stores the actual data in
structure of a typical FS. HDFS
• The server holding the NameNode instance is • Can run on any underlying
quite crucial, as there is only one. filesystem (ext3/4, NTFS, etc)
HDFS • Transaction log for file deletes/adds, etc. Does not• Notifies NameNode of what
use transactions for whole blocks or file-streams, blocks it has
only metadata. • NameNode replicates blocks
• Handles creation of more replica blocks when 2x in local rack, 1x elsewhere
necessary after a DataNode failure
HDFS
Replication Strategy: Use Checksums to validate data –
• One replica on local node CRC32
• Second replica on a remote rack • File Creation
• Third replica on same remote rack • Client computes checksum per 512
• Additional replicas are randomly byte
placed • DataNode stores the checksum
•Clients read from nearest replica • File Access
• Client retrieves the data anD
HDFS Replication checksum from DataNode
• If validation fails, client tries other
replicas

•Client retrieves a list of DataNodes on which to place replicas of a block


• Client writes block to the first DataNode
•The first DataNode forwards the data to the next DataNode in the Pipeline
•When all replicas are written, the client moves on to write the next block in file
• Hadoop is in use at most • Key Applications
organizations that handle big data: • Advertisement (Mining user
o Yahoo! behavior to generate
o Yahoo!’s Search Webmap recommendations)
runs on 10,000 core Linux • Searches (group related
cluster and powers Yahoo! documents)
Web search • Security (search for uncommon
patterns)
Hadoop Usage
o Facebook
o FB’s Hadoop cluster hosts
100+ PB of data (July,
2012) & growing at ½
PB/day (Nov, 2012)
o Amazon
o Netflix
 Non-realtime large dataset computing:

 NY Times was dynamically generating PDFs of articles from 1851-1922

 Wanted to pre-generate & statically serve articles to improve


performance

 Using Hadoop + MapReduce running on EC2 / S3, converted 4TB of


Hadoop Usage TIFFs into 11 million PDF articles in 24 hrs
• Design requirements:
o Integrate display of email, SMS and
chat messages between pairs and
groups of users
o Strong control over who users receive
Hadoop Usage: messages from
Facebook Messages o Suited for production use between 500
million people immediately after
launch
o Stringent latency & uptime
requirements
• System requirements
o High write throughput
o Cheap, elastic storage
o Low latency
o High consistency (within a
Hadoop Usage: single data center good
Facebook Messages enough)

o Disk-efficient sequential and


random read performance
 Classic alternatives
 These requirements typically met using large MySQL cluster &
caching tiers using Memcache
 Content on HDFS could be loaded into MySQL or Memcached if
needed by web tier

 Problems with previous solutions


 MySQL has low random write throughput… BIG problem for
Hadoop Usage: messaging!
Facebook Messages  Difficult to scale MySQL clusters rapidly while maintaining
performance
 MySQL clusters have high management overhead, require more
expensive hardware
 Facebook’s solution
 Hadoop + HBase as foundations
 Improve & adapt HDFS and HBase to scale to FB’s workload and
operational considerations
 Major concern was availability: NameNode is SPOF & failover
times are at least 20 minutes
 Proprietary “AvatarNode”: eliminates SPOF, makes HDFS safe to
deploy even with 24/7 uptime requirement
Hadoop Usage:  Performance improvements for realtime workload: RPC timeout.
Rather fail fast and try a different DataNode
Facebook Messages
Sensory Based Applications Location Based
Services (LBS)

Mobile Music: 52.5%


Cloud Computing Mobile Video:25.2%
Mobile Gaming: 19.3%

for Mobile Augmented Reality


and Pervasive Mobile Social
Applications Networks and
Crowdsourcing
Multimedia and
Data Streaming

Due to limited resources on mobile devices,


we need outside resources to empower mobile apps.

44
Public Cloud Providers

Mobile Cloud
Local and Private
Computing Cloud Providers
Wired and Wireless
Network Providers
Content and Service
Providers
Ecosystem

Devices, Users
and Apps

45
Tier 1: Public Cloud
(+) Scalable and Elastic
(-) Price, Delay

Tier 2: Local Cloud


2-Tier Cloud (+) Low Delay, Low Power,
(-) Not Scalable and Elastic RTT:
Architecture 3G Access
Point
~290ms

Wi-Fi Access
Point

RTT:
~80ms

IBM: by 2017 61% of


enterprise is likely to be on
a tiered cloud

46
Public Cloud Providers

Mobile Cloud Local and Private


Cloud Providers
Wired and Wireless Content and Service
Network Providers
Computing Providers

Ecosystem

Devices, Users
and Apps

47
How can we Optimally and Fairly assign services to mobile users
using a 2-tier cloud architecture (knowing user mobility pattern)
considering power consumed on mobile device, delay users
experience and price as the main criteria for optimization.

Mobility-Aware
Modeling Mobile
Service Allocation
Apps
Algorithms

Middleware
Scalability Architecture and
System Design
 Model apps as consisting of a series of logical steps known as a
Service with different composition patterns:

S1 S2 S3 S1

Modeling Mobile SEQ LOOP

Applications 1
S3
P1 S3

S1 S4 S1 S4
as Workflows 1
S2 S2
P2
𝑷𝟏 + 𝑷𝟐 = 𝟏, 𝑷𝟏 , 𝑷𝟐 ∈ {𝟎, 𝟏}
AND: CONCURRENT FUNCTIONS XOR: CONDITIONAL FUNCTIONS

S3 S6
0 Par1
Start End
S1 S4 S5 S8

1 S2 Par2 S7
t1 t2 t3 t4 tN

l1
W1

Modeling l2 Wj+1
Wk+1
Mobile Wj

Applications as l3 Wk
Workflows
ln Location-Time Workflow

• It could be formally defined as:


𝒍𝒏 𝟏 𝒍𝒏 𝟐 𝒍𝒏 𝒌
𝑾(𝒖𝒌 )𝚻 𝑳 ≝ (𝒘 𝒖𝒌 𝒕𝒎𝟏 , 𝒘 𝒖𝒌 𝒕𝒎𝟐 ,….,𝒘 𝒖𝒌 𝒕𝒎𝒌 )

50
𝒒(𝒖𝒌 𝒔𝒊 ,𝒍𝒋 )𝒑𝒐𝒘𝒆𝒓 power consumed on 𝒖𝒌 cellphone when he is in l𝐨𝐜𝐚𝐭𝐢𝐨𝐧 𝒍𝒋 using 𝒔𝒊.

• The QoS could be defined in two different Levels:


• Atomic service level
• Composite service level or workflow level.
• Atomic service level could be defined as (for power as an
example):
Quality of Service
(QoS)

• The workflow QoS is based on different patterns.

QoS SEQ AND (PAR) XOR (IF-ELSE-THEN) LOOP


𝒊=𝒏 𝒊=𝒏
𝑾𝒑𝒐𝒘𝒆𝒓 𝒎𝒂𝒙 𝒒(𝒖𝒌 𝒔𝒊,𝒍𝒋 )𝒑𝒐𝒘𝒆𝒓 𝒒(𝒖𝒌 𝒔𝒊,𝒍𝒋 )𝒑𝒐𝒘𝒆𝒓 × 𝒌
𝒔𝒊 ,𝒍𝒋 𝒔𝒊,𝒍𝒋 𝒊
෍ 𝒒(𝒖𝒌 )𝒑𝒐𝒘𝒆𝒓 ෍ 𝒒(𝒖𝒌 )𝒑𝒐𝒘𝒆𝒓
𝒊=𝟏 𝒊=𝟏

51
• different QoSes have different dimensions (Price->$, power-
>joule, delay->s)
• We need a normalization process to make them comparable.

𝑾(𝒖𝒌 )𝒑𝒐𝒘𝒆𝒓 𝒎𝒂𝒙 − 𝑾(𝒖𝒌 )𝒑𝒐𝒘𝒆𝒓


The normalized , 𝑾(𝒖𝒌 )𝒑𝒐𝒘𝒆𝒓 𝒎𝒂𝒙 ≠
power, price 𝑾(𝒖𝒌 )𝒑𝒐𝒘𝒆𝒓 𝒎𝒂𝒙 − 𝑾(𝒖𝒌 )𝒑𝒐𝒘𝒆𝒓 𝒎𝒊𝒏
and delay is the 𝑾(𝒖𝒌 )𝒑𝒐𝒘𝒆𝒓 ≝
𝑾(𝒖𝒌 )𝒑𝒐𝒘𝒆𝒓 𝒎𝒊𝒏
Normalization real number in
interval [0,1]. 𝟏, 𝒆𝒍𝒔𝒆
𝒎𝒂𝒙
The higher the 𝑾(𝒖𝒌 )𝜯 𝑳 𝒑𝒐𝒘𝒆𝒓
− 𝑾(𝒖𝒌 )𝜯 𝑳 𝒑𝒐𝒘𝒆𝒓 𝒎𝒂𝒙
normalized 𝒎𝒂𝒙 𝒎𝒊𝒏 , 𝑾(𝒖𝒌 )𝜯 𝑳 𝒑𝒐𝒘𝒆𝒓

QoS the better 𝑾(𝒖𝒌 )𝜯 𝑳 − 𝑾(𝒖𝒌 )𝜯 𝑳
the execution 𝑾(𝒖𝒌 )𝜯 𝑳 𝒑𝒐𝒘𝒆𝒓
≝ 𝒑𝒐𝒘𝒆𝒓 𝒑𝒐𝒘𝒆𝒓
𝒎𝒊𝒏
plan is. 𝑾(𝒖𝒌 )𝜯 𝑳 𝒑𝒐𝒘𝒆𝒓

𝟏, 𝒆𝒍𝒔𝒆

M. Reza. Rahimi, Nalini Venkatasubramanian, Sharad Mehrotra and Athanasios Vasilakos, "MAPCloud: Mobile
Applications on an Elastic and Scalable 2-Tier Cloud Architecture", In the 5th IEEE/ACM International Conference
on Utility and Cloud Computing (UCC 2012), USA, Nov 2012.

52
𝟏
 𝒎𝒂𝒙 σ𝒖𝒌 𝒎𝒊𝒏 𝑾(𝒖𝒌 )𝚻 𝑳 𝒑𝒐𝒘𝒆𝒓
, 𝑾(𝒖𝒌 )𝚻 𝑳 𝒑𝒓𝒊𝒄𝒆
, 𝑾(𝒖𝒌 )𝚻 𝑳 𝒅𝒆𝒍𝒂𝒚
|𝑼|
 𝑺𝒖𝒃𝒋𝒆𝒄𝒕 𝒕𝒐:
𝟏
𝑾(𝒖𝒌 )𝚻 𝑳 ≤ 𝑩𝒑𝒐𝒘𝒆𝒓 ,
|𝑼| 𝒑𝒐𝒘𝒆𝒓
𝟏
𝑾(𝒖𝒌 )𝚻 𝑳 𝒑𝒓𝒊𝒄𝒆
≤ 𝑩𝒑𝒓𝒊𝒄𝒆 ,
|𝑼|
𝟏
𝑾(𝒖𝒌 )𝚻 𝑳 𝒅𝒆𝒍𝒂𝒚
≤ 𝑩𝒅𝒆𝒍𝒂𝒚 ,
 |𝑼|
𝜿 ≤ 𝑪𝒂𝒑(𝑳𝒐𝒄𝒂𝒍_𝑪𝒍𝒐𝒖𝒅𝒔)
𝜿 ≜ 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒎𝒐𝒃𝒊𝒍𝒆 𝑼𝒔𝒆𝒓𝒔 𝒖𝒔𝒊𝒏𝒈
Optimal Service 𝑭𝒂𝒊𝒓𝒏𝒆𝒔𝒔 𝑈𝑡𝑖𝑙𝑖𝑡𝑦𝒔𝒆𝒓𝒗𝒊𝒄𝒆𝒔 𝒐𝒏 𝒍𝒐𝒄𝒂𝒍 𝒄𝒍𝒐𝒖𝒅
Allocation for ∀ 𝒖𝒌 ∈ 𝒖𝟏 , … , 𝒖|𝑼|

Single Mobile User


 In this optimization problem our goal is to maximize the minimum saving of power, price and
delay of the mobile applications.

53
Brute-Force Search
(BFS)

Simulated
Annealing Based
Service Allocation Algorithms for
Single Mobile User and Mobile Group-Ware Genetic Based
Applications

Greedy Based

Random Service
Allocation (RSA)

• MuSIC: Mobility Aware Service AllocatIon on Cloud.


• based-on a simulated annealing approach.

54
Cloud Service Registry

MAPCloud Web Service Interface


MAPCloud
Runtime

MAPCloud Web Service Interface


MAPCloud
Middleware QoS-Aware
Architecture Mobile Client
MAPCloud LTW Service DB
Mobile User
Local and
Public
Engine
Log DB Cloud Pool

Optimal Service Scheduler

MAPCloud Middleware

55
 M. Satyanarayanan, P. Bahl, R. Cáceres, N. Davies " The Case for
VM-Based Cloudlets in Mobile Computing",PerCom 2009.
 M. Reza Rahimi, Jian Ren, Chi Harold Liu, Athanasios V. Vasilakos,
and Nalini Venkatasubramanian, "Mobile Cloud Computing: A
Survey, State of Art and Future Directions", in ACM/Springer
Mobile Application and Networks (MONET), Special Issue on
Mobile Cloud Computing, Nov. 2013.
 Reza Rahimi, Nalini Venkatasubramanian, Athanasios Vasilakos,
References "MuSIC: On Mobility-Aware Optimal Service Allocation in Mobile
Cloud Computing", In the IEEE 6th International Conference on
Cloud Computing, (Cloud 2013), Silicon Valley, CA, USA, July 2013

56

You might also like