Cloud Computing
Professor Mangal Sain
Lecture 6
Cloud Basics & Storage
Lecture 6 – Part 1
Anatomy of Cloud
HARNESSING BIG DATA
OLTP: Online Transaction Processing (DBMSs)
OLAP: Online Analytical Processing (Data Warehousing)
RTAP: Real-Time Analytics Processing (Big Data Architecture &
technology)
THE MODEL HAS CHANGED…
The Model of Generating/Consuming Data has
Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
WHAT’S DRIVING BIG DATA
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
THE EVOLUTION OF BUSINESS INTELLIGENCE
Interactive
Business Big Data:
Intelligence & Real Time &
Speed In-memory RDBMS
Scale Single View
QliqView, Tableau, HANA
Graph Databases
BI Reporting
OLAP &
Dataware house
Big Data:
Business Objects, SAS, Scale Speed
Informatica, Cognos other
Batch Processing &
SQL Reporting Tools Distributed Data Store
Hadoop/Spark;
HBase/Cassandra
1990’s 2000’s 2010’s
BIG DATA ANALYTICS
Big data is more real-time in
nature than traditional DW
applications
Traditional DW
architectures (e.g. Exadata,
Teradata) are not well-suited
for big data apps
Shared nothing, massively
parallel processing, scale out
architectures are well-suited
for big data apps
BIG DATA TECHNOLOGY
CLOUD COMPUTING
IT resources provided as a service
Compute, storage, databases, queues
Clouds
leverage economies of scale of
commodity hardware
Cheap storage, high bandwidth networks &
multicore processors
Geographically distributed data centers
Offerings from Microsoft, Amazon, Google, …
wikipedia:Cloud Computing
BENEFITS
Cost Savings
Security
Flexibility
Mobility
Insight
Increased Collaboration
Quality Control
Disaster Recovery
Loss Prevention
Automatic Software Updates
Competitive Edge
Sustainability
Lecture 6– Part 2
Anatomy of Cloud contd.
TYPES OF CLOUD COMPUTING
Public Cloud: Computing infrastructure is
hosted at the vendor’s premises.
Private Cloud: Computing architecture is
dedicated to the customer and is not shared
with other organisations.
Hybrid Cloud: Organisations host some
critical, secure applications in private
clouds. The not so critical applications are
hosted in the public cloud
Cloud bursting: the organisation uses its own
infrastructure for normal usage, but cloud is used
for peak loads.
Community Cloud
CLASSIFICATION OF CLOUD COMPUTING BASED ON
SERVICE PROVIDED
Infrastructure as a service (IaaS)
Offering hardware related services using the principles of cloud computing. These
could include storage services (database or disk storage) or virtual servers.
Amazon EC2, Amazon S3, Rackspace Cloud Servers and Flexiscale.
Platform as a Service (PaaS)
Offering a development platform on the cloud.
Google’s Application Engine, Microsofts Azure,
Salesforce.com’s force.com .
Software as a service (SaaS)
Including a complete software offering on the cloud. Users
can access a software application hosted by the cloud
vendor on pay-per-use basis. This is a well-established
sector.
Salesforce.coms’ offering in the online Customer
Relationship Management (CRM) space, Googles gmail
and Microsofts hotmail, Google docs.
INFRASTRUCTURE AS A SERVICE (IAAS)
PLATFORM AS A SERVICE
SOFTWARE AS A SERVICE (SAAS)
MORE REFINED CATEGORIZATION
Storage-as-a-service
Database-as-a-service
Information-as-a-service
Process-as-a-service
Application-as-a-service
Platform-as-a-service
Integration-as-a-service
Security-as-a-service
Management/
Governance-as-a-service
Testing-as-a-service
Infrastructure-as-a-service
InfoWorld Cloud Computing Deep Dive
KEY INGREDIENTS IN CLOUD COMPUTING
Service-Oriented Architecture (SOA)
Utility Computing (on demand)
Virtualization (P2P Network)
SAAS (Software As A Service)
PAAS (Platform AS A Service)
IAAS (Infrastructure AS A Servie)
Web Services in Cloud
ENABLING TECHNOLOGY: VIRTUALIZATION
App App App App App App
Operating System OS OS OS
Hardware Hypervisor
Traditional Stack Hardware
Virtualized Stack
EVERYTHING AS A SERVICE
Utility computing = Infrastructure as a Service (IaaS)
Why buy machines when you can rent cycles?
Examples: Amazon’s EC2, Rackspace
Platform as a Service (PaaS)
Give me nice API and take care of the maintenance,
upgrades, …
Example: Google App Engine
Software as a Service (SaaS)
Just run it for me!
Example: Gmail, Salesforce
CLOUD VERSUS CLOUD
Amazon Elastic Compute Cloud
Google App Engine
Microsoft Azure
GoGrid
AppNexus
Lecture 6 – Part 3
Anatomy of Cloud and Cloud Storage
RECAP: CLOUD BENEFITS
Elastic,just-in-time infrastructure
More efficient resource utilization
Pay for what you use
Potential to reduce processing time
Parallelization
Leverage multiple data centers
High availability, lower response times
TODAY’S CLOUD APPLICATIONS
Web applications
Client/server paradigm
Request/response messaging pattern
Interactive communication
Processing pipelines
Examples: Indexing, data mining, image processing,
video transcoding, document processing
Batch processing systems
Example: report generation, fraud detection, analytics,
backups, automated testing
MANY STYLES OF SYSTEM
Near the edge of the application focus is on vast numbers
of clients and rapid response
Inside we find data-intensive services that operate in a
pipelined manner, asynchronously
Deep inside the application we see a world of virtual
computer clusters that are scheduled to share resources
and on which applications like MapReduce (Hadoop) are
very popular
HOW ARE CLOUD APPS STRUCTURED?
Clients
talk to application using Web
browsers or the Web services standards
But this only gets us to the outer “skin” of
the data center, not the interior
Consider Amazon: it can host entire
company web sites (like Netflix.com), data
(S3), servers (EC2), databases (RDS) and
even virtual desktops!
BIG PICTURE OVERVIEW
Client requests are
handled in by front-end
Web servers
Application servers are
Web
invoked for dynamic Web
content generation Web
App Web
App
and run app logic App
Web App Shards
PHP, Java, Python, …
App
Back-end databases Cache Cache
manage and provide
access to data DB DB Index
APPLICATIONS WITH MULTIPLE TIERS
Web servers
Application servers
Data store (or database)
REDUNDANCY AT EACH TIER
Web servers
Application servers
Data store
LOAD BALANCER
Load balancer
Web servers
Application servers
Data store
SCALING: STATELESS, CACHING, AND SHARDING
STATELESS SERVERS ARE EASIEST TO SCALE
Views a client request as an independent transaction and responds
to it
Advantages:
Simpler and easier to scale: does not maintain state
More robust: tolerating instance failures does not require overheads
restoring state
Stateless servers
CACHING
Load balancer
Web servers
Application servers
Caching
Data store
CACHING
Caching is central to responsiveness
Basic idea is to always used cached data if at all possible,
so the inner services (data stores) are shielded from
“online” load
Caching is only temporary storage, hence it is stateless
We can add multiple cache serves to spread loads
Must think hard about patterns of data access
Some data needs to be heavily replicated to offer very fast
access on vast numbers of nodes
In principle the level of replication should match level of
load and the degree to which the data is needed
STATEFUL SERVERS REQUIRE ATTENTION
Scaling a relational database is challenging
Traditional approach is replication
Data is written to a master server and then replicated to one or
more slave servers (synchronously or asynchronously)
Read operations can be handled by the slaves
All writes happen on the master
Data store
Master Slave
STATEFUL SERVERS REQUIRE ATTENTION
Cons:
Master becomes the write bottleneck
Master is a single point of failure
As load increases, cost of replication increases
Slaves may fall behind and serve stale data
SHARDING
Data partitioning strategy
Basic idea: split data between multiple machines and
have a way to make sure you always access data from the
right place
Typically define a sharding key and create a shard mapping
(e.g., shard_idx = hash(key) mod N)
Other partitioning schemes exist: e.g., allocate whole tables on
the same machine
Partition Partition Partition
1 2 3
BENEFITS OF SHARDING
Increased read and write throughput
High availability
Possibility of doing more work in parallel within the
application server
Challenge: picking a good partitioning scheme
Otherwise risk of having hotspots in the system due to
load imbalance
SHARDING USED IN MANY WAYS
Sharding is not only for partitioning data within a
database
Applies essentially to every application tier
Notion of sharding is cross-cutting
Example: partition data across caching servers
Two popular in-memory caching systems:
memcached: distributed object caching system
redis: distributed data structure server (also works as store)
AND IT ISN’T JUST ABOUT UPDATES
Shouldalso be thinking about patterns that arise
when doing reads (“queries”)
Some can just be performed by a single representative of a
service
But others might need the parallelism of having several (or
even a huge number) of machines do parts of the work
concurrently
The term sharding is used for data, but here we
might talk about “parallel computation on a shard”
FIRST-TIER PARALLELISM
Parallelismis vital for fast interactive services
Key question:
Request has reached some service instance X
Will it be faster…
… For X to just compute the response
… Or for X to subdivide the work by asking subservices to do parts of
the job?
Glimpse of an answer
When you make a search on Bing, the query is processed in
parallel by even 1000s of servers that run in real-time on
your request!
Parallel actions must focus on the critical path
WHAT DOES “CRITICAL PATH” MEAN?
Focus on delay until a client receives a reply
Critical path are actions that contribute to this delay
Request
Service instance
Response delay
seen by end-user Service
would include response
Internet latencies delay
Response
PARALLEL SPEEDUP
In this example of a parallel read-only request, the critical path
centers on the middle “subservice”
Request Service instance
Response Critical path
delay seen
by end-user
would Service Critical path
include response
Internet delay
latencies
Critical path
Response
WITH REPLICAS WE JUST LOAD BALANCE
Request Service instance
Response
delay seen
by end-user
would Service
include response
Internet delay
latencies
Response
WHAT IF A REQUEST TRIGGERS UPDATES?
If updates are done “asynchronously” we might not experience
much delay on the critical path
Cloud systems often work this way
Avoids waiting for slow services to process the updates but may force the
tier-one service to “guess” the outcome
For example, store in the master database and replicate to the slave in
the background
Many cloud systems use these sorts of “tricks” to speed up response
time
WHAT IF WE SEND UPDATES WITHOUT WAITING?
Several issues now arise
Are all the replicas applying updates in the same order?
Might not matter unless the same data item is being changed
But then clearly we do need some “agreement” on order
What if the leader replies to the end user but then crashes and it
turns out that the updates were lost in the network?
Data center networks can be surprisingly lossy at times
Also, bursts of updates can queue up
Such issues result in inconsistency
IS INCONSISTENCY A BAD THING?
How much consistency is really needed in the first tier of the cloud?
Think about YouTube videos. Would consistency be an issue here?
What about the Amazon “number of units available” counters. Will
people notice if those are a bit off?
Puzzle: can you come up with a general policy for knowing how
much consistency a given thing needs?
CLOUD STORAGE
COMPLEX SERVICE, SIMPLE STORAGE
Variable-size files
- read, write, append
- move, rename
- lock, unlock
- ...
Operating system
Fixed-size blocks
- read
- write
PC users see a rich, powerful interface
Hierarchical namespace (directories); can move, rename,
append to, truncate, (de)compress, view, delete files, ...
But the actual storage device is very simple
HDD only knows how to read and write fixed-size data
blocks
Translation done by the operating system
ANALOGY TO CLOUD STORAGE
Shopping carts
Friend lists
User accounts
Profiles
...
Web service
Key/value store
- read, write
- delete
Many cloud services have a similar structure
Users see a rich interface (shopping carts, product
categories, searchable index, recommendations, ...)
But the actual storage service is very simple
Read/write 'blocks', similar to a giant hard disk
Translation done by the web service
WHAT’S WRONG WITH RELATIONAL DBS?
Most applications interact through a database
Recall RDBMS:
Manage data access, enforce data integrity, control concurrency, support
recovery after a failure
Many applications push traditional RDBMS solutions to the limit
by demanding:
High scalability
Very large amounts of data
Minimal latency
High availability
Solution is far from ideal
IDEAL DATA STORES ON THE CLOUD
Many situations need hosting of large data sets
Examples: Amazon catalog, eBay listings, Facebook pages, …
Ideal: Abstraction of a 'big disk in the clouds', which
would have:
Perfect durability – nothing would ever disappear in a
crash
100% availability – we could always get to the service
Zero latency from anywhere on earth – no delays!
Minimal bandwidth utilization – we only send across
the network what we absolutely need
Isolation under concurrent updates – make sure data
stays consistent
THE INCONVENIENCES OF THE REAL WORLD
Why isn't this feasible?
The “cloud” exists over a physical network
Communication takes time, esp. across the globe
Bandwidth is limited, both on the backbone and endpoint
The “cloud” has imperfect hardware
Hard disks crash
Servers crash
Software has bugs
FINDING THE RIGHT TRADEOFF
In practice, we can't have everything
... but most applications don't really need 'everything'!
Some observations:
1. Read-only (or read-mostly) data is easiest to support
Replicate it everywhere! No concurrency issues!
But only some kinds of data fit this pattern – examples?
2. Granularity matters: “Few large-object” tasks generally
tolerate longer latencies than “many small-object” tasks
Fewer requests, often more processing at the client
But it’s much more expensive to replicate or to update!
3. Maybe it makes sense to develop separate solutions for large
read-mostly objects vs. small read-write objects!
Different requirements → different technical solutions
KVS AND CURRENT SYSTEMS
KEY-VALUE STORES
Keys Values
(bob, bschmitt@foo.com)
(gettysburg, "Four score and seven years ago...")
(29ck2dxa1, 0128ckso1$9#*!!8349e)
(windows, )
The key-value store (KVS) is a simple abstraction
for managing persistent state
Data is organized as (key, value) pairs
Only three basic operations:
PUT(key, value)
GET(key) → value
Delete(key)
EXAMPLES OF KVS
Where have you seen this concept before?
Conventional examples outside the cloud:
In-memory associative arrays and hash tables – limited to a single
application, only persistent until program ends
On-disk indices (like BerkeleyDB)
"Inverted indices" behind search engines
Database management systems – multiple KVSs++
Distributed hashtables
Decentralized distributed systems inspired by P2P (see LSINF2345)
Examples: Chord/Pastry
SUPPORTING AN INTERNET SERVICE WITH A KVS
We’ll do this through a central server, e.g., a Web or application
server
A B
Two main issues:
1. There may be multiple concurrent
requests from different clients
These might be GETs, PUTs, DELETEs, etc.
S
2. These requests may come from different
parts of the network, with message propagation delays
It takes a while for a request to make it to the server!
We’ll have to handle requests in the order received (why?)
CONCURRENCY CONTROL
Most systems use locks on individual items
Each requestor asks for the lock
A lock manager processes these requests (typically
in FIFO order) as follows:
Lock manager grants the lock to a requestor
Requestor makes modifications
Then releases the lock when it’s done
LIMITATIONS OF PER-KEY CONCURRENCY CONTROL
Suppose I want to transfer credits
from my WoW account to my friend’s?
… while someone else is doing a GET
on my (and her) credit amounts to see if
they want to trade?
This is where one needs a database
management system (DBMS) or transaction
processing manager (app server)
Allows for “locking” at a higher level, across keys and
possibly even systems (see LINGI2172 for more details)
Could you implement higher-level locks within the
KVS? If so, how?
SPECIALIZED DATA STORES
Example: Amazon’s solutions
Dynamo [SOSP’07]
Many services only store and retrieve data by primary key
Examples: user preferences, shopping cart, best seller lists
Don’t require querying and management RDBMS functionality
Simple Storage Service (S3)
Need to store large objects that change infrequently
Examples: virtual machines, pictures
SPECIALIZED DATA STORES
Example: Google’s solutions
The Google File System [SOSP’03]
Distributed file system for large data-intensive applications
No POSIX API; focus on multi-GB files divided in fixed-size
chunks (64 MB); mostly mutated by appending new data
Single master node maintains all file metadata
Bigtable [OSDI’06]
Distributed storage system for structured data
Data model is a sparse multi-dimensional sorted map indexed
by row and column keys and a timestamp
Each value in the map is opaque to the storage system
SPECIALIZED DATA STORES
Example: Facebook’s solutions
Cassandra [Ladis’09]
A distributed storage system for large sets of structured data
Optimized for very high write throughput; no master nodes
Haystack [OSDI’10]
Object store system optimized for photos
In 2010, over 260 billion images; 20 PB of data; 60 TB/week
Data written once, read often, never modified, rarely deleted
TAO [ATC’13]
A read-optimized graph data store to serve the social graph
Sustains 1 billion reads/s on a changing data set of many PBs
Explicitly favors availability over consistency
SPECIALIZED DATA STORES
Example: LinkedIn’s solutions
Kafka [NetDB’11]
A high-throughput distributed messaging system
Pub/sub architecture designed for aggregating log data
Messages are persisted on disk for durability and replicated for fault
tolerance; guarantees at-least-once delivery
Voldemort
A distributed key-value store supporting only get/put/delete
Inspired by Amazon’s Dynamo: tunable consistency, highly available