GCP Storage Compute
GCP Storage Compute
Over view
Distributed frameworks such as Hadoop/
MapReduce help
Required Context
- Hive, HBase, Pig
Monolithic Distributed
Lots of cheap hardware
“Nodes”
• Partition data
• Co-ordinate computing tasks
• Handle fault tolerance and recovery
• Allocate capacity to processes
Single Co-ordinating Soft ware
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
< <raw data> <raw data> <raw data> <raw data> <raw data>
r <raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> < <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> r <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
To solve distributed
Google File System
storage
To solve distributed
MapReduce
computing
Single Co-ordinating Soft ware
MapReduce MapReduce
Hadoop
HDFS MapReduce
HDFS MapReduce
HDFS MapReduce
MapReduce
User defines map and
reduce tasks using the
YARN MapReduce API
HDFS
Co-ordination Bet ween Hadoop Blocks
MapReduce
YARN
A job is triggered on
the cluster
HDFS
Co-ordination Bet ween Hadoop Blocks
MapReduce
YARN
YARN figures out where
and how to run the job, and
Hadoop
Hadoop
Hadoop
Distributed
Computing Who deploys the software?
Infrastructure
You do.
You do.
On-premise
How does scaling happen?
You do.
Colocation
Services How does scaling happen?
On-premise
Colocation
Services Else, end up with a white elephant data centre
Who owns the machines?
Cloud Services:
Operational Little chance of a white elephant data center
Implications
GCP Overview
Google Cloud Platform
Resources Billing
Google Cloud Platform
Resources Billing
Resources
Location Hierarchy
“Regions” “Zones”
e.g. Central US, Western Europe, Basically, data centres within
and East Asia regions
Within region, locations usually have
“Regions” network latencies of less than 5ms
e.g. Central US, Western
Europe, and East Asia
“Single failure domain” within a region
Regional
AppEngine instances
Resources
Zonal
VM (Compute Engine) instances, disks
Resources
Hardware Software
computers, hard disks Virtual machines (VMs),
Software services
Resources
Hardware Software
Compute choices, Big data,
Storage technologies… machine learning …
Recall
- Big Data
- Storage Technologies
Cloud storage, Cloud SQL, BigTable, Datastore
- Machine Learning
Concepts, TensorFlow, Cloud ML
Recall
- Compute choices
AppEngine, Compute Engine, Containers
Minor Topics - Logging and monitoring
Stackdriver
- Security, networking
API keys, load balancing…
Resources
Hardware Software
Compute choices, Big data,
Storage technologies… machine learning …
Consumption Mechanisms
Command-line
GCP Console Client Libraries
Interface
Resources Billing
Google Cloud Platform
Resources Billing
Billing
Project ~ Namespace
Name, ID, number
Project ~ Namespace
Getting Feet Wet
Cloud services have important advantages over
on-premise or colocated data centers
Summary
GCP, Google’s Cloud Platform, offers a suite of
storage and compute solutions
Over view
Google AppEngine is the PaaS option -
serverless and ops-free
Just buy storage Get VMs, manage yourself Just focus on code, forget the rest
Need HTTPS serving, release management etc Create containers, manage clusters
Hosting a Website
Static, No SSL
Plain HTML files
Or
Hosting on - store on GitHub
Cloud Storage
- then use WebHook to run update script
Or
- use CI/CD tool like Jenkins
- Cloud Storage plug-in for post-build step
Hosting a Website
Static, No SSL
Plain HTML files
Storage” Get VMs, manage yourself Just focus on code, forget the rest
Need HTTPS serving, release management etc Create containers, manage clusters
Hosting a Website
Load balancing, scaling Heroku, Engine Yard
“Google Cloud Currently on VMs or servers Lots of code, languages
Storage” Get VMs, manage yourself Just focus on code, forget the rest
Need HTTPS serving, release management etc Create containers, manage clusters
Hosting a Website
SSL so that HTTPS serving is possible
Storage” Get VMs, manage yourself Just focus on code, forget the rest
Lots of dependencies
“Firebase Hosting + Deployment is becoming painful
Storage” Get VMs, manage yourself Just focus on code, forget the rest
Lots of dependencies
“Firebase Hosting + Deployment is becoming painful
- Auto-scaled
DevOps - laundry list
Hosting with Automated Image Builds with Jenkins, Packer, and Kubernetes
Compute Engine
Distributed Load Testing with Kubernetes
Storage” Engine”
Just focus on code, forget the rest
Lots of dependencies
“Firebase Hosting + Deployment is becoming painful
Storage” Engine”
Just focus on code, forget the rest
Lots of dependencies
“Firebase Hosting + Deployment is becoming painful
Microservices
Lots of dependencies
Deployment is becoming painful
(www.docker.com)
Container
(www.docker.com)
Container Cluster
Kubernetes
Container Cluster
Kubernetes Master
Container Cluster
Pod Pod Pod
Kubernetes Master
Containers and VMs
(www.docker.com)
DevOps - need largely mitigated
Hosting with
Container Engine Can use Jenkins for CI/CD
Hosting with StackDriver for logging and monitoring
Container Engine
Hosting a Website
You run separate web server, database
Microservices
Lots of dependencies
Deployment is becoming painful
Storage” Engine”
Just focus on code, forget the rest
- Use Container Engine for a rendering microservice that uses Compute Engine VMs running
Windows to do the actual frame rendering.
- Use App Engine for your web front end, Cloud SQL as your database, and Container
Engine for your big data processing.
App Engine
Environments
Standard Flexible
Pre-configured with: Java 7, More choices: Java 8, Python
Python 2.7, Go, PHP 3.x, .NET
- Serverless!
- Instance classes determine price, billing
- Laundry list of services - pay for what you use
Environments
Standard Flexible
Pre-configured with: Java 7, More choices: Java 8, Python
Python 2.7, Go, PHP 3.x, .NET
- Serverless!
- Instance classes determine price, billing
- Laundry list of services - pay for what you use
- Based on container instances running on
Google's infrastructure
Standard Flexible
Pre-configured with: Java 7, More choices: Java 8, Python
Python 2.7, Go, PHP 3.x, .NET
- Serverless!
- Instance classes determine price, billing
- Laundry list of services - pay for what you use
Environments
Standard Flexible
Pre-configured with: Java 7, More choices: Java 8, Python
Python 2.7, Go, PHP 3.x, .NET
- Serverless!
- Instance classes determine price, billing
- Laundry list of services - pay for what you use
- Allows you to customize your runtime and even
the operating system of your virtual machine
AppEngine using Dockerfiles
Flexible
Environment - Under the hood, merely instances of Google
Compute Engine VMs
Environments
Standard Flexible
Pre-configured with: Java 7, More choices: Java 8, Python
Python 2.7, Go, PHP 3.x, .NET
- Serverless!
- Instance classes determine price, billing
- Laundry list of services - pay for what you use
Cloud Functions
- Serverless execution environment for building and connecting cloud services
Hosting a Website
Load balancing, scaling
Currently on VMs or servers
- High-memory
Machine - High-CPU
- Built-in redundancy
Local SSD - The data that you store on a local SSD persists only until you stop
or delete the instance
- Small - each local SSD is 375 GB in size, but you can attach up
to eight local SSD devices for 3 TB of total local SSD storage
space per instance.
- Very high IOPS and low latency
Buckets - when you must share data easily between multiple instances or
zones.
- Flexible, scalable, durable
Hosting a Website
“Google Cloud “Google Compute Heroku, Engine Yard
Lots of code, languages
Storage” Engine”
Just focus on code, forget the rest
Lots of dependencies
“Firebase Hosting + Deployment is becoming painful
Hosting a Website
You run separate web server, database
Microservices
Lots of dependencies
Deployment is becoming painful
Container
A container image is a lightweight, stand-alone, executable package of a
piece of software that includes everything needed to run it: code, runtime,
system tools, system libraries, settings
(www.docker.com)
Container
(www.docker.com)
Containers and VMs
(www.docker.com)
Hosting a Website
You run separate web server, database
Microservices
Lots of dependencies
Deployment is becoming painful
Storage” Engine”
Just focus on code, forget the rest
Advantages Portability
Rapid deployment
Orchestration - Kubernetes clusters
Kubernetes
Container Cluster
Kubernetes Master
Container Cluster
Pod Pod Pod
Kubernetes Master
- Group of Compute Engine instances running Kubernetes.
Container - It consists of
Cluster - one or more node instances, and
- a managed Kubernetes master endpoint.
- Managed from the master
Instances - Each node runs the Docker runtime and hosts a Kubelet agent,
which manages the Docker containers scheduled on the host
- Managed master also runs the Kubernetes API server, which
Master - services REST requests
- schedules pod creation and deletion on worker nodes
Endpoint - synchronizes pod information (such as open ports and location)
- Subset of machines within a cluster that all have the same
configuration.
- Working:
Container - import source code from a variety of repositories or cloud
storage spaces
Builder - execute a build to your specifications
- produce artifacts such as Docker containers or Java archives.
- Private registry for Docker images
Kubernetes Master
Autoscaling
Pod Pod
Kubelet Kubelet
Kubernetes Master
Autoscaling
Pod Pod Pod
Kubernetes Master
Autoscaling
Pod Pod Pod
Kubernetes Master
- Automatic resizing of clusters with Cluster Autoscaler
Summary
Google AppEngine is the PaaS option -
serverless and ops-free
SQL Interface atop file data Hive (SQL-like, but MapReduce on HDFS)
Storage - Location
- Storage Class
- Multi-regional - frequent access from anywhere in the world
- Recall that OLTP needs strict write consistency, OLAP does not
- Superficially similar in use-case to Hive
BigQuery -
-
SQL-like abstraction for non-relational data
- Recall that OLTP needs strict write consistency, OLAP does not
- Cloud Spanner is Google proprietary, more advanced than Cloud SQL
Cloud SQL, - Cloud Spanner offers “horizontal scaling” - i.e. bigger data, more instances,
replication etc
Cloud Spanner
- Under the hood, Cloud Spanner has a surprising design - more later
Use-Cases
- So, a returning 10 rows will take the same length of time whether dataset is
DataStore 10 rows, or 10 billion rows
- Similar to HBase
Cloud Storage
Use-Cases
Storage - Location
- Storage Class
- Multi-regional - frequent access from anywhere in the world
- Frequently accessed ("hot" objects), such as serving website content, interactive workloads,
or mobile and gaming applications.
- Geo-redundant - Cloud Storage stores your data redundantly in at least two regions
separated by at least 100 miles within the multi-regional location of the bucket.
Bucket Storage Classes
- Unlike other "cold" storage services, same throughput and latency (i.e. not slower to access)
- 90-day minimum storage duration, costs for data access, and higher per-operation costs
- Infrequently accessed data, such as data stored for legal or regulatory reasons
- XML and JSON APIs
- Client SDK
- Cloud Storage considers bucket names that contain dots to be domain names
- Recall that OLTP needs strict write consistency, OLAP does not
Relational Data
Student
StudentID Student Name StudentID CourseID Term Grade
Name
1 Jane Doe
Spring
2 John Walsh CS294 B+
2016
CourseID Course Name
Cloud SQL, - Cloud Spanner offers “horizontal scaling” - i.e. bigger data, more instances,
replication etc
Cloud Spanner
- Under the hood, Cloud Spanner has a surprising design - more later
- MySQL - fast and the usual
- Not serverless
High Availability - The failover replica must be in a different zone than the original instance, also
called the master
Configuration
- All changes made to the data on the master, including to user tables, are
replicated to the failover replica using semisynchronous replication.
- Provides secure access to your Cloud SQL Second Generation instances
without having to whitelist IP addresses or configure SSL.
- Secure connections: The proxy automatically encrypts traffic to and from the
Cloud Proxy database; SSL certificates are used to verify client and server identities.
The Cloud SQL Proxy works by having a local client, called the proxy, running in
the local environment
Cloud Proxy
Your application communicates with the proxy with the standard database protocol
used by your database
Cloud Proxy
The proxy uses a secure tunnel to communicate with its companion process
running on the server.
Cloud Proxy
You can install the proxy anywhere in your local environment. The location of the
proxy binaries does not impact where it listens for data from your application.
Cloud Spanner
Use-Cases
- Recall that OLTP needs strict write consistency, OLAP does not
- Cloud Spanner is Google proprietary, more advanced than Cloud SQL
Cloud SQL, - Cloud Spanner offers “horizontal scaling” - i.e. bigger data, more instances,
replication etc
Cloud Spanner
- Under the hood, Cloud Spanner has a surprising design - more later
- Use when
- strong consistency
Don’t use if
Spanner -
Data Model - Tables ‘look’ relational - rows, columns, strongly typed schemas
- But…
Relational Data
Student
StudentID Student Name StudentID CourseID Term Grade
Name
1 Jane Doe
Spring
2 John Walsh CS294 B+
2016
CourseID Course Name
Spring
2 John Walsh CS294 B+
- Usually query student and course grades together 2016
1 Jane Doe
2 John Walsh
3 Raymond Wu
If you query Students and Grades together, make Grades child of Students
Child -
- Child rows are inserted between parent rows with that key prefix
Interleaving - “Interleaving”
Hotspotting - Use hash of key value if you naturally monotonically ordered keys
- Under the hood, Cloud Scanner divides data among servers across key
ranges
Hotspotting
Student
StudentID Student Name StudentID CourseID Term Grade
Name
1 Jane Doe
Spring
2 John Walsh CS294 B+
- Usually query student and course grades together 2016
1 Jane Doe
2 John Walsh
3 Raymond Wu
Spring
2 John Walsh CS294 B+
2016
- Change key so that it is not monotonically
increasing Raymond Winter
3 ME101 C+
Wu 2015
- Hash StudentID values
Summer
1 Jane Doe CS183 A+
2012
Interleaved Representation
StudentID Student Name
B99 Raymond Wu
Splits -
-
A split is a range of rows that can be moved around independent of others
Splits are added to distribute high read-write data (to break up hotspots)
B99 Raymond Wu
Spring
2 John Walsh CS294 B+
2016
CourseID Course Name
B99 Raymond Wu
Spring
2 John Walsh CS294 B+
- Usually query student and course grades together 2016
Data Types - STRUCTs are not OK in tables, but can be returned by queries (eg if query
returns ARRAY of ARRAYs)
- These commit timestamps are "real time" so you can compare them to
your watch
- Two transaction modes
- Cloud Scanner has a version-gc that reclaims versions older than 1 hour old
Bigtable
Use-Cases
- Similar to HBase
- BigTable is basically GCP’s managed HBase
- This is a much stronger link than between say Hive and BigQuery!
Id To Type Content
Id To Type Content
2 john
l sale Redmi sale
Id To Type Content
Row = 3
Column = To
Columnar Store
Id Column Value
1 To mike
1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john
4 To megan
4 Type sale
1 To mike
1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john
4 To megan
4 Type sale
1 To mike
1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john
4 To megan
4 Type sale
1 To mike
1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john
4 To megan
4 Type sale
Columnar store
It is a general structure of
how columnar stores are
constructed
Properties of HBase
Employee Details
Employee Subordinates
Employee Subordinates
Id Subordinate Id
1 2
Denormalized storage 1 3
Employee Address
Denormalized storage
Id Subordinate Id
1 2
1 3
Denormalized storage
Denormalized storage
Id Subordinate Id
1 2
1 3
Normalization
Normalization
Optimizes storage
distributed system!
But storage is cheap in a
distributed system!
Denormalized storage
Optimize number of
disk seeks
Denormalized Storage
Id Name Function Grade Id Subordinate Id
1 Emily Finance 6 1 2
2 John Finance 3 1 3
3 Ben Finance 4
NoSQL
Only a limited set of operations are
allowed in HBase
Create
Read
Update
Only CRUD operations
Delete
CRUD
No operations involving multiple tables
No indexes on tables
Only CRUD operations
No constraints
Id Name Function Grade Subordinates Address
Complex queries such as grouping, aggregates, joins etc Only basic operations such as create, read, update and
delete
Normalized storage to minimize redundancy and optimize
space Denormalized storage to minimize disk seeks
Id To Type Content
Id To Type Content
Column Timestamp
4-dimensional Data Model
Timestamp Value
Timestamp Value
Row
Timestamp Value
A Table for Employee Data
Work Personal
Work Personal
single employee
24490982 VP
Notification Data
Id Column Value
1 To mike
1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john
4 To megan
4 Type sale
1 To mike
1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john
4 To megan
Row Key
4 Type sale
1 To mike
1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john
4 To megan
Column Family
4 Type sale
1 To mike
1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john
4 To megan
Columns
4 Type sale
1 To mike
1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john
4 To megan
Value + Timestamp
4 Type sale
Timestamp Value
Timestamp Value
Timestamp Value
Uniquely identifies a row
ColumnFamily:ColumnName = Work:Department
Used as the version number for the values
Timestamp stored in a column
Personal Professional
marital_statu
Some ID name gender employed field
s
Filtering Rows Based on Conditions
SQL vs. HBase Shell Commands
Conditions on columns
Timestamp range
BigTable Performance
- Don’t use if you need transaction support (OLTP) - use Cloud SQL or Cloud
Spanner
- Don’t use for immutable blobs like movies each > 10 MB - use Cloud Storage
instead
- Use for very fast scanning and high throughput
Use BigTable - Where each data item < 10 MB and total data > 1 TB
When - Use where writes infrequent/unimportant (no ACID) but fast scans crucial
Use BigTable - Use Device ID # Time as row key if common query = “All data for a
- Use Time # Device ID as row key if common query = “All data for a
period for all devices”
- Like Cloud Spanner, data stored in sorted lex order of keys
Hotspotting - Salting
- Will observe read and write patterns and redistribute data so that shards are
“Warming evenly hit
the Cache” - Will try to store roughly same amount of data in different nodes
- This is why testing over hours is important to get true sense of performance
- Use SSD unless skimping on cost
SSD or HDD - More predictable throughput too (no disk seek variance)
Disks - Don’t even think about HDD unless storing > 10 TB and all batch queries
- The more random access, the stronger the case for SSD
SSD or HDD Disks
- Poor schema design (eg sequential keys)
- Inappropriate workload
- So, a returning 10 rows will take the same length of time whether dataset is
DataStore 10 rows, or 10 billion rows
Some queries use indices - not all All queries use indices!
Query time depend on both size of data set and size of Query time independent of data set, depends on result
result set set alone
Traditional RDBMS vs. DataStore
Types of all values in a column are the same Types of different properties with same name in an entity
can be different
Traditional RDBMS vs. DataStore
- Don’t use if application has lots of writes and updates on key columns
- Use for crazy scaling of read performance - to virtually any size
Use DataStore
Use for hierarchical documents with key/value data
When -
- “Built-in” Indices on each property (~field) of each entity kind (~table row)
Full Indexing - If you are certain a property will never be queried, can explicitly exclude it
from indexing
No joins possible
Implications of -
- From where?