KEMBAR78
GCP Storage Compute | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
62 views378 pages

GCP Storage Compute

The document provides an overview of cloud computing and distributed frameworks like Hadoop and MapReduce, highlighting their importance in handling large data sets. It discusses the differences between roles such as Data Engineer and Cloud Architect, their respective tests, and the major topics covered in each. Additionally, it outlines Google Cloud Platform's resources, compute options, and the advantages of using cloud services over traditional data centers.

Uploaded by

flaviano teodoro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views378 pages

GCP Storage Compute

The document provides an overview of cloud computing and distributed frameworks like Hadoop and MapReduce, highlighting their importance in handling large data sets. It discusses the differences between roles such as Data Engineer and Cloud Architect, their respective tests, and the major topics covered in each. Additionally, it outlines Google Cloud Platform's resources, compute options, and the advantages of using cloud services over traditional data centers.

Uploaded by

flaviano teodoro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 378

Module - Overview

As data sets grow larger, in-memory processing


on a single machine does not work well

Over view
Distributed frameworks such as Hadoop/
MapReduce help

Configuring a cluster of distributed machines is


complicated and expensive

Cloud services such as GCP, AWS and Azure help


The Test
Two Tests

Data Engineer Cloud Architect

Both - 2 hours, 50 questions, multiple-choice


Tests and Life
Data Engineer Not easy to “game”

Big Data, ML, Hadoop..tough test


Cloud Architect Definitely easier, more theoretical

Compute, Networking, Security,…


- Big Data
Data Engineer - BigQuery, DataFlow, Pub/Sub
Major Topics
- Storage Technologies
(Also minor topics Cloud storage, Cloud SQL, BigTable, Datastore
for Cloud Architect)
- Machine Learning
Concepts, TensorFlow, Cloud ML
- Compute choices
AppEngine, Compute Engine, Containers
Data Engineer -
Minor Topics - Logging and monitoring
Stackdriver
(Also major topics
for Cloud Architect)
- Security, networking
API keys, load balancing…
- Hadoop, Spark, MapReduce…

Required Context
- Hive, HBase, Pig

- RDBMS, indexing, hashing


- Syntax is tested too
Drills and Labs

- Implementation knowledge essential


Don’t try to “prep for the test” (famous last words)
Why Cloud Computing
Why Cloud Computing

Data is getting bigger World is getting smaller

Nothing fits in-memory on a single machine


Big Data, Small World

Super-computer Cluster of generic


computers
Big Data, Small World

Monolithic Distributed
Lots of cheap hardware

Replication & fault tolerance

Distributed Distributed computing


Lots of cheap hardware
HDFS
Replication & fault tolerance
YARN
Distributed Distributed computing
MapReduce
“Clusters”

“Nodes”

Distributed “Server Farms”


“Server Farms”

All of these servers need to be


co-ordinated by a single piece
of software
Single Co-ordinating Soft ware

• Partition data
• Co-ordinate computing tasks
• Handle fault tolerance and recovery
• Allocate capacity to processes
Single Co-ordinating Soft ware

Google developed proprietary


software to run on these
distributed systems
Single Co-ordinating Soft ware

<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
< <raw data> <raw data> <raw data> <raw data> <raw data>
r <raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> < <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> r <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data> <raw data> <raw data> <raw data>

First: store millions of records on multiple machines


Single Co-ordinating Soft ware

Second: run processes on all these machines to crunch data


Single Co-ordinating Soft ware

To solve distributed
Google File System
storage

To solve distributed
MapReduce
computing
Single Co-ordinating Soft ware

Google File System


Apache developed open source
versions of these technologies
MapReduce
Single Co-ordinating Soft ware

Google File System HDFS

MapReduce MapReduce
Hadoop

HDFS MapReduce

A file system to manage A framework to process


the storage of data data across multiple servers
Hadoop

HDFS MapReduce

In 2013, Apache released Hadoop 2.0


Hadoop

HDFS MapReduce

MapReduce was broken into two separate


parts
Hadoop

HDFS MapReduce YARN

A framework to define A framework to run


a data processing task the data processing task
Hadoop

HDFS MapReduce YARN

Each of these components have corresponding


configuration files
Co-ordination Bet ween Hadoop Blocks

MapReduce
User defines map and
reduce tasks using the
YARN MapReduce API

HDFS
Co-ordination Bet ween Hadoop Blocks

MapReduce

YARN
A job is triggered on
the cluster
HDFS
Co-ordination Bet ween Hadoop Blocks

MapReduce

YARN
YARN figures out where
and how to run the job, and

HDFS stores the result in HDFS


Hadoop Ecosystem

Hadoop

An ecosystem of tools have sprung up around this


core piece of software
Hadoop Ecosystem

Hive HBase Pig

Hadoop

Kafka Spark Oozie


Hadoop Ecosystem

Hive HBase Pig

Kafka Spark Oozie


Provides an SQL interface to Hadoop
Hive
The bridge to Hadoop for folks who don’t have
exposure to OOP in Java
A database management system on top of
Hadoop
HBase
Integrates with your application just like a
traditional database
A data manipulation language

Transforms unstructured data into a


Pig structured format

Query this structured data using interfaces


like Hive
A distributed computing engine used along
with Hadoop

Spark Interactive shell to quickly process datasets

Has a bunch of built in libraries for machine


learning, stream processing, graph processing
etc.
A tool to schedule workflows on all the
Oozie
Hadoop ecosystem technologies
Kafka Stream processing for unbounded datasets
Hadoop Ecosystem

Hive HBase Pig

Hadoop

Kafka Spark Oozie


Hadoop Ecosystem

Hive HBase Pig

Setting up and running distributed software is


Hadoop
an expensive, complicated exercise

Kafka Spark Oozie


Who owns the machines?

Distributed
Computing Who deploys the software?
Infrastructure

How does scaling happen?


Distributed Computing Infrastructure

On-premise Colocation Services Cloud Services


Who owns the machines?

You do.

Who deploys the software?

You do.
On-premise
How does scaling happen?

You buy machines and scale them in.


Who owns the machines?

You (usually and mostly) do

Who deploys the software?

You do.
Colocation
Services How does scaling happen?

You buy or lease machines.


Utilisation planning is super-important

On-premise

Colocation
Services Else, end up with a white elephant data centre
Who owns the machines?

Google, Amazon or Microsoft does

Who deploys the software?

GCP, AWS or Azure does it for you


Cloud
Services How does scaling happen?

You dynamically add machines


OpEx rather than CapEx

Cloud Services: Conserve cash


Financial
Implications
Watch for nickel-and-diming
Utilisation planning very simple indeed

Cloud Services:
Operational Little chance of a white elephant data center
Implications
GCP Overview
Google Cloud Platform

“Use Resources” “Pay for resources”


Hardware (VMs, disks) and
Billed for usage per-project
Software (BigQuery, BigTable…)
Google Cloud Platform

Resources Billing
Google Cloud Platform

Resources Billing
Resources
Location Hierarchy

“Regions” “Zones”
e.g. Central US, Western Europe, Basically, data centres within
and East Asia regions
Within region, locations usually have
“Regions” network latencies of less than 5ms
e.g. Central US, Western
Europe, and East Asia
“Single failure domain” within a region

“Zones” Identified as region-name + letter


Basically, data centres within
regions
asia-east1-a
Zones
Hardware computers, hard disks

Resources Software Virtual machines (VMs)


Global aka multi-regional
Zones Cloud Storage, DataStore, BigQuery

Regional
AppEngine instances
Resources
Zonal
VM (Compute Engine) instances, disks
Resources

Hardware Software
computers, hard disks Virtual machines (VMs),
Software services
Resources

Hardware Software
Compute choices, Big data,
Storage technologies… machine learning …
Recall

- Big Data

Major Topics BigQuery, DataFlow, Pub/Sub

- Storage Technologies
Cloud storage, Cloud SQL, BigTable, Datastore

- Machine Learning
Concepts, TensorFlow, Cloud ML
Recall

- Compute choices
AppEngine, Compute Engine, Containers
Minor Topics - Logging and monitoring
Stackdriver

- Security, networking
API keys, load balancing…
Resources

Hardware Software
Compute choices, Big data,
Storage technologies… machine learning …
Consumption Mechanisms

Command-line
GCP Console Client Libraries
Interface

gcloud utility (needs


web front-end Python, Java, Go…
SDK), or gcloud shell
Google Cloud Platform

Resources Billing
Google Cloud Platform

Resources Billing
Billing

All resources consumed associated with a project

Projects are associated with accounts

Billing happens per-project


Projects
Resources + Settings + Metadata

Resources within a project can easily interact

Project ~ Namespace
Name, ID, number

Project ID is unique, forever

Project ~ Namespace
Getting Feet Wet
Cloud services have important advantages over
on-premise or colocated data centers

Summary
GCP, Google’s Cloud Platform, offers a suite of
storage and compute solutions

TensorFlow and Cloud ML give GCP an edge in


machine learning applications

The Google Cloud Data Engineer test is a fairly


rigorous one
Module - Compute
GCP offers three Compute Options for running
cloud apps

Over view
Google AppEngine is the PaaS option -
serverless and ops-free

Google ComputeEngine is the IaaS option - fully


controllable down to OS

Google Container Engine lies in between - clusters of


machines running Kubernetes and hosting containers
Hosting Web Content:
A Case Study of Compute Options
Cloud Use-Cases

Running a Hadoop Serving a


Hosting a website
cluster TensorFlow model

Relatively basic “Big data” “Machine


Learning”
Cloud Use-Cases

Running a Hadoop Serving a


Hosting a website
cluster TensorFlow model

Relatively basic “Big data” “Machine


Learning”
Hosting a Website
Static, No SSL Load balancing, scaling Heroku, Engine Yard
Plain HTML files Currently on VMs or servers Lots of code, languages

Just buy storage Get VMs, manage yourself Just focus on code, forget the rest

SSL, CDN, Bells & whistles Lots of dependencies


Still quite static, but rich content Deployment is becoming painful

Need HTTPS serving, release management etc Create containers, manage clusters
Hosting a Website
Static, No SSL
Plain HTML files

Just buy storage

Buy disk space


“Google Cloud Storage”
Automatic scaling
No HTTPS, no CDN, no deployment help, nothing
To host a static site, create a Cloud Storage Bucket and
Hosting on upload content
Cloud Storage
Can use “storage.googleapis.com” URL, or own domain
Could write own HTML CSS

Hosting on Or use static generators


Cloud Storage - Jekyll
- Ghost
- Hugo
Can copy over content to bucket directly
- Web Console
- Cloud Shell

Or
Hosting on - store on GitHub
Cloud Storage
- then use WebHook to run update script

Or
- use CI/CD tool like Jenkins
- Cloud Storage plug-in for post-build step
Hosting a Website
Static, No SSL
Plain HTML files

Just buy storage

Buy disk space


“Google Cloud Storage”
Automatic scaling
No HTTPS, no CDN, no deployment help, nothing
Hosting a Website
Load balancing, scaling Heroku, Engine Yard
“Google Cloud Currently on VMs or servers Lots of code, languages

Storage” Get VMs, manage yourself Just focus on code, forget the rest

SSL, CDN, Bells & whistles Lots of dependencies


Still quite static, but rich content Deployment is becoming painful

Need HTTPS serving, release management etc Create containers, manage clusters
Hosting a Website
Load balancing, scaling Heroku, Engine Yard
“Google Cloud Currently on VMs or servers Lots of code, languages

Storage” Get VMs, manage yourself Just focus on code, forget the rest

SSL, CDN, Bells & whistles Lots of dependencies


Still quite static, but rich content Deployment is becoming painful

Need HTTPS serving, release management etc Create containers, manage clusters
Hosting a Website
SSL so that HTTPS serving is possible

CDN edges world-over

Atomic deployment, one-click rollback

SSL, CDN, Bells & whistles “Firebase Hosting +


Google Cloud Storage”
Still quite static, but rich content

Need HTTPS serving, release management etc


Hosting a Website
Load balancing, scaling Heroku, Engine Yard
“Google Cloud Currently on VMs or servers Lots of code, languages

Storage” Get VMs, manage yourself Just focus on code, forget the rest

Lots of dependencies
“Firebase Hosting + Deployment is becoming painful

Google Cloud Storage” Create containers, manage clusters


Hosting a Website
Load balancing, scaling Heroku, Engine Yard
“Google Cloud Currently on VMs or servers Lots of code, languages

Storage” Get VMs, manage yourself Just focus on code, forget the rest

Lots of dependencies
“Firebase Hosting + Deployment is becoming painful

Google Cloud Storage” Create containers, manage clusters


Hosting a Website
Load balancing, scaling
Currently on VMs or servers

Get VMs, manage yourself

You’d like to control load balancing, scaling etc yourself

IAAS (“Infra-as-a-service”) “Google Compute Engine”


Configuration, administration, management - all on you

No need to buy machines or install OS, dev stack, languages etc


Google Cloud Launcher

LAMP stack or WordPress in a few minutes


Hosting with
Compute Engine
Cost estimates before deployment
You choose machine types, disk sizes before
deployment

Can customise configuration, rename


Hosting with instances etc
Compute Engine
After deployment, have full control of machine
instances
Loads of storage options

- Cloud Storage Buckets


- Standard persistent disks
Hosting with - SSD (solid state disks)
Compute Engine - Local SSD

After deployment, have full control of machine


instances
Loads of storage technologies if you prefer

- Cloud SQL (MySQL, PostgreSQL


- NoSQL
Hosting with - GCP NoSQL tools - BigTable, Datastore
Compute Engine
Load Balancing options
- Network load balancing: forwarding rules based on
address, port, protocol

- HTTP load balancing: look into content, examine cookies,


Hosting with
Compute Engine certain clients to one server…

- Internal: On private network not on internet

- Auto-scaled
DevOps - laundry list

Compute Engine Management with Puppet, Chef, Salt, and


Ansible

Hosting with Automated Image Builds with Jenkins, Packer, and Kubernetes

Compute Engine
Distributed Load Testing with Kubernetes

Continuous Delivery with Travis CI

Managing Deployments with Spinnaker


Hosting with StackDriver for logging and monitoring
Compute Engine
Hosting a Website
Load balancing, scaling
Currently on VMs or servers

Get VMs, manage yourself

You’d like to control load balancing, scaling etc yourself

IaaS (“Infra-as-a-service”) “Google Compute Engine”


Configuration, administration, management - all on you

No need to buy machines or install OS, dev stack, languages etc


Hosting a Website
“Google Cloud “Google Compute Heroku, Engine Yard
Lots of code, languages

Storage” Engine”
Just focus on code, forget the rest

Lots of dependencies
“Firebase Hosting + Deployment is becoming painful

Google Cloud Storage” Create containers, manage clusters


Hosting a Website
“Google Cloud “Google Compute Heroku, Engine Yard
Lots of code, languages

Storage” Engine”
Just focus on code, forget the rest

Lots of dependencies
“Firebase Hosting + Deployment is becoming painful

Google Cloud Storage” Create containers, manage clusters


Hosting a Website
You run separate web server, database

Separate containers to isolate from each other “Google Container Engine”


Service-oriented architecture

Microservices

Lots of dependencies
Deployment is becoming painful

Create containers, manage clusters


Container
A container image is a lightweight, stand-alone, executable package of a
piece of software that includes everything needed to run it: code, runtime,
system tools, system libraries, settings

(www.docker.com)
Container

A container image is a lightweight, stand-alone, executable package of a


piece of software that includes everything needed to run it: code, runtime,
system tools, system libraries, settings
(www.docker.com)
Containers and VMs

Containers Virtual Machines

(www.docker.com)
Container Cluster

Kubelet Kubelet Kubelet

Kubernetes
Container Cluster

Pod Pod Pod

Kubelet Kubelet Kubelet

Node Instance #1 Node Instance #2 Node Instance #3

Kubernetes Master
Container Cluster
Pod Pod Pod

Kubelet Kubelet Kubelet

Node Instance #1 Node Instance #2 Node Instance #3

Kubernetes Master
Containers and VMs

Containers Virtual Machines


Virtualise the Operating System Virtualise hardware

More portable Less portable

Quick to boot Slow to boot

Size - tens of MBs Size - tens of GBs

(www.docker.com)
DevOps - need largely mitigated
Hosting with
Container Engine Can use Jenkins for CI/CD
Hosting with StackDriver for logging and monitoring
Container Engine
Hosting a Website
You run separate web server, database

Separate containers to isolate from each other “Google Container Engine”


Service-oriented architecture

Microservices

Lots of dependencies
Deployment is becoming painful

Create containers, manage clusters


Hosting a Website
“Google Cloud “Google Compute Heroku, Engine Yard
Lots of code, languages

Storage” Engine”
Just focus on code, forget the rest

“Firebase Hosting + “Google Container


Google Cloud Storage” Engine”
Hosting a Website
Heroku, Engine Yard
Lots of code, languages

Just focus on code, forget the rest

Just write the code - leave the rest to platform

PaaS (“Platform-as-a-Service”) “Google AppEngine”


Hosting on AppEngine
Hosting a Website
“Google Cloud “Google Compute “Google
Storage” Engine” AppEngine”

“Firebase Hosting + “Google Container


Google Cloud Storage” Engine”
Compute Choices
Container Compute
App Engine Engine Engine

A flexible, zero ops


Logical infrastructure powered by Virtual machines running in
(serverless!) platform for
Kubernetes, the open source Google's global data center
building highly available apps
container orchestration system. network
Container Compute
App Engine Engine Engine

You want to focus on writing You need complete control over


code, and never want to touch You want to increase velocity and your infrastructure and direct
a server, cluster, or improve operability dramatically by access to high-performance
infrastructure. separating the app from the OS. hardware such as GPUs and
local SSDs.
Container Compute
App Engine Engine Engine

You need to make OS-level


You neither know nor care
changes, such as providing your
about the OS running your You don’t have dependencies on a
own network or graphic drivers,
code specific operating system.
to squeeze out the last drop of
performance.
Container Compute
App Engine Engine Engine

Support for Java, Python,


PHP, Go, Ruby (beta) and Direct access to GPUs that you
Run the same application on your
Node.js (beta) ... or bring your can use to accelerate specific
laptop, on premise and in the cloud.
own app runtime. workloads.
Container Compute
App Engine Engine Engine
Any workload requiring a specific
Web sites; Mobile app and Containerized workloads OS or OS configuration
gaming backends
Cloud-native distributed systems. Currently deployed, on-premises
RESTful APIs
software that you want to run in
Hybrid applications. the cloud.
Internet of things (IoT) apps.

Can’t be containerised easily; or


need existing VM images
Mix-and-Match
- Use App Engine for the front end serving layer, while running Redis in Compute Engine.

- Use Container Engine for a rendering microservice that uses Compute Engine VMs running
Windows to do the actual frame rendering.

- Use App Engine for your web front end, Cloud SQL as your database, and Container
Engine for your big data processing.
App Engine
Environments

Standard Flexible
Pre-configured with: Java 7, More choices: Java 8, Python
Python 2.7, Go, PHP 3.x, .NET

- Serverless!
- Instance classes determine price, billing
- Laundry list of services - pay for what you use
Environments

Standard Flexible
Pre-configured with: Java 7, More choices: Java 8, Python
Python 2.7, Go, PHP 3.x, .NET

- Serverless!
- Instance classes determine price, billing
- Laundry list of services - pay for what you use
- Based on container instances running on
Google's infrastructure

- Preconfigured with one of several available


AppEngine runtimes (Java 7, Python 2.7, Go and PHP)
Standard
Environment - Each runtime also includes libraries that support
App Engine Standard APIs

- Maybe all you need


- Applications run in a secure, sandboxed
environment

- App Engine standard environment distributes


requests across multiple servers, and scaling
AppEngine
servers to meet traffic demands
Standard
Environment
- Your application runs within its own secure,
reliable environment that is independent of the
hardware, operating system, or physical location
of the server.
Environments

Standard Flexible
Pre-configured with: Java 7, More choices: Java 8, Python
Python 2.7, Go, PHP 3.x, .NET

- Serverless!
- Instance classes determine price, billing
- Laundry list of services - pay for what you use
Environments

Standard Flexible
Pre-configured with: Java 7, More choices: Java 8, Python
Python 2.7, Go, PHP 3.x, .NET

- Serverless!
- Instance classes determine price, billing
- Laundry list of services - pay for what you use
- Allows you to customize your runtime and even
the operating system of your virtual machine
AppEngine using Dockerfiles
Flexible
Environment - Under the hood, merely instances of Google
Compute Engine VMs
Environments

Standard Flexible
Pre-configured with: Java 7, More choices: Java 8, Python
Python 2.7, Go, PHP 3.x, .NET

- Serverless!
- Instance classes determine price, billing
- Laundry list of services - pay for what you use
Cloud Functions
- Serverless execution environment for building and connecting cloud services

- Write simple, single-purpose functions

- Attached to events emitted from your cloud infrastructure and services

- Cloud Function is triggered when an event being watched is fired


Cloud Functions

- Your code executes in a fully managed environment

- No need to provision any infrastructure or worry about managing any servers

- Cloud Functions are written in Javascript

- Run it in any standard Node.js runtime


Compute Engine
Recall

Hosting a Website
Load balancing, scaling
Currently on VMs or servers

Get VMs, manage yourself

You’d like to control load balancing, scaling etc yourself

IAAS (“Infra-as-a-service”) “Google Compute Engine”


Configuration, administration, management - all on you

No need to buy machines or install OS, dev stack, languages etc


- Public images for Linux and Windows Server that
Google provides

Image - Private images that you create or import to Compute


Types Engine

- Images of other OSes OK too


- Creator has full root privileges, SSH capability
- Can share with other users

Creation - While creating instance specify


- zone
- OS
- machine type
- Each instance belongs to a project

- Projects can have any number of instances

Projects and - Projects can have upto 5 VPC (Virtual Private


Networks)
Instances
- Each instance belongs in one VPC
- instances within VPC communicate on LAN
- instances across VPC communicate on internet
- Standard

- High-memory

Machine - High-CPU

Types - Shared-core (small, non-resource intensive)

- Can attach GPU dies to most machine types


- Much much cheaper than regular Compute Engine
instances

Preemptible - But, might be terminated (preempted) at any time if


Compute Engine needs the resources
Instances
- Use for fault-tolerant applications
- Will definitely be terminated after running for 24
hours

Preemptible - Probability of termination varies by day/zone etc

Instances - Cannot live migrate (stay alive during updates) or


auto-restart on maintenance
- Step 1 in Preemption: Compute Engine sends a Soft
Off signal

- Step 2: Hopefully, you have a shutdown script to


clean up and give up control within 30 seconds
Preemptible
Instances - Step 3: If not, Compute Engine sends a Mechanical
Off signal

- Compute Engine transitions to Terminated state


- Each instance comes with a small root persistent
disk containing the OS

- Add additional storage options


- Persistent disks
Storage - Standard
- SSD
Options - Local SSDs
- Cloud Storage
Storage Options
- Durable network storage devices that instances can access like
physical disks in a desktop or a server

- Compute Engine manages physical disks and data distribution to


ensure redundancy and optimize performance
Persistent
Disks - Encrypted (custom encryption possible)

- Built-in redundancy

- Restricted to the zone where instance is located


- Two types - Standard and SSD

Persistent - Standard Persistent - regular hard disks - cheap - OK for


sequential access
Disks
- SSD Persistent - expensive - fast for random access
- Physically attached to the server that hosts your virtual machine
instance

- Local SSDs have higher throughput and lower latency

Local SSD - The data that you store on a local SSD persists only until you stop
or delete the instance

- Small - each local SSD is 375 GB in size, but you can attach up
to eight local SSD devices for 3 TB of total local SSD storage
space per instance.
- Very high IOPS and low latency

- Unlike persistent disks, you must manage the striping on local


Local SSD SSDs yourself

- Encrypted, custom encryption not possible


- use when latency and throughput are not a priority

Cloud Storage - and

Buckets - when you must share data easily between multiple instances or
zones.
- Flexible, scalable, durable

- ~Infinite size possible

- Performance depends on storage class


Cloud Storage - Multi-regional
Buckets - Regional
- Nearline
- Coldline
Containers
Recall

Hosting a Website
“Google Cloud “Google Compute Heroku, Engine Yard
Lots of code, languages

Storage” Engine”
Just focus on code, forget the rest

Lots of dependencies
“Firebase Hosting + Deployment is becoming painful

Google Cloud Storage” Create containers, manage clusters


Recall

Hosting a Website
You run separate web server, database

Separate containers to isolate from each other “Google Container Engine”


Service-oriented architecture

Microservices

Lots of dependencies
Deployment is becoming painful

Create containers, manage clusters


Recall

Container
A container image is a lightweight, stand-alone, executable package of a
piece of software that includes everything needed to run it: code, runtime,
system tools, system libraries, settings

(www.docker.com)
Container

A container image is a lightweight, stand-alone, executable package of a


piece of software that includes everything needed to run it: code, runtime,
system tools, system libraries, settings
(www.docker.com)
Containers and VMs

Containers Virtual Machines

(www.docker.com)
Containers and VMs

Containers Virtual Machines


Virtualise the Operating System Virtualise hardware

More portable Less portable

Quick to boot Slow to boot

Size - tens of MBs Size - tens of GBs

(www.docker.com)
Hosting a Website
You run separate web server, database

Separate containers to isolate from each other “Google Container Engine”


Service-oriented architecture

Microservices

Lots of dependencies
Deployment is becoming painful

Create containers, manage clusters


Hosting a Website
“Google Cloud “Google Compute Heroku, Engine Yard
Lots of code, languages

Storage” Engine”
Just focus on code, forget the rest

“Firebase Hosting + “Google Container


Google Cloud Storage” Engine”
Componentization - microservices

Advantages Portability

Rapid deployment
Orchestration - Kubernetes clusters

Advantages Image registration - Pull images from container registry

Flexibility - mix-and-match with other cloud providers, on-premise


Storage options as with Compute Engine

Storage However, remember that container disks are ephemeral


options
Need to use gcePersistentDisk abstraction for persistent disk
Network load balancing works out-of-box with Container Engine
Load
For HTTP load balancing, need to integrate with Compute Engine
Balancing load balancing
Container Cluster

Kubelet Kubelet Kubelet

Kubernetes
Container Cluster

Pod Pod Pod

Kubelet Kubelet Kubelet

Node Instance #1 Node Instance #2 Node Instance #3

Kubernetes Master
Container Cluster
Pod Pod Pod

Kubelet Kubelet Kubelet

Node Instance #1 Node Instance #2 Node Instance #3

Kubernetes Master
- Group of Compute Engine instances running Kubernetes.

Container - It consists of
Cluster - one or more node instances, and
- a managed Kubernetes master endpoint.
- Managed from the master

Node - Run the services necessary to support Docker containers

Instances - Each node runs the Docker runtime and hosts a Kubelet agent,
which manages the Docker containers scheduled on the host
- Managed master also runs the Kubernetes API server, which
Master - services REST requests
- schedules pod creation and deletion on worker nodes
Endpoint - synchronizes pod information (such as open ports and location)
- Subset of machines within a cluster that all have the same
configuration.

- Useful for customizing instance profiles in your cluster


Node Pool
- You can also run multiple Kubernetes node versions on each node
pool in your cluster, update each node pool independently, and target
different node pools for specific deployments.
- Tool that executes your container image builds on Google Cloud
Platform's infrastructure

- Working:
Container - import source code from a variety of repositories or cloud
storage spaces
Builder - execute a build to your specifications
- produce artifacts such as Docker containers or Java archives.
- Private registry for Docker images

- Can access Container Registry through secure HTTPS endpoints,


which lets you push, pull, and manage images from any system,
whether it's a Compute Engine instance or your own hardware
Container
Registry - Can use the Docker credential helper command-line tool to
configure Docker to authenticate directly with Container Registry

- Can use third-party cluster management, continuous integration,


or other solutions outside of Google Cloud Platform
Autoscaling
Pod Pod Pod

Kubelet Kubelet Kubelet

Node Instance #1 Node Instance #2 Node Instance #3

Kubernetes Master
Autoscaling
Pod Pod

Kubelet Kubelet

Node Instance #1 Node Instance #2

Kubernetes Master
Autoscaling
Pod Pod Pod

Kubelet Kubelet Kubelet

Node Instance #1 Node Instance #2 Node Instance #N

Kubernetes Master
Autoscaling
Pod Pod Pod

Kubelet Kubelet Kubelet

Node Instance #1 Node Instance #2 Node Instance #3

Kubernetes Master
- Automatic resizing of clusters with Cluster Autoscaler

- Periodically checks whether there are any pods waiting, resizes


cluster if needed
Autoscaling
- Also monitors usage of nodes and deletes nodes if all its pods can
be scheduled elsewhere
AppEngine, Compute Engine and Container
Engine are GCP’s 3 compute options

Summary
Google AppEngine is the PaaS option -
serverless and ops-free

Google ComputeEngine is the IaaS option - fully


controllable down to OS

Google Container Engine lies in between - clusters of


machines running Kubernetes and hosting containers
Module - Storage
Block storage for compute VMs - persistent disks or SSDs

Over view Immutable blobs like video/images - Cloud Storage

OLTP - Cloud SQL or Cloud Spanner

NoSQL Documents like HTML/XML - Datastore

NoSQL Key-values - BigTable (~HBase)

Getting data into Cloud Storage - Transfer service


Storage Options
Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage File system - maybe HDFS

SQL Interface atop file data Hive (SQL-like, but MapReduce on HDFS)

Document database, NoSQL CouchDB, MongoDB (key-value/indexed database)

Fast scanning, NoSQL HBase (columnar database)

Transaction Processing (OLTP) RDBMS

Analytics/Data Warehouse (OLAP) Hive (SQL-like, but MapReduce on HDFS)


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Mobile-Specific Use-Cases

When you need Use


Storage for Compute, Block Storage along with mobile SDKs Cloud Storage for Firebase

Fast random access with mobile SDKs Firebase Realtime DB


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Data is not structured

Block Lowest level of storage - no abstraction at all

Meant for use from VMs


Storage
Location tied to VM location
Data stored in volumes (called blocks)

Remember the options available on Compute Engine VMs


- Persistent disks
Block - Standard
Storage - SSD
- Local SSDs
- (Also Cloud Storage - more in a bit)
Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


- Create buckets to store data

- Buckets are globally unique

Cloud - Name (globally unique)

Storage - Location

- Storage Class
- Multi-regional - frequent access from anywhere in the world

- Regional - frequent access from specific region


Bucket Storage
Classes - Nearline - accessed once a month at max

- Coldline - accessed once a year at max


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


- Latency bit higher than BigTable, DataStore - prefer those for low latency

- No ACID properties - can’t use for transaction processing (OLTP)

BigQuery - Great for analytics/business intelligence/data warehouse (OLAP)

- Recall that OLTP needs strict write consistency, OLAP does not
- Superficially similar in use-case to Hive

BigQuery -

-
SQL-like abstraction for non-relational data

Underlying implementation actually quite different from Hive though


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


- Relational databases - super-structured data, constraints etc

- ACID properties - use for transaction processing (OLTP)


Cloud SQL,
Cloud Spanner - Too slow and too many checks for analytics/BI/warehousing (OLAP)

- Recall that OLTP needs strict write consistency, OLAP does not
- Cloud Spanner is Google proprietary, more advanced than Cloud SQL

Cloud SQL, - Cloud Spanner offers “horizontal scaling” - i.e. bigger data, more instances,
replication etc
Cloud Spanner
- Under the hood, Cloud Spanner has a surprising design - more later
Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


- Document data - eg XML or HTML - has a characteristic pattern

- Key-value structure, i.e. structured data

DataStore - Typically not used either for OLTP or OLAP

- Fast lookup on keys is the most common use-case


- Speciality of DataStore is that query execution time depends on size of
returned result (not size of data set)

- So, a returning 10 rows will take the same length of time whether dataset is
DataStore 10 rows, or 10 billion rows

- Ideal for “needle-in-a-haystack” type applications, i.e. lookups of non-


sequential keys
- Indices are always fast to read, slow to write

DataStore - So, don’t use for write-intensive data


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


- Fast scanning of sequential key values - use BigTable

- Columnar database, good for sparse data

BigTable - Sensitive to hot spotting - need to design key structure carefully

- Similar to HBase
Cloud Storage
Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


- Create buckets to store data

- Buckets are globally unique

Cloud - Name (globally unique)

Storage - Location

- Storage Class
- Multi-regional - frequent access from anywhere in the world

- Regional - frequent access from specific region


Bucket Storage
Classes - Nearline - accessed once a month at max

- Coldline - accessed once a year at max


Bucket Storage Classes

- Frequently accessed ("hot" objects), such as serving website content, interactive workloads,
or mobile and gaming applications.

- Highest availability of the storage classes

- Geo-redundant - Cloud Storage stores your data redundantly in at least two regions
separated by at least 100 miles within the multi-regional location of the bucket.
Bucket Storage Classes

- Appropriate for storing data that is used by Compute Engine instances.


- Better performance for data-intensive computations, as opposed to storing your data in a multi-regional
location
Bucket Storage Classes

- Slightly lower availability


- 30-day minimum storage duration
- Data you plan to read or modify on average once a month or less
- Data backup, disaster recovery, and archival storage.
Bucket Storage Classes

- Unlike other "cold" storage services, same throughput and latency (i.e. not slower to access)

- 90-day minimum storage duration, costs for data access, and higher per-operation costs

- Infrequently accessed data, such as data stored for legal or regulatory reasons
- XML and JSON APIs

- Command line (gsutil)


Working with
Cloud Storage - GCP Console (web)

- Client SDK
- Cloud Storage considers bucket names that contain dots to be domain names

- Must be syntactically valid DNS names

- E.g bucket…example.com is not valid because it contains three dots in a row


Domain-Named
Buckets - End with a currently-recognized top-level domain, such as .com

- Pass domain ownership verification

- E.g. Team member creating bucket must be domain owner or manager


- Number of ways to demonstrate ownership of a site or domain, including:

- Adding a special Meta tag to the site's homepage.

Domain - Uploading a special HTML file to the site.

Verification - Verifying ownership directly from Search Console.

- Adding a DNS TXT or CNAME record to the domain's DNS configuration.


Mobile-Specific Use-Cases

When you need Use


Storage for Compute, Block Storage along with mobile SDKs Cloud Storage for Firebase

Fast random access with mobile SDKs Firebase Realtime DB


Cloud SQL
Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


- Relational databases - super-structured data, constraints etc

- ACID properties - use for transaction processing (OLTP)


Cloud SQL,
Cloud Spanner - Too slow and too many checks for analytics/BI/warehousing (OLAP)

- Recall that OLTP needs strict write consistency, OLAP does not
Relational Data
Student
StudentID Student Name StudentID CourseID Term Grade
Name
1 Jane Doe

2 John Walsh 1 Jane Doe CS101 Fall 2015 A-


3 Raymond Wu

Spring
2 John Walsh CS294 B+
2016
CourseID Course Name

CS101 Introduction to Computer Science Raymond Winter


3 ME101 C+
Wu 2015
EE275 Logic Circuits

CS183 Computer Architecture Summer


1 Jane Doe CS183 A+
2012
- Cloud Spanner is Google proprietary, more advanced than Cloud SQL

Cloud SQL, - Cloud Spanner offers “horizontal scaling” - i.e. bigger data, more instances,
replication etc
Cloud Spanner
- Under the hood, Cloud Spanner has a surprising design - more later
- MySQL - fast and the usual

Cloud SQL - PostgreSQL - complex queries


- Instances need to be created explicitly

- Not serverless

- Specify region while creating instance

- First vs. second generation instances


Instances - Second generation instances allow proxy support - no need to whitelist IP
addresses or configure SSL

- Higher availability configuration

- Maintenance won’t take down the server


- A Second Generation instance is in an high availability configuration when it
has a failover replica

High Availability - The failover replica must be in a different zone than the original instance, also
called the master
Configuration
- All changes made to the data on the master, including to user tables, are
replicated to the failover replica using semisynchronous replication.
- Provides secure access to your Cloud SQL Second Generation instances
without having to whitelist IP addresses or configure SSL.

- Secure connections: The proxy automatically encrypts traffic to and from the
Cloud Proxy database; SSL certificates are used to verify client and server identities.

- Easier connection management: The proxy handles authentication with Google


Cloud SQL, removing the need to provide static IP addresses.
Cloud Proxy
Cloud Proxy

The Cloud SQL Proxy works by having a local client, called the proxy, running in
the local environment
Cloud Proxy

Your application communicates with the proxy with the standard database protocol
used by your database
Cloud Proxy

The proxy uses a secure tunnel to communicate with its companion process
running on the server.
Cloud Proxy

When you start the proxy, need to tell it

- What Cloud SQL instances it should establish connections to


- Where it will listen for data coming from your application to be sent to Cloud SQL
- Where it will find the credentials it will use to authenticate your application to Cloud SQL
Cloud Proxy

You can install the proxy anywhere in your local environment. The location of the
proxy binaries does not impact where it listens for data from your application.
Cloud Spanner
Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


- Relational databases - super-structured data, constraints etc

- ACID properties - use for transaction processing (OLTP)


Cloud SQL,
Cloud Spanner - Too slow and too many checks for analytics/BI/warehousing (OLAP)

- Recall that OLTP needs strict write consistency, OLAP does not
- Cloud Spanner is Google proprietary, more advanced than Cloud SQL

Cloud SQL, - Cloud Spanner offers “horizontal scaling” - i.e. bigger data, more instances,
replication etc
Cloud Spanner
- Under the hood, Cloud Spanner has a surprising design - more later
- Use when

- Need high availability

- strong consistency

Cloud - transactional reads and writes (especially writes!)

Don’t use if
Spanner -

- Data is not relational, or not even structured

- Want an open source RDBMS

- Strong consistency and availability is overkill


- Databases contain tables (as usual)

Data Model - Tables ‘look’ relational - rows, columns, strongly typed schemas

- But…
Relational Data
Student
StudentID Student Name StudentID CourseID Term Grade
Name
1 Jane Doe

2 John Walsh 1 Jane Doe CS101 Fall 2015 A-


3 Raymond Wu

Spring
2 John Walsh CS294 B+
2016
CourseID Course Name

CS101 Introduction to Computer Science Raymond Winter


3 ME101 C+
Wu 2015
EE275 Logic Circuits

CS183 Computer Architecture Summer


1 Jane Doe CS183 A+
2012
Relational Data
Student
StudentID Student Name StudentID CourseID Term Grade
Name
1 Jane Doe

2 John Walsh 1 Jane Doe CS101 Fall 2015 A-


3 Raymond Wu

Spring
2 John Walsh CS294 B+
- Usually query student and course grades together 2016

- Most common query = get transcript Raymond Winter


3 ME101 C+
Wu 2015
- Specify a parent-child relationship for efficient storage
Summer
1 Jane Doe CS183 A+
2012
Interleaved Representation
StudentID Student Name

1 Jane Doe

StudentID Student Name CourseID Term Grade


1 Jane Doe CS101 Fall 2015 A-
1 Jane Doe CS183 Summer 2012 A+

2 John Walsh

StudentID Student Name CourseID Term Grade


2 John Walsh CS294 Spring 2016 B+

3 Raymond Wu

StudentID Student Name CourseID Term Grade

3 Raymond Wu ME101 Winter 2015 C+


- Parent-child relationships between tables

Parent- - These cause physical location for fast access

If you query Students and Grades together, make Grades child of Students
Child -

- Data locality will be enforced between 2 independent tables!


- Every table must have primary keys

Parent- - To declare table is child of another…

Prefix parent’s primary key onto primary key of child


Child -

- (This storage model resembles HBase btw)


- Rows are stored in sorted order of primary key values

- Child rows are inserted between parent rows with that key prefix

Interleaving - “Interleaving”

- Fast sequential access - like HBase


- As in HBase - need to choose Primary key carefully

- Do not use monotonically increasing values, else writes will be on same


locations - hot spotting

Hotspotting - Use hash of key value if you naturally monotonically ordered keys

- Under the hood, Cloud Scanner divides data among servers across key
ranges
Hotspotting
Student
StudentID Student Name StudentID CourseID Term Grade
Name
1 Jane Doe

2 John Walsh 1 Jane Doe CS101 Fall 2015 A-


3 Raymond Wu

Spring
2 John Walsh CS294 B+
- Usually query student and course grades together 2016

- Most common query = get transcript Raymond Winter


3 ME101 C+
Wu 2015
- Specify a parent-child relationship for efficient storage
Summer
1 Jane Doe CS183 A+
2012
Interleaved Representation
StudentID Student Name

1 Jane Doe

StudentID Student Name CourseID Term Grade


1 Jane Doe CS101 Fall 2015 A-
1 Jane Doe CS183 Summer 2012 A+

2 John Walsh

StudentID Student Name CourseID Term Grade


2 John Walsh CS294 Spring 2016 B+

3 Raymond Wu

StudentID Student Name CourseID Term Grade

3 Raymond Wu ME101 Winter 2015 C+


Hotspotting
Student
StudentID Student Name StudentID CourseID Term Grade
Name
X23 Jane Doe

F23 John Walsh 1 Jane Doe CS101 Fall 2015 A-


B99 Raymond Wu

Spring
2 John Walsh CS294 B+
2016
- Change key so that it is not monotonically
increasing Raymond Winter
3 ME101 C+
Wu 2015
- Hash StudentID values

Summer
1 Jane Doe CS183 A+
2012
Interleaved Representation
StudentID Student Name

X23 Jane Doe

StudentID Student Name CourseID Term Grade


X23 Jane Doe CS101 Fall 2015 A-
X23 Jane Doe CS183 Summer 2012 A+

F23 John Walsh

StudentID Student Name CourseID Term Grade


F23 John Walsh CS294 Spring 2016 B+

B99 Raymond Wu

StudentID Student Name CourseID Term Grade

B99 Raymond Wu ME101 Winter 2015 C+


- Parent-child relationships can get complicated - up to 7 layers deep!

- CloudScanner is distributed - uses “splits”

Splits -

-
A split is a range of rows that can be moved around independent of others

Splits are added to distribute high read-write data (to break up hotspots)

- Splits are, of course, influenced by parent-child relationships


Splits
StudentID Student Name

X23 Jane Doe

StudentID Student Name CourseID Term Grade


X23 Jane Doe CS101 Fall 2015 A-
X23 Jane Doe CS183 Summer 2012 A+

F23 John Walsh

StudentID Student Name CourseID Term Grade


F23 John Walsh CS294 Spring 2016 B+

B99 Raymond Wu

StudentID Student Name CourseID Term Grade

B99 Raymond Wu ME101 Winter 2015 C+


- Like in HBase, key-based storage ensures fast sequential scan of keys

- Remember that tables must have primary keys

- Unlike in HBase, can also add secondary indices

Secondary - Might cause same data to be stored twice

Indices - Fine-grained control on use of indices

- Force query to use a specific index (index directives)

- Force column to be copied into a secondary index (STORING clause)


Relational Data
Student
StudentID Student Name StudentID CourseID Term Grade
Name
1 Jane Doe

2 John Walsh 1 Jane Doe CS101 Fall 2015 A-


3 Raymond Wu

Spring
2 John Walsh CS294 B+
2016
CourseID Course Name

CS101 Introduction to Computer Science Raymond Winter


3 ME101 C+
Wu 2015
EE275 Logic Circuits

CS183 Computer Architecture Summer


1 Jane Doe CS183 A+
2012
Primary Index - Interleaved Representation
StudentID Student Name

X23 Jane Doe

StudentID Student Name CourseID Term Grade


X23 Jane Doe CS101 Fall 2015 A-
X23 Jane Doe CS183 Summer 2012 A+

F23 John Walsh

StudentID Student Name CourseID Term Grade


F23 John Walsh CS294 Spring 2016 B+

B99 Raymond Wu

StudentID Student Name CourseID Term Grade

B99 Raymond Wu ME101 Winter 2015 C+


Relational Data
Student
CourseID Course Name StudentID CourseID Term Grade
Name
CS101 Introduction to Computer Science

EE275 Logic Circuits


1 Jane Doe CS101 Fall 2015 A-
CS183 Computer Architecture

Spring
2 John Walsh CS294 B+
- Usually query student and course grades together 2016

- But also query courses and grades Raymond Winter


3 ME101 C+
Wu 2015
- So, create secondary index
Summer
1 Jane Doe CS183 A+
2012
Secondary Index - Interleaved Representation
CourseID CourseName

CS101 Introduction to Computer Science

CourseID Student Name StudentID Term Grade


CS101 Jane Doe X23 Fall 2015 A-
CS101 John Walsh F23 Fall 2002 B+

EE275 Logic Circuits

CourseID Student Name CourseID Term Grade


EE275 John Walsh EE275 Spring 2016 A-

CS183 Computer Architecture

CourseID Student Name CourseID Term Grade

CS183 Raymond Wu CS183 Winter 2015 C+


- Remember that tables are strongly-typed (schemas must have types)

- Non-normalized types such as ARRAY and STRUCT available too

Data Types - STRUCTs are not OK in tables, but can be returned by queries (eg if query
returns ARRAY of ARRAYs)

- ARRAYs are OK in tables, but ARRAYs of ARRAYs are not


- Supports serialisability

- Cloud Spanner transaction support is super-strong, even stronger than


traditional ACID

Transactions - Transactions commit in an order that is reflected in their commit


timestamps

- These commit timestamps are "real time" so you can compare them to
your watch
- Two transaction modes

- Locking read-write (slow)

Transactions - Read-only (fast)

- If making a one-off read, use something known as a “Single Read Call”

- Fastest, no transaction checks needed!


- Can set timestamp bounds

- Strong - “read latest data”

- Bounded Staleness - “read version no later than …”

Staleness - Exact Staleness - “read at exactly …”

- (could be in past or future)

- Cloud Scanner has a version-gc that reclaims versions older than 1 hour old
Bigtable
Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


- Fast scanning of sequential key values - use BigTable

- Columnar database, good for sparse data

BigTable - Sensitive to hot spotting - need to design key structure carefully

- Similar to HBase
- BigTable is basically GCP’s managed HBase

- This is a much stronger link than between say Hive and BigQuery!

- Usual advantages of GCP -

BigTable and - scalability

HBase - low ops/admin burden

- cluster resizing without downtime

- many more column families before performance drops (~100 OK)


HBase vs. Relational Databases
Properties of HBase

Columnar store Denormalized storage

Only CRUD operations ACID at the row level


A Notification Ser vice

Id To Type Content

1 mike offer Offer on mobiles

2 john sale Redmi sale

3 jill order Order delivered


Columnar store 4 megan sale Clothes sale

Layout of a traditional relational database


A Notification Ser vice

Id To Type Content

1 mike offer Offer on mobiles

2 john
l sale Redmi sale

3 jill order Order delivered


Columnar store 4 megan sale Clothes sale

Layout of a traditional relational database


A Notification Ser vice

Id To Type Content

1 mike offer Offer on mobiles

2 john sale Redmi sale

3 jill order Order delivered


Columnar store 4 megan sale Clothes sale

Row = 3
Column = To
Columnar Store
Id Column Value

1 To mike

1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john

2 john sale Redmi sale 2 Type sale

2 Content Redmi sale


3 jill order Order delivered
3 To jill
4 megan sale Clothes sale
3 Type order

3 Content Order delivered

4 To megan

4 Type sale

4 Content Clothes sale


Columnar Store
Id Column Value

1 To mike

1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john

2 john sale Redmi sale 2 Type sale

2 Content Redmi sale


3 jill order Order delivered
3 To jill
4 megan sale Clothes sale
3 Type order

3 Content Order delivered

4 To megan

4 Type sale

4 Content Clothes sale


Columnar Store
Id Column Value

1 To mike

1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john

2 john sale Redmi sale 2 Type sale

2 Content Redmi sale


3 jill order Order delivered
3 To jill
4 megan sale Clothes sale
3 Type order

3 Content Order delivered

4 To megan

4 Type sale

4 Content Clothes sale


Columnar Store
Id Column Value

1 To mike

1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john

2 john sale Redmi sale 2 Type sale

2 Content Redmi sale


3 jill order Order delivered
3 To jill
4 megan sale Clothes sale
3 Type order

3 Content Order delivered

4 To megan

4 Type sale

4 Content Clothes sale


Advantages of a Columnar Store

Sparse tables: No wastage of space when storing sparse


data

Columnar store Dynamic attributes: Update attributes dynamically without


changing storage structure
Sparse Tables

Id To Type Content Expiry

1 mike offer Offer on mobiles 2345689070

2 john sale Redmi sale

3 jill order Order delivered

4 megan sale Clothes sale 2456123989

Sale and offer notifications may have an


expiry time
Sparse Tables

Id To Type Content Expiry Order Status

1 mike offer Offer on mobiles 2345689070

2 john sale Redmi sale

3 jill order Order delivered Delivered

4 megan sale Clothes sale 2456123989

Order related notifications may have


an order status
Sparse Tables

Id To Type Content Expiry Order Status

1 mike offer Offer on mobiles 2345689070

2 john sale Redmi sale

3 jill order Order delivered Delivered

4 megan sale Clothes sale 2456123989

In a traditional database this results in a


change in database structure
Sparse Tables

Id To Type Content Expiry Order Status

1 mike offer Offer on mobiles 2345689070

2 john sale Redmi sale

3 jill order Order delivered Delivered

4 megan sale Clothes sale 2456123989

And empty cells when data is not applicable to certain


rows
Sparse Tables

Id To Type Content Expiry Order Status

1 mike offer Offer on mobiles 2345689070

2 john sale Redmi sale

3 jill order Order delivered Delivered

4 megan sale Clothes sale 2456123989

These cells still occupy space!


Id Column Value
1 To mike
1 Type offer
1 Content Offer on mobiles
1 Expiry 2345689070
2 To john
2 Type sale
2 Content Redmi sale
3 To jill
3 Type order
3 Content Order delivered
4 To megan
4 Type sale
4 Content Clothes sale
Columnar store
4 Expiry 2456123989

Dynamically add new attributes as rows


in this table
Id Column Value
1 To mike
1 Type offer
1 Content Offer on mobiles
1 Expiry 2345689070
2 To john
2 Type sale
2 Content Redmi sale
3 To jill
3 Type order
3 Content Order delivered
4 To megan
4 Type sale
4 Content Clothes sale
Columnar store
4 Expiry 2456123989

No wastage of space with empty cells!


Note that this is not the exact layout
of how data is stored in HBase

Columnar store
It is a general structure of
how columnar stores are
constructed
Properties of HBase

Columnar store Denormalized storage

Only CRUD operations ACID at the row level


Traditional databases use
normalized forms of database
Denormalized storage design to minimize redundancy
Minimize Redundancy

Employee Details

Employee Subordinates

Denormalized storage Employee Address


Employee Details

Id Name Function Grade


1 Emily Finance 6

Employee Subordinates

Id Subordinate Id
1 2
Denormalized storage 1 3

Employee Address

Id City Zip Code


1 Palo Alto 94305
2 Seattle 98101
Employee Details

Id Name Function Grade


1 Emily Finance 6
2 John Finance 3
3 Ben Finance 4

Denormalized storage

All employee details in one table


Employee Subordinates

Id Subordinate Id
1 2
1 3

Denormalized storage

Employees referenced only by ids


everywhere else
Employee Address

Id City Zip Code


1 Palo Alto 94305
2 Seattle 98101

Denormalized storage

Data is made more granular by


splitting it across multiple tables
Id Name Function Grade
1 Emily Finance 6

Id Subordinate Id
1 2
1 3

Id City Zip Code


Denormalized storage
1 Palo Alto 94305
2 Seattle 98101

Normalization
Normalization

Optimizes storage

But storage is cheap in a


Denormalized storage

distributed system!
But storage is cheap in a
distributed system!

Denormalized storage
Optimize number of
disk seeks
Denormalized Storage
Id Name Function Grade Id Subordinate Id
1 Emily Finance 6 1 2
2 John Finance 3 1 3
3 Ben Finance 4

Id Name Function Grade Subordinates


1 Emily Finance 6 <ARRAY>
2 John Finance 3
3 Ben Finance 4
Denormalized Storage
Id Name Function Grade Id City Zip Code
1 Emily Finance 6 1 Palo Alto 94305
2 John Finance 3 2 Seattle 98101
3 Ben Finance 4

Id Name Function Grade Subordinates Address


1 Emily Finance 6 <ARRAY> <STRUCT>
2 John Finance 3
3 Ben Finance 4
Denormalized Storage

Id Name Function Grade Subordinates Address


1 Emily Finance 6 <ARRAY> <STRUCT>
2 John Finance 3
3 Ben Finance 4

Store everything related to an employee in the


same table
Denormalized Storage

Id Name Function Grade Subordinates Address


1 Emily Finance 6 <ARRAY> <STRUCT>
2 John Finance 3
3 Ben Finance 4

Read a single record to get all details


about an employee in one read operation
Properties of HBase

Columnar store Denormalized storage

Only CRUD operations ACID at the row level


Traditional Databases and SQL

Joins: Combining information across tables using keys

Group By: Grouping and aggregating data for the


Only CRUD operations groups

Order By: Sorting rows by a certain column


HBase does not
support SQL
Only CRUD operations

NoSQL
Only a limited set of operations are
allowed in HBase
Create

Read

Update
Only CRUD operations

Delete

CRUD
No operations involving multiple tables

No indexes on tables
Only CRUD operations

No constraints
Id Name Function Grade Subordinates Address

Only CRUD operations This is why all details need to


be self contained in one row
Properties of HBase

Columnar store Denormalized storage

Only CRUD operations ACID at the row level


Updates to a single row are atomic

ACID at the row level


All columns in a row
are updated or none are
Updates to multiple rows are not
atomic

ACID at the row level Even if the update is on the


same column in multiple rows
Traditional RDBMS vs. HBase

Traditional RDBMS HBase


Data arranged in rows and columns Data arranged in a column-wise manner

Supports SQL NoSQL database

Complex queries such as grouping, aggregates, joins etc Only basic operations such as create, read, update and
delete
Normalized storage to minimize redundancy and optimize
space Denormalized storage to minimize disk seeks

ACID compliant ACID compliant at the row level


How Is Data Laid out in HBase?
Notification Data in a Traditional Database

Id To Type Content

1 mike offer Offer on mobiles

2 john sale Redmi sale

3 jill order Order delivered

4 megan sale Clothes sale

This is a 2-dimensional data model


Notification Data in a Traditional Database

Id To Type Content

1 mike offer Offer on mobiles

2 john sale Redmi sale

3 jill order Order delivered

4 megan sale Clothes sale

unique row id column name


4-dimensional Data Model

Row Key Column Family

Column Timestamp
4-dimensional Data Model

Column Family Column Family

Row Key Column Column Column Column Column

Timestamp Value

Timestamp Value
Row
Timestamp Value
A Table for Employee Data

Work Personal

EmpID Dept Grade Title Name SSN


A Table for Employee Data

Work Personal

EmpId Dept Grade Title Name SSN

A record for a 23456987 AVP

single employee
24490982 VP
Notification Data
Id Column Value

1 To mike

1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john

2 john sale Redmi sale 2 Type sale

2 Content Redmi sale


3 jill order Order delivered
3 To jill
4 megan sale Clothes sale
3 Type order

3 Content Order delivered

4 To megan

4 Type sale

4 Content Clothes sale


Notification Data
Id Column Value

1 To mike

1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john

2 john sale Redmi sale 2 Type sale

2 Content Redmi sale


3 jill order Order delivered
3 To jill
4 megan sale Clothes sale
3 Type order

3 Content Order delivered

4 To megan

Row Key
4 Type sale

4 Content Clothes sale


Notification Data
Id Column Value

1 To mike

1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john

2 john sale Redmi sale 2 Type sale

2 Content Redmi sale


3 jill order Order delivered
3 To jill
4 megan sale Clothes sale
3 Type order

3 Content Order delivered

4 To megan

Column Family
4 Type sale

4 Content Clothes sale


Notification Data
Id Column Value

1 To mike

1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john

2 john sale Redmi sale 2 Type sale

2 Content Redmi sale


3 jill order Order delivered
3 To jill
4 megan sale Clothes sale
3 Type order

3 Content Order delivered

4 To megan

Columns
4 Type sale

4 Content Clothes sale


Notification Data
Id Column Value

1 To mike

1 Type offer
Id To Type Content
1 Content Offer on mobiles
1 mike offer Offer on mobiles 2 To john

2 john sale Redmi sale 2 Type sale

2 Content Redmi sale


3 jill order Order delivered
3 To jill
4 megan sale Clothes sale
3 Type order

3 Content Order delivered

4 To megan

Value + Timestamp
4 Type sale

4 Content Clothes sale


Data Layout in HBase

Column Family Column Family

Row Key Column Column Column Column Column

Timestamp Value

Timestamp Value

Timestamp Value
Uniquely identifies a row

Can be primitives, structures, arrays


Row Key
Represented internally as a byte array

Sorted in ascending order


All rows have the same set of column families

Each column family is stored in a separate data


Column Family file

Set up at schema definition time

Can have different columns for each row


Columns are units within a column family

Column New columns can be added on the fly

ColumnFamily:ColumnName = Work:Department
Used as the version number for the values
Timestamp stored in a column

The value for any version can be accessed


Census Data Layout in HBase

Personal Professional

marital_statu
Some ID name gender employed field
s
Filtering Rows Based on Conditions
SQL vs. HBase Shell Commands

SQL HBase Shell

select * from census scan ‘census’


SQL vs. HBase Shell Commands

SQL HBase Shell

select name from census scan 'census',


{COLUMNS =>
['personal:name']}
SQL vs. HBase Shell Commands

SQL HBase Shell

select * from census limit 1 scan 'census', {LIMIT => 1}


SQL vs. HBase Shell Commands

SQL HBase Shell

select * from census where get ‘census’, 1


rowkey = 1
Filters allow you to
control what data is
returned from a scan
operation
Built-in Filters

Conditions on row keys

Conditions on columns

Multiple conditions on columns

Timestamp range
BigTable Performance
- Don’t use if you need transaction support (OLTP) - use Cloud SQL or Cloud
Spanner

- Don’t use for data less than 1 TB (can’t parallelize)

- Don’t use if analytics/business intelligence/data warehousing - use BigQuery


Avoid BigTable instead
When - Don’t use for documents or highly structured hierarchies - use DataStore
instead

- Don’t use for immutable blobs like movies each > 10 MB - use Cloud Storage
instead
- Use for very fast scanning and high throughput

- Use for non-structured key/value data

Use BigTable - Where each data item < 10 MB and total data > 1 TB

When - Use where writes infrequent/unimportant (no ACID) but fast scans crucial

- Use for Time Series data


- BigTable is a natural fit for Timestamp data (range queries)

- Say IOT sensor network emitting data at intervals

Use BigTable - Use Device ID # Time as row key if common query = “All data for a

for Time Series device over period of time”

- Use Time # Device ID as row key if common query = “All data for a
period for all devices”
- Like Cloud Spanner, data stored in sorted lex order of keys

- Data is distributed based on key values

- So, performance will be really poor if


Hotspotting and
Schema Design - Reads/writes are concentrated in some ranges

- For instance if key values are sequential

- Use hashing of key values, or non-sequential keys


- Field Promotion: Use in reverse URL order like Java package
names

Avoiding - This way keys have similar prefixes, differing endings

Hotspotting - Salting

- Hash the key value


- BigTable will improve performance over time

- Will observe read and write patterns and redistribute data so that shards are
“Warming evenly hit

the Cache” - Will try to store roughly same amount of data in different nodes

- This is why testing over hours is important to get true sense of performance
- Use SSD unless skimping on cost

- SSD can be 20x faster on individual row reads

SSD or HDD - More predictable throughput too (no disk seek variance)

Disks - Don’t even think about HDD unless storing > 10 TB and all batch queries

- The more random access, the stronger the case for SSD
SSD or HDD Disks
- Poor schema design (eg sequential keys)

- Inappropriate workload

- too small (<300 GB)

- used in short bursts (needs hours to tune performance internally)


Reasons for Poor
Performance - Cluster too small

- Cluster just fired up or scaled up

- HDD used instead of SSD

- Development v Production instance


- Each table has just one index - the row key. Choose it well

Schema - Rows are sorted lexicographically by row key

All operations are atomic at row level


Design -

- Related entities in adjacent rows


- Row keys: 4KB per key

- Column Families: ~100 per table

Size Limits - Column Values: ~ 10 MB each

- Total Row Size: ~100 MB


- Reverse domain names

Types of - String identifiers

Row Keys - Timestamps as suffix in key


- Domain names

- Sequential numeric values

Row Keys to - Timestamps alone

Avoid - Timestamps as prefix of row-key

- Mutable or repeatedly updated values


Datastore
Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


Use-Cases

When you need Use


Storage for Compute, Block Storage Persistent (hard disks), SSD

Storing media, Blob Storage Cloud Storage

SQL Interface atop file data BigQuery

Document database, NoSQL DataStore

Fast scanning, NoSQL BigTable

Transaction Processing (OLTP) Cloud SQL, Cloud Spanner

Analytics/Data Warehouse (OLAP) BigQuery


- Document data - eg XML or HTML - has a characteristic pattern

- Key-value structure, i.e. structured data

DataStore - Typically not used either for OLTP or OLAP

- Fast lookup on keys is the most common use-case


- Speciality of DataStore is that query execution time depends on size of
returned result (not size of data set)

- So, a returning 10 rows will take the same length of time whether dataset is
DataStore 10 rows, or 10 billion rows

- Ideal for “needle-in-a-haystack” type applications, i.e. lookups of non-


sequential keys
Traditional RDBMS vs. DataStore

Traditional RDBMS DataStore


Atomic transactions Atomic transactions

Indices for fast lookup Indices for fast lookup

Some queries use indices - not all All queries use indices!

Query time depend on both size of data set and size of Query time independent of data set, depends on result
result set set alone
Traditional RDBMS vs. DataStore

Traditional RDBMS DataStore


Structured relational data Structured hierarchical data (XML, HTML)

Rows stored in Tables Entities of different in Kinds (think HTML tags)

Rows consist of fields Entities consist of Properties

Primary Keys for unique ID Keys for unique ID


Traditional RDBMS vs. DataStore

Traditional RDBMS DataStore


Rows of table have same properties (Schema is strongly Entities of a kind can have different properties (think
enforced) optional tags in HTML)

Types of all values in a column are the same Types of different properties with same name in an entity
can be different
Traditional RDBMS vs. DataStore

Traditional RDBMS DataStore


Lots of joins No joins

Filtering on subqueries No filtering on subqueries

Multiple inequality conditions Only one inequality filter OK per query


- Don’t use if you need very strong transaction support (OLTP) - OK for
basic ACID support though

- Don’t use for non-hierarchical or unstructured data - BigTable is better

Avoid DataStore - Don’t use if analytics/business intelligence/data warehousing - use BigQuery


instead
When
- Don’t use for immutable blobs like movies each > 10 MB - use Cloud Storage
instead

- Don’t use if application has lots of writes and updates on key columns
- Use for crazy scaling of read performance - to virtually any size
Use DataStore
Use for hierarchical documents with key/value data
When -
- “Built-in” Indices on each property (~field) of each entity kind (~table row)

- “Composite” Indices on multiple property values

Full Indexing - If you are certain a property will never be queried, can explicitly exclude it
from indexing

- Each query is evaluated using its “perfect index”


- Given a query, which is the index that most optimally returns query results?

- Depends on following (in order)

Perfect Index - equality filter

- inequality filter (only 1 allowed)

- sort conditions if any specified


- Updates are really slow

No joins possible
Implications of -

Full Indexing - Can’t filter results based on subquery results

- Can’t include more than one inequality filter (one is OK)


- Separate data partitions for each client organization
Multi- - Can use the same schema for all clients, but vary the values
tenancy - Specified via a namespace (inside which kinds and entities can exist)
- Can optionally use transactions - not required
Transaction - Not as strong as Cloud Spanner (which is ACID++), but stronger than
Support BigQuery or BigTable
- Two consistency levels possible for query results

Consistency - Strongly consistent: return up-to-date result, however long it takes

- Eventually consistent: faster, but might return stale


Transfer Ser vice
- The transfer service helps get data into Cloud Storage

- From where?

Importing - From AWS, i.e. an S3 bucket

Data - From HTTP/HTTPS location

- From local files

- From another Cloud Storage Bucket


- Recall that gsutil can be used to get data into cloud storage buckets
gsutil or Transfer - Prefer the transfer service when transferring from AWS etc
Service?
- If copying files over from on-premise, use gsutil
- One-time vs recurring transfers

- Delete from destination if they don’t exist in source


Transfer Service
Bells & Whistles - Delete from source after copying over

- Periodic synchronisation of source and destination based on file filters


Block storage for compute VMs - persistent disks or SSDs

Summary Immutable blobs like video/images - Cloud Storage

OLTP - Cloud SQL or Cloud Spanner

NoSQL Documents like HTML/XML - Datastore

NoSQL Key-values - BigTable (~HBase)

Getting data into Cloud Storage - Transfer service

You might also like