KEMBAR78
Getting Started With Amazon Redshift | PDF | Amazon Web Services | Apache Hadoop
0% found this document useful (0 votes)
557 views51 pages

Getting Started With Amazon Redshift

Uploaded by

rohit kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
557 views51 pages

Getting Started With Amazon Redshift

Uploaded by

rohit kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Getting Started with

Amazon Redshift

Maor Kleider, Sr. Product Manager, Amazon Redshift

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda

• Introduction
• Benefits
• Use cases
• Getting started
• Q&A
What is Big Data?

When your data sets become so large and diverse


that you have to start innovating around how to
collect, store, process, analyze and share them
It’s never been easier to generate vast amounts of data

Generate

Individual AWS customers Collect & Store


generate over a PB/day

Analyze

Collaborate & Act


Amazon S3 lets you collect and store all this data

Generate

Store exabytes of
Individual AWS customers Collect & Store
data in S3
generating over PB/day

Analyze

Collaborate & Act


But how do you analyze it?

Generate

Store exabytes of
Individual AWS customers Collect & Store
data in S3
generating over PB/day

Highly
Analyze
Constrained

Collaborate & Act


The Dark Data Problem
Most generated data is unavailable for analysis
Data Volume

Generated Data
Available for Analysis

Year
1990 2000 2010 2020
Sources:
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
AWS Big Data Portfolio
Collect Store Analyze

Amazon Kinesis AWS Direct


Amazon S3 Amazon Glacier Amazon EMR Amazon EC2
Firehose Connect

Amazon Kinesis Amazon Amazon Amazon RDS, Amazon Athena


Amazon
Analytics Snowball Dynamo DB Amazon Aurora Redshift Athena

Amazon Kinesis Amazon Amazon Amazon Amazon Machine


Streams CloudSearch Elasticsearch QuickSight Learning

AWS Database Migration Service AWS


AWSGlue
Glue
Amazon Redshift

shift
Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year

150+ features
a lot faster
a lot simpler
a lot cheaper

Relational data warehouse


Massively parallel; petabyte scale

Amazon Fully managed


Redshift HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Selected Amazon Redshift customers
Use Case: Traditional Data Warehousing

Business Advanced pipelines Secure and Bulk Loads


Reporting and queries Compliant and Updates

Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools

Japanese Mobile World’s Largest Children’s Powering 100 marketplaces


Phone Provider Book Publisher in 50 countries
Use Case: Log Analysis

Log & Machine Clickstream Time-Series


IOT Data Events Data Data

Cheap – Analyze large volumes of data cost-effectively


Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics

Interactive data analysis and Ride analytics for pricing Ad prediction and
recommendation engine and product development on-demand analytics
Use Case: Business Applications

Multi-Tenant BI Back-end Analytics as a


Applications services Service

Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline

Infosys Information Analytics-as-a- Product and Consumer


Platform (IIP) Service Analytics
Amazon Redshift architecture
Leader node
Simple SQL endpoint BI tools Analytics tools SQL clients

Stores metadata JDBC/ODBC


Optimizes query plan
Coordinates query execution

Compute nodes
Leader node
Local columnar storage
10 GigE
Parallel/distributed execution of all queries, loads, (HPC)
backups, restores, resizes

Start at just $0.25/hour, grow to 2 PB (compressed) Compute node Compute node Compute node
DC1: SSD; scale from 160 GB to 326 TB
Ingestion
DS2: HDD; scale from 2 TB to 2 PB Backup
Restore

Amazon S3 Amazon EMR Amazon Dynamo DB SSH


Benefit #1: Amazon Redshift is fast
analyze compression listing;

Dramatically less I/O Table | Column | Encoding


---------+----------------+----------
listing | listid | delta
Column storage listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
Data compression listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
Zone maps listing | listtime | raw

Direct-attached storage 10 10 | 13 | 14 | 26 |…

324 … | 100 | 245 | 324


Large data block sizes 375 375 | 393 | 417…

623 … 512 | 549 | 623


637 637 | 712 | 809 …

959 … | 834 | 921 | 959


Benefit #1: Amazon Redshift is fast

Parallel and distributed


Query

Load

Export

Backup

Restore

Resize
Benefit #1: Amazon Redshift is fast

Hardware optimized for I/O intensive workloads, 4 GB/sec/node

Enhanced networking, over 1 million packets/sec/node

Choice of storage type, instance size

Regular cadence of auto-patched improvements


Benefit #1: Amazon Redshift is fast

“Did I mention that it’s ridiculously fast? We’re using “After investigating Redshift, Snowflake, and
it to provide our analysts with an alternative to Hadoop” BigQuery, we found that Redshift offers top-of-the-
line performance at best-in-market price points”

“On our previous big data warehouse system, it took


around 45 minutes to run a query against a year of
“…[Redshift] performance has blown away everyone
data, but that number went down to just 25 seconds
here. We generally see 50-100X speedup over Hive”
using Amazon Redshift”

“We regularly process multibillion row datasets “We saw a 2X performance improvement on a wide
and we do that in a matter of hours. We are heading variety of workloads. The more complex the queries,
to up to 10 times more data volumes in the next couple the higher the performance improvement”
of years, easily
And has gotten faster...

5X Query throughput improvement over the past year


 Memory allocation (launched)
 Improved commit and I/O logic (launched)
 Queue hopping (launched)
Fast
 Query monitoring rules (launched)

10X Vacuuming performance improvement


 Ensures data is sorted for efficient and fast I/O
 Reclaims space from deleted rows
 Enhanced vacuum performance leads to better system throughput

Efficient
The life of a query
Client Amazon Redshift Cluster

2 3
BI tools
Compute node

1
Queue 1

Analytics tools
Queue 2
Compute node

Leader node

SQL clients
Compute node
Query monitoring rules
• Allows automatic handling of runaway (poorly written) queries

• Metrics with operators and values (e.g. query_cpu_time > 1000) create a predicate

• Multiple predicates can be AND-ed together to create a rule

• Multiple rules can be defined for a queue in WLM. These rules are OR-ed together

If { rule } then [action]


{ rule : metric operator value } eg: rows_scanned > 100000
• Metric : cpu_time, query_blocks_read, rows scanned, query
execution time, cpu & io skew per slice, join_row_count, etc.
• Operator : <, >, ==
• Value : integer
[action] : hop, log, abort
Query monitoring rules
Monitor and control
cluster resources
consumed by a query

Get notified, abort and


reprioritize long-
running / bad queries

Pre-defined templates
for common use
cases
Query monitoring rules
Common use cases:
• Protect interactive queues
INTERACTIVE = { “query_execution_time > 15 sec” or
“query_cpu_time > 1500 uSec” or
”query_blocks_read > 18000 blocks” } [HOP]

• Monitor ad-hoc queues for heavy queries


AD-HOC = { “query_execution_time > 120” or
“query_cpu_time > 3000” or
”query_blocks_read > 180000” or
“memory_to_disk > 400000000000”} [LOG]

• Limit the number of rows returned to a client


MAXLINES = { “RETURN_ROWS > 50000” } [ABORT]
Benefit #2: Amazon Redshift is inexpensive

Price per hour for Effective annual


DS2 (HDD) DS2.XL single node price per TB compressed

On-demand $ 0.850 $ 3,725


1 year reservation $ 0.500 $ 2,190 Pricing is simple
3 year reservation $ 0.228 $ 999 Number of nodes x price/hour
No charge for leader node
Price per hour for Effective annual
No upfront costs
DC1 (SSD) DC1.L single node price per TB compressed Pay as you go
On-demand $ 0.250 $ 13,690
1 year reservation $ 0.161 $ 8,795
3 year reservation $ 0.100 $ 5,500
Benefit #3: Amazon Redshift is fully managed

Continuous/incremental backups
Multiple copies within cluster Compute node Compute node Compute node

Continuous and incremental backups


to Amazon S3
Region 1

Continuous and incremental backups


across regions Amazon S3

Streaming restore Region 2

Amazon S3
Benefit #3: Amazon Redshift is fully managed

Fault tolerance
Disk failures Compute node Compute node Compute node

Node failures

Network failures Region 1

Availability Zone/region level disasters Amazon S3

Region 2

Amazon S3
Node fault tolerance
Data-path monitoring agents
Node level monitoring
can detect SW/HW
Compute node
issues and take action

Leader node Compute node


Client

Compute node
Node fault tolerance
Data-path monitoring agents Failure is detected at one
of the compute nodes

Compute node

Leader node Compute node


Client

Compute node
Node fault tolerance
Data-path monitoring agents Redshift parks the
connections

Compute node Next, the node is


replaced

Leader node Compute node


Client

Compute node
Node fault tolerance
Data-path monitoring agents Queries are re-submitted

Compute node

Leader node Compute node


Client

Compute node
Node fault tolerance
Data-path monitoring agents Additional monitoring
layer for the leader
Cluster-level monitoring agents node and network
Compute node

Leader node Compute node


Client

Compute node
Benefit #4: Security is built-in Customer VPC

 Load encrypted from S3


BI tools Analytics tools SQL clients
 SSL to secure data in transit
JDBC/ODBC
 ECDHE perfect forward secrecy
Internal VPC
 Amazon VPC for network isolation

 Encryption to secure data at rest


Leader node

 All blocks on disks and in S3 encrypted 10 GigE


(HPC)
 Block key, cluster key, master key (AES-256)

 On-premises HSM & AWS CloudHSM support


Compute node Compute node Compute node

 Audit logging and AWS CloudTrail integration


Ingestion
Backup
 SOC 1/2/3, PCI-DSS, FedRAMP, BAA Restore

Amazon S3 Amazon EMR Amazon Dynamo DB SSH


Benefit #5: Amazon Redshift is powerful
• Approximate functions

• User defined functions

• Machine learning

• Data science
Benefit #6: Amazon Redshift has a large ecosystem

Data integration Business intelligence Systems integrators


Benefit #7: Service oriented architecture

EC2/SSH
DynamoDB

RDS/Aurora

Amazon ML

EMR
Amazon
Redshift CloudSearch

Data Pipeline
Amazon
Mobile
S3 Amazon Kinesis Analytics
Amazon Redshift Spectrum
Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes

Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query

S3
SQL
High concurrency: Multiple No ETL: Query data in-place Full Amazon Redshift
clusters access same data using open file formats SQL support
Life of a query Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…

JDBC/ODBC

Amazon
Redshift

Redshift Spectrum ...


Fast @ Exabyte scale
1 2 3 4 N

Amazon S3 Data Catalog


Exabyte-scale object storage Apache Hive Metastore
Amazon Redshift Spectrum – Current support

File formats Compression Encryption

• Parquet • Gzip • SSE with AES256


• CSV • Snappy • SSE KMS with default
• Sequence • Lzo (coming soon) key
• RCFile • Bz2
• ORC (coming soon)
• RegExSerDe (coming soon)

Column types Table type

• Numeric: bigint, int, smallint, float, double • Non-partitioned table


and decimal (s3://mybucket/orders/..)
• Char/varchar/string • Partitioned table
• Timestamp (s3://mybucket/orders/date=YYYY-MM-
• Boolean DD/..)
• DATE type can be used only as a
partitioning key
The Emerging Analytics Architecture

Storage
Amazon S3 AWS Glue Data Catalog
Exabyte-scale Object Storage Hive-compatible Metastore

Serverless
Compute
Amazon Kinesis Firehose AWS Glue Amazon Redshift Spectrum AWS Lambda
Real-Time Data Streaming ETL & Data Catalog Fast @ Exabyte scale Trigger-based Code Execution

Data
Processing
Amazon EMR Amazon Redshift Amazon Athena
Athena
Managed Hadoop Applications Petabyte-scale Data Warehousing Interactive Query
Over 20 customers helped preview Amazon Redshift Spectrum
Use cases
NTT Docomo: Japan’s largest mobile service provider

68 million customers Scaling challenges


Tens of TBs per day of data across a Performance issues
mobile network
6 PB of total data (uncompressed) Need same level of security
Data science for marketing Need for a hybrid environment
operations, logistics, and so on

Greenplum on-premises
NTT Docomo: Japan’s largest mobile service provider

125 node DS2.8XL cluster


S3
4,500 vCPUs, 30 TB RAM
2 PB compressed
Data ET Forwarder
Source State Loader
Management
10x faster analytic queries
AWS
Direct
Connect 50% reduction in time for new
Client Amazon Redshift Sandbox BI application deployment
Significantly less operations
overhead
Nasdaq: powering 100 marketplaces in 50 countries

Orders, quotes, trade executions, Expensive legacy DW


market “tick” data from 7 exchanges ($1.16 M/yr.)
7 billion rows/day Limited capacity (1 yr. of data
Analyze market share, client activity, online)
surveillance, billing, and so on
Needed lower TCO
Must satisfy multiple security
Microsoft SQL Server on-premises and regulatory requirements
Similar performance
Nasdaq: powering 100 marketplaces in 50 countries

23 node DS2.8XL cluster


828 vCPUs, 5 TB RAM
368 TB compressed
2.7 T rows, 900 B derived
8 tables with 100 B rows
7 man-month migration
¼ the cost, 2x storage, room to
grow
Faster performance, very
secure
Amazon.com clickstream analytics

Web log analysis for Amazon.com


• PBs workload, 2TB/day@67% YoY
• Largest table: 400 TB

Understand customer behavior

Previous solution
• Legacy DW (Oracle)—query across 1 week/hr
• Hadoop—query across 1 month/hr
Results with Amazon Redshift

• Query 15 months in 14 min • 100 node DS2.8XL clusters • 20% time of one DBA

• Load 5B rows in 10 min • Easy resizing • Increased productivity

• 21B w/ 10B rows: 3 days to 2 hrs • Managed backups and restore


(Hive  Redshift)
• Failure tolerance and recovery
• Load pipeline: 90 hrs to 8 hrs
(Oracle  Redshift)
Resources

Detail Pages
• http://aws.amazon.com/redshift
• https://aws.amazon.com/marketplace/redshift/
• https://aws.amazon.com/redshift/developer-resources/
• Amazon Redshift Utilities - GitHub

Best Practices
• http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-
practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-
practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-
performance.html
Thank you!

You might also like