Modernize Your Analyticsand Data Architecture
Modernize Your Analyticsand Data Architecture
Sriram Kuravi
Partner Solutions Architect
”
© 2019, Amazon Web Services, Inc. or its Affiliates.
… but realizing value from data is challenging
Unable to link data together Data collected too infrequently Data difficult to access
1 Setup storage
Visualization
Creating engaging visual and narrative journeys
Data Visualizer for analytical solutions
Dashboards
Reporting
Analytics
Data Big Data Serverless Interactive Operational Real time
Warehousing Processing Data processing Query Analytics Analytics
Data movement
Analytics
EMR (Spark & AWS Glue Elasticsearch Kinesis Data
Redshift (Spark & Athena
Hadoop) Service Analytics
Python)
Data movement
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Managed Streaming for Apache Kafka
Most ways to move data to the data lake Data
movement
Amazon Kinesis Amazon Kinesis Amazon Kinesis Amazon Managed Amazon Kinesis
Data Streams Data Firehose Data Analytics Streaming for Video Streams
Apache Kafka
Collect and Load data streams Analyze data Collect and Capture, process,
store data into AWS data streams with store data and store media
streams for stores SQL or Java streams for streams for playback
analytics analytics and analytics
Producer writes to
a partition
Consumer reads
* AWS DMS includes eight on-premises databases, one Azure database, five Amazon
RDS/Amazon Aurora database types, and Amazon Simple Storage Service (Amazon S3)
“
”
Challenge
2048 core SGI mainframe pain points:
• Complex - required intricate job orchestration for
replication and distribution of processes
• Reliability issues – 12% jobs timed out/failed
• Timeliness of analytics – 40 core-years for upgrades;
jobs could take 2 weeks before execution begins
Solution
Deploying a data lake on Amazon S3 with 100TB of data.
Using containers, serverless, and Amazon EventBridge
for monitoring and usage report
Benefit
Improved insight into their research, accuracy in
predicting costs for resource requirements, reduction in
time to science and cost
Processing and Querying In Place
Lambda Function
Before After
200 seconds and 11.2 cents 95 seconds and costs 2.8 cents
# Download and process all keys # Select IP Address and Keys
for key in src_keys: for key in src_keys:
response = s3_client.get_object(Bucket=src_bucket, response = s3_client.select_object_content
Key=key) (Bucket=src_bucket, Key=key, expression =
contents = response['Body'].read() SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object
for line in contents.split('\n')[:-1]: as obj)
line_count +=1 contents = response['Body'].read()
try: for line in contents:
data = line.split(',') line_count +=1
srcIp = data[0][:8] try:
….
2X Faster at 1/5 of the cost
….
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
Refining algorithms
Other
• Flexible • Brittle
• Powerful • Error-Prone
• Unit Tests • Laborious
• CI/CD • Sources Change
• Developer Tools … • Schemas Change
• Volume Changes
• EVERYTHING KEEPS CHANGING !!!
Serverless
First BI service built for the cloud with pay-per-session pricing & ML insights
Auto-scale 10 to 10K+ Create dashboards in Secure, Private access to Programmatically onboard users
users in minutes minutes AWS data and manage content
Pay-as-you-go Deploy globally without Integrated S3 data lake Easily embed in your apps
provisioning a single permissions through AWS IAM
server
Challenge
The UK Home Office needed to build a Cost Analytics service
that internal customers use to consume reports around team
utilization of their shared Kubernetes infrastructure on a
pod level
Solution
Home Office implemented a custom-built Cost Analytics
solution using AWS Lambda, Amazon CloudWatch, Amazon
S3, AWS Glue, Amazon Athena, and Amazon QuickSight
Benefit
Reporting has driven behavioral changes for teams to reduce
costs by right-sizing the storage and compute, using
reserved instances, and scheduling. They are also working
on a Cost Efficiency Rating report that scores teams based
on various savings and efficiency techniques per service.
The solution is driving down costs for the Home Office and
hence, the tax payers.
Predictive insights with AWS ML & AI services
Broadest and most complete set of Machine Learning capabilities
AI SERVICES
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD DEVELOPMENT CONTACT CENTERS
Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon Contact Lens
Rekognition Polly Transcribe Comprehend Translate Textract Kendra Lex Personalize Forecast Fraud Detector CodeGuru
For Amazon Connect
+Medical +Medical
ML SERVICES
1 2 AMAZON
ATHENA
AMAZON
AWS GLUE AWS GLUE EMR
AMAZON S3
CRAWLER DATA CATALOG
QUICKSIGHT
AMAZON
REDSHIFT
SPECTRUM
1 Crawlers scan your data sets and populate the Glue Data Catalog
$ Data
Catalog
Serverless analytics
Pr o o f - o f - c o n c e p t e s t i m a t i o n
AWS Glue
30 partitions/month Data
$0/month Catalog
5,000 queries/month
$0.005/query ~$5/user/month
Data Lake
Ingest = ~ $35
Storage = ~ $23
~$35/month AWS IoT Query = ~ $25
5 BI users = ~ $25
10,000 devices
• Total POC cost = ~ $108/month
8KB/device/hr
© 2019, Amazon Web Services, Inc. or its Affiliates. • That’s $3.60/day
* This is a hypothetical example. Costs will vary based on actual workload.
Serverless Analytics
Pr o d u c t i o n w o r k l o a d e s t i m a t i o n
10 hr/day/month AWS Glue
20 x M4.Xlarge Data
~ $1,900 month Catalog 100,000 queries/month
$0.005 (1 GB)/query ~ $5/user/month
Data lake
ETL on-demand = ~ $1,000 (Spot discount)
>= 90% discount with Spot Instances
Potential problem:
1. Too many small files
2. Not necessarily optimized for
Machine
Analytics Amazon Learning
Athena
Amazon
Kinesis Data science
Applications Firehose Presto/Spark
S3 Data Lake on EMR
Amazon Redshift
Reporting
Data Warehouse
Hourly Compactions
to Parquet/ORC
Machine
Athena Learning
Amazon Redshift
Reporting
Data Warehouse
Data science
AWS DMS Presto/Spark
Databases S3 Data Lake on EMR
Amazon Redshift
Reporting
Data Warehouse
Athena
Amazon Redshift
Data Warehouse
Athena
Amazon Redshift
Data Warehouse
Athena
Kinesis
Databases Firehose Tier 1 S3 Datalake: Tier 2 S3 Datalake: Presto/Spark
Raw Data Analytics on EMR
Potential problem:
Tier 1 raw data should have Amazon Redshift
the least transformations Data Warehouse
Athena
Kinesis
Databases Firehose Tier 1 S3 Datalake: Tier 2 S3 Datalake: Presto/Spark
Raw Data Analytics on EMR
Amazon Redshift
Data Warehouse
Direct Query
Amazon Athena
Internet
Raw Data Hadoop Staged Data (Data Lake)
Interfaces
Amazon S3 Amazon EMR Amazon S3 Data Analysts
Network xDRs
Schemaless
Amazon ElasticSearch
AWS Direct
Connect Advanced Analytics
MLlib
Semi/Unstructured Business Users
IoT Amazon EMR
AWS Database
Migration Data Warehouse
Social Media Amazon Redshift Engagement Platforms
Stream Analysis
Amazon EMR Amazon Amazon Amazon
Amazon Kinesis Machine S3 Athena
Legacy Apps
Learning
Web / Logs Amazon RDS
Event Capture
Network Automation /
Amazon Kinesis Near-Zero Latency Events
Internet of Things Amazon DynamoDB
OSS/BSS
Speed (Real-Time) Machine Learning / Auditing
© 2019, Amazon Web Services, Inc. or its Affiliates.
Additional resources
Whitepapers
Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility – https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf
Big Data Analytics Options on AWS – http://d0.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf
Thank You