0% found this document useful (0 votes)

41 views45 pages

Modernserverlessdatalak

Tws

Uploaded by

vthreefriends

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views45 pages

Modernserverlessdatalak

Tws

Uploaded by

vthreefriends

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Build and automate a modern serverless

data lake on AWS

Aditya Challa
AWS Solutions Architect
Amazon Web Services

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.
A data lake is usually a single store of all enterprise data including raw copies of source system data and
transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.
A data lake can include structured data from relational databases (rows and columns), semi-structured data
(CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
A data lake can be established "on premises" (within an organization's data centers) or "in the cloud" (using
cloud services from vendors such as Amazon Web Services).

-- Wikipedia

Serverless computing is a cloud computing execution model in which the cloud provider runs the server,
and dynamically manages the allocation of machine resources. Pricing is based on the actual amount of
resources consumed by an application, rather than on pre-purchased units of capacity. It can be a form of
utility computing.

-- Wikipedia
Typical steps of building a data lake

1 Set up storage

4 Configure and enforce

2 Move data security and compliance
policies
3 Cleanse, prep, and 5 Make data available
catalog data for analytics
Defining the AWS data lake
Data lakes provide:

Business Machine
intelligence learning Relational and nonrelational data

Scale-out to Amazon EBS

DW queries Big data
Interactive Real time
processing

Catalog
Diverse set of analytics and machine learning tools
1001100001001010111001
0101011100101010000101

Work on data without any data movement

1111011010
0011110010110010110
0100011000010

Data warehouse Data lake

Designed for low-cost storage and analytics

OLTP ERP CRM LoB Devices Web Sensors Social
Why use AWS for big data & analytics?

Agility Scalability

Broadest and deepest

Low cost
capabilities

Get to insights faster Data migrations made easy

Data lake on AWS

AWS Amazon Amazon

Amazon Amazon Elasticsearch AWS
AppSync API Gateway Cognito
DynamoDB Service (Amazon ES) Glue
Catalog & search Access & user interfaces
Central storage
Scalable, secure, cost-
effective

Amazon Amazon AWS Amazon DynamoDB

Athena EMR Glue Redshift

AWS
Snowball
Amazon
Kinesis Data
AWS Direct
Connect
AWS Database AWS Storage
Migration Gateway
S3
Firehose Service (AWS DMS)

Data ingestion Manage & secure Amazon

QuickSight
Amazon
Kinesis
Amazon ES Amazon
Neptune
Amazon
RDS

Analytics & serving

AWS IAM AWS Amazon
KMS CloudTrail CloudWatch
Modern serverless data lake components

Amazon
Amazon S3 AWS Glue AWS Lambda CloudWatch
Events
Amazon S3 is the best place for data lakes

Unmatched Best security, Object-level Business Most ways to

durability, compliance, controls insights bring data in
availability, and audit into your data
and scalability capabilities
Ingest
methods Rapidly ingest all data sources
IoT, sensor data, clickstream data,
social media feeds, streaming logs

Oracle, MySQL, MongoDB, DB2,

SQL Server, Amazon RDS

On-premises ERP, mainframes,

lab equipment, NAS storage

Offline sensor data, NAS,

on-premises Hadoop A data lake needs to
accommodate a wide
On-premises data lakes, EDW, variety of concurrent
large-scale data collection data sources
AWS Transfer for SFTP
Fully managed service enabling transfer
of data over SFTP while stored in Amazon S3

Seamless migration Fully managed Secure and compliant

of existing workflows in AWS

Native integration Cost- Simple

with AWS services effective to use
AWS DataSync
Transfer service that simplifies, automates, and accelerates data movement

AWS

Transfers up Simple data Secure and AWS Pay as

to 10 Gbps movement to reliable integrated you go
per agent Amazon S3 or transfers
Amazon EFS
Combines the speed and reliability of network acceleration
software with the cost-effectiveness of open-source tools

Migrate active application Transfer data for timely Replicate data to AWS
data to AWS in-cloud analysis for business continuity
Choosing the right data formats
There is no such thing as the “best” data format
• All involve tradeoffs, depending on workload & tools
• CSV, TSV, JSON are easy but not efficient
• Compress & store or archive as raw input
• Columnar compressed are generally preferred
• Parquet or ORC
• Smaller storage footprint = lower cost
• More efficient scan & query
• Row-oriented (AVRO) good for full data scans
• Organize into partitions
• Coalescing to larger partitions over time
Key considerations are cost, performance, and support
Serverless ETL using AWS Glue

Building training sets

Cleaning and organizing data

Collecting datasets

Mining data for patterns

Refining algorithms

Other
Set up a catalog, ETL, and data prep
with AWS Glue
Serverless provisioning, configuration,
and scaling to run your ETL jobs on
Apache Spark
Pay only for the resources used for jobs
Crawl your data sources, identify data
formats, and suggest schemas and
transformations
Automates the effort in building,
maintaining, and running ETL jobs
AWS Glue In Action
AWS Glue: Components
§ Hive metastore compatible with enhanced functionality
§ Crawlers automatically extract metadata and create tables
§ Integrated with Athena, Amazon Redshift Spectrum
Data Catalog

§ Auto-generates ETL code

§ Builds on open frameworks—Python and Spark
§ Developer-centric—editing, debugging, sharing
Job Authoring

§ Runs jobs on a serverless Spark platform

§ Provides flexible scheduling
Job Execution § Handles dependency resolution, monitoring, and alerting
AWS Glue Data Catalog
Manage table metadata through a Hive metastore API or Hive SQL.
Supported by tools like Hive, Presto, Spark, etc.

We added a few extensions:

§ Search over metadata for data discovery
§ Connection info—JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas evolve and other
metadata are updated

Populate using Hive DDL, bulk import, or automatically through crawlers

AWS Glue Data Catalog: Crawlers
Crawlers automatically build your Data Catalog and keep it in sync

§ Automatically discover new data, extract schema definitions

§ Detect schema changes and version tables
§ Detect Hive style partitions on Amazon S3

• Built-in classifiers for popular types; custom classifiers using Grok

expressions

• Run ad hoc or on a schedule; serverless—only pay when crawler runs

Data Catalog: Detecting partitions
S3 bucket hierarchy Table definition

Column Type
sim=.93 month=Nov
month str
date str
sim=.99 date=10 … sim=.95 date=15 col 1 int
col 2 float

file 1 … file N file 1 … file N

Estimate schema similarity among files at each level to

handle semi-structured logs, schema evolution . . .
Data Catalog: Table details

Table properties
Nested fields

Data statistics

Table
schema
Job authoring in AWS Glue
• You have choices on how to get • Python code generated by
started AWS Glue
• Connect a notebook or IDE to
AWS Glue
• Existing code brought into
AWS Glue
Job authoring: Automatic code generation

1. Customize the mappings

2. AWS Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: ETL code
§ Human-readable, editable, and portable PySpark code

§ Flexible: AWS Glue’s ETL library simplifies manipulating complex, semi-structured data

§ Customizable: Use native PySpark, import custom libraries, and/or leverage AWS Glue’s libraries

§ Collaborative: Share code snippets via GitHub, reuse code across jobs
Job authoring: AWS Glue Dynamic Frames
Like Spark’s Data Frames, but better for:
Dynamic frame schema • Cleaning and (re)-structuring semi-structured
data sets, e.g., JSON, Avro, Apache logs . . .

No upfront schema needed:

• Infers schema on the fly, enabling
A B1 B2 C D[ ] transformations in a single pass

Easy to handle the unexpected:

X Y
• Tracks new fields and inconsistent changing data
types with choices, e.g., integer or string
• Automatically marks and separates error records
Job authoring: Leveraging the community
No need to start from scratch.
Use AWS Glue samples stored in GitHub to share, reuse,
contribute: https://github.com/awslabs/aws-glue-samples
• Migration scripts to import existing Hive metastore
data into AWS Glue Data Catalog
• Examples of how to use Dynamic Frames and
Relationalize() transform
• Examples of how to use arbitrary PySpark code with
AWS Glue’s Python ETL library

Download AWS Glue’s Python ETL library to start

developing code in your IDE:
https://github.com/awslabs/aws-glue-libs
Job execution: Scheduling and monitoring
Compose jobs globally with event-
based dependencies Marketing: Ad spend by
customer segment
§ Easy to reuse and leverage work Event-based
across organization boundaries Lambda trigger Data-
Multiple triggering mechanisms based
§ Schedule-based: e.g., time of day
§ Event-based: e.g., job completion Schedule
Weekly Data-
§ On-demand: e.g., Lambda sales Central: ROI by
basedcustomer segment
§ More : Amazon S3 notifications,
and Amazon CloudWatch Events Sales: Revenue by
customer segment
Logs and alerts are available in
CloudWatch
Job execution: Serverless
Compute instances
There is no need to provision, configure, or
manage servers

§ Auto-configure VPC and role-based access

§ Customers can specify the capacity that

gets allocated to each job

§ Automatically scale resources (on post-GA

roadmap)

§ You pay only for the resources you

consume while consuming them
Amazon VPC Amazon VPC
Common customer use cases

AWS Glue Data Catalog

Real-Time data collection with Glue ETL

AWS Glue Data Catalog

Data import using Glue database connectors

AWS Glue Data Catalog

Serverless processing using
Lambda

Productivity-focused compute platform to build powerful, dynamic, modular

applications in the cloud

1 2 3
No infrastructure to Cost-effective and Bring your own
manage efficient code

Focus on business logic Pay only for what you use Run code in standard
languages
Application components for serverless apps
EVENT SOURCE FUNCTION SERVICES (ANYTHING)

Changes in
data state
Node
Requests to Python
endpoints Java
… more coming soon
Changes in
resource state
Event sources that integrate with Lambda
DATA STORES ENDPOINTS

New

Amazon S3 DynamoDB Kinesis Amazon Amazon RDS Amazon API Gateway AWS
Cognito Aurora Alexa IoT

REPOSITORIES EVENT/MESSAGE SERVICES ORCHESTRATION AND

STATE MANAGEMENT

AWS CloudTrail CloudWatch Amazon Amazon SNS Cron events AWS Step
CloudFormation SES Functions

… and the list will continue to grow!

Lambda use case for streaming data ingestion
Lambda:
DynamoDB:
Transformations &
Lookup tables
enrichment
Lookup
Amazon S3:
Buffered files

Raw records Transformed records

Record
Producers Amazon Redshift:
Table loads

Raw records Transformed records

Amazon ES:
Amazon Amazon Kinesis Firehose: Domain loads
Kinesis Delivery stream
Agent

Amazon S3:
CloudWatch: Source record backup
Delivery metrics
Amazon Kinesis Streams and Lambda
Amazon Kinesis: Lambda:
Stream Processor function

Streaming source Other AWS services

• Number of Amazon Kinesis Streams shards corresponds to concurrent

invocations of Lambda function

• Batch size sets maximum number of records per Lambda function invocation
Serverless data lake architecture

ClouldWatch
Rule

ClouldWatch
Rule
Steps in building a serverless data lake
1. Ingest data into Amazon S3
2. Configure an Amazon S3 event trigger
3. Automate the Data Catalog with an AWS Glue crawler
4. Author ETL jobs
5. Automate ETL job execution
6. Monitor with CloudWatch Events
Serverless data lake blog post reference

https://aws.amazon.com/blogs/big-data/build-and-automate-a-
serverless-data-lake-using-an-aws-glue-trigger-for-the-data-catalog-
and-etl-jobs/
Data lakes and analytics
More than 10,000 data lakes on AWS
AWS Partners
Thank you!
Aditya Challa
aditchal@amazon.com

Data Pipelines With AWS Glue (Level 200)
No ratings yet
Data Pipelines With AWS Glue (Level 200)
33 pages
Cheat Sheet AWS Data Engineer Associate
No ratings yet
Cheat Sheet AWS Data Engineer Associate
117 pages
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
AWS Data Lake
100% (1)
AWS Data Lake
104 pages
Building Serverless Analytics Pipelines With AWS Glue - Tom McMeekin-1
No ratings yet
Building Serverless Analytics Pipelines With AWS Glue - Tom McMeekin-1
39 pages
Awsq
No ratings yet
Awsq
5 pages
AWS Glue Guide
No ratings yet
AWS Glue Guide
17 pages
AWS Athena & Glue for Data Analysis
No ratings yet
AWS Athena & Glue for Data Analysis
13 pages
Glue by Pushpjeet
No ratings yet
Glue by Pushpjeet
7 pages
AWS 05 DataLake
No ratings yet
AWS 05 DataLake
78 pages
AWS Data Lake
No ratings yet
AWS Data Lake
13 pages
AWS Data Ingestion Workshop Guide
No ratings yet
AWS Data Ingestion Workshop Guide
43 pages
Data Engineering by AWS
100% (1)
Data Engineering by AWS
11 pages
AWS Glue Metadata Management Guide
No ratings yet
AWS Glue Metadata Management Guide
5 pages
AWS Glue: Quick Start Guide for Devs
No ratings yet
AWS Glue: Quick Start Guide for Devs
36 pages
AWS Glue
No ratings yet
AWS Glue
6 pages
Unite Real-Time and Batch Analytics With AWS Glue
No ratings yet
Unite Real-Time and Batch Analytics With AWS Glue
28 pages
AWS Glue for ETL Developers
No ratings yet
AWS Glue for ETL Developers
5 pages
Big Data PDF
No ratings yet
Big Data PDF
18 pages
Abd213 R Howtobuildadatalakewithawsgluedatacatalog 180208045612
No ratings yet
Abd213 R Howtobuildadatalakewithawsgluedatacatalog 180208045612
43 pages
AWS Analytics and Data Solutions
No ratings yet
AWS Analytics and Data Solutions
34 pages
AWS Interview Q&A - Advanced
No ratings yet
AWS Interview Q&A - Advanced
10 pages
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
No ratings yet
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
9 pages
The Ultimate Data Engineering Guide - Apache Spark, Apache Airflow, and AWS Glue
No ratings yet
The Ultimate Data Engineering Guide - Apache Spark, Apache Airflow, and AWS Glue
6 pages
AWS Data Lake
No ratings yet
AWS Data Lake
118 pages
Data Lake On Aws
No ratings yet
Data Lake On Aws
29 pages
PSO Data Analytics Day 1
100% (1)
PSO Data Analytics Day 1
106 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
An Introduction To Data Lakes and Data Analytics On AWS ANT204
No ratings yet
An Introduction To Data Lakes and Data Analytics On AWS ANT204
34 pages
Notes
No ratings yet
Notes
28 pages
AWS Certified Data Engineer
No ratings yet
AWS Certified Data Engineer
693 pages
AWS Glue ETL Guide: Setup & Execution
No ratings yet
AWS Glue ETL Guide: Setup & Execution
10 pages
Affinity
No ratings yet
Affinity
7 pages
Orchestrate Redshift ETL Using AWS Glue and Step Functions: You Will Learn
No ratings yet
Orchestrate Redshift ETL Using AWS Glue and Step Functions: You Will Learn
4 pages
AWS Glue for Legacy ETL Modernization
No ratings yet
AWS Glue for Legacy ETL Modernization
16 pages
AWS Glue
No ratings yet
AWS Glue
5 pages
AWS Glue
100% (1)
AWS Glue
225 pages
Cheat Sheet AWS Solutions Architect Professional
No ratings yet
Cheat Sheet AWS Solutions Architect Professional
177 pages
AWS Data Infrastructure Guide
No ratings yet
AWS Data Infrastructure Guide
9 pages
AWS Data-Lake Ebook
No ratings yet
AWS Data-Lake Ebook
9 pages
DocScanner 20 Oct 2024 2-19 PM
No ratings yet
DocScanner 20 Oct 2024 2-19 PM
16 pages
AWS Data Lake
No ratings yet
AWS Data Lake
87 pages
ANT205 R Achieving Your Modern Data Architecture
No ratings yet
ANT205 R Achieving Your Modern Data Architecture
71 pages
6 +Athena,+QuickSight,+EMR
No ratings yet
6 +Athena,+QuickSight,+EMR
63 pages
58076778-Node Javier Ramirez - AWS PDF
No ratings yet
58076778-Node Javier Ramirez - AWS PDF
73 pages
REPEAT 3 Architecting Your Data Lake With SAP On AWS ENT310-R3
No ratings yet
REPEAT 3 Architecting Your Data Lake With SAP On AWS ENT310-R3
22 pages
Sample Article - KG-RP180925
No ratings yet
Sample Article - KG-RP180925
5 pages
Serverless Etl Aws Glue
No ratings yet
Serverless Etl Aws Glue
17 pages
Data Architecture On Aws Slides
No ratings yet
Data Architecture On Aws Slides
33 pages
Research - IBM DataStage To AWS Glue Migration
No ratings yet
Research - IBM DataStage To AWS Glue Migration
7 pages
BDC Output 10
No ratings yet
BDC Output 10
7 pages
1605192076066-614 DAS-C01 Study Guide
No ratings yet
1605192076066-614 DAS-C01 Study Guide
18 pages
Lesson 02 Exploring The World of AWS Glue
No ratings yet
Lesson 02 Exploring The World of AWS Glue
33 pages
AWS Tools for Data Engineers
No ratings yet
AWS Tools for Data Engineers
24 pages
AWS Cheatbook For Dummies
No ratings yet
AWS Cheatbook For Dummies
172 pages
Enterprise Data Warehousing On Aws
No ratings yet
Enterprise Data Warehousing On Aws
26 pages
Basic Terms of DATA ENGINEERING
No ratings yet
Basic Terms of DATA ENGINEERING
9 pages
OJSNAD Doc
No ratings yet
OJSNAD Doc
271 pages
Hidden Wisdom Important
No ratings yet
Hidden Wisdom Important
16 pages
DMWQ1D4S3T1 - Building Analytics at Scale With Amazon Athena
No ratings yet
DMWQ1D4S3T1 - Building Analytics at Scale With Amazon Athena
48 pages
Tax Planning For Salaried Employees - Taxguru - in
No ratings yet
Tax Planning For Salaried Employees - Taxguru - in
5 pages
East Zone Councillors List
No ratings yet
East Zone Councillors List
7 pages
HPE AI-ML Accelerated With HPE Proliant
No ratings yet
HPE AI-ML Accelerated With HPE Proliant
33 pages
Making Inferences
No ratings yet
Making Inferences
1 page
C 00846707
No ratings yet
C 00846707
82 pages
PDF Pioneer b2 Tests Compress
100% (4)
PDF Pioneer b2 Tests Compress
65 pages
E-Class Record Deped Activity
No ratings yet
E-Class Record Deped Activity
22 pages
Faceless - Book Review
No ratings yet
Faceless - Book Review
13 pages
Book of Kings
No ratings yet
Book of Kings
6 pages
Chinese Homework Answers
100% (1)
Chinese Homework Answers
5 pages
Advanced Modal Usage with 'Can'
No ratings yet
Advanced Modal Usage with 'Can'
29 pages
Python Code Quality Guidelines
No ratings yet
Python Code Quality Guidelines
43 pages
CHAPTER IV Repaired
No ratings yet
CHAPTER IV Repaired
21 pages
Tahap Bacaan Murid 2022
No ratings yet
Tahap Bacaan Murid 2022
9 pages
Wma11 Jan25 QP
No ratings yet
Wma11 Jan25 QP
32 pages
Gejala Batu Hempedu
No ratings yet
Gejala Batu Hempedu
4 pages
Plato VS Aristotle PDF
No ratings yet
Plato VS Aristotle PDF
14 pages
Experiment Number 3 Basic Router Configuration
No ratings yet
Experiment Number 3 Basic Router Configuration
4 pages
The Cult of Amoghpash Lokeshvara in Kathmandu Valley
No ratings yet
The Cult of Amoghpash Lokeshvara in Kathmandu Valley
2 pages
Ebooks File Introductory Statistics For Data Analysis Warren J. Ewens All Chapters
No ratings yet
Ebooks File Introductory Statistics For Data Analysis Warren J. Ewens All Chapters
49 pages
Tarek Hajj Shehadi: VECTOR INTEGRAL CALCULUS
No ratings yet
Tarek Hajj Shehadi: VECTOR INTEGRAL CALCULUS
65 pages
Bedok View Secondary School: Common Test II 2013
No ratings yet
Bedok View Secondary School: Common Test II 2013
4 pages
OAVS TGT (Maths) Official Paper (Held On - 29 May, 2018 Shift 1)
No ratings yet
OAVS TGT (Maths) Official Paper (Held On - 29 May, 2018 Shift 1)
32 pages
Grammar Translation Method Guide
No ratings yet
Grammar Translation Method Guide
16 pages
Basic Math & Pre-Algebra All-in-One For Dummies 1st Edition Mark Zegarelli Download
No ratings yet
Basic Math & Pre-Algebra All-in-One For Dummies 1st Edition Mark Zegarelli Download
118 pages
SIM7500 - SIM7600 - SIM7800 Series - FTPS - AT Command Manual - V1.00
No ratings yet
SIM7500 - SIM7600 - SIM7800 Series - FTPS - AT Command Manual - V1.00
29 pages
SAA Citation Style Guide (2021)
No ratings yet
SAA Citation Style Guide (2021)
16 pages
PI 100 Lecture Notes
No ratings yet
PI 100 Lecture Notes
4 pages
SAP S - 4HANA Utilities For CE - SAP Community
No ratings yet
SAP S - 4HANA Utilities For CE - SAP Community
20 pages
ServiceNow ITSM TPSM Training Schedule
No ratings yet
ServiceNow ITSM TPSM Training Schedule
9 pages
2021-2022 Diagnostic Test English 5
No ratings yet
2021-2022 Diagnostic Test English 5
6 pages
Test Feelings and Animals
No ratings yet
Test Feelings and Animals
2 pages

Modernserverlessdatalak

Uploaded by

Modernserverlessdatalak

Uploaded by

Build and automate a modern serverless

data lake on AWS

4 Configure and enforce

Scale-out to Amazon EBS

Work on data without any data movement

Data warehouse Data lake

Designed for low-cost storage and analytics

Broadest and deepest

Get to insights faster Data migrations made easy

AWS Amazon Amazon

Amazon Amazon AWS Amazon DynamoDB

Data ingestion Manage & secure Amazon

Analytics & serving

Unmatched Best security, Object-level Business Most ways to

Oracle, MySQL, MongoDB, DB2,

On-premises ERP, mainframes,

Offline sensor data, NAS,

Seamless migration Fully managed Secure and compliant

Native integration Cost- Simple

Transfers up Simple data Secure and AWS Pay as

Building training sets

Cleaning and organizing data

Mining data for patterns

§ Auto-generates ETL code

§ Runs jobs on a serverless Spark platform

We added a few extensions:

Populate using Hive DDL, bulk import, or automatically through crawlers

§ Automatically discover new data, extract schema definitions

• Built-in classifiers for popular types; custom classifiers using Grok

• Run ad hoc or on a schedule; serverless—only pay when crawler runs

file 1 … file N file 1 … file N

Estimate schema similarity among files at each level to

1. Customize the mappings

No upfront schema needed:

Easy to handle the unexpected:

Download AWS Glue’s Python ETL library to start

§ Auto-configure VPC and role-based access

§ Customers can specify the capacity that

§ Automatically scale resources (on post-GA

§ You pay only for the resources you

AWS Glue Data Catalog

AWS Glue Data Catalog

AWS Glue Data Catalog

Productivity-focused compute platform to build powerful, dynamic, modular

REPOSITORIES EVENT/MESSAGE SERVICES ORCHESTRATION AND

… and the list will continue to grow!

Raw records Transformed records

Raw records Transformed records

Streaming source Other AWS services

• Number of Amazon Kinesis Streams shards corresponds to concurrent

You might also like