Build and automate a modern serverless
data lake on AWS
Aditya Challa
AWS Solutions Architect
Amazon Web Services
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.
A data lake is usually a single store of all enterprise data including raw copies of source system data and
transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.
A data lake can include structured data from relational databases (rows and columns), semi-structured data
(CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
A data lake can be established "on premises" (within an organization's data centers) or "in the cloud" (using
cloud services from vendors such as Amazon Web Services).
-- Wikipedia
Serverless computing is a cloud computing execution model in which the cloud provider runs the server,
and dynamically manages the allocation of machine resources. Pricing is based on the actual amount of
resources consumed by an application, rather than on pre-purchased units of capacity. It can be a form of
utility computing.
-- Wikipedia
Typical steps of building a data lake
1 Set up storage
4 Configure and enforce
2 Move data security and compliance
policies
3 Cleanse, prep, and 5 Make data available
catalog data for analytics
Defining the AWS data lake
Data lakes provide:
Business Machine
intelligence learning Relational and nonrelational data
Scale-out to Amazon EBS
DW queries Big data
Interactive Real time
processing
Catalog
Diverse set of analytics and machine learning tools
1001100001001010111001
0101011100101010000101
Work on data without any data movement
1111011010
0011110010110010110
0100011000010
Data warehouse Data lake
Designed for low-cost storage and analytics
OLTP ERP CRM LoB Devices Web Sensors Social
Why use AWS for big data & analytics?
Agility Scalability
Broadest and deepest
Low cost
capabilities
Get to insights faster Data migrations made easy
Data lake on AWS
AWS Amazon Amazon
Amazon Amazon Elasticsearch AWS
AppSync API Gateway Cognito
DynamoDB Service (Amazon ES) Glue
Catalog & search Access & user interfaces
Central storage
Scalable, secure, cost-
effective
Amazon Amazon AWS Amazon DynamoDB
Athena EMR Glue Redshift
AWS
Snowball
Amazon
Kinesis Data
AWS Direct
Connect
AWS Database AWS Storage
Migration Gateway
S3
Firehose Service (AWS DMS)
Data ingestion Manage & secure Amazon
QuickSight
Amazon
Kinesis
Amazon ES Amazon
Neptune
Amazon
RDS
Analytics & serving
AWS IAM AWS Amazon
KMS CloudTrail CloudWatch
Modern serverless data lake components
Amazon
Amazon S3 AWS Glue AWS Lambda CloudWatch
Events
Amazon S3 is the best place for data lakes
Unmatched Best security, Object-level Business Most ways to
durability, compliance, controls insights bring data in
availability, and audit into your data
and scalability capabilities
Ingest
methods Rapidly ingest all data sources
IoT, sensor data, clickstream data,
social media feeds, streaming logs
Oracle, MySQL, MongoDB, DB2,
SQL Server, Amazon RDS
On-premises ERP, mainframes,
lab equipment, NAS storage
Offline sensor data, NAS,
on-premises Hadoop A data lake needs to
accommodate a wide
On-premises data lakes, EDW, variety of concurrent
large-scale data collection data sources
AWS Transfer for SFTP
Fully managed service enabling transfer
of data over SFTP while stored in Amazon S3
Seamless migration Fully managed Secure and compliant
of existing workflows in AWS
Native integration Cost- Simple
with AWS services effective to use
AWS DataSync
Transfer service that simplifies, automates, and accelerates data movement
AWS
Transfers up Simple data Secure and AWS Pay as
to 10 Gbps movement to reliable integrated you go
per agent Amazon S3 or transfers
Amazon EFS
Combines the speed and reliability of network acceleration
software with the cost-effectiveness of open-source tools
Migrate active application Transfer data for timely Replicate data to AWS
data to AWS in-cloud analysis for business continuity
Choosing the right data formats
There is no such thing as the “best” data format
• All involve tradeoffs, depending on workload & tools
• CSV, TSV, JSON are easy but not efficient
• Compress & store or archive as raw input
• Columnar compressed are generally preferred
• Parquet or ORC
• Smaller storage footprint = lower cost
• More efficient scan & query
• Row-oriented (AVRO) good for full data scans
• Organize into partitions
• Coalescing to larger partitions over time
Key considerations are cost, performance, and support
Serverless ETL using AWS Glue
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data prep is ~80% of data lake work
Building training sets
Cleaning and organizing data
Collecting datasets
Mining data for patterns
Refining algorithms
Other
Set up a catalog, ETL, and data prep
with AWS Glue
Serverless provisioning, configuration,
and scaling to run your ETL jobs on
Apache Spark
Pay only for the resources used for jobs
Crawl your data sources, identify data
formats, and suggest schemas and
transformations
Automates the effort in building,
maintaining, and running ETL jobs
AWS Glue In Action
AWS Glue: Components
§ Hive metastore compatible with enhanced functionality
§ Crawlers automatically extract metadata and create tables
§ Integrated with Athena, Amazon Redshift Spectrum
Data Catalog
§ Auto-generates ETL code
§ Builds on open frameworks—Python and Spark
§ Developer-centric—editing, debugging, sharing
Job Authoring
§ Runs jobs on a serverless Spark platform
§ Provides flexible scheduling
Job Execution § Handles dependency resolution, monitoring, and alerting
AWS Glue Data Catalog
Manage table metadata through a Hive metastore API or Hive SQL.
Supported by tools like Hive, Presto, Spark, etc.
We added a few extensions:
§ Search over metadata for data discovery
§ Connection info—JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas evolve and other
metadata are updated
Populate using Hive DDL, bulk import, or automatically through crawlers
AWS Glue Data Catalog: Crawlers
Crawlers automatically build your Data Catalog and keep it in sync
§ Automatically discover new data, extract schema definitions
§ Detect schema changes and version tables
§ Detect Hive style partitions on Amazon S3
• Built-in classifiers for popular types; custom classifiers using Grok
expressions
• Run ad hoc or on a schedule; serverless—only pay when crawler runs
Data Catalog: Detecting partitions
S3 bucket hierarchy Table definition
Column Type
sim=.93 month=Nov
month str
date str
sim=.99 date=10 … sim=.95 date=15 col 1 int
col 2 float
file 1 … file N file 1 … file N
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution . . .
Data Catalog: Table details
Table properties
Nested fields
Data statistics
Table
schema
Job authoring in AWS Glue
• You have choices on how to get • Python code generated by
started AWS Glue
• Connect a notebook or IDE to
AWS Glue
• Existing code brought into
AWS Glue
Job authoring: Automatic code generation
1. Customize the mappings
2. AWS Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: ETL code
§ Human-readable, editable, and portable PySpark code
§ Flexible: AWS Glue’s ETL library simplifies manipulating complex, semi-structured data
§ Customizable: Use native PySpark, import custom libraries, and/or leverage AWS Glue’s libraries
§ Collaborative: Share code snippets via GitHub, reuse code across jobs
Job authoring: AWS Glue Dynamic Frames
Like Spark’s Data Frames, but better for:
Dynamic frame schema • Cleaning and (re)-structuring semi-structured
data sets, e.g., JSON, Avro, Apache logs . . .
No upfront schema needed:
• Infers schema on the fly, enabling
A B1 B2 C D[ ] transformations in a single pass
Easy to handle the unexpected:
X Y
• Tracks new fields and inconsistent changing data
types with choices, e.g., integer or string
• Automatically marks and separates error records
Job authoring: Leveraging the community
No need to start from scratch.
Use AWS Glue samples stored in GitHub to share, reuse,
contribute: https://github.com/awslabs/aws-glue-samples
• Migration scripts to import existing Hive metastore
data into AWS Glue Data Catalog
• Examples of how to use Dynamic Frames and
Relationalize() transform
• Examples of how to use arbitrary PySpark code with
AWS Glue’s Python ETL library
Download AWS Glue’s Python ETL library to start
developing code in your IDE:
https://github.com/awslabs/aws-glue-libs
Job execution: Scheduling and monitoring
Compose jobs globally with event-
based dependencies Marketing: Ad spend by
customer segment
§ Easy to reuse and leverage work Event-based
across organization boundaries Lambda trigger Data-
Multiple triggering mechanisms based
§ Schedule-based: e.g., time of day
§ Event-based: e.g., job completion Schedule
Weekly Data-
§ On-demand: e.g., Lambda sales Central: ROI by
basedcustomer segment
§ More : Amazon S3 notifications,
and Amazon CloudWatch Events Sales: Revenue by
customer segment
Logs and alerts are available in
CloudWatch
Job execution: Serverless
Compute instances
There is no need to provision, configure, or
manage servers
§ Auto-configure VPC and role-based access
§ Customers can specify the capacity that
gets allocated to each job
§ Automatically scale resources (on post-GA
roadmap)
§ You pay only for the resources you
consume while consuming them
Amazon VPC Amazon VPC
Common customer use cases
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Log aggregation with AWS Glue ETL
AWS Glue Data Catalog
Real-Time data collection with Glue ETL
AWS Glue Data Catalog
Data import using Glue database connectors
AWS Glue Data Catalog
Serverless processing using
Lambda
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Benefits of Lambda
Productivity-focused compute platform to build powerful, dynamic, modular
applications in the cloud
1 2 3
No infrastructure to Cost-effective and Bring your own
manage efficient code
Focus on business logic Pay only for what you use Run code in standard
languages
Application components for serverless apps
EVENT SOURCE FUNCTION SERVICES (ANYTHING)
Changes in
data state
Node
Requests to Python
endpoints Java
… more coming soon
Changes in
resource state
Event sources that integrate with Lambda
DATA STORES ENDPOINTS
New
Amazon S3 DynamoDB Kinesis Amazon Amazon RDS Amazon API Gateway AWS
Cognito Aurora Alexa IoT
REPOSITORIES EVENT/MESSAGE SERVICES ORCHESTRATION AND
STATE MANAGEMENT
AWS CloudTrail CloudWatch Amazon Amazon SNS Cron events AWS Step
CloudFormation SES Functions
… and the list will continue to grow!
Lambda use case for streaming data ingestion
Lambda:
DynamoDB:
Transformations &
Lookup tables
enrichment
Lookup
Amazon S3:
Buffered files
Raw records Transformed records
Record
Producers Amazon Redshift:
Table loads
Raw records Transformed records
Amazon ES:
Amazon Amazon Kinesis Firehose: Domain loads
Kinesis Delivery stream
Agent
Amazon S3:
CloudWatch: Source record backup
Delivery metrics
Amazon Kinesis Streams and Lambda
Amazon Kinesis: Lambda:
Stream Processor function
Streaming source Other AWS services
• Number of Amazon Kinesis Streams shards corresponds to concurrent
invocations of Lambda function
• Batch size sets maximum number of records per Lambda function invocation
Serverless data lake architecture
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless data lake architecture
ClouldWatch
Rule
ClouldWatch
Rule
Steps in building a serverless data lake
1. Ingest data into Amazon S3
2. Configure an Amazon S3 event trigger
3. Automate the Data Catalog with an AWS Glue crawler
4. Author ETL jobs
5. Automate ETL job execution
6. Monitor with CloudWatch Events
Serverless data lake blog post reference
https://aws.amazon.com/blogs/big-data/build-and-automate-a-
serverless-data-lake-using-an-aws-glue-trigger-for-the-data-catalog-
and-etl-jobs/
Data lakes and analytics
More than 10,000 data lakes on AWS
AWS Partners
Thank you!
Aditya Challa
aditchal@amazon.com
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.