KEMBAR78
Modernserverlessdatalak | PDF | Amazon Web Services | Cloud Computing
0% found this document useful (0 votes)
41 views45 pages

Modernserverlessdatalak

Tws

Uploaded by

vthreefriends
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views45 pages

Modernserverlessdatalak

Tws

Uploaded by

vthreefriends
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Build and automate a modern serverless

data lake on AWS

Aditya Challa
AWS Solutions Architect
Amazon Web Services

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.
A data lake is usually a single store of all enterprise data including raw copies of source system data and
transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning.
A data lake can include structured data from relational databases (rows and columns), semi-structured data
(CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
A data lake can be established "on premises" (within an organization's data centers) or "in the cloud" (using
cloud services from vendors such as Amazon Web Services).

-- Wikipedia

Serverless computing is a cloud computing execution model in which the cloud provider runs the server,
and dynamically manages the allocation of machine resources. Pricing is based on the actual amount of
resources consumed by an application, rather than on pre-purchased units of capacity. It can be a form of
utility computing.

-- Wikipedia
Typical steps of building a data lake

1 Set up storage

4 Configure and enforce


2 Move data security and compliance
policies
3 Cleanse, prep, and 5 Make data available
catalog data for analytics
Defining the AWS data lake
Data lakes provide:

Business Machine
intelligence learning Relational and nonrelational data

Scale-out to Amazon EBS


DW queries Big data
Interactive Real time
processing

Catalog
Diverse set of analytics and machine learning tools
1001100001001010111001
0101011100101010000101

Work on data without any data movement


1111011010
0011110010110010110
0100011000010

Data warehouse Data lake

Designed for low-cost storage and analytics


OLTP ERP CRM LoB Devices Web Sensors Social
Why use AWS for big data & analytics?

Agility Scalability

Broadest and deepest


Low cost
capabilities

Get to insights faster Data migrations made easy


Data lake on AWS

AWS Amazon Amazon


Amazon Amazon Elasticsearch AWS
AppSync API Gateway Cognito
DynamoDB Service (Amazon ES) Glue
Catalog & search Access & user interfaces
Central storage
Scalable, secure, cost-
effective

Amazon Amazon AWS Amazon DynamoDB


Athena EMR Glue Redshift

AWS
Snowball
Amazon
Kinesis Data
AWS Direct
Connect
AWS Database AWS Storage
Migration Gateway
S3
Firehose Service (AWS DMS)

Data ingestion Manage & secure Amazon


QuickSight
Amazon
Kinesis
Amazon ES Amazon
Neptune
Amazon
RDS

Analytics & serving


AWS IAM AWS Amazon
KMS CloudTrail CloudWatch
Modern serverless data lake components

Amazon
Amazon S3 AWS Glue AWS Lambda CloudWatch
Events
Amazon S3 is the best place for data lakes

Unmatched Best security, Object-level Business Most ways to


durability, compliance, controls insights bring data in
availability, and audit into your data
and scalability capabilities
Ingest
methods Rapidly ingest all data sources
IoT, sensor data, clickstream data,
social media feeds, streaming logs

Oracle, MySQL, MongoDB, DB2,


SQL Server, Amazon RDS

On-premises ERP, mainframes,


lab equipment, NAS storage

Offline sensor data, NAS,


on-premises Hadoop A data lake needs to
accommodate a wide
On-premises data lakes, EDW, variety of concurrent
large-scale data collection data sources
AWS Transfer for SFTP
Fully managed service enabling transfer
of data over SFTP while stored in Amazon S3

Seamless migration Fully managed Secure and compliant


of existing workflows in AWS

Native integration Cost- Simple


with AWS services effective to use
AWS DataSync
Transfer service that simplifies, automates, and accelerates data movement

AWS

Transfers up Simple data Secure and AWS Pay as


to 10 Gbps movement to reliable integrated you go
per agent Amazon S3 or transfers
Amazon EFS
Combines the speed and reliability of network acceleration
software with the cost-effectiveness of open-source tools

Migrate active application Transfer data for timely Replicate data to AWS
data to AWS in-cloud analysis for business continuity
Choosing the right data formats
There is no such thing as the “best” data format
• All involve tradeoffs, depending on workload & tools
• CSV, TSV, JSON are easy but not efficient
• Compress & store or archive as raw input
• Columnar compressed are generally preferred
• Parquet or ORC
• Smaller storage footprint = lower cost
• More efficient scan & query
• Row-oriented (AVRO) good for full data scans
• Organize into partitions
• Coalescing to larger partitions over time
Key considerations are cost, performance, and support
Serverless ETL using AWS Glue

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data prep is ~80% of data lake work

Building training sets

Cleaning and organizing data

Collecting datasets

Mining data for patterns

Refining algorithms

Other
Set up a catalog, ETL, and data prep
with AWS Glue
Serverless provisioning, configuration,
and scaling to run your ETL jobs on
Apache Spark
Pay only for the resources used for jobs
Crawl your data sources, identify data
formats, and suggest schemas and
transformations
Automates the effort in building,
maintaining, and running ETL jobs
AWS Glue In Action
AWS Glue: Components
§ Hive metastore compatible with enhanced functionality
§ Crawlers automatically extract metadata and create tables
§ Integrated with Athena, Amazon Redshift Spectrum
Data Catalog

§ Auto-generates ETL code


§ Builds on open frameworks—Python and Spark
§ Developer-centric—editing, debugging, sharing
Job Authoring

§ Runs jobs on a serverless Spark platform


§ Provides flexible scheduling
Job Execution § Handles dependency resolution, monitoring, and alerting
AWS Glue Data Catalog
Manage table metadata through a Hive metastore API or Hive SQL.
Supported by tools like Hive, Presto, Spark, etc.

We added a few extensions:


§ Search over metadata for data discovery
§ Connection info—JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas evolve and other
metadata are updated

Populate using Hive DDL, bulk import, or automatically through crawlers


AWS Glue Data Catalog: Crawlers
Crawlers automatically build your Data Catalog and keep it in sync

§ Automatically discover new data, extract schema definitions


§ Detect schema changes and version tables
§ Detect Hive style partitions on Amazon S3

• Built-in classifiers for popular types; custom classifiers using Grok


expressions

• Run ad hoc or on a schedule; serverless—only pay when crawler runs


Data Catalog: Detecting partitions
S3 bucket hierarchy Table definition

Column Type
sim=.93 month=Nov
month str
date str
sim=.99 date=10 … sim=.95 date=15 col 1 int
col 2 float

file 1 … file N file 1 … file N

Estimate schema similarity among files at each level to


handle semi-structured logs, schema evolution . . .
Data Catalog: Table details

Table properties
Nested fields

Data statistics

Table
schema
Job authoring in AWS Glue
• You have choices on how to get • Python code generated by
started AWS Glue
• Connect a notebook or IDE to
AWS Glue
• Existing code brought into
AWS Glue
Job authoring: Automatic code generation

1. Customize the mappings


2. AWS Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: ETL code
§ Human-readable, editable, and portable PySpark code

§ Flexible: AWS Glue’s ETL library simplifies manipulating complex, semi-structured data

§ Customizable: Use native PySpark, import custom libraries, and/or leverage AWS Glue’s libraries

§ Collaborative: Share code snippets via GitHub, reuse code across jobs
Job authoring: AWS Glue Dynamic Frames
Like Spark’s Data Frames, but better for:
Dynamic frame schema • Cleaning and (re)-structuring semi-structured
data sets, e.g., JSON, Avro, Apache logs . . .

No upfront schema needed:


• Infers schema on the fly, enabling
A B1 B2 C D[ ] transformations in a single pass

Easy to handle the unexpected:


X Y
• Tracks new fields and inconsistent changing data
types with choices, e.g., integer or string
• Automatically marks and separates error records
Job authoring: Leveraging the community
No need to start from scratch.
Use AWS Glue samples stored in GitHub to share, reuse,
contribute: https://github.com/awslabs/aws-glue-samples
• Migration scripts to import existing Hive metastore
data into AWS Glue Data Catalog
• Examples of how to use Dynamic Frames and
Relationalize() transform
• Examples of how to use arbitrary PySpark code with
AWS Glue’s Python ETL library

Download AWS Glue’s Python ETL library to start


developing code in your IDE:
https://github.com/awslabs/aws-glue-libs
Job execution: Scheduling and monitoring
Compose jobs globally with event-
based dependencies Marketing: Ad spend by
customer segment
§ Easy to reuse and leverage work Event-based
across organization boundaries Lambda trigger Data-
Multiple triggering mechanisms based
§ Schedule-based: e.g., time of day
§ Event-based: e.g., job completion Schedule
Weekly Data-
§ On-demand: e.g., Lambda sales Central: ROI by
basedcustomer segment
§ More : Amazon S3 notifications,
and Amazon CloudWatch Events Sales: Revenue by
customer segment
Logs and alerts are available in
CloudWatch
Job execution: Serverless
Compute instances
There is no need to provision, configure, or
manage servers

§ Auto-configure VPC and role-based access

§ Customers can specify the capacity that


gets allocated to each job

§ Automatically scale resources (on post-GA


roadmap)

§ You pay only for the resources you


consume while consuming them
Amazon VPC Amazon VPC
Common customer use cases

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Log aggregation with AWS Glue ETL

AWS Glue Data Catalog


Real-Time data collection with Glue ETL

AWS Glue Data Catalog


Data import using Glue database connectors

AWS Glue Data Catalog


Serverless processing using
Lambda

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Benefits of Lambda

Productivity-focused compute platform to build powerful, dynamic, modular


applications in the cloud

1 2 3
No infrastructure to Cost-effective and Bring your own
manage efficient code

Focus on business logic Pay only for what you use Run code in standard
languages
Application components for serverless apps
EVENT SOURCE FUNCTION SERVICES (ANYTHING)

Changes in
data state
Node
Requests to Python
endpoints Java
… more coming soon
Changes in
resource state
Event sources that integrate with Lambda
DATA STORES ENDPOINTS

New

Amazon S3 DynamoDB Kinesis Amazon Amazon RDS Amazon API Gateway AWS
Cognito Aurora Alexa IoT

REPOSITORIES EVENT/MESSAGE SERVICES ORCHESTRATION AND


STATE MANAGEMENT

AWS CloudTrail CloudWatch Amazon Amazon SNS Cron events AWS Step
CloudFormation SES Functions

… and the list will continue to grow!


Lambda use case for streaming data ingestion
Lambda:
DynamoDB:
Transformations &
Lookup tables
enrichment
Lookup
Amazon S3:
Buffered files

Raw records Transformed records


Record
Producers Amazon Redshift:
Table loads

Raw records Transformed records

Amazon ES:
Amazon Amazon Kinesis Firehose: Domain loads
Kinesis Delivery stream
Agent

Amazon S3:
CloudWatch: Source record backup
Delivery metrics
Amazon Kinesis Streams and Lambda
Amazon Kinesis: Lambda:
Stream Processor function

Streaming source Other AWS services

• Number of Amazon Kinesis Streams shards corresponds to concurrent


invocations of Lambda function

• Batch size sets maximum number of records per Lambda function invocation
Serverless data lake architecture

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless data lake architecture

ClouldWatch
Rule

ClouldWatch
Rule
Steps in building a serverless data lake
1. Ingest data into Amazon S3
2. Configure an Amazon S3 event trigger
3. Automate the Data Catalog with an AWS Glue crawler
4. Author ETL jobs
5. Automate ETL job execution
6. Monitor with CloudWatch Events
Serverless data lake blog post reference

https://aws.amazon.com/blogs/big-data/build-and-automate-a-
serverless-data-lake-using-an-aws-glue-trigger-for-the-data-catalog-
and-etl-jobs/
Data lakes and analytics
More than 10,000 data lakes on AWS
AWS Partners
Thank you!
Aditya Challa
aditchal@amazon.com

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

You might also like