Data Architecture on AWS
David Tucker
TECHNICAL ARCHITECT & CTO CONSULTANT
@_davidtucker_ davidtucker.net
Reviewing approaches for integrating
Overview data from your own data center
Examining approaches for processing
data
Exploring data analysis approaches
Integrating machine learning and AI into
data analysis
Integrating On-premise Data
On-premise Data Integration Services
AWS Storage AWS
Gateway DataSync
Hybrid-cloud storage Automated data
service transfer service
AWS Storage Gateway
Integrates cloud storage into your local
network
Deployed as a VM or specific hardware
appliance
Integrates with S3 and EBS
Supports three different gateway types
- Tape Gateway
- Volume Gateway
- File Gateway
Gateway Types
File Gateway Tape Gateway Volume Gateway
Stores files in Amazon S3 Enables tape backup Provides cloud based
while providing cached processes to store data in iSCSI volumes to local
low-latency local access the cloud on virtual tapes applications
AWS DataSync
Leverages the DataSync agent deployed
as a VM on your network
Integrates with S3, EFS, and FSx for
Windows File Server on AWS
Greatly improved speed of transfer due
to custom protocol and optimizations
Charged per GB of data transferred
Processing Data
Data Processing Services
AWS Glue Amazon EMR AWS Data Pipeline
Managed Extract, Big-data cloud Data workflow
Transform, and Load processing using orchestration service
(ETL) Service popular tools across AWS services
AWS Glue
Fully managed ETL (extract, transform,
and load) service on AWS
Supports data in Amazon RDS,
DynamoDB, Redshift, and S3
Supports a serverless model of
execution
Amazon EMR
Enables big-data processing on
Amazon EC2 and S3
Supports popular open-source
frameworks and tools
Operates in a clustered environment
without additional configuration
Supports many different big-data use
cases
Supported Amazon EMR Frameworks
Apache Spark Apache Hive Apache HBase
Apache Flink Apache Hudi Presto
AWS Data Pipeline
Managed ETL (extract, transform, and
load) service on AWS
Manages data workflow through AWS
services
Supports S3, EMR, Redshift,
DynamoDB, and RDS
Can integrate on-premise data stores
Analyzing Data
Data Analysis Services
Amazon Athena Amazon Quicksight Amazon CloudSearch
Service that enables Business intelligence Managed search
querying of data service enabling data service for custom
stored in Amazon S3 dashboards applications
Amazon Athena
Fully-managed serverless service
Enables querying of large-scale data
stored within Amazon S3
Queries are written using standard SQL
Charged based on data scanned for
query
Amazon Quicksight
Fully managed business intelligence
service
Enables dynamic data dashboard
based on data stored in AWS
Charged on a per-user and per-session
pricing model
Multiple versions provided based on
needs
Amazon CloudSearch
Fully-managed search service on AWS
Support scaling of search
infrastructure to meet demand
Charged per hour and instance type of
search infrastructure
Enables developers to integrate search
into custom applications
Integrating AI and Machine Learning
AI and Machine Learning Services
Amazon Rekognition Amazon Translate Amazon Transcribe
Computer vision Text translation Speech to text
service powered by service powered by solution using
Machine Learning Machine Learning Machine Learning
Amazon Rekognition
Fully-managed image and video
recognition deep learning service
Identifies objects in images
Identifies objects and actions in videos
Can detect specific people using facial
analysis
Supports custom labels for your
business objects
Amazon Translate
Fully-managed service for translation
of text
Currently supports 54 languages
Can perform language identification
Works both in batch and real-time
Amazon Transcribe
Fully-managed speech recognition
services
Recorded speech is converted into text
in custom applications
Includes a specific sub-service for
medical use
Supports batch and real-time
transcription
Currently supports 31 languages
Scenario Based Review
Scenario 1
Ruth is a data scientist for a financial
services company
Large-scale data set needs to be
processed before analysis
Ruth doesn’t want to manage servers
but just wants to define processing
What service would you recommend to
Ruth?
Scenario 2
Jessi is a member of the IT team for a
biotech company
She is currently working to identify an
approach for controlled lab access
She wants leverage AI to determine
access based on facial imaging
Is there an AWS service that can help
with this approach?
Scenario 3
Roger’s company sells custom services
around machine learning
His head of sales is trying to find a great
way to visualize their sales data
This data is currently stored in Redshift
as their data warehouse
What AWS service would allow this
access to the data by non-technical
resources?
Summary
Reviewed approaches for integrating
Summary data from your own data center
Examined approaches for processing
data
Explored data analysis approaches
Integrated machine learning and AI into
data analysis
Scenario 1
Ruth is a data scientist for a financial
services company
Large-scale data set needs to be
processed before analysis
Ruth doesn’t want to manage servers
but just wants to define processing
What service would you recommend to
Ruth?
Solution: AWS Glue
Scenario 2
Jessi is a member of the IT team for a
biotech company
She is currently working to identify an
approach for controlled lab access
She wants leverage AI to determine
access based on facial imaging
Is there an AWS service that can help
with this approach?
Solution: Amazon Rekognition
Scenario 3
Roger’s company sells custom services
around machine learning
His head of sales is trying to find a great
way to visualize their sales data
This data is currently stored in Redshift
as their data warehouse
What AWS service would allow this
access to the data by non-technical
resources?
Solution: Amazon Quicksight