KEMBAR78
Make your data fly - Building data platform in AWS | PPTX
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NORDICS
Clarion Hotel Helsinki
March 21, 2018
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Make your data fly -
Building data platform
in AWS
Kimmo Kantojärvi & Roope Parviainen
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today’s topics
● We are...
● Architectural evolution
● Making Data DevOps to work
● How to cope with the data challenges
● Our experiences with couple of the components/services
and some tips & tricks
● EMR, Redshift, Airflow, visualization tools
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
We are...
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kimmo (@kimmokantojarvi)
● Coding architect
● 15 years in data business
● AWS Certified Solutions Architect -
Professional
● Ilves fan
Roope
● Data Architect #HandsDirty
● Professional love for data of 5 years
● Software Development × DW × data
platforms × IoT
● AWS Certified Solutions Architect -
Professional
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
We are a data and
customer value driven
transformation company
▪ 96 % of our 186 clients recommend us
▪ Over 2 million daily users in maintained services
▪ Extensive partner network in tech and insight
1996
FOUNDED
650
EMPLOYEES
6
CITIES
4
COUNTRIES
76MTURNOVER 2017
20%AVG. PROFITABLE
GROWTH PER ANNUM
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
We help our
customers
to create new
services by
understanding their
customers and
managing the change.
We build capabilities
and intelligence that
help develop and
create new business
opportunities.
We build and deliver
new business and
services technologies
and infrastructure.
We chase results
and take care of our
customers and their
services.
Offering
Consulting
and service
design
Data,
analytics and
AI
Digital
services
DevOps and
cloud
services
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architectural evolution
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
It used to be so simple ;)
Source → ETL → DW → BI
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today the architecture is much more versatile
and enabled by cloud
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What happened?
From
● On-premise
● Few key technologies
● Closed solutions from big players
● Investments
● Compute & storage combined
● Data pull/batch
● Schema-on-write
● GUI
● Long projects, big lead times
To
● Cloud
● Various specific technologies
● Open source
● Flexible cost structure
● Separation of compute & storage
● Data push/stream
● Schema-on-read
● Code
● Agile methods, need to deliver fast
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Various options to load & process data
● Traditional
○ SQL
○ ETL tools
○ Integration tools
● APIs
● AWS Services
○ Glue
○ EMR
○ Kinesis
○ IoT
○ EC2/Lambda
○ S3
● Processing/streaming engines
○ Spark
○ Flink
○ Storm
○ Presto/Hive
● Custom code
○ R, Python, etc.
○ Machine learning
Make sure your new systems are
built to share data!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Offloading data processing with EMR (+ Spark)
● Suitable for processing large amount of data and complex calculations
● Java, Scala, Python
● Combine SQL, Python generators and Spark dataframes - Win-Win!
● Very cost-effective with spot instances
● Some learning curve (understanding configuration, behaviour and
metrics)
● Not all EC2 instance types available
● Ramp-up time ~10min - not ideal for short tasks unless run
continuously
● Testing code locally challenging (e.g. py-test + spark plugin)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
code.zip
job &
environment
configurations
60 x c3.xlarge process 10B rows in 1 hour = 3,5€
1000 SQL queries replaced with 1000 lines of Python &
Spark
S3
DynamoDB
S3 Redshift
EMR
data
copy/unload
data
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
response =
ec2_client.describe_spot_price_history(
AvailabilityZone='eu-west-1a',
StartTime='2018-03-01',
EndTime='2018-03-21',
InstanceTypes=['c3.xlarge'],
ProductDescriptions=['Linux/UNIX'],
MaxResults=100
)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
So many data storage options nowadays
● File/object storage
○ S3
● Data warehouses
○ Redshift, Snowflake
● Traditional databases
○ RDS (MySQL, Postgres, MariaDB,
MSSQL, Oracle)
● NoSQL databases
○ DynamoDB
○ MongoDB
○ Cassandra
● In-memory databases
○ Exasol
● GPU databases
○ MapD, BrytlytDB
● Time series databases
○ Kdb+, InfluxDB
● Caches
○ Redis, Memcached
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Redshift performance requires planning & design
● Redshift is cluster and each node has
own data → data distribution affects
query performance and data loading
● Optimal to query few wide tables rather
than join many narrow tables together
○ E.g. data vault modeling a bit challenging
from query performance point of view
● Each table requires minimum storage
→ more nodes → higher minimum
storage
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
In addition to data distribution managing the query queues
(WLM) setup important
● Max 500 concurrent connections per cluster, but only max 50 query
slots
● Each slot takes own share of the memory, 50 slots → memory split to
1/50 parts
● Can be used to control long-running (maybe not so smart) queries
made by users
○ E.g. failover after 5 min to queue with less resources
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
In Spectrum we trust
● Store part of the data in S3 (e.g.
parquet + snappy), access as external
table with SQL
● Separate Spectrum compute layer
● Read-only, still need to process the data
into S3 and Redshift does support only
CSV at the moment
● Athena and Spectrum seem to be faster
if you have no joins but just single table
● VPC support not available yet
https://aws.amazon.com/blogs/
big-data/10-best-practices-for-
amazon-redshift-spectrum/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Spectrum related wish list
● VPC support
● Write/delete also to allow schema-on-write
● Redshift unload to parquet/avro
● Some control over compute or control over cost structure
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Redshift requires still some maintenance
● Tasks taken care by AWS
○ Backups
○ Resizing
○ Node/disk replacement
○ Query caching
● Built-in maintenance processes which user controls
○ Analyze → Query optimizer needs to know tables
○ Vacuum → Sort data in correct order and free up storage for deleted data
○ Compression → Optimize table compression
● https://github.com/awslabs/amazon-redshift-utils
○ Great toolset for maintenance and reviewing system status
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Some other tips with Redshift
● With 3-year full prepayment break-even after 1 year = commitment
actually only 1 year
○ 5,12TB = 32 x dc2.xlarge = 2 x dc2.8xlarge ≈ $90k/year
○ All upfront 3-years $31k/year
● Publish directly from staging and model later → faster visible results for
business users
● A lot of interesting development going on (especially Spectrum)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sharing your data
● APIs
● Integration tools
● BI tools
● AWS services
○ QuickSight
○ Athena
○ API Gateway
○ S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Visualizing the data
● First phase to generate value of data is to visualize it
● General purpose BI/Analytics tool does not (always) cope with e.g.
○ vast amount of data
○ special visualisation need
● Right tool for the right purpose, “Mix and match”
○ PowerBI/Birst/Quicksight and custom d3.js / trending tool / Grafana
/ Kibana
○ Multiple data sources
■ Virtualization of data sources
■ Data catalogs and understandability
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Visualizing the data
VS.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Right tool for the right purpose
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fast and slow data - same but different
● Platforms have to be able ingest both slow and fast data
○ Batches are simply not enough
○ Data streams & event-driven data loads
● Different endpoints / integrations (SFTP, HTTP REST, MQTT, data
dumps)
● Different data pipelines and databases
○ Even for to same data based on usage needs
○ Orchestration of the whole becomes difficult
○ Parallelism when loading
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Managing the data flow
● Open source
○ Airflow
○ Oozie
○ Luigi
○ Jenkins
● Traditional ETL & Integration tools
● AWS services
○ Batch
○ Step
● Custom code
○ Lambda
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Airflow
● Visualization and management of whole data load
○ SQL
○ Command line
○ Python/Java/etc.
● Suitable for batch loading
● Loads can be generated programmatically based on metadata
● Parallel/multiple loads, managing parallelism
● Load history
● Logs available directly
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Airflow
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Airflow
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Airflow
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Things to consider
● Batch vs. streaming need to be handled separately
● Airflow has some flaws
○ GUI is not always up-to-date
○ Scanning DAG statuses takes time
● If you have a lot of custom code Lambdas running at different times,
how do you manage parallelism and how to monitor
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Making Data DevOps to work
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data DevOps
● Target to achieve deployment processes similar to software projects
● Was not even possible earlier, because of poor support in traditional
tools
● To make it effective and scalable should be metadata driven
○ Code generated based on metadata
● Need to focus in following good coding practices
● Version management for everything
○ Infrastructure as a code
○ Recursive schema changes
○ Data load changes
○ Report changes?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data DevOps - Agile Data Engine
● Based on our previous experience/projects,
now formalized and bundled as a product
● Enabled by AWS services, difficult to
implement on-premises
● Design once, deploy multiple runtime
environments
● Functionality
○ Data modelling, Load Mapping, Data Vault
Automation
○ Continuous Deployment Management
○ Metadata Driven ELT Execution and Concurrency
Control
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data modeling and why data vault
● Data vault is modeling and development method
● Hub = business entity, Satellite = all details, Link = join between
entities
● Well defined principles for developing, naming conventions, etc.
H_ORDER
S_ORDER
L_CUSTOMER
_ORDER
H_CUSTOMER
S_CUSTOMER
1 *
1
*
1*
1
*
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data vault is one of the key enabler for increasing speed
with schema-on-write approach
● Data model split into pieces allowing loads in multiple steps/parts
● Data loads can be auto-generated
● Many-to-many links allow representing any business situation
● Built-in storing history of changes with satellite structure
● Standard development model allows easier personnel changes
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How to survive with the data
challenges
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
#saddata
● Data you forced to collect even though no-one wants it as a customer
& no-one needs it in your business & no-one can find or utilize - Jarno
Kartela, AWS Summit Stockholm, 2017
● So basically consider what data you are collecting, it all adds some
maintenance overhead and need to keep GDPR in mind
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Handling malicious data
● Typically not considered
● Source could be 3rd party service or system which has poor data
validation/handling
● Probably best to create separate landing account and run security
check to the data before pushing forward
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Simple tasks to secure data
● Encrypt
○ S3 buckets
○ RDS & Redshift
○ EBS volumes
● Just block accesses
○ Network ACL
○ Security groups
○ S3 bucket policies
● Setup notifications on changes
● Prevent opening access
{ "Version": "2008-10-17",
"Statement": [
"Effect": "Deny",
"Action": "*",
"Resource": "arn:aws:s3:::my-bucket/*",
"Condition": {
"StringNotEqualsIfExists": {
"aws:SourceVpc": "vpc-abcdefg"
},
"NotIpAddressIfExists": {
"aws:SourceIp": [
"1.1.1.1/32" ]
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
There is no single data platform to answer all your needs
● How do you remove customer data from parquet files in S3 (as
required in GDPR)
● How do you manage access to S3, Redshift, Tableau, etc. in
centralized manner
● No centralized metadata management (maybe Glue in the future)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Credits
Harri Kallio
Tero Honko
Thank you!
Questions?

Make your data fly - Building data platform in AWS

  • 1.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. NORDICS Clarion Hotel Helsinki March 21, 2018
  • 2.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Make your data fly - Building data platform in AWS Kimmo Kantojärvi & Roope Parviainen
  • 3.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Today’s topics ● We are... ● Architectural evolution ● Making Data DevOps to work ● How to cope with the data challenges ● Our experiences with couple of the components/services and some tips & tricks ● EMR, Redshift, Airflow, visualization tools
  • 4.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. We are...
  • 5.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Kimmo (@kimmokantojarvi) ● Coding architect ● 15 years in data business ● AWS Certified Solutions Architect - Professional ● Ilves fan Roope ● Data Architect #HandsDirty ● Professional love for data of 5 years ● Software Development × DW × data platforms × IoT ● AWS Certified Solutions Architect - Professional
  • 6.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. We are a data and customer value driven transformation company ▪ 96 % of our 186 clients recommend us ▪ Over 2 million daily users in maintained services ▪ Extensive partner network in tech and insight 1996 FOUNDED 650 EMPLOYEES 6 CITIES 4 COUNTRIES 76MTURNOVER 2017 20%AVG. PROFITABLE GROWTH PER ANNUM
  • 7.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. We help our customers to create new services by understanding their customers and managing the change. We build capabilities and intelligence that help develop and create new business opportunities. We build and deliver new business and services technologies and infrastructure. We chase results and take care of our customers and their services. Offering Consulting and service design Data, analytics and AI Digital services DevOps and cloud services
  • 8.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved.
  • 9.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Architectural evolution
  • 10.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. It used to be so simple ;) Source → ETL → DW → BI
  • 11.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Today the architecture is much more versatile and enabled by cloud
  • 12.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. What happened? From ● On-premise ● Few key technologies ● Closed solutions from big players ● Investments ● Compute & storage combined ● Data pull/batch ● Schema-on-write ● GUI ● Long projects, big lead times To ● Cloud ● Various specific technologies ● Open source ● Flexible cost structure ● Separation of compute & storage ● Data push/stream ● Schema-on-read ● Code ● Agile methods, need to deliver fast
  • 13.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved.
  • 14.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Various options to load & process data ● Traditional ○ SQL ○ ETL tools ○ Integration tools ● APIs ● AWS Services ○ Glue ○ EMR ○ Kinesis ○ IoT ○ EC2/Lambda ○ S3 ● Processing/streaming engines ○ Spark ○ Flink ○ Storm ○ Presto/Hive ● Custom code ○ R, Python, etc. ○ Machine learning Make sure your new systems are built to share data!
  • 15.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Offloading data processing with EMR (+ Spark) ● Suitable for processing large amount of data and complex calculations ● Java, Scala, Python ● Combine SQL, Python generators and Spark dataframes - Win-Win! ● Very cost-effective with spot instances ● Some learning curve (understanding configuration, behaviour and metrics) ● Not all EC2 instance types available ● Ramp-up time ~10min - not ideal for short tasks unless run continuously ● Testing code locally challenging (e.g. py-test + spark plugin)
  • 16.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. code.zip job & environment configurations 60 x c3.xlarge process 10B rows in 1 hour = 3,5€ 1000 SQL queries replaced with 1000 lines of Python & Spark S3 DynamoDB S3 Redshift EMR data copy/unload data
  • 17.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved.
  • 18.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. response = ec2_client.describe_spot_price_history( AvailabilityZone='eu-west-1a', StartTime='2018-03-01', EndTime='2018-03-21', InstanceTypes=['c3.xlarge'], ProductDescriptions=['Linux/UNIX'], MaxResults=100 )
  • 19.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved.
  • 20.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. So many data storage options nowadays ● File/object storage ○ S3 ● Data warehouses ○ Redshift, Snowflake ● Traditional databases ○ RDS (MySQL, Postgres, MariaDB, MSSQL, Oracle) ● NoSQL databases ○ DynamoDB ○ MongoDB ○ Cassandra ● In-memory databases ○ Exasol ● GPU databases ○ MapD, BrytlytDB ● Time series databases ○ Kdb+, InfluxDB ● Caches ○ Redis, Memcached
  • 21.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Redshift performance requires planning & design ● Redshift is cluster and each node has own data → data distribution affects query performance and data loading ● Optimal to query few wide tables rather than join many narrow tables together ○ E.g. data vault modeling a bit challenging from query performance point of view ● Each table requires minimum storage → more nodes → higher minimum storage
  • 22.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. In addition to data distribution managing the query queues (WLM) setup important ● Max 500 concurrent connections per cluster, but only max 50 query slots ● Each slot takes own share of the memory, 50 slots → memory split to 1/50 parts ● Can be used to control long-running (maybe not so smart) queries made by users ○ E.g. failover after 5 min to queue with less resources
  • 23.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. In Spectrum we trust ● Store part of the data in S3 (e.g. parquet + snappy), access as external table with SQL ● Separate Spectrum compute layer ● Read-only, still need to process the data into S3 and Redshift does support only CSV at the moment ● Athena and Spectrum seem to be faster if you have no joins but just single table ● VPC support not available yet https://aws.amazon.com/blogs/ big-data/10-best-practices-for- amazon-redshift-spectrum/
  • 24.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Spectrum related wish list ● VPC support ● Write/delete also to allow schema-on-write ● Redshift unload to parquet/avro ● Some control over compute or control over cost structure
  • 25.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Redshift requires still some maintenance ● Tasks taken care by AWS ○ Backups ○ Resizing ○ Node/disk replacement ○ Query caching ● Built-in maintenance processes which user controls ○ Analyze → Query optimizer needs to know tables ○ Vacuum → Sort data in correct order and free up storage for deleted data ○ Compression → Optimize table compression ● https://github.com/awslabs/amazon-redshift-utils ○ Great toolset for maintenance and reviewing system status
  • 26.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Some other tips with Redshift ● With 3-year full prepayment break-even after 1 year = commitment actually only 1 year ○ 5,12TB = 32 x dc2.xlarge = 2 x dc2.8xlarge ≈ $90k/year ○ All upfront 3-years $31k/year ● Publish directly from staging and model later → faster visible results for business users ● A lot of interesting development going on (especially Spectrum)
  • 27.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved.
  • 28.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Sharing your data ● APIs ● Integration tools ● BI tools ● AWS services ○ QuickSight ○ Athena ○ API Gateway ○ S3
  • 29.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Visualizing the data ● First phase to generate value of data is to visualize it ● General purpose BI/Analytics tool does not (always) cope with e.g. ○ vast amount of data ○ special visualisation need ● Right tool for the right purpose, “Mix and match” ○ PowerBI/Birst/Quicksight and custom d3.js / trending tool / Grafana / Kibana ○ Multiple data sources ■ Virtualization of data sources ■ Data catalogs and understandability
  • 30.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Visualizing the data VS.
  • 31.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Right tool for the right purpose
  • 32.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved.
  • 33.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Fast and slow data - same but different ● Platforms have to be able ingest both slow and fast data ○ Batches are simply not enough ○ Data streams & event-driven data loads ● Different endpoints / integrations (SFTP, HTTP REST, MQTT, data dumps) ● Different data pipelines and databases ○ Even for to same data based on usage needs ○ Orchestration of the whole becomes difficult ○ Parallelism when loading
  • 34.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved.
  • 35.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Managing the data flow ● Open source ○ Airflow ○ Oozie ○ Luigi ○ Jenkins ● Traditional ETL & Integration tools ● AWS services ○ Batch ○ Step ● Custom code ○ Lambda
  • 36.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Airflow ● Visualization and management of whole data load ○ SQL ○ Command line ○ Python/Java/etc. ● Suitable for batch loading ● Loads can be generated programmatically based on metadata ● Parallel/multiple loads, managing parallelism ● Load history ● Logs available directly
  • 37.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Airflow
  • 38.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Airflow
  • 39.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Airflow
  • 40.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Things to consider ● Batch vs. streaming need to be handled separately ● Airflow has some flaws ○ GUI is not always up-to-date ○ Scanning DAG statuses takes time ● If you have a lot of custom code Lambdas running at different times, how do you manage parallelism and how to monitor
  • 41.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Making Data DevOps to work
  • 42.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Data DevOps ● Target to achieve deployment processes similar to software projects ● Was not even possible earlier, because of poor support in traditional tools ● To make it effective and scalable should be metadata driven ○ Code generated based on metadata ● Need to focus in following good coding practices ● Version management for everything ○ Infrastructure as a code ○ Recursive schema changes ○ Data load changes ○ Report changes?
  • 43.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Data DevOps - Agile Data Engine ● Based on our previous experience/projects, now formalized and bundled as a product ● Enabled by AWS services, difficult to implement on-premises ● Design once, deploy multiple runtime environments ● Functionality ○ Data modelling, Load Mapping, Data Vault Automation ○ Continuous Deployment Management ○ Metadata Driven ELT Execution and Concurrency Control
  • 44.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved.
  • 45.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Data modeling and why data vault ● Data vault is modeling and development method ● Hub = business entity, Satellite = all details, Link = join between entities ● Well defined principles for developing, naming conventions, etc. H_ORDER S_ORDER L_CUSTOMER _ORDER H_CUSTOMER S_CUSTOMER 1 * 1 * 1* 1 *
  • 46.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Data vault is one of the key enabler for increasing speed with schema-on-write approach ● Data model split into pieces allowing loads in multiple steps/parts ● Data loads can be auto-generated ● Many-to-many links allow representing any business situation ● Built-in storing history of changes with satellite structure ● Standard development model allows easier personnel changes
  • 47.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. How to survive with the data challenges
  • 48.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. #saddata ● Data you forced to collect even though no-one wants it as a customer & no-one needs it in your business & no-one can find or utilize - Jarno Kartela, AWS Summit Stockholm, 2017 ● So basically consider what data you are collecting, it all adds some maintenance overhead and need to keep GDPR in mind
  • 49.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Handling malicious data ● Typically not considered ● Source could be 3rd party service or system which has poor data validation/handling ● Probably best to create separate landing account and run security check to the data before pushing forward
  • 50.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Simple tasks to secure data ● Encrypt ○ S3 buckets ○ RDS & Redshift ○ EBS volumes ● Just block accesses ○ Network ACL ○ Security groups ○ S3 bucket policies ● Setup notifications on changes ● Prevent opening access { "Version": "2008-10-17", "Statement": [ "Effect": "Deny", "Action": "*", "Resource": "arn:aws:s3:::my-bucket/*", "Condition": { "StringNotEqualsIfExists": { "aws:SourceVpc": "vpc-abcdefg" }, "NotIpAddressIfExists": { "aws:SourceIp": [ "1.1.1.1/32" ]
  • 51.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. There is no single data platform to answer all your needs ● How do you remove customer data from parquet files in S3 (as required in GDPR) ● How do you manage access to S3, Redshift, Tableau, etc. in centralized manner ● No centralized metadata management (maybe Glue in the future)
  • 52.
    © 2018, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Credits Harri Kallio Tero Honko
  • 53.