Make your data fly - Building data platform in AWS

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NORDICS
Clarion Hotel Helsinki
March 21, 2018

Make your data fly -
Building data platform
in AWS
Kimmo Kantojärvi & Roope Parviainen

Today’s topics
● We are...
● Architectural evolution
● Making Data DevOps to work
● How to cope with the data challenges
● Our experiences with couple of the components/services
and some tips & tricks
● EMR, Redshift, Airflow, visualization tools

We are...

Kimmo (@kimmokantojarvi)
● Coding architect
● 15 years in data business
● AWS Certified Solutions Architect -
Professional
● Ilves fan
Roope
● Data Architect #HandsDirty
● Professional love for data of 5 years
● Software Development × DW × data
platforms × IoT
● AWS Certified Solutions Architect -
Professional

We are a data and
customer value driven
transformation company
▪ 96 % of our 186 clients recommend us
▪ Over 2 million daily users in maintained services
▪ Extensive partner network in tech and insight
1996
FOUNDED
650
EMPLOYEES
6
CITIES
4
COUNTRIES
76MTURNOVER 2017
20%AVG. PROFITABLE
GROWTH PER ANNUM

We help our
customers
to create new
services by
understanding their
customers and
managing the change.
We build capabilities
and intelligence that
help develop and
create new business
opportunities.
We build and deliver
new business and
services technologies
and infrastructure.
We chase results
and take care of our
customers and their
services.
Offering
Consulting
and service
design
Data,
analytics and
AI
Digital
services
DevOps and
cloud
services

Architectural evolution

It used to be so simple ;)
Source → ETL → DW → BI

Today the architecture is much more versatile
and enabled by cloud

What happened?
From
● On-premise
● Few key technologies
● Closed solutions from big players
● Investments
● Compute & storage combined
● Data pull/batch
● Schema-on-write
● GUI
● Long projects, big lead times
To
● Cloud
● Various specific technologies
● Open source
● Flexible cost structure
● Separation of compute & storage
● Data push/stream
● Schema-on-read
● Code
● Agile methods, need to deliver fast

Various options to load & process data
● Traditional
○ SQL
○ ETL tools
○ Integration tools
● APIs
● AWS Services
○ Glue
○ EMR
○ Kinesis
○ IoT
○ EC2/Lambda
○ S3
● Processing/streaming engines
○ Spark
○ Flink
○ Storm
○ Presto/Hive
● Custom code
○ R, Python, etc.
○ Machine learning
Make sure your new systems are
built to share data!

Offloading data processing with EMR (+ Spark)
● Suitable for processing large amount of data and complex calculations
● Java, Scala, Python
● Combine SQL, Python generators and Spark dataframes - Win-Win!
● Very cost-effective with spot instances
● Some learning curve (understanding configuration, behaviour and
metrics)
● Not all EC2 instance types available
● Ramp-up time ~10min - not ideal for short tasks unless run
continuously
● Testing code locally challenging (e.g. py-test + spark plugin)

code.zip
job &
environment
configurations
60 x c3.xlarge process 10B rows in 1 hour = 3,5€
1000 SQL queries replaced with 1000 lines of Python &
Spark
S3
DynamoDB
S3 Redshift
EMR
data
copy/unload
data

response =
ec2_client.describe_spot_price_history(
AvailabilityZone='eu-west-1a',
StartTime='2018-03-01',
EndTime='2018-03-21',
InstanceTypes=['c3.xlarge'],
ProductDescriptions=['Linux/UNIX'],
MaxResults=100
)

So many data storage options nowadays
● File/object storage
○ S3
● Data warehouses
○ Redshift, Snowflake
● Traditional databases
○ RDS (MySQL, Postgres, MariaDB,
MSSQL, Oracle)
● NoSQL databases
○ DynamoDB
○ MongoDB
○ Cassandra
● In-memory databases
○ Exasol
● GPU databases
○ MapD, BrytlytDB
● Time series databases
○ Kdb+, InfluxDB
● Caches
○ Redis, Memcached

Redshift performance requires planning & design
● Redshift is cluster and each node has
own data → data distribution affects
query performance and data loading
● Optimal to query few wide tables rather
than join many narrow tables together
○ E.g. data vault modeling a bit challenging
from query performance point of view
● Each table requires minimum storage
→ more nodes → higher minimum
storage

In addition to data distribution managing the query queues
(WLM) setup important
● Max 500 concurrent connections per cluster, but only max 50 query
slots
● Each slot takes own share of the memory, 50 slots → memory split to
1/50 parts
● Can be used to control long-running (maybe not so smart) queries
made by users
○ E.g. failover after 5 min to queue with less resources

In Spectrum we trust
● Store part of the data in S3 (e.g.
parquet + snappy), access as external
table with SQL
● Separate Spectrum compute layer
● Read-only, still need to process the data
into S3 and Redshift does support only
CSV at the moment
● Athena and Spectrum seem to be faster
if you have no joins but just single table
● VPC support not available yet
https://aws.amazon.com/blogs/
big-data/10-best-practices-for-
amazon-redshift-spectrum/

Spectrum related wish list
● VPC support
● Write/delete also to allow schema-on-write
● Redshift unload to parquet/avro
● Some control over compute or control over cost structure

Redshift requires still some maintenance
● Tasks taken care by AWS
○ Backups
○ Resizing
○ Node/disk replacement
○ Query caching
● Built-in maintenance processes which user controls
○ Analyze → Query optimizer needs to know tables
○ Vacuum → Sort data in correct order and free up storage for deleted data
○ Compression → Optimize table compression
● https://github.com/awslabs/amazon-redshift-utils
○ Great toolset for maintenance and reviewing system status

Some other tips with Redshift
● With 3-year full prepayment break-even after 1 year = commitment
actually only 1 year
○ 5,12TB = 32 x dc2.xlarge = 2 x dc2.8xlarge ≈ $90k/year
○ All upfront 3-years $31k/year
● Publish directly from staging and model later → faster visible results for
business users
● A lot of interesting development going on (especially Spectrum)

Sharing your data
● APIs
● Integration tools
● BI tools
● AWS services
○ QuickSight
○ Athena
○ API Gateway
○ S3

Visualizing the data
● First phase to generate value of data is to visualize it
● General purpose BI/Analytics tool does not (always) cope with e.g.
○ vast amount of data
○ special visualisation need
● Right tool for the right purpose, “Mix and match”
○ PowerBI/Birst/Quicksight and custom d3.js / trending tool / Grafana
/ Kibana
○ Multiple data sources
■ Virtualization of data sources
■ Data catalogs and understandability

Visualizing the data
VS.

Right tool for the right purpose

Fast and slow data - same but different
● Platforms have to be able ingest both slow and fast data
○ Batches are simply not enough
○ Data streams & event-driven data loads
● Different endpoints / integrations (SFTP, HTTP REST, MQTT, data
dumps)
● Different data pipelines and databases
○ Even for to same data based on usage needs
○ Orchestration of the whole becomes difficult
○ Parallelism when loading

Managing the data flow
● Open source
○ Airflow
○ Oozie
○ Luigi
○ Jenkins
● Traditional ETL & Integration tools
● AWS services
○ Batch
○ Step
● Custom code
○ Lambda

Airflow
● Visualization and management of whole data load
○ SQL
○ Command line
○ Python/Java/etc.
● Suitable for batch loading
● Loads can be generated programmatically based on metadata
● Parallel/multiple loads, managing parallelism
● Load history
● Logs available directly

Airflow

Things to consider
● Batch vs. streaming need to be handled separately
● Airflow has some flaws
○ GUI is not always up-to-date
○ Scanning DAG statuses takes time
● If you have a lot of custom code Lambdas running at different times,
how do you manage parallelism and how to monitor

Making Data DevOps to work

Data DevOps
● Target to achieve deployment processes similar to software projects
● Was not even possible earlier, because of poor support in traditional
tools
● To make it effective and scalable should be metadata driven
○ Code generated based on metadata
● Need to focus in following good coding practices
● Version management for everything
○ Infrastructure as a code
○ Recursive schema changes
○ Data load changes
○ Report changes?

Data DevOps - Agile Data Engine
● Based on our previous experience/projects,
now formalized and bundled as a product
● Enabled by AWS services, difficult to
implement on-premises
● Design once, deploy multiple runtime
environments
● Functionality
○ Data modelling, Load Mapping, Data Vault
Automation
○ Continuous Deployment Management
○ Metadata Driven ELT Execution and Concurrency
Control

Data modeling and why data vault
● Data vault is modeling and development method
● Hub = business entity, Satellite = all details, Link = join between
entities
● Well defined principles for developing, naming conventions, etc.
H_ORDER
S_ORDER
L_CUSTOMER
_ORDER
H_CUSTOMER
S_CUSTOMER
1 *
1
*
1*
1
*

Data vault is one of the key enabler for increasing speed
with schema-on-write approach
● Data model split into pieces allowing loads in multiple steps/parts
● Data loads can be auto-generated
● Many-to-many links allow representing any business situation
● Built-in storing history of changes with satellite structure
● Standard development model allows easier personnel changes

How to survive with the data
challenges

#saddata
● Data you forced to collect even though no-one wants it as a customer
& no-one needs it in your business & no-one can find or utilize - Jarno
Kartela, AWS Summit Stockholm, 2017
● So basically consider what data you are collecting, it all adds some
maintenance overhead and need to keep GDPR in mind

Handling malicious data
● Typically not considered
● Source could be 3rd party service or system which has poor data
validation/handling
● Probably best to create separate landing account and run security
check to the data before pushing forward

Simple tasks to secure data
● Encrypt
○ S3 buckets
○ RDS & Redshift
○ EBS volumes
● Just block accesses
○ Network ACL
○ Security groups
○ S3 bucket policies
● Setup notifications on changes
● Prevent opening access
{ "Version": "2008-10-17",
"Statement": [
"Effect": "Deny",
"Action": "*",
"Resource": "arn:aws:s3:::my-bucket/*",
"Condition": {
"StringNotEqualsIfExists": {
"aws:SourceVpc": "vpc-abcdefg"
},
"NotIpAddressIfExists": {
"aws:SourceIp": [
"1.1.1.1/32" ]

There is no single data platform to answer all your needs
● How do you remove customer data from parquet files in S3 (as
required in GDPR)
● How do you manage access to S3, Redshift, Tableau, etc. in
centralized manner
● No centralized metadata management (maybe Glue in the future)

Credits
Harri Kallio
Tero Honko

Make your data fly - Building data platform in AWS

More Related Content

Similar to Make your data fly - Building data platform in AWS

Recently uploaded

Make your data fly - Building data platform in AWS