KEMBAR78
Future of Data Engineering | PDF
The Future of Data Engineering
Chris Riccomini / WePay / @criccomini / QCon SF / 2019-11-12
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
data-engineering-pipelines-
warehouses/
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
This talk
• Context
• Stages
• Architecture
Context
Me
• WePay, LinkedIn, PayPal
• Data infrastructure, data engineering, service infrastructure, data science
• Kafka, Airflow, BigQuery, Samza, Hadoop, Azkaban, Teradata
Me
• WePay, LinkedIn, PayPal
• Data infrastructure, data engineering, service infrastructure, data science
• Airflow, BigQuery, Kafka, Samza, Hadoop, Azkaban, Teradata
Me
• WePay, LinkedIn, PayPal
• Data infrastructure, data engineering, service infrastructure, data science
• Airflow, BigQuery, Kafka, Samza, Hadoop, Azkaban, Teradata
Me
• WePay, LinkedIn, PayPal
• Data infrastructure, data engineering, service infrastructure, data science
• Airflow, BigQuery, Kafka, Samza, Hadoop, Azkaban, Teradata
Data engineering?
A data engineer’s job is to help an organization
move and process data
“…data engineers build tools, infrastructure, frameworks, and
services.”
-- Maxime Beauchemin, The Rise of the Data Engineer
Why?
Six stages of data pipeline maturity
• Stage 0: None
• Stage 1: Batch
• Stage 2: Realtime
• Stage 3: Integration
• Stage 4: Automation
• Stage 5: Decentralization
Six stages of data pipeline maturity
• Stage 0: None
• Stage 1: Batch
• Stage 2: Realtime
• Stage 3: Integration
• Stage 4: Automation
• Stage 5: Decentralization
You might be ready for a data warehouse if…
• You have no data warehouse
• You have a monolithic architecture
• You need a data warehouse up and running yesterday
• Data engineering isn’t your full time job
Stage 0: None
DBMonolith
Stage 0: None
DBMonolith
WePay circa 2014
MySQL
PHP
Monolith
Problems
• Queries began timing out
• Users were impacting each other
• MySQL was missing complex analytical SQL functions
• Report generation was breaking
Six stages of data pipeline maturity
• Stage 0: None
• Stage 1: Batch
• Stage 2: Realtime
• Stage 3: Integration
• Stage 4: Automation
• Stage 5: Decentralization
You might be ready for batch if…
• You have a monolithic architecture
• Data engineering is your part-time job
• Queries are timing out
• Exceeding DB capacity
• Need complex analytical SQL functions
• Need reports, charts, and business intelligence
Stage 1: Batch
DBMonolith Scheduler DWH
WePay circa 2016
MySQL
PHP
Monolith
Airflow BQ
Problems
• Large number of Airflow jobs for loading all tables
• Missing and inaccurate create_time and modify_time
• DBA operations impacting pipeline
• Hard deletes weren’t propagating
• MySQL replication latency was causing data quality issues
• Periodic loads cause occasional MySQL timeouts
Six stages of data pipeline maturity
• Stage 0: None
• Stage 1: Batch
• Stage 2: Realtime
• Stage 3: Integration
• Stage 4: Automation
• Stage 5: Decentralization
You might be ready for realtime if…
• Loads are taking too long
• Pipeline is no longer stable
• Many complicated workflows
• Data latency is becoming an issue
• Data engineering is your fulltime job
• You already have Apache Kafka in your organization
Stage 2: Realtime
DBMonolith
Streaming
Platform DWH
WePay circa 2017
Kafka BQKCBQMySQL
PHP
Monolith
Debezium
MySQLService Debezium
MySQLService Debezium
WePay circa 2017
Kafka BQKCBQMySQL
PHP
Monolith
Debezium
MySQLService Debezium
MySQLService Debezium
WePay circa 2017
Kafka BQKCBQMySQL
PHP
Monolith
Debezium
MySQLService Debezium
MySQLService Debezium
Change data capture?
…an approach to data integration that is based on
the identification, capture and delivery of the
changes made to enterprise data sources.
https://en.wikipedia.org/wiki/Change_data_capture
Debezium sources
• MongoDB
• MySQL
• PostgreSQL
• SQL Server
• Oracle (Incubating)
• Cassandra (Incubating)
WePay circa 2017
Kafka BQKCBQMySQL
PHP
Monolith
Debezium
MySQLService Debezium
MySQLService Debezium
Kafka Connect BigQuery
• Open source connector that WePay wrote
• Stream data from Apache Kafka to Google BigQuery
• Supports GCS loads
• Supports realtime streaming inserts
• Automatic table schema updates
Problems
• Pipeline for Datastore was still on Airflow
• No pipeline at all for Cassandra or Bigtable
• BigQuery needed logging data
• Elastic search needed data
• Graph DB needed data
https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/
Six stages of data pipeline maturity
• Stage 0: None
• Stage 1: Batch
• Stage 2: Realtime
• Stage 3: Integration
• Stage 4: Automation
• Stage 5: Decentralization
You might be ready for integration if…
• You have microservices
• You have a diverse database ecosystem
• You have many specialized derived data systems
• You have a team of data engineers
• You have a mature SRE organization
Stage 3: Integration
DBService
Streaming
Platform DWH
NoSQLService
New
SQL
Service Graph
DB
Search
WePay circa 2019
Kafka BQKCBQMySQL
PHP
Monolith
Debezium
CassandraService Debezium
MySQLService Debezium
Graph
DB
Waltz
Service
KCW
Service
Service
WePay circa 2019
Kafka BQKCBQMySQL
PHP
Monolith
Debezium
CassandraService Debezium
MySQLService Debezium
Graph
DB
Waltz
Service
KCW
Service
Service
WePay circa 2019
Kafka BQKCBQMySQL
PHP
Monolith
Debezium
CassandraService Debezium
MySQLService Debezium
Graph
DB
Waltz
Service
KCW
Service
Service
WePay circa 2019
Kafka BQKCBQMySQL
PHP
Monolith
Debezium
CassandraService Debezium
MySQLService Debezium
Graph
DB
Waltz
Service
KCW
Service
Service
Metcalfe’s law
Problems
• Add new channel to replica MySQL DB
• Create and configure Kafka topics
• Add new Debezium connector to Kafka connect
• Create destination dataset in BigQuery
• Add new KCBQ connector to Kafka connect
• Create BigQuery views
• Configure data quality checks for new tables
• Grant access to BigQuery dataset
• Deploy stream processors or workflows
Six stages of data pipeline maturity
• Stage 0: None
• Stage 1: Batch
• Stage 2: Realtime
• Stage 3: Integration
• Stage 4: Automation
• Stage 5: Decentralization
You might be ready for automation if…
• Your SREs can’t keep up
• You’re spending a lot of time on manual toil
• You don’t have time for the fun stuff
Realtime Data Integration
Stage 4: Automation
DBService
Streaming
Platform DWH
NoSQLService
New
SQL
Service Graph
DB
Search
Automated Operations
Orchestration Monitoring Configuration …
Automated Data Management
Data Catalog RBAC/IAM/ACL DLP …
Automated Operations
“If a human operator needs to touch your system
during normal operations, you have a bug.”
-- Carla Geisser, Google SRE
Normal operations?
• Add new channel to replica MySQL DB
• Create and configure Kafka topics
• Add new Debezium connector to Kafka connect
• Create destination dataset in BigQuery
• Add new KCBQ connector to Kafka connect
• Create BigQuery views
• Configure data quality checks for new tables
• Granting access
• Deploying stream processors or workflows
Automated operations
• Terraform
• Ansible
• Helm
• Salt
• CloudFormation
• Chef
• Puppet
• Spinnaker
Terraform
provider "kafka" {
bootstrap_servers = ["localhost:9092"]
}
resource "kafka_topic" "logs" {
name = "systemd_logs"
replication_factor = 2
partitions = 100
config = {
"segment.ms" = "20000"
"cleanup.policy" = "compact"
}
}
Terraform
provider "kafka-connect" {
url = "http://localhost:8083"
}
resource "kafka-connect_connector" "sqlite-sink" {
name = "test-sink"
config = {
"name" = "test-sink"
"connector.class" = "io.confluent.connect.jdbc.JdbcSinkConnector"
"tasks.max" = "1"
"topics" = "orders"
"connection.url" = "jdbc:sqlite:test.db"
"auto.create" = "true"
}
}
But we were doing this… why so much toil?
• We had Terraform and Ansible
• We were on the cloud
• We had BigQuery scripts and tooling
Spending time on data management
• Who gets access to this data?
• How long can this data be persisted?
• Is this data allowed in this system?
• Which geographies must data be persisted in?
• Should columns be masked?
Regulation is coming
Photo by Darren Halstead
Regulation is coming here
GDPR, CCPA, PCI, HIPAA, SOX, SHIELD, …
Photo by Darren Halstead
Automated Data Management
Set up a data catalog
• Location
• Schema
• Ownership
• Lineage
• Encryption
• Versioning
Realtime Data Integration
Stage 4: Automation
DBService
Streaming
Platform DWH
NoSQLService
New
SQL
Service Graph
DB
Search
Automated Operations
Orchestration Monitoring Configuration …
Automated Data Management
Data Catalog RBAC/IAM/ACL DLP …
Configure your access
• RBAC
• IAM
• ACL
Configure your policies
• Role based access controls
• Identity access management
• Access control lists
Kafka ACLs with Terraform
provider "kafka" {
bootstrap_servers = ["localhost:9092"]
ca_cert = file("../secrets/snakeoil-ca-1.crt")
client_cert = file("../secrets/kafkacat-ca1-signed.pem")
client_key = file("../secrets/kafkacat-raw-private-key.pem")
skip_tls_verify = true
}
resource "kafka_acl" "test" {
resource_name = "syslog"
resource_type = "Topic"
acl_principal = "User:Alice"
acl_host = "*"
acl_operation = "Write"
acl_permission_type = "Deny"
}
Automate management
• New user access
• New data access
• Service account access
• Temporary access
• Unused access
Detect violations
• Auditing
• Data loss prevention
Detecting sensitive data
{
"item":{
"value":"My phone number is (415) 555-0890"
},
"inspectConfig":{
"includeQuote":true,
"minLikelihood":"POSSIBLE",
"infoTypes":{
"name":"PHONE_NUMBER"
}
}
}
{
"result":{
"findings":[
{
"quote":"(415) 555-0890",
"infoType":{
"name":"PHONE_NUMBER"
},
"likelihood":"VERY_LIKELY",
"location":{
"byteRange":{
"start":"19",
"end":"33"
},
},
}
]
}
}
Progress
• Users can find the data that they need
• Automated data management and operations
Problems
• Data engineering still manages configuration and deployment
Six stages of data pipeline maturity
• Stage 0: None
• Stage 1: Batch
• Stage 2: Realtime
• Stage 3: Integration
• Stage 4: Automation
• Stage 5: Decentralization
You might be ready for decentralization if…
• You have a fully automated realtime data pipeline
• People still come to you to get data loaded
If we have an automated data pipeline and data warehouse,
do we need a single team to manage this?
Realtime Data Integration
Stage 5: Decentralization
DBService
Streaming
Platform
NoSQLService
New
SQL
Service
Graph
DB
Search
Automated Operations
Orchestration Monitoring Configuration …
Automated Data Management
Data Catalog RBAC/IAM/ACL DLP …
DWH
DWH
From monolith to microservices microwarehouses
Partial decentralization
• Raw tools are exposed to other engineering teams
• Requires Git, YAML, JSON, pull requests, terraform commands, etc.
Full decentralization
• Polished tools are exposed to everyone
• Security and compliance manage access and policy
• Data engineering manages data tooling and infrastructure
• Everyone manages data pipelines and data warehouses
Realtime Data Integration
Modern Data Pipeline
DBService
Streaming
Platform
NoSQLService
New
SQL
Service
Graph
DB
Search
Automated Operations
Orchestration Monitoring Configuration …
Automated Data Management
Data Catalog RBAC/IAM/ACL DLP …
DWH
DWH
Thanks!
(..and we’re hiring)
🙏
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
data-engineering-pipelines-
warehouses/

Future of Data Engineering

  • 1.
    The Future ofData Engineering Chris Riccomini / WePay / @criccomini / QCon SF / 2019-11-12
  • 2.
    InfoQ.com: News &Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ data-engineering-pipelines- warehouses/
  • 3.
    Purpose of QCon -to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4.
    This talk • Context •Stages • Architecture
  • 5.
  • 6.
    Me • WePay, LinkedIn,PayPal • Data infrastructure, data engineering, service infrastructure, data science • Kafka, Airflow, BigQuery, Samza, Hadoop, Azkaban, Teradata
  • 7.
    Me • WePay, LinkedIn,PayPal • Data infrastructure, data engineering, service infrastructure, data science • Airflow, BigQuery, Kafka, Samza, Hadoop, Azkaban, Teradata
  • 8.
    Me • WePay, LinkedIn,PayPal • Data infrastructure, data engineering, service infrastructure, data science • Airflow, BigQuery, Kafka, Samza, Hadoop, Azkaban, Teradata
  • 9.
    Me • WePay, LinkedIn,PayPal • Data infrastructure, data engineering, service infrastructure, data science • Airflow, BigQuery, Kafka, Samza, Hadoop, Azkaban, Teradata
  • 10.
  • 11.
    A data engineer’sjob is to help an organization move and process data
  • 12.
    “…data engineers buildtools, infrastructure, frameworks, and services.” -- Maxime Beauchemin, The Rise of the Data Engineer
  • 13.
  • 19.
    Six stages ofdata pipeline maturity • Stage 0: None • Stage 1: Batch • Stage 2: Realtime • Stage 3: Integration • Stage 4: Automation • Stage 5: Decentralization
  • 20.
    Six stages ofdata pipeline maturity • Stage 0: None • Stage 1: Batch • Stage 2: Realtime • Stage 3: Integration • Stage 4: Automation • Stage 5: Decentralization
  • 21.
    You might beready for a data warehouse if… • You have no data warehouse • You have a monolithic architecture • You need a data warehouse up and running yesterday • Data engineering isn’t your full time job
  • 22.
  • 23.
  • 24.
  • 25.
    Problems • Queries begantiming out • Users were impacting each other • MySQL was missing complex analytical SQL functions • Report generation was breaking
  • 26.
    Six stages ofdata pipeline maturity • Stage 0: None • Stage 1: Batch • Stage 2: Realtime • Stage 3: Integration • Stage 4: Automation • Stage 5: Decentralization
  • 27.
    You might beready for batch if… • You have a monolithic architecture • Data engineering is your part-time job • Queries are timing out • Exceeding DB capacity • Need complex analytical SQL functions • Need reports, charts, and business intelligence
  • 28.
  • 29.
  • 30.
    Problems • Large numberof Airflow jobs for loading all tables • Missing and inaccurate create_time and modify_time • DBA operations impacting pipeline • Hard deletes weren’t propagating • MySQL replication latency was causing data quality issues • Periodic loads cause occasional MySQL timeouts
  • 31.
    Six stages ofdata pipeline maturity • Stage 0: None • Stage 1: Batch • Stage 2: Realtime • Stage 3: Integration • Stage 4: Automation • Stage 5: Decentralization
  • 32.
    You might beready for realtime if… • Loads are taking too long • Pipeline is no longer stable • Many complicated workflows • Data latency is becoming an issue • Data engineering is your fulltime job • You already have Apache Kafka in your organization
  • 33.
  • 35.
    WePay circa 2017 KafkaBQKCBQMySQL PHP Monolith Debezium MySQLService Debezium MySQLService Debezium
  • 36.
    WePay circa 2017 KafkaBQKCBQMySQL PHP Monolith Debezium MySQLService Debezium MySQLService Debezium
  • 37.
    WePay circa 2017 KafkaBQKCBQMySQL PHP Monolith Debezium MySQLService Debezium MySQLService Debezium
  • 38.
  • 39.
    …an approach todata integration that is based on the identification, capture and delivery of the changes made to enterprise data sources. https://en.wikipedia.org/wiki/Change_data_capture
  • 40.
    Debezium sources • MongoDB •MySQL • PostgreSQL • SQL Server • Oracle (Incubating) • Cassandra (Incubating)
  • 42.
    WePay circa 2017 KafkaBQKCBQMySQL PHP Monolith Debezium MySQLService Debezium MySQLService Debezium
  • 43.
    Kafka Connect BigQuery •Open source connector that WePay wrote • Stream data from Apache Kafka to Google BigQuery • Supports GCS loads • Supports realtime streaming inserts • Automatic table schema updates
  • 44.
    Problems • Pipeline forDatastore was still on Airflow • No pipeline at all for Cassandra or Bigtable • BigQuery needed logging data • Elastic search needed data • Graph DB needed data
  • 45.
  • 46.
    Six stages ofdata pipeline maturity • Stage 0: None • Stage 1: Batch • Stage 2: Realtime • Stage 3: Integration • Stage 4: Automation • Stage 5: Decentralization
  • 47.
    You might beready for integration if… • You have microservices • You have a diverse database ecosystem • You have many specialized derived data systems • You have a team of data engineers • You have a mature SRE organization
  • 48.
    Stage 3: Integration DBService Streaming PlatformDWH NoSQLService New SQL Service Graph DB Search
  • 49.
    WePay circa 2019 KafkaBQKCBQMySQL PHP Monolith Debezium CassandraService Debezium MySQLService Debezium Graph DB Waltz Service KCW Service Service
  • 50.
    WePay circa 2019 KafkaBQKCBQMySQL PHP Monolith Debezium CassandraService Debezium MySQLService Debezium Graph DB Waltz Service KCW Service Service
  • 51.
    WePay circa 2019 KafkaBQKCBQMySQL PHP Monolith Debezium CassandraService Debezium MySQLService Debezium Graph DB Waltz Service KCW Service Service
  • 53.
    WePay circa 2019 KafkaBQKCBQMySQL PHP Monolith Debezium CassandraService Debezium MySQLService Debezium Graph DB Waltz Service KCW Service Service
  • 54.
  • 56.
    Problems • Add newchannel to replica MySQL DB • Create and configure Kafka topics • Add new Debezium connector to Kafka connect • Create destination dataset in BigQuery • Add new KCBQ connector to Kafka connect • Create BigQuery views • Configure data quality checks for new tables • Grant access to BigQuery dataset • Deploy stream processors or workflows
  • 58.
    Six stages ofdata pipeline maturity • Stage 0: None • Stage 1: Batch • Stage 2: Realtime • Stage 3: Integration • Stage 4: Automation • Stage 5: Decentralization
  • 59.
    You might beready for automation if… • Your SREs can’t keep up • You’re spending a lot of time on manual toil • You don’t have time for the fun stuff
  • 60.
    Realtime Data Integration Stage4: Automation DBService Streaming Platform DWH NoSQLService New SQL Service Graph DB Search Automated Operations Orchestration Monitoring Configuration … Automated Data Management Data Catalog RBAC/IAM/ACL DLP …
  • 61.
  • 62.
    “If a humanoperator needs to touch your system during normal operations, you have a bug.” -- Carla Geisser, Google SRE
  • 63.
    Normal operations? • Addnew channel to replica MySQL DB • Create and configure Kafka topics • Add new Debezium connector to Kafka connect • Create destination dataset in BigQuery • Add new KCBQ connector to Kafka connect • Create BigQuery views • Configure data quality checks for new tables • Granting access • Deploying stream processors or workflows
  • 64.
    Automated operations • Terraform •Ansible • Helm • Salt • CloudFormation • Chef • Puppet • Spinnaker
  • 65.
    Terraform provider "kafka" { bootstrap_servers= ["localhost:9092"] } resource "kafka_topic" "logs" { name = "systemd_logs" replication_factor = 2 partitions = 100 config = { "segment.ms" = "20000" "cleanup.policy" = "compact" } }
  • 66.
    Terraform provider "kafka-connect" { url= "http://localhost:8083" } resource "kafka-connect_connector" "sqlite-sink" { name = "test-sink" config = { "name" = "test-sink" "connector.class" = "io.confluent.connect.jdbc.JdbcSinkConnector" "tasks.max" = "1" "topics" = "orders" "connection.url" = "jdbc:sqlite:test.db" "auto.create" = "true" } }
  • 67.
    But we weredoing this… why so much toil? • We had Terraform and Ansible • We were on the cloud • We had BigQuery scripts and tooling
  • 68.
    Spending time ondata management • Who gets access to this data? • How long can this data be persisted? • Is this data allowed in this system? • Which geographies must data be persisted in? • Should columns be masked?
  • 69.
    Regulation is coming Photoby Darren Halstead
  • 70.
    Regulation is cominghere GDPR, CCPA, PCI, HIPAA, SOX, SHIELD, … Photo by Darren Halstead
  • 71.
  • 72.
    Set up adata catalog • Location • Schema • Ownership • Lineage • Encryption • Versioning
  • 78.
    Realtime Data Integration Stage4: Automation DBService Streaming Platform DWH NoSQLService New SQL Service Graph DB Search Automated Operations Orchestration Monitoring Configuration … Automated Data Management Data Catalog RBAC/IAM/ACL DLP …
  • 79.
    Configure your access •RBAC • IAM • ACL
  • 80.
    Configure your policies •Role based access controls • Identity access management • Access control lists
  • 82.
    Kafka ACLs withTerraform provider "kafka" { bootstrap_servers = ["localhost:9092"] ca_cert = file("../secrets/snakeoil-ca-1.crt") client_cert = file("../secrets/kafkacat-ca1-signed.pem") client_key = file("../secrets/kafkacat-raw-private-key.pem") skip_tls_verify = true } resource "kafka_acl" "test" { resource_name = "syslog" resource_type = "Topic" acl_principal = "User:Alice" acl_host = "*" acl_operation = "Write" acl_permission_type = "Deny" }
  • 83.
    Automate management • Newuser access • New data access • Service account access • Temporary access • Unused access
  • 84.
  • 86.
    Detecting sensitive data { "item":{ "value":"Myphone number is (415) 555-0890" }, "inspectConfig":{ "includeQuote":true, "minLikelihood":"POSSIBLE", "infoTypes":{ "name":"PHONE_NUMBER" } } } { "result":{ "findings":[ { "quote":"(415) 555-0890", "infoType":{ "name":"PHONE_NUMBER" }, "likelihood":"VERY_LIKELY", "location":{ "byteRange":{ "start":"19", "end":"33" }, }, } ] } }
  • 87.
    Progress • Users canfind the data that they need • Automated data management and operations
  • 88.
    Problems • Data engineeringstill manages configuration and deployment
  • 89.
    Six stages ofdata pipeline maturity • Stage 0: None • Stage 1: Batch • Stage 2: Realtime • Stage 3: Integration • Stage 4: Automation • Stage 5: Decentralization
  • 90.
    You might beready for decentralization if… • You have a fully automated realtime data pipeline • People still come to you to get data loaded
  • 91.
    If we havean automated data pipeline and data warehouse, do we need a single team to manage this?
  • 92.
    Realtime Data Integration Stage5: Decentralization DBService Streaming Platform NoSQLService New SQL Service Graph DB Search Automated Operations Orchestration Monitoring Configuration … Automated Data Management Data Catalog RBAC/IAM/ACL DLP … DWH DWH
  • 93.
    From monolith tomicroservices microwarehouses
  • 96.
    Partial decentralization • Rawtools are exposed to other engineering teams • Requires Git, YAML, JSON, pull requests, terraform commands, etc.
  • 97.
    Full decentralization • Polishedtools are exposed to everyone • Security and compliance manage access and policy • Data engineering manages data tooling and infrastructure • Everyone manages data pipelines and data warehouses
  • 98.
    Realtime Data Integration ModernData Pipeline DBService Streaming Platform NoSQLService New SQL Service Graph DB Search Automated Operations Orchestration Monitoring Configuration … Automated Data Management Data Catalog RBAC/IAM/ACL DLP … DWH DWH
  • 99.
  • 100.
    Watch the videowith slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ data-engineering-pipelines- warehouses/