Data Engineering
Brian Njuguna
Agenda
▪ What is Data Engineering
▪ Data Engineering vs Data Scientist
▪ Key Component of Data Engineering
▪ Key Skills For Data Engineering
▪ Case Study
▪ Bonus
▪ Learning Resources
Data Engineering
▪ Data Engineering focuses on designing, building, and managing data
pipelines.
▪ It's crucial for collecting, transforming, and storing large datasets
used for analysis.
▪ Key role in enabling data-driven decision-making and powering
technologies like AI and machine learning.
Data Engineering vs. Data Science
▪ Data Engineers: Build and maintain the infrastructure for data.
▪ Data Scientists: Analyze data to extract insights and build models.
▪ Collaboration between both roles is critical to delivering value from
data.
Data Engineering vs. Other Domains
Key Components of Data Engineering
▪ Data Ingestion (Pipelines)
▪ Data Processing (Transforming, cleaning, and aggregating)
▪ Data Modeling (Analysis, reporting, and decision-making)
▪ Data Storage(Database, Data Lake, Data Warehouse)
▪ Data Quality (Accurate, consistent, and complete)
▪ Data Catalog (Encyclopedia for your data platform)
▪ Access Management (Protect sensitive information)
▪ Data Observability and Orchestration (Detect and resolve)
Key Skills for Data Engineers
▪ Programming: Python, Java, Scala.
▪ Database Management: SQL, NoSQL (MongoDB, Cassandra).
▪ Big Data Tools: Hadoop, Spark.
▪ Cloud Platforms: AWS, Azure, GCP.
▪ Data Pipeline Tools: Apache Airflow, Kafka, DBT.
ETL Pipeline
ELT Pipeline
Data Streaming and Batch Process
Data Pipeline Architecture Best Practices
▪ Map and understand the dependencies.
▪ Design your data pipeline so it is modular and automated.
▪ Create data pipeline SLAs (service level agreements).
▪ Let the data drive the data pipeline architecture.
▪ Create data products.
▪ Continuously review and optimize costs.
▪ Make pipelines idempotent.
Story Telling With Data
▪ Understand the Context
▪ Choose an Appropriate Visual
▪ Eliminate Clutter
▪ Focus Attention
▪ Tell a Story
▪ Use Accessible and Intuitive Labels
▪ Iterate and Seek Feedback
▪ Balance Data and Design
Learning Resources
▪ “Data Modeling Made Simple” by Steve Hoberman
▪ “Designing Data-Intensive Applications” by Martin Kleppmann
▪ “The Data Warehouse Toolkit” by Ralph Kimball and Margy Ross
▪ “Clean Code” by Robert C. Martin
▪ “Principles of Distributed Database Systems” by M. Tamer Özsu
and Patrick Valduriez
▪ Storytelling with data - Nussbaumer Knaflic
Thank You