Total Hour – 45
15 days course with Lab session
Outline: Data Engineering on Google Cloud Platform (DEGCP)
Module 01 - Introduction to Data Engineering
Topics: Explore the role of a data engineer.
• Analyze data engineering challenges
• Introduction to BigQuery
• Data lakes and data warehouses
• Transactional databases versus data warehouses
• Partner effectively with other data teams
• Manage data access and governance
• Build production-ready pipelines
• Review Google Cloud customer case study
Objectives:
• Understand the role of a data engineer
• Discuss benefits of doing data engineering in the cloud
• Discuss challenges of data engineering practice and how building data pipelines in the
cloud helps to address these
• Review and understand the purpose of a data lake versus a data warehouse, and when
to use which
Activities:
• Lab: Using BigQuery to do Analysis
Module 02 - Building a Data Lake
Topics:
• Introduction to data lakes
• Data storage and ETL options on Google Cloud
• Building a data lake using Cloud Storage
• Securing Cloud Storage
• Storing all sorts of data types
• Cloud SQL as a relational data lake
Objectives:
• Understand why Cloud Storage is a great option for building a data lake on Google Cloud
• Learn how to use Cloud SQL for a relational data lake
Activities:
• Lab: Loading Taxi Data into Cloud SQL
Module 03 - Building a Data Warehouse
Topics:
• The modern data warehouse
• Introduction to BigQuery
• Getting started with BigQuery
• Loading data
• Exploring schemas
• Schema design
• Nested and repeated fields
• Optimizing with partitioning and clustering
Objectives:
• Discuss requirements of a modern warehouse
• Understand why BigQuery is the scalable data warehousing solution on Google Cloud
• Understand core concepts of BigQuery and review options of loading data into BigQuery
Activities:
• Lab: Loading Data into BigQuery
• Lab: Working with JSON and Array Data in BigQuery
Module 04 - Introduction to Building Batch Data Pipelines
Topics:
• EL, ELT, ETL
• Quality considerations
• How to carry out operations in BigQuery
• Shortcomings
• ETL to solve data quality issues
Objectives:
• Review different methods of loading data into your data lakes and warehouses: EL, ELT,
and ETL
• Discuss data quality considerations and when to use ETL instead of EL and ELT
Module 05 - Executing Spark on Dataproc
Topics:
• The Hadoop ecosystem
• Run Hadoop on Dataproc
• Cloud Storage instead of HDFS
• Optimize Dataproc
Objectives:
• Review the parts of the Hadoop ecosystem
• Learn how to lift and shift your existing Hadoop workloads to the cloud using Dataproc
• Understand considerations around using Cloud Storage instead of HDFS for storage
• Learn how to optimize Dataproc jobs
Activities:
• Lab: Running Apache Spark jobs on Dataproc
Module 06 - Serverless Data Processing with Dataflow
Topics:
• Introduction to Dataflow
• Why customers value Dataflow
• Dataflow pipelines
• Aggregating with GroupByKey and Combine
• Side inputs and windows
• Dataflow templates
• Dataflow SQL
Objectives:
• Understand how to decide between Dataflow and Dataproc for processing data pipelines
• Understand the features that customers value in Dataflow
• Discuss core concepts in Dataflow
• Review the use of Dataflow templates and SQL
Activities:
• Lab: A Simple Dataflow Pipeline (Python/Java)
• Lab: MapReduce in Dataflow (Python/Java)
• Lab: Side inputs (Python/Java)
Module 07 - Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
Topics:
• Building batch data pipelines visually with Cloud Data Fusion
• Components
• UI overview
• Building a pipeline
• Exploring data using Wrangler
• Orchestrating work between Google Cloud services with Cloud Composer
• Apache Airflow environment
• DAGs and operators
• Workflow scheduling
• Monitoring and logging
Objectives:
• Discuss how to manage your data pipelines with Data Fusion and Cloud Composer
• Understand Data Fusion’s visual design capabilities
• Learn how Cloud Composer can help to orchestrate the work across multiple Google
Cloud services
Activities:
• Lab: Building and Executing a Pipeline Graph in Data Fusion
• Optional Lab: An introduction to Cloud Composer
Module 08 - Introduction to Processing Streaming Data
Topics: Processing Streaming Data
Objectives:
• Explain streaming data processing
• Describe the challenges with streaming data
• Identify the Google Cloud products and tools that can help address streaming data
challenges
Module 09 - Serverless Messaging with Pub/Sub
Topics:
• Introduction to Pub/Sub
• Pub/Sub push versus pull
• Publishing with Pub/Sub code
Objectives:
• Describe the Pub/Sub service
• Understand how Pub/Sub works
• Gain hands-on Pub/Sub experience with a lab that simulates real-time streaming sen
• Gain hands-on Pub/Sub experience with a lab that simulates real-time streaming sensor
data
Activities:
• Lab: Publish Streaming Data into Pub/Sub
Module 10 - Dataflow Streaming Features
Topics:
• Steaming data challenges
• Dataflow windowing
Objectives:
• Understand the Dataflow service
• Build a stream processing pipeline for live traffic data
• Demonstrate how to handle late data using watermarks, triggers, and accumulation
Activities:
• Lab: Streaming Data Pipelines
Module 11 - High-Thoughput BigQuery and Bigtable Streaming Features
Topics:
• Streaming into BigQuery and visualizing results
• High-throughput streaming with Cloud Bigtable
• Optimizing Cloud Bigtable performance
Objectives:
• Learn how to perform ad hoc analysis on streaming data using BigQuery and dashboards
• Understand how Cloud Bigtable is a low-latency solution
• Describe how to architect for Bigtable and how to ingest data into Bigtable
• Highlight performance considerations for the relevant services
Activities:
• Lab: Streaming Analytics and Dashboards
• Lab: Streaming Data Pipelines into Bigtable
Module 12 - Advanced BigQuery Functionality and Performance
Topics:
• Analytic window functions
• Use With clauses
• GIS functions
• Performance considerations
Objectives:
• Review some of BigQuery’s advanced analysis capabilities
• Discuss ways to improve query performance
Activities:
• Lab: Optimizing your BigQuery Queries for Performance
• Optional Lab: Partitioned Tables in BigQuery
Module 13 - Int5roduction to Analytics and AI
Topics:
• What is AI?
• From ad-hoc data analysis to data-driven decisions
• Options for ML models on Google Cloud
Objectives:
• Understand the proposition that ML adds value to your data
• Understand the relationship between ML, AI, and Deep Learning
• Identify ML options on Google Cloud
Module 14 - Prebuilt ML Model APIs for Unstructured Data
Topics:
• Unstructured data is hard
• ML APIs for enriching data
Objectives:
• Discuss challenges when working with unstructured data
• Learn the applications of ready-to-use ML APIs on unstructured data
Activities:
• Lab: Using the Natural Language API to Classify Unstructured Text
Module 15 - Big Data Analytics with Notebooks
Topics:
• What’s a notebook?
• BigQuery magic and ties to Pandas
Objectives:
• Introduce Notebooks as a tool for prototyping ML solutions
• Learn to execute BigQuery commands from Notebooks
Activities:
• Lab: BigQuery in Jupyter Labs on AI Platform
Module 16 - Production ML Pipelines
Topics:
• Ways to do ML on Google Cloud
• Vertex AI Pipelines
• AI Hub
Objectives:
• Describe options available for building custom ML models
• Understand the use of tools like Vertex AI Pipelines
Activities:
• Lab: Running Pipelines on Vertex AI
Module 17 - Custom Model Building with SQL in BigQuery ML
Topics:
• BigQuery ML for quick model building
• Supported models
Objectives:
• Learn how to create ML models by using SQL syntax in BigQuery
• Demonstrate building different kinds of ML models using BigQuery ML
Activities:
• Lab option 1: Predict Bike Trip Duration with a Regression Model in BigQuery ML
• Lab option 2: Movie Recommendations in BigQuery ML
Module 18 - Custom Model Building with AutoML
Topics:
• Why AutoML?
• AutoML Vision
• AutoML NLP
• AutoML tables
Objectives:
• Explore various AutoML products used in machine learning
• Learn to use AutoML to create powerful models without coding