Road to Data Engineer 2.0!
medium.com/@o.anonthanasap/road-to-data-engineer-2-0-664ef352e0d8
October 10, 2021
Last two months, I had an excellent opportunity to take the Road to Data Engineer (R2DE
2.0) course from DATAth.com. Road to Data Engineer (R2DE) is a course that provides
fundamental to advance knowledge in the Data Engineer track with the workshop in every
chapter. We can apply the knowledge to build our end-to-end Data Pipeline by using the
cutting edge technology that is popular today
This dashboard was inspired by Road to Data Engineer course :)
LINK to my Google Data Studio Dashboard
My GitHub Code: https://github.com/orapinanon/dataengineer_proj
Tech Stack
✔ Python (Google Colab)
✔ SQL
✔ Pandas
✔ Apache Spark, PySpark
✔ Apache Airflow, DAG: Directed Acyclic Graph
✔ Google Cloud Platform (GCP)
✔ Google BigQuery
✔ Google Data Studio
✔ Databricks
✔ Basic and Advanced Git + CI/CD
✔ Docker + Airflow + Kubernetes
Course Curriculum
1/9
Data Engineer end-to-end flow
✔ CH0 Introduction to Data Engineering — Python & SQL
Basic of Data Engineer
Concept of Big Data & Hadoop
Database, Data Warehouse, Data Lake
✔ CH1 Data Pipeline & ETL
Data Pipeline (ETL/ELT)
Data Pipeline designing
Data Integration
Workshop 1: Data Collection with Python (Google Colab), Pandas, REST API
✔ CH2 Data Quality & Wrangling
Data Cleansing
Data Quality (Data Lineage and Data Dictionary)
EDA (Exploratory Data Analysis) and Data Profiling
Handling Anomaly และ missing data
Distributed Data Processing
Concept of Apache Spark
Workshop 2: Data Wrangling, Data Cleansing with Apache Spark (Colab and
PySpark)
✔ CH3 Basic Cloud — Google Cloud
Cloud Computing concept with Google Cloud Platform (GCP)
Concept of public / private / hybrid cloud
Cloud vs on-premise
Concept of Cloud computing e.g., Managed Service and serverless
GCP (Google Cloud Platform) services
Data Processing and storage on cloud
2/9
Basic of Bash Command
Workshop 3: Data Storage with GCS
✔CH4 Data Pipeline Orchestration with Airflow
Data Pipeline Orchestration
Data Pipeline tool
Concept of Apache Airflow
DAG: Directed Acyclic Graph
Cloud Composer (GCP)
Create data pipeline with Apache Airflow
Workshop 4: Automated Data Pipeline with Airflow
✔ CH5 Introduction to BigQuery
Data Warehouse
BigQuery and the concept of BigQuery
Load data into BigQuery
Design schema in BigQuery
Workshop 5: Building Data Warehouse with BigQuery (feeding data from AirFlow)
✔ CH6 Introduction to Google Data Studio
Data visualization
Google Data Studio
Connect Data Studio with data sources
Dimension and Metric in chart
Workshop 6: Building dashboard with Google Data Studio (Data from BigQuery)
✔ CH7 Advanced Data Engineering
Case study for using data pipeline
Data Architecture
Git and Docker container
Data Privacy
Introduce to Machine Learning engineer
ML deployment pipeline (MLDevOp)
The future of Data Warehouse e.g., Snowflake
✔ Special Classes
Intro to Databricks
What I have learned working with data
Enterprise Data Architecture
Slowly Changing Dimension
What recruiter looks for in DE LinkedIn Profile
Intro to PowerBI
Soft skills for Data Engineers
3/9
Intro to Data Monitoring
Basic and Advanced Git + CI/CD
Docker + Airflow + Kubernetes
The Data Engineer workshop
Data Engineer end-to-end flow
Workshop 1: Data Collection with Python (Google Colab), Pandas, REST API
Workshop 2: Data Wrangling, Data Cleansing with Apache Spark (Colab and
PySpark)
Example of our dataset (Audible book data)
Workshop 3: Data Storage with Google Cloud Storage
Workshop 4: Automated Data Pipeline with Airflow
Workshop 5: Building Data Warehouse with BigQuery (feeding data from AirFlow)
Workshop 6: Building dashboard with Google Data Studio (Data from BigQuery)
Workshop 6: Building dashboard with Google Data Studio (Data
from BigQuery)
The dashboard is the last step of the Data Engineer/Data Science work. This article will
show the last workshop building dashboard with Google Data Studio
4/9
Google Data Studio: Data Studio is a free tool that turns your data into informative, easy
to read, easy to share, and fully customizable dashboards and reports.
Input: Datastore in BigQuery -> Output: Report and Dashboard on Google Data Studio
Example of our dataset (Audible book data)
Step to do Dashboard:
1. Create a table in BigQuery
2. Creating view, filtering some data for the dashboard
5/9
CREATE VIEW vw_customer_purchase AS
SELECT customer AS customer_name, COUNT(*) AS purchase_count
FROM sales_table
GROUP BY customer_name;
Create a View to let DataAnalyst see only information to use
Income (Thai Baht)
Country
Name book
Customer ID (take into account Total Customer)
Book category
Time to buy
Book ID (for future use)
3. Create Dashboard
LINK to my Google Data Studio Dashboard
The Dashboard 1 “Overview” contains:
Show summary:
- Business income
- Number of customers
- Number of purchases in each country
- Bestsellers
- Best-selling book category.
6/9
Dashboard 1: Overview
The Dashboard 2 “Search book by revenue” contains:
System for searching books by sales:
- can select country select sales at want to search
- Displays only books that meet the Search Criteria.
7/9
Dashboard 2: Search book by revenue
Notes: The parameter allows the user to enter their information. We can create a
variable (Parameter) to allow users to fill in their data or choose.
Notes2: Calculated Field, write your equations If an existing column doesn’t meet our
needs, we can write an equation from the existing data to create the column we need.
Certificate after graduation
After completing the course and passing the final exam, you will immediately receive a
certificate from the Road to Data Engineer instructor!
8/9
Certificate after graduation
References Data
9/9