KEMBAR78
Road To Data Engineer | PDF | Cloud Computing | Apache Spark
0% found this document useful (0 votes)
401 views9 pages

Road To Data Engineer

The document summarizes the Road to Data Engineer 2.0 course from DATAth.com. The course provides fundamental and advanced knowledge in data engineering, and teaches students to build end-to-end data pipelines using modern technologies like Python, SQL, Apache Spark, Apache Airflow, Google Cloud Platform, and Google Data Studio. The curriculum covers topics such as data collection, data quality, data warehousing with BigQuery, data visualization with Data Studio, and data engineering best practices. Students complete workshops to apply their skills in building real data pipelines and dashboards. Upon completing the course and passing the final exam, students receive a certificate.

Uploaded by

dtanonimo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
401 views9 pages

Road To Data Engineer

The document summarizes the Road to Data Engineer 2.0 course from DATAth.com. The course provides fundamental and advanced knowledge in data engineering, and teaches students to build end-to-end data pipelines using modern technologies like Python, SQL, Apache Spark, Apache Airflow, Google Cloud Platform, and Google Data Studio. The curriculum covers topics such as data collection, data quality, data warehousing with BigQuery, data visualization with Data Studio, and data engineering best practices. Students complete workshops to apply their skills in building real data pipelines and dashboards. Upon completing the course and passing the final exam, students receive a certificate.

Uploaded by

dtanonimo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Road to Data Engineer 2.0!

medium.com/@o.anonthanasap/road-to-data-engineer-2-0-664ef352e0d8

October 10, 2021

Last two months, I had an excellent opportunity to take the Road to Data Engineer (R2DE
2.0) course from DATAth.com. Road to Data Engineer (R2DE) is a course that provides
fundamental to advance knowledge in the Data Engineer track with the workshop in every
chapter. We can apply the knowledge to build our end-to-end Data Pipeline by using the
cutting edge technology that is popular today

This dashboard was inspired by Road to Data Engineer course :)

LINK to my Google Data Studio Dashboard

My GitHub Code: https://github.com/orapinanon/dataengineer_proj

Tech Stack
✔ Python (Google Colab)
✔ SQL
✔ Pandas
✔ Apache Spark, PySpark
✔ Apache Airflow, DAG: Directed Acyclic Graph
✔ Google Cloud Platform (GCP)
✔ Google BigQuery
✔ Google Data Studio
✔ Databricks
✔ Basic and Advanced Git + CI/CD
✔ Docker + Airflow + Kubernetes
Course Curriculum

1/9
Data Engineer end-to-end flow

✔ CH0 Introduction to Data Engineering — Python & SQL


Basic of Data Engineer
Concept of Big Data & Hadoop
Database, Data Warehouse, Data Lake

✔ CH1 Data Pipeline & ETL


Data Pipeline (ETL/ELT)
Data Pipeline designing
Data Integration
Workshop 1: Data Collection with Python (Google Colab), Pandas, REST API

✔ CH2 Data Quality & Wrangling


Data Cleansing
Data Quality (Data Lineage and Data Dictionary)
EDA (Exploratory Data Analysis) and Data Profiling
Handling Anomaly และ missing data
Distributed Data Processing
Concept of Apache Spark
Workshop 2: Data Wrangling, Data Cleansing with Apache Spark (Colab and
PySpark)

✔ CH3 Basic Cloud — Google Cloud


Cloud Computing concept with Google Cloud Platform (GCP)
Concept of public / private / hybrid cloud
Cloud vs on-premise
Concept of Cloud computing e.g., Managed Service and serverless
GCP (Google Cloud Platform) services
Data Processing and storage on cloud

2/9
Basic of Bash Command
Workshop 3: Data Storage with GCS

✔CH4 Data Pipeline Orchestration with Airflow


Data Pipeline Orchestration
Data Pipeline tool
Concept of Apache Airflow
DAG: Directed Acyclic Graph
Cloud Composer (GCP)
Create data pipeline with Apache Airflow
Workshop 4: Automated Data Pipeline with Airflow

✔ CH5 Introduction to BigQuery


Data Warehouse
BigQuery and the concept of BigQuery
Load data into BigQuery
Design schema in BigQuery
Workshop 5: Building Data Warehouse with BigQuery (feeding data from AirFlow)

✔ CH6 Introduction to Google Data Studio


Data visualization
Google Data Studio
Connect Data Studio with data sources
Dimension and Metric in chart
Workshop 6: Building dashboard with Google Data Studio (Data from BigQuery)

✔ CH7 Advanced Data Engineering


Case study for using data pipeline
Data Architecture
Git and Docker container
Data Privacy
Introduce to Machine Learning engineer
ML deployment pipeline (MLDevOp)
The future of Data Warehouse e.g., Snowflake

✔ Special Classes
Intro to Databricks
What I have learned working with data
Enterprise Data Architecture
Slowly Changing Dimension
What recruiter looks for in DE LinkedIn Profile
Intro to PowerBI
Soft skills for Data Engineers

3/9
Intro to Data Monitoring
Basic and Advanced Git + CI/CD
Docker + Airflow + Kubernetes

The Data Engineer workshop

Data Engineer end-to-end flow

Workshop 1: Data Collection with Python (Google Colab), Pandas, REST API
Workshop 2: Data Wrangling, Data Cleansing with Apache Spark (Colab and
PySpark)

Example of our dataset (Audible book data)

Workshop 3: Data Storage with Google Cloud Storage


Workshop 4: Automated Data Pipeline with Airflow
Workshop 5: Building Data Warehouse with BigQuery (feeding data from AirFlow)
Workshop 6: Building dashboard with Google Data Studio (Data from BigQuery)

Workshop 6: Building dashboard with Google Data Studio (Data


from BigQuery)
The dashboard is the last step of the Data Engineer/Data Science work. This article will
show the last workshop building dashboard with Google Data Studio

4/9
Google Data Studio: Data Studio is a free tool that turns your data into informative, easy
to read, easy to share, and fully customizable dashboards and reports.

Input: Datastore in BigQuery -> Output: Report and Dashboard on Google Data Studio

Example of our dataset (Audible book data)

Step to do Dashboard:
1. Create a table in BigQuery
2. Creating view, filtering some data for the dashboard

5/9
CREATE VIEW vw_customer_purchase AS
SELECT customer AS customer_name, COUNT(*) AS purchase_count
FROM sales_table
GROUP BY customer_name;

Create a View to let DataAnalyst see only information to use

Income (Thai Baht)


Country
Name book
Customer ID (take into account Total Customer)
Book category
Time to buy
Book ID (for future use)

3. Create Dashboard

LINK to my Google Data Studio Dashboard

The Dashboard 1 “Overview” contains:


Show summary:
- Business income
- Number of customers
- Number of purchases in each country
- Bestsellers
- Best-selling book category.

6/9
Dashboard 1: Overview

The Dashboard 2 “Search book by revenue” contains:


System for searching books by sales:
- can select country select sales at want to search
- Displays only books that meet the Search Criteria.

7/9
Dashboard 2: Search book by revenue

Notes: The parameter allows the user to enter their information. We can create a
variable (Parameter) to allow users to fill in their data or choose.

Notes2: Calculated Field, write your equations If an existing column doesn’t meet our
needs, we can write an equation from the existing data to create the column we need.

Certificate after graduation


After completing the course and passing the final exam, you will immediately receive a
certificate from the Road to Data Engineer instructor!

8/9
Certificate after graduation

References Data

9/9

You might also like