Course ratings
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Vincent Vankrunkelsven
Data Engineer @ DataCamp
Ratings at DataCamp
INTRODUCTION TO DATA ENGINEERING
Recommend using ratings
Get rating data
Clean and calculate top-recommended courses
Recalculate daily
Example usage: user's dashboard
INTRODUCTION TO DATA ENGINEERING
As an ETL process
It's an ETL process!
INTRODUCTION TO DATA ENGINEERING
The database
Course Rating
course_id user_id
title
course_id
description
rating
programming_language
INTRODUCTION TO DATA ENGINEERING
The database relationship
Course Rating
course_id user_id
title course_id
description rating
programming_language
INTRODUCTION TO DATA ENGINEERING
Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
From ratings to
recommendations
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Vincent Vankrunkelsven
Data Engineer @ DataCamp
The recommendations table
user_id course_id rating
1 1 4.8
1 74 4.78
1 21 4.5
2 32 4.9
The estimated rating of a course the user hasn't taken yet.
INTRODUCTION TO DATA ENGINEERING
Recommendation techniques
Matrix factorization
Building Recommendation Engines with PySpark
INTRODUCTION TO DATA ENGINEERING
Common sense transformation
Course Recommendations
course_id
title user_id course_id rating
description 1 1 4.8
programming_language
1 74 4.78
1 21 4.5
2 32 4.9
Rating
user_id
course_id
rating
INTRODUCTION TO DATA ENGINEERING
Average course ratings
Average course rating
course_id avg_rating
1 4.8
74 4.78
21 4.5
32 4.9
We want to recommend highly rated courses
INTRODUCTION TO DATA ENGINEERING
Use the right programming language
Rating
user_id course_id programming_language rating
1 1 r 4.8
1 74 sql 4.78
1 21 sql 4.5
1 32 python 4.9
Recommend SQL course for user with id 1
INTRODUCTION TO DATA ENGINEERING
Recommend new courses
Rating
user_id course_id programming_language rating
1 1 r 4.8
1 74 sql 4.78
1 21 sql 4.5
1 32 python 4.9
Don't recommend the combinations already in the rating table
INTRODUCTION TO DATA ENGINEERING
Our recommendation transformation
Use technology that user has rated most
Don't recommend courses that user already rated
Recommend three highest rated courses from remaining combinations
INTRODUCTION TO DATA ENGINEERING
Rating
user_id course_id programming_language rating
1 12 sql 4.78
1 52 sql 4.5
1 32 r 4.9
Recommend three highest rated SQL courses which are not 12 and 52.
INTRODUCTION TO DATA ENGINEERING
Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Scheduling daily
jobs
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Vincent Vankrunkelsven
Data Engineer @ DataCamp
What you've done so far
Extract using extract_course_data() and extract_rating_data()
Clean up using NA using transform_fill_programming_language()
Average course ratings per course: transform_avg_rating()
Get eligible user and course id pairs: transform_courses_to_recommend()
Calculate the recommendations: transform_recommendations()
INTRODUCTION TO DATA ENGINEERING
Loading to Postgres
Use the calculations in data products
Update daily
Example use case: sending out e-mails with recommendations
INTRODUCTION TO DATA ENGINEERING
The loading phase
recommendations.to_sql(
"recommendations",
db_engine,
if_exists="append",
)
INTRODUCTION TO DATA ENGINEERING
def etl(db_engines):
# Extract the data
courses = extract_course_data(db_engines)
rating = extract_rating_data(db_engines)
# Clean up courses data
courses = transform_fill_programming_language(courses)
# Get the average course ratings
avg_course_rating = transform_avg_rating(rating)
# Get eligible user and course id pairs
courses_to_recommend = transform_courses_to_recommend(
rating,
courses,
)
# Calculate the recommendations
recommendations = transform_recommendations(
avg_course_rating,
courses_to_recommend,
)
# Load the recommendations into the database
load_to_dwh(recommendations, db_engine))
INTRODUCTION TO DATA ENGINEERING
Creating the DAG
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
dag = DAG(dag_id="recommendations",
scheduled_interval="0 0 * * *")
task_recommendations = PythonOperator(
task_id="recommendations_task",
python_callable=etl,
)
INTRODUCTION TO DATA ENGINEERING
Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Congratulations
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Vincent Vankrunkelsven
Data Engineer @ DataCamp
Introduction to data engineering
Identify the tasks of a data engineer
What kind of tools they use
Cloud service providers
INTRODUCTION TO DATA ENGINEERING
Data engineering toolbox
Databases
Parallel computing & frameworks (Spark)
Work ow scheduling with Air ow
INTRODUCTION TO DATA ENGINEERING
Extract, Load and Transform (ETL)
Extract: get data from several sources
Transform: perform transformations using parallel computing
Load: load data into target database
INTRODUCTION TO DATA ENGINEERING
Case study: DataCamp
Fetch data from multiple sources
Transform to form recommendations
Load into target database
INTRODUCTION TO DATA ENGINEERING
Good job!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G