M A S S AC H U S E T T S I N S T I T U T E O F T E C H N O LO G Y
Case Study:
Build Your Own Recommendation
System for Movies
CAS E ST U DY: B U I L D YO U R OW N R E C O M M E N DAT I O N
SYST E M F O R M OV I E S ( E X T R AC T E D F R O M M I T S O N L I N E
COURSE, DATA SCIENCE AND BIG DATA ANALYTICS:
MAKING DATA-DRIVEN DECISIONS)
WHAT WILL YOU GET
OUT OF THIS CASE STUDY?
This case study is extracted from MITs online
course for professionals, Data Science and Big Data
Analytics: Making Data-Driven Decisions.
After going through this case study, youll be able to:
Analyze data to develop your own version of a
recommendation engine, which forms the basis
of content systems used at companies like
Netflix, Pandora, Spotify, etcetera
Experience a hands-on approach to advance
your data science skills
Access to a series of resources and tools,
including sample data basis, that will enable you
to build your recommendation system
Get a sneak peak at the content included in
MITs online professional course on data science
M A S S AC H U S E T T S I N S T I T U T E O F T E C H N O LO G Y
I M P O RTANT:
Dont get discouraged if some of the
steps described seem too complicated!
Remember, this is an extract of the online
course that will provide you with all the
background necessary to successfully
complete this case study.
Learn more about the course here.
CAS E ST U DY: B U I L D YO U R OW N R E C O M M E N DAT I O N
SYST E M F O R M OV I E S ( E X T R AC T E D F R O M M I T S O N L I N E
C O U R S E , DATA S C I E N C E A N D B I G DATA A N A LY T I C S :
MAKING DATA-DRIVEN DECISIONS)
WHY THIS CASE STUDY?
By following some simple steps, you can develop
your own version of a recommendation engine,
which forms the basis of several content
recommendation systems, e.g. Netflix, Pandora,
Spotify etc. You can now apply this acquired skill to
all sorts of domains of your choice, e.g. restaurant
recommendations.
Self-Help Documentation: In this document, we
walk through some helpful tips to get you started
with building your own Recommendation engine
based on the case studies discussed in the MEET YOUR INSTRUCTOR, PROF. DEVAVRAT SHAH
Recommendation systems module. In this tutorial, Co-director of the MIT online course Data Science and Big Data
we provide examples and some pseudo-code for Analytics: Making Data-Driven Decisions
the following programming environments: R, Python.
As a professor in the department of electrical
Time Required: The time required todo this activity engineering and computer science at MIT, Dr. Shahs
varies depending on your experience in the current research is on the theory of large complex
required programming background. We suggest to networks. He is a member of the Laboratory for
plan somewhere between 1 & 3 hours. Remember, Information and Decision Systems and the Director
this is an optional activity for participants looking for the Statistics and Data Science Center in MIT
hands-on experience. Institute for Data, Systems, and Society. Dr. Shah
received his Bachelor of Technology in Computer
Before You Start: Watch this video! Its taken also Science and Engineering from the Indian Institute
from the course and it provides context and of Technology, Bombay, in 1999. He received the
knowledge you will need to complete this activity
M A S S AC H U S E T T S I N S T I T U TPresidents
E O F T E C Hof
N OIndia
LO G YGold Medal, awarded to the
If link above doesnt work, copy and paste this on best graduating student across all engineering
your browser: https://www.youtube.com/watch? disciplines. He received his Ph.D. in CS from
v=m9gESMWWb5Q Stanford University. His doctoral thesis won the
George B. Dantzig award from INFORMS for best
dissertation. In 2005, he started teaching at MIT.
In 2013, he co-founded Celect, Inc.
D I S C L AI MER:
This case study will require some prior knowledge and experience with the programming language you choose to use for reproducing case
study results. Generally, participants with 6 months of experience using RorPython should be successful ingoing through these exercises.
MIT is not responsible for errors in these tutorials or in external, publicly available data sets, code, and implementation libraries. Please note
that any links to external, publicly available websites, data sets, code, and implementation libraries are provided as a courtesy for the student.
They should not be construed as an endorsement of the content or views of the linked materials.
CAS E ST U DY: B U I L D YO U R OW N R E C O M M E N DAT I O N
SYST E M F O R M OV I E S ( E X T R AC T E D F R O M M I T S O N L I N E
C O U R S E , DATA S C I E N C E A N D B I G DATA A N A LY T I C S :
MAKING DATA DRIVEN-DECISIONS)
INTRODUCTION
In this document, we walk through some helpful 2 WORKING WITH THE DATA SET
tips to get you started with building your own
Recommendation engine based on the case The first task is to explore the dataset. You can do
studies discussed in the Recommendation systems so using a programming environment of your choice,
module. In this tutorial, we provide examples and e.g. Python or R.
some pseudo-code for the following programming
environments: R, Python. We cover the following: In R, you can read the data by simply calling the
read.table() function:
data = read.table('u.data')
1. Getting the data
2. Working with the dataset
You can rename the column names as desired:
3. Recommender libraries in R, Python olnames(data) = c("user_id", "item_id", "rating", "timestamp")
4. Data partitions (Train, Test) Since we don't need the timestamps,
5. Integrating a Popularity Recommender we can drop them:
6. Integrating a Collaborative Filtering Recommender data = data[ , -which(names(data) %in% c("timestamp"))]
7. Integrating an Item-Similarity Recommender
You can look at the data properties by using:
8. Getting Top-K recommendations
str(data)
9. Evaluation: RMSE
10. Evaluation: Confusion Matrix/Precision-Recall summary(data)
Plot a histogram of the data:
hist(data$rating)
Remember to watch this video first: https://
www.youtube.com/watch?v=m9gESMWWb5Q
1 GETTING THE DATA
For this tutorial, we use the dataset(s) provided by
MovieLens. MovieLens has several datasets. You
can choose any. For this tutorial, we will use the
100K dataset dataset. This dataset set consists of:
100,000 ratings (1-5) from 943 users
on 682 movies.
Each user has rated at least 20 movies.
Simple demographic info for the users
(age, gender, occupation, zip)
Download the "u.data" file. To view this file you can
use Microsoft Excel, for example. It has the following
tab-separated format: user id | item id | rating |
timestamp. The timestamps are in Unix seconds
since 1/1/1970 UTC, EPOCH format.
CAS E ST U DY: B U I L D YO U R OW N R E C O M M E N DAT I O N
SYST E M F O R M OV I E S ( E X T R AC T E D F R O M M I T S O N L I N E
C O U R S E , DATA S C I E N C E A N D B I G DATA A N A LY T I C S :
MAKING DATA DRIVEN-DECISIONS)
In Python, you can convert the data to a Pandas
dataframe to organize the dataset. For plotting in
Python, you can use MatplotLib. You can do all the
operations above (described for R), in Python using
Pandas in the following way:
import matplotlib as mpl
mpl.use('TkAgg')
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
col_names = ["user_id", "item_id", "rating", "timestamp"]
data = pd.read_table(u.data, names=col_names)
data = data.drop(timestamp, 1)
data.info()
plt.hist(data[rating])
plt.show()
DATA SPARSITY
The dataset sparsity can be calculated as:
Number of Ratings in the Dataset
Sparsity = * 100%
(Number of movies/Columns) * (Number of Users/Rows)
In R, you can calculate these quantities as follows:
Number_Ratings = nrows(data)
Number_Movies = length(unique(data$item_id))
Number_Users = length(unique(data$user_id))
In Python, while using Pandas, can you do the same:
Number_Ratings = len(data)
Number_Movies = len(np.unique(data[item_id]))
Number_Users = len(np.unique(data[user_id]))
SUB-SETTING THE DATA:
If you want the data to be less sparse, for example,
a good way to achieve that is to subset the data
where you only select Users/Movies that have at
least a certain number of observations in the dataset.
CAS E ST U DY: B U I L D YO U R OW N R E C O M M E N DAT I O N
SYST E M F O R M OV I E S ( E X T R AC T E D F R O M M I T S O N L I N E
C O U R S E , DATA S C I E N C E A N D B I G DATA A N A LY T I C S :
MAKING DATA DRIVEN-DECISIONS)
In R, for example, if you wanted to subset the data
such that only users with 50 or more ratings
remained, you would do the following:
data = data[ data$user_id %in% names(table(data$user_id))
[table(data$user_id) > 50] , ]
3 RECOMMENDERS
If you want to build your own Recommenders from
scratch, you can consult the vast amounts of
academic literature available freely. There are also In Pandas (Python), using the SciKit-Learn library,
several self-help guides which can be useful, such we can do the same via:
as these: import pandas as pd
Collaborative Filtering with R; import numpy as np
How to build a Recommender System; from sklearn.cross_validation import train_test_split
# assuming pdf is the pandas dataframe with the data
On the other hand, why build a recommender from train, test = train_test_split(pdf, test_size = 0.3)
scratch when there is a vast array of publicly
available Recommenders (in all sorts of Alternatively, one can use the Recommender libraries
programming environments) ready for use? Some (discussed earlier) to create the data splits.
examples are:
RecommenderLab for R; For RecommenderLab in R, the documentation in
Graphlab-Create for Python (has a free license Section 5.6 provides examples that will allow random
for personal and academic use); data splits.
Apache Spark's Recommendation module;
Apache Mahout; Graphlab's Sframe objects also have a random_split()
function which works similarly.
For this tutorial, we will reference RecommenderLab
and Graphlab-Create.
5 POPULARITY RECOMMENDER
4 SPLITTING DATA RANDOMLY (TRAIN/TEST) RecommenderLab, provides a popularity recom-
mender out of the box. Section 5.5 of the Recom-
A random split can be created in R and Pandas menderLab guide provides examples and sample
(Python). code to help do this.
In R, you can do the following to create a 70/30 split GraphLab-Create also provides a Popularity Recom-
for Train/Test: mender. If the dataset is in Pandas, it can easily
library(caTools) integrate with GraphLab's Sframe datatype as noted
spl = sample.split(data$rating, 0.7) here. Some more information on the Popularity
train = subset (data, spl == TRUE) Recommender and its usage is provided on the
test = subset (data, spl == FALSE) popularity recommenders online documentation.
CAS E ST U DY: B U I L D YO U R OW N R E C O M M E N DAT I O N
SYST E M F O R M OV I E S ( E X T R AC T E D F R O M M I T S O N L I N E
C O U R S E , DATA S C I E N C E A N D B I G DATA A N A LY T I C S :
MAKING DATA DRIVEN-DECISIONS)
6 COLLABORATIVE FILTERING 7 ITEM SIMILARITY FILTERING
Most recommender libraries will provide an imple- Several recommender libraries will also provide
mentation for Collaborative Filtering methods. The Item-Item similarity based methods.
RecommenderLab in R and GraphLab in Python both
provide implementations of Collaborative Filtering For RecommenderLab, use the "IBCF" (item-based
methods, as noted next: collaborative filtering) to train the model.
For RecommenderLab, use the "UBCF" (user-based For GraphLab, use the "Item-Item Similarity Recom-
collaborative filtering) to train the model, as noted in mender".
the documentation.
Item Similarity recommenders can use the "0/1"
For GraphLab, use the "Factorization Recommender". ratings model to train the algorithms (where 0 means
the item was not rated by the user and 1 means it
Often, a regularization parameter is used with these was). No other information is used. For these types
models. The best value for this regularization para- of recommenders, a ranked list of items recom-
meter is chosen using a Validations set. Here is how mended for each user is made available as the
this can be done: output, based on "similar" items. Instead of RMSE, a
Precision/Recall metric can be used to evaluate the
1. If the Train/Test split has already been performed accuracy of the model (see details in the Evaluation
(as detailed earlier), split the Train set further Section below).
(75%/25%) in to Train/Validation sets. Now we have
three sets: Train, Validation, Test.
2. Train several models, each using a different value
of the regularization parameter (usually in the range: 8 TOP-K RECOMMENDATIONS
(1e-5, 1e-1).
3. Use the Validation set to determine which model Based on scores assigned to User-Item pairs, each
results in the lowest RMSE (see Evaluation section recommender algorithm makes available functions
below).
that will provide a sorted list of top-K items most
4. Use the regularization value that corresponds to
the lowest Validation-set RMSE (see Evaluation
highly recommended for each user (from among
section below). those items not already rated by the user).
5. Finally, with that parameter value fixed, use the
trained model to get a final RMSE value on the
Test set.
6. It can also help plotting the Validation set RMSE
values vs the Regularization parameter values to
determine the best one.
I M P O RTANT:
Dont get discouraged if some of the steps described seem too
complicated! Remember, this is an extract of the online course that
will provide you with all the background necessary to successfully
complete this case study.
CAS E ST U DY: B U I L D YO U R OW N R E C O M M E N DAT I O N
SYST E M F O R M OV I E S ( E X T R AC T E D F R O M M I T S O N L I N E
C O U R S E , DATA S C I E N C E A N D B I G DATA A N A LY T I C S :
MAKING DATA DRIVEN-DECISIONS)
In RecommenderLab, the parameter type='topNlist' For RecommenderLab, the getConfusionMatrix
to the evaluate() function will produce such a list. (results), where results is the output of the evaluate()
function discussed earlier, will provide the True
In GraphLab, the recommend(K) function for each Positives, False Negatives, False Positives and True
type of recommender object will do the same. Negatives matrix from which Precision and Recall
can be calculated.
9 EVALUATION: RMSE (ROOT MEAN SQUARED In Graphlab, the following function will also produce
ERROR) the Confusion Matrix: evaluation.confusion_matrix().
Also, if comparing models, e.g. Popularity Recom-
Once the model is trained on the Training data, and
mender and Item-Item Similarity Recommender, a
any parameters determined using a Validation set,
precision/recall plot can be generated by using the
if required, the model can be used to compute the
following function:
error (RMSE) on predictions made on the Test data.
recommender.util.compare_models(metric='precion
_recall'). This will produce a Precision/Recall plot
RecommenderLab in R, uses the predict() and
(and list of values) for various values of K (the
calcPredictionAccuracy() functions to compute the
number of recommendations for each user).
predictions (based on the trained model) and
evaluate RMSE (and MSE and MAE).
GraphLab in Python, also has a predict() function to
get predictions. It provides a suite of functions to WANT TO KEEP LEARNING?
evaluate metrics such as rmse (evaluation.rmse(),
for example). Join us! MITs online course
Data Science and Big Data Analytics:
Making Data Driven-Decisions starts
10 EVALUATION: PRECISION/RECALL, CONFUSION Oct 23, 2017. Enroll today!
MATRIX
For the top-K recommendations based evaluation,
such as in Item Similarity recommenders, we can
evaluate using a Confusion Matrix or Precision/
Recall values. Specifically,
Precision: out of K top recommendations, how
many of the true best K songs were recommended.
Recall: out of the K best recommendations, how
many were recommended.