Recommender Systems &
Embeddings
Charles Ollion - Olivier Grisel
1 / 76
Outline
Embeddings
2 / 76
Outline
Embeddings
Dropout Regularization
3 / 76
Outline
Embeddings
Dropout Regularization
Recommender Systems
4 / 76
Embeddings
5 / 76
Symbolic variable
Text: characters, words, bigrams...
Recommender Systems: item ids, user ids
Any categorical descriptor: tags, movie genres, visited
URLs, skills on a resume, product categories...
6 / 76
Symbolic variable
Text: characters, words, bigrams...
Recommender Systems: item ids, user ids
Any categorical descriptor: tags, movie genres, visited
URLs, skills on a resume, product categories...
7 / 76
Symbolic variable
Text: characters, words, bigrams...
Recommender Systems: item ids, user ids
Any categorical descriptor: tags, movie genres, visited
URLs, skills on a resume, product categories...
8 / 76
Symbolic variable
Text: characters, words, bigrams...
Recommender Systems: item ids, user ids
Any categorical descriptor: tags, movie genres, visited
URLs, skills on a resume, product categories...
Notation:
Symbol s in vocabulary V
9 / 76
One-hot representation
|V |
onehot('salad') = [0, 0, 1, . . . , 0] ∈ {0, 1}
0 pig carrot salad door roof cake
10 / 76
One-hot representation
|V |
onehot('salad') = [0, 0, 1, . . . , 0] ∈ {0, 1}
0 pig carrot salad door roof cake
Sparse, discrete, large dimension |V |
Each axis has a meaning
Symbols are equidistant from each other:
–
euclidean distance = √2
11 / 76
Embedding
d
embedding('salad') = [3.28, −0.45, . . . 7.11] ∈ R
12 / 76
Embedding
d
embedding('salad') = [3.28, −0.45, . . . 7.11] ∈ R
Continuous and dense
Can represent a huge vocabulary in low dimension, typically:
d ∈ {16, 32, . . . , 4096}
Axis have no meaning a priori
Embedding metric can capture semantic distance
13 / 76
Embedding
d
embedding('salad') = [3.28, −0.45, . . . 7.11] ∈ R
Continuous and dense
Can represent a huge vocabulary in low dimension, typically:
d ∈ {16, 32, . . . , 4096}
Axis have no meaning a priori
Embedding metric can capture semantic distance
Neural Networks compute transformations on continuous vectors
14 / 76
Implementation with Keras
Size of vocabulary n = |V | , size of embedding d
# input: batch of integers
Embedding(output_dim=d, input_dim=n, input_length=1)
# output: batch of float vectors
15 / 76
Implementation with Keras
Size of vocabulary n = |V | , size of embedding d
# input: batch of integers
Embedding(output_dim=d, input_dim=n, input_length=1)
# output: batch of float vectors
Equivalent to one-hot encoding multiplied by a weight matrix
:
n×d
W ∈ R
embedding(x) = onehot(x). W
16 / 76
Implementation with Keras
Size of vocabulary n = |V | , size of embedding d
# input: batch of integers
Embedding(output_dim=d, input_dim=n, input_length=1)
# output: batch of float vectors
Equivalent to one-hot encoding multiplied by a weight matrix
:
n×d
W ∈ R
embedding(x) = onehot(x). W
W is typically randomly initialized, then tuned by backprop
17 / 76
Implementation with Keras
Size of vocabulary n = |V | , size of embedding d
# input: batch of integers
Embedding(output_dim=d, input_dim=n, input_length=1)
# output: batch of float vectors
Equivalent to one-hot encoding multiplied by a weight matrix
:
n×d
W ∈ R
embedding(x) = onehot(x). W
W is typically randomly initialized, then tuned by backprop
W are trainable parameters of the model
18 / 76
Distance and similarity in
Embedding space
Euclidean distance
d(x, y) = ||x − y||
2
Simple with good properties
Dependent on norm
(embeddings usually
unconstrained)
19 / 76
Distance and similarity in
Embedding space
Euclidean distance Cosine similarity
x⋅y
d(x, y) = ||x − y|| cosine(x, y) =
2 ||x||⋅||y||
Simple with good properties Angle between points,
Dependent on norm regardless of norm
(embeddings usually cosine(x, y) ∈ (−1, 1)
unconstrained) Expected cosine similarity of
random pairs of vectors is 0
20 / 76
Distance and similarity in
Embedding space
If x and y both have unit norms:
2
||x − y|| = 2 ⋅ (1 − cosine(x, y))
2
21 / 76
Distance and similarity in
Embedding space
If x and y both have unit norms:
2
||x − y|| = 2 ⋅ (1 − cosine(x, y))
2
or alternatively:
2
||x − y||
2
cosine(x, y) = 1 −
2
22 / 76
Distance and similarity in
Embedding space
If x and y both have unit norms:
2
||x − y|| = 2 ⋅ (1 − cosine(x, y))
2
or alternatively:
2
||x − y||
2
cosine(x, y) = 1 −
2
Alternatively, dot product (unnormalized) is used in practice as a
pseudo similarity
23 / 76
Visualizing Embeddings
Visualizing requires a projection in 2 or 3 dimensions
Objective: visualize which embedded symbols are similar
24 / 76
Visualizing Embeddings
Visualizing requires a projection in 2 or 3 dimensions
Objective: visualize which embedded symbols are similar
PCA
Limited by linear projection, embeddings usually have complex
high dimensional structure
25 / 76
Visualizing Embeddings
Visualizing requires a projection in 2 or 3 dimensions
Objective: visualize which embedded symbols are similar
PCA
Limited by linear projection, embeddings usually have complex
high dimensional structure
t-SNE
Visualizing data using t-SNE, L van der Maaten, G Hinton, The Journal of Machine
Learning Research, 2008
26 / 76
t-Distributed Stochastic
Neighbor Embedding
Unsupervised, low-dimension, non-linear projection
Optimized to preserve relative distances between nearest
neighbors
Global layout is not necessarily meaningful
27 / 76
t-Distributed Stochastic
Neighbor Embedding
Unsupervised, low-dimension, non-linear projection
Optimized to preserve relative distances between nearest
neighbors
Global layout is not necessarily meaningful
t-SNE projection is non deterministic (depends
on initialization)
Critical parameter: perplexity, usually set to 20, 30
See http://distill.pub/2016/misread-tsne/
28 / 76
Example word vectors
excerpt from work by J. Turian on a model trained by R. Collobert et al. 2008
29 / 76
Visualizing Mnist
30 / 76
Dropout Regularization
31 / 76
Regularization
Size of the embeddings
32 / 76
Regularization
Size of the embeddings
Depth of the network
33 / 76
Regularization
Size of the embeddings
Depth of the network
L2 penalty on embeddings
34 / 76
Regularization
Size of the embeddings
Depth of the network
L2 penalty on embeddings
Dropout
Randomly set activations to 0 with probability p
Bernoulli mask sampled for a forward pass / backward pass pair
Typically only enabled at training time
35 / 76
Dropout
Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.,
Journal of Machine Learning Research 2014
36 / 76
Dropout
Interpretation
Reduces the network dependency to individual neurons
More redundant representation of data
Ensemble interpretation
Equivalent to training a large ensemble of shared-parameters,
binary-masked models
Each model is only trained on a single data point
37 / 76
Dropout
At test time, multiply weights by p to keep same level of activation
Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.,
Journal of Machine Learning Research 2014
38 / 76
Overfitting Noise
39 / 76
A bit of Dropout
40 / 76
Too much: Underfitting
41 / 76
Implementation with Keras
model = Sequential()
model.add(Dense(hidden_size, input_shape, activation='relu'))
model.add(Dropout(p=0.5))
model.add(Dense(hidden_size, activation='relu'))
model.add(Dropout(p=0.5))
model.add(Dense(output_size, activation='softmax'))
42 / 76
Recommender Systems
43 / 76
Recommender Systems
Recommend contents and products
Movies on Netflix and YouTube, weekly playlist and related Artists
on Spotify, books on Amazon, related apps on app stores, "Who to
Follow" on twitter...
44 / 76
Recommender Systems
Recommend contents and products
Movies on Netflix and YouTube, weekly playlist and related Artists
on Spotify, books on Amazon, related apps on app stores, "Who to
Follow" on twitter...
Prioritized social media status updates
45 / 76
Recommender Systems
Recommend contents and products
Movies on Netflix and YouTube, weekly playlist and related Artists
on Spotify, books on Amazon, related apps on app stores, "Who to
Follow" on twitter...
Prioritized social media status updates
Personalized search engine results
46 / 76
Recommender Systems
Recommend contents and products
Movies on Netflix and YouTube, weekly playlist and related Artists
on Spotify, books on Amazon, related apps on app stores, "Who to
Follow" on twitter...
Prioritized social media status updates
Personalized search engine results
Personalized ads and RTB
47 / 76
RecSys 101
Content-based vs Collaborative Filtering (CF)
Content-based: user metadata (gender, age, location...) and item
metadata (year, genre, director, actors)
48 / 76
RecSys 101
Content-based vs Collaborative Filtering (CF)
Content-based: user metadata (gender, age, location...) and item
metadata (year, genre, director, actors)
Collaborative Filtering: passed user/item interactions: stars, plays,
likes, clicks
49 / 76
RecSys 101
Content-based vs Collaborative Filtering (CF)
Content-based: user metadata (gender, age, location...) and item
metadata (year, genre, director, actors)
Collaborative Filtering: passed user/item interactions: stars, plays,
likes, clicks
Hybrid systems: CF + metadata to mitigate the cold-start problem
50 / 76
Explicit vs Implicit Feedback
Explicit: positive and negative feedback
Examples: review stars and votes
Regression metrics: Root Mean Squared Error (RMSE),
Mean Absolute Error (MAE)...
51 / 76
Explicit vs Implicit Feedback
Explicit: positive and negative feedback
Examples: review stars and votes
Regression metrics: Root Mean Squared Error (RMSE),
Mean Absolute Error (MAE)...
Implicit: positive feedback only
Examples: page views, plays, comments...
Ranking metrics: ROC AUC, precision at rank, NDCG...
52 / 76
Explicit vs Implicit Feedback
Implicit feedback much more abundant than explicit feedback
53 / 76
Explicit vs Implicit Feedback
Implicit feedback much more abundant than explicit feedback
Explicit feedback does not always reflect actual user behaviors
Self-declared independent movie enthusiast but watch a majority
of blockblusters
54 / 76
Explicit vs Implicit Feedback
Implicit feedback much more abundant than explicit feedback
Explicit feedback does not always reflect actual user behaviors
Self-declared independent movie enthusiast but watch a majority
of blockblusters
Implicit feedback can be negative
Page view with very short dwell time
Click on "next" button
55 / 76
Explicit vs Implicit Feedback
Implicit feedback much more abundant than explicit feedback
Explicit feedback does not always reflect actual user behaviors
Self-declared independent movie enthusiast but watch a majority
of blockblusters
Implicit feedback can be negative
Page view with very short dwell time
Click on "next" button
Implicit (and Explicit) feedback distribution impacted by UI/UX
changes and the RecSys deployment itself.
56 / 76
Matrix Factorization for CF
ratings user item
T 2 2 2
L(U , V ) = ∑ ||r i,j − u ⋅ v j || + λ(||U || + ||V || )
i 2 2 2
(i,j)∈D
Train U and V on observed ratings data ri,j
Use U T V to find missing entries in sparse rating data matrix R
57 / 76
Architecture and
Regularization
58 / 76
RecSys with Explicit Feedback
user
item
59 / 76
Deep RecSys Architecture
user
item
60 / 76
Deep RecSys with metadata
user metadata
user item metadata
item
61 / 76
Implicit Feedback: Triplet loss
user
item+
item-
62 / 76
Deep Triplet Networks
user
item+
item-
63 / 76
Training a Triplet Model
Gather a set of positive pairs user i and item j
While model has not converged:
64 / 76
Training a Triplet Model
Gather a set of positive pairs user i and item j
While model has not converged:
Shuffle the set of pairs (i, j)
65 / 76
Training a Triplet Model
Gather a set of positive pairs user i and item j
While model has not converged:
Shuffle the set of pairs (i, j)
For each (i, j) :
Sample item k uniformly at random
66 / 76
Training a Triplet Model
Gather a set of positive pairs user i and item j
While model has not converged:
Shuffle the set of pairs (i, j)
For each (i, j) :
Sample item k uniformly at random
Call item k a negative item for user i
67 / 76
Training a Triplet Model
Gather a set of positive pairs user i and item j
While model has not converged:
Shuffle the set of pairs (i, j)
For each (i, j) :
Sample item k uniformly at random
Call item k a negative item for user i
Train model on triplet (i, j, k)
68 / 76
Deep Neural Networks for YouTube Recommendations
https://research.google.com/pubs/pub45530.html
69 / 76
Ethical Considerations of
Recommender Systems
70 / 76
Ethical Considerations
Amplification of existing discriminatory and unfair behaviors / bias
Example: gender bias in ad clicks (fashion / jobs)
Using the firstname as a predictive feature
71 / 76
Ethical Considerations
Amplification of existing discriminatory and unfair behaviors / bias
Example: gender bias in ad clicks (fashion / jobs)
Using the firstname as a predictive feature
Amplification of the filter bubble and opinion polarization
Personalization can amplify "people only follow people they
agree with"
Optimizing for "engagement" promotes content that causes
strong emotional reaction (and turns normal users into haters?)
RecSys can exploit weaknesses of some users, lead to addiction
Addicted users clicks over-represented in future training data
72 / 76
Call to action
Designing Ethical Recommender Systems
Wise modeling choices (e.g. use of "firstname" as feature)
Conduct internal audits to detect fairness issues: SHAP,
Integrated Gradients, fairlearn.org
Learning representations that enforce fairness?
73 / 76
Call to action
Designing Ethical Recommender Systems
Wise modeling choices (e.g. use of "firstname" as feature)
Conduct internal audits to detect fairness issues: SHAP,
Integrated Gradients, fairlearn.org
Learning representations that enforce fairness?
Transparency
Educate decision makers and the general public
How to allow users to assess fairness by themselves?
How to allow for independent audits while respecting the
privacy of users?
74 / 76
Call to action
Designing Ethical Recommender Systems
Wise modeling choices (e.g. use of "firstname" as feature)
Conduct internal audits to detect fairness issues: SHAP,
Integrated Gradients, fairlearn.org
Learning representations that enforce fairness?
Transparency
Educate decision makers and the general public
How to allow users to assess fairness by themselves?
How to allow for independent audits while respecting the
privacy of users?
75 / 76
Active Area of Research