KEMBAR78
Deep Learning Lectures - 3 | PDF | Norm (Mathematics) | Euclidean Vector
0% found this document useful (0 votes)
9 views75 pages

Deep Learning Lectures - 3

Uploaded by

Việt Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views75 pages

Deep Learning Lectures - 3

Uploaded by

Việt Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Recommender Systems &

Embeddings
Charles Ollion - Olivier Grisel

1 / 76
Outline
Embeddings

2 / 76
Outline
Embeddings

Dropout Regularization

3 / 76
Outline
Embeddings

Dropout Regularization

Recommender Systems

4 / 76
Embeddings

5 / 76
Symbolic variable
Text: characters, words, bigrams...

Recommender Systems: item ids, user ids

Any categorical descriptor: tags, movie genres, visited


URLs, skills on a resume, product categories...

6 / 76
Symbolic variable
Text: characters, words, bigrams...

Recommender Systems: item ids, user ids

Any categorical descriptor: tags, movie genres, visited


URLs, skills on a resume, product categories...

7 / 76
Symbolic variable
Text: characters, words, bigrams...

Recommender Systems: item ids, user ids

Any categorical descriptor: tags, movie genres, visited


URLs, skills on a resume, product categories...

8 / 76
Symbolic variable
Text: characters, words, bigrams...

Recommender Systems: item ids, user ids

Any categorical descriptor: tags, movie genres, visited


URLs, skills on a resume, product categories...

Notation:

Symbol s in vocabulary V

9 / 76
One-hot representation
|V |
onehot('salad') = [0, 0, 1, . . . , 0] ∈ {0, 1}

0 pig carrot salad door roof cake

10 / 76
One-hot representation
|V |
onehot('salad') = [0, 0, 1, . . . , 0] ∈ {0, 1}

0 pig carrot salad door roof cake

Sparse, discrete, large dimension |V |


Each axis has a meaning
Symbols are equidistant from each other:

euclidean distance = √2

11 / 76
Embedding
d
embedding('salad') = [3.28, −0.45, . . . 7.11] ∈ R

12 / 76
Embedding
d
embedding('salad') = [3.28, −0.45, . . . 7.11] ∈ R

Continuous and dense


Can represent a huge vocabulary in low dimension, typically:
d ∈ {16, 32, . . . , 4096}

Axis have no meaning a priori


Embedding metric can capture semantic distance

13 / 76
Embedding
d
embedding('salad') = [3.28, −0.45, . . . 7.11] ∈ R

Continuous and dense


Can represent a huge vocabulary in low dimension, typically:
d ∈ {16, 32, . . . , 4096}

Axis have no meaning a priori


Embedding metric can capture semantic distance

Neural Networks compute transformations on continuous vectors

14 / 76
Implementation with Keras
Size of vocabulary n = |V | , size of embedding d

# input: batch of integers


Embedding(output_dim=d, input_dim=n, input_length=1)
# output: batch of float vectors

15 / 76
Implementation with Keras
Size of vocabulary n = |V | , size of embedding d

# input: batch of integers


Embedding(output_dim=d, input_dim=n, input_length=1)
# output: batch of float vectors

Equivalent to one-hot encoding multiplied by a weight matrix


:
n×d
W ∈ R

embedding(x) = onehot(x). W

16 / 76
Implementation with Keras
Size of vocabulary n = |V | , size of embedding d

# input: batch of integers


Embedding(output_dim=d, input_dim=n, input_length=1)
# output: batch of float vectors

Equivalent to one-hot encoding multiplied by a weight matrix


:
n×d
W ∈ R

embedding(x) = onehot(x). W

W is typically randomly initialized, then tuned by backprop

17 / 76
Implementation with Keras
Size of vocabulary n = |V | , size of embedding d

# input: batch of integers


Embedding(output_dim=d, input_dim=n, input_length=1)
# output: batch of float vectors

Equivalent to one-hot encoding multiplied by a weight matrix


:
n×d
W ∈ R

embedding(x) = onehot(x). W

W is typically randomly initialized, then tuned by backprop


W are trainable parameters of the model

18 / 76
Distance and similarity in
Embedding space
Euclidean distance

d(x, y) = ||x − y||


2

Simple with good properties


Dependent on norm
(embeddings usually
unconstrained)

19 / 76
Distance and similarity in
Embedding space
Euclidean distance Cosine similarity
x⋅y
d(x, y) = ||x − y|| cosine(x, y) =
2 ||x||⋅||y||

Simple with good properties Angle between points,


Dependent on norm regardless of norm
(embeddings usually cosine(x, y) ∈ (−1, 1)
unconstrained) Expected cosine similarity of
random pairs of vectors is 0

20 / 76
Distance and similarity in
Embedding space
If x and y both have unit norms:

2
||x − y|| = 2 ⋅ (1 − cosine(x, y))
2

21 / 76
Distance and similarity in
Embedding space
If x and y both have unit norms:

2
||x − y|| = 2 ⋅ (1 − cosine(x, y))
2

or alternatively:

2
||x − y||
2
cosine(x, y) = 1 −
2

22 / 76
Distance and similarity in
Embedding space
If x and y both have unit norms:

2
||x − y|| = 2 ⋅ (1 − cosine(x, y))
2

or alternatively:

2
||x − y||
2
cosine(x, y) = 1 −
2

Alternatively, dot product (unnormalized) is used in practice as a


pseudo similarity

23 / 76
Visualizing Embeddings
Visualizing requires a projection in 2 or 3 dimensions
Objective: visualize which embedded symbols are similar

24 / 76
Visualizing Embeddings
Visualizing requires a projection in 2 or 3 dimensions
Objective: visualize which embedded symbols are similar

PCA

Limited by linear projection, embeddings usually have complex


high dimensional structure

25 / 76
Visualizing Embeddings
Visualizing requires a projection in 2 or 3 dimensions
Objective: visualize which embedded symbols are similar

PCA

Limited by linear projection, embeddings usually have complex


high dimensional structure

t-SNE

Visualizing data using t-SNE, L van der Maaten, G Hinton, The Journal of Machine
Learning Research, 2008

26 / 76
t-Distributed Stochastic
Neighbor Embedding
Unsupervised, low-dimension, non-linear projection
Optimized to preserve relative distances between nearest
neighbors
Global layout is not necessarily meaningful

27 / 76
t-Distributed Stochastic
Neighbor Embedding
Unsupervised, low-dimension, non-linear projection
Optimized to preserve relative distances between nearest
neighbors
Global layout is not necessarily meaningful

t-SNE projection is non deterministic (depends


on initialization)

Critical parameter: perplexity, usually set to 20, 30


See http://distill.pub/2016/misread-tsne/

28 / 76
Example word vectors

excerpt from work by J. Turian on a model trained by R. Collobert et al. 2008

29 / 76
Visualizing Mnist

30 / 76
Dropout Regularization

31 / 76
Regularization
Size of the embeddings

32 / 76
Regularization
Size of the embeddings

Depth of the network

33 / 76
Regularization
Size of the embeddings

Depth of the network

L2 penalty on embeddings

34 / 76
Regularization
Size of the embeddings

Depth of the network

L2 penalty on embeddings

Dropout

Randomly set activations to 0 with probability p


Bernoulli mask sampled for a forward pass / backward pass pair
Typically only enabled at training time

35 / 76
Dropout

Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.,
Journal of Machine Learning Research 2014

36 / 76
Dropout
Interpretation

Reduces the network dependency to individual neurons


More redundant representation of data

Ensemble interpretation

Equivalent to training a large ensemble of shared-parameters,


binary-masked models
Each model is only trained on a single data point

37 / 76
Dropout

At test time, multiply weights by p to keep same level of activation

Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Srivastava et al.,
Journal of Machine Learning Research 2014

38 / 76
Overfitting Noise

39 / 76
A bit of Dropout

40 / 76
Too much: Underfitting

41 / 76
Implementation with Keras

model = Sequential()
model.add(Dense(hidden_size, input_shape, activation='relu'))
model.add(Dropout(p=0.5))
model.add(Dense(hidden_size, activation='relu'))
model.add(Dropout(p=0.5))
model.add(Dense(output_size, activation='softmax'))

42 / 76
Recommender Systems

43 / 76
Recommender Systems
Recommend contents and products

Movies on Netflix and YouTube, weekly playlist and related Artists


on Spotify, books on Amazon, related apps on app stores, "Who to
Follow" on twitter...

44 / 76
Recommender Systems
Recommend contents and products

Movies on Netflix and YouTube, weekly playlist and related Artists


on Spotify, books on Amazon, related apps on app stores, "Who to
Follow" on twitter...

Prioritized social media status updates

45 / 76
Recommender Systems
Recommend contents and products

Movies on Netflix and YouTube, weekly playlist and related Artists


on Spotify, books on Amazon, related apps on app stores, "Who to
Follow" on twitter...

Prioritized social media status updates

Personalized search engine results

46 / 76
Recommender Systems
Recommend contents and products

Movies on Netflix and YouTube, weekly playlist and related Artists


on Spotify, books on Amazon, related apps on app stores, "Who to
Follow" on twitter...

Prioritized social media status updates

Personalized search engine results

Personalized ads and RTB

47 / 76
RecSys 101
Content-based vs Collaborative Filtering (CF)

Content-based: user metadata (gender, age, location...) and item


metadata (year, genre, director, actors)

48 / 76
RecSys 101
Content-based vs Collaborative Filtering (CF)

Content-based: user metadata (gender, age, location...) and item


metadata (year, genre, director, actors)

Collaborative Filtering: passed user/item interactions: stars, plays,


likes, clicks

49 / 76
RecSys 101
Content-based vs Collaborative Filtering (CF)

Content-based: user metadata (gender, age, location...) and item


metadata (year, genre, director, actors)

Collaborative Filtering: passed user/item interactions: stars, plays,


likes, clicks

Hybrid systems: CF + metadata to mitigate the cold-start problem

50 / 76
Explicit vs Implicit Feedback
Explicit: positive and negative feedback

Examples: review stars and votes

Regression metrics: Root Mean Squared Error (RMSE),


Mean Absolute Error (MAE)...

51 / 76
Explicit vs Implicit Feedback
Explicit: positive and negative feedback

Examples: review stars and votes

Regression metrics: Root Mean Squared Error (RMSE),


Mean Absolute Error (MAE)...

Implicit: positive feedback only

Examples: page views, plays, comments...

Ranking metrics: ROC AUC, precision at rank, NDCG...

52 / 76
Explicit vs Implicit Feedback
Implicit feedback much more abundant than explicit feedback

53 / 76
Explicit vs Implicit Feedback
Implicit feedback much more abundant than explicit feedback

Explicit feedback does not always reflect actual user behaviors

Self-declared independent movie enthusiast but watch a majority


of blockblusters

54 / 76
Explicit vs Implicit Feedback
Implicit feedback much more abundant than explicit feedback

Explicit feedback does not always reflect actual user behaviors

Self-declared independent movie enthusiast but watch a majority


of blockblusters

Implicit feedback can be negative

Page view with very short dwell time


Click on "next" button

55 / 76
Explicit vs Implicit Feedback
Implicit feedback much more abundant than explicit feedback

Explicit feedback does not always reflect actual user behaviors

Self-declared independent movie enthusiast but watch a majority


of blockblusters

Implicit feedback can be negative

Page view with very short dwell time


Click on "next" button

Implicit (and Explicit) feedback distribution impacted by UI/UX


changes and the RecSys deployment itself.

56 / 76
Matrix Factorization for CF
ratings user item

T 2 2 2
L(U , V ) = ∑ ||r i,j − u ⋅ v j || + λ(||U || + ||V || )
i 2 2 2

(i,j)∈D

Train U and V on observed ratings data ri,j


Use U T V to find missing entries in sparse rating data matrix R

57 / 76
Architecture and
Regularization

58 / 76
RecSys with Explicit Feedback

user

item

59 / 76
Deep RecSys Architecture

user

item

60 / 76
Deep RecSys with metadata
user metadata
user item metadata

item

61 / 76
Implicit Feedback: Triplet loss

user

item+

item-

62 / 76
Deep Triplet Networks

user

item+

item-

63 / 76
Training a Triplet Model
Gather a set of positive pairs user i and item j

While model has not converged:

64 / 76
Training a Triplet Model
Gather a set of positive pairs user i and item j

While model has not converged:

Shuffle the set of pairs (i, j)

65 / 76
Training a Triplet Model
Gather a set of positive pairs user i and item j

While model has not converged:

Shuffle the set of pairs (i, j)


For each (i, j) :
Sample item k uniformly at random

66 / 76
Training a Triplet Model
Gather a set of positive pairs user i and item j

While model has not converged:

Shuffle the set of pairs (i, j)


For each (i, j) :
Sample item k uniformly at random
Call item k a negative item for user i

67 / 76
Training a Triplet Model
Gather a set of positive pairs user i and item j

While model has not converged:

Shuffle the set of pairs (i, j)


For each (i, j) :
Sample item k uniformly at random
Call item k a negative item for user i
Train model on triplet (i, j, k)

68 / 76
Deep Neural Networks for YouTube Recommendations
https://research.google.com/pubs/pub45530.html

69 / 76
Ethical Considerations of
Recommender Systems

70 / 76
Ethical Considerations
Amplification of existing discriminatory and unfair behaviors / bias

Example: gender bias in ad clicks (fashion / jobs)


Using the firstname as a predictive feature

71 / 76
Ethical Considerations
Amplification of existing discriminatory and unfair behaviors / bias

Example: gender bias in ad clicks (fashion / jobs)


Using the firstname as a predictive feature

Amplification of the filter bubble and opinion polarization

Personalization can amplify "people only follow people they


agree with"
Optimizing for "engagement" promotes content that causes
strong emotional reaction (and turns normal users into haters?)
RecSys can exploit weaknesses of some users, lead to addiction
Addicted users clicks over-represented in future training data

72 / 76
Call to action
Designing Ethical Recommender Systems

Wise modeling choices (e.g. use of "firstname" as feature)


Conduct internal audits to detect fairness issues: SHAP,
Integrated Gradients, fairlearn.org
Learning representations that enforce fairness?

73 / 76
Call to action
Designing Ethical Recommender Systems

Wise modeling choices (e.g. use of "firstname" as feature)


Conduct internal audits to detect fairness issues: SHAP,
Integrated Gradients, fairlearn.org
Learning representations that enforce fairness?

Transparency

Educate decision makers and the general public


How to allow users to assess fairness by themselves?
How to allow for independent audits while respecting the
privacy of users?
74 / 76
Call to action
Designing Ethical Recommender Systems

Wise modeling choices (e.g. use of "firstname" as feature)


Conduct internal audits to detect fairness issues: SHAP,
Integrated Gradients, fairlearn.org
Learning representations that enforce fairness?

Transparency

Educate decision makers and the general public


How to allow users to assess fairness by themselves?
How to allow for independent audits while respecting the
privacy of users?
75 / 76
Active Area of Research

You might also like