NLP Reading Notes
December 20, 2017
1 Neural Networks - Ch1. Introduction
• supervised machine learning: algorithms that attempt to infer usage patterns and regularities from a set
of pre-annotated input and output pairs
• Deep Learning/Neural Networks: learning of parameterized differentiable mathematical functions - often
many layers of these differentiable functions are chained together
• embedding layer: a mapping of discrete symbols to continuous vectors ina relatively low dimensional space
• Two Major kinds of neural network architectures: feed-forward networks and recurrent/recursive networks
• Feed-Forward Networks, in particular multi-layer perceptrons (MLPs) allow to work with fixed sized
inputs, or with variable length inputs in which we can disregard the order of the elements. When feeding the
network with a set of input components, it learns to combine them in a meaningful way.
– MLPs can be used wherever a linear model can
– Nonlinear
– easily integrate pre-0trained word embeddings
• Convolutional Feed-forward networks are specialized architectures that excel at extracting local patterns
in the data - sensitive word order. Work well for identifying indicative phrases or idioms up to a fixed length
in long documents.
• Recurrent Neural Networks(RNNs) are specialized models for sequential data. These are network com-
ponents that take as input a sequence of items, and produce a fixed size vector that summarized that sequence.
Rarely used as stand-alone but rather for input transformers.
• Language-Modeling: the task of predicting the probability of the next word in a sequence
• Structured Problems In NLP: Problems requiring the production of complex output structures such as
sequence or trees.
• Neural Network models can accommodate structured NLP problems by adapting structured-prediction
algorithms or sequence-to-sequence (encoder-decoder) models.
1
December 20, 2017
2 Neural Networks - Ch.2 Learning Basics and Linear Models
Supervised Learning and Parameterized Functions
• Hypothesis Classes: Families of functions, e.g., the space of all linear functions with din inputs and dout
outputs, or the space of all decision trees over din variables
• inductive bias: A set of assumptions about the form of the desired solution
• Hypothesis Class of High-Dimensional Linear Functions: Functions of the form
f (x) = x · W + b
x∈R din
W ∈ Rdin ×dout b ∈ Rdout
Here, the vector x is the input to the function, while the matrix W and the vector b are the parameters
• linear models have several desired properties: they are easy and efficient to train, they often result in convex
optimization objectives, the trained models are somewhat interpretable, and they are often very effective in
practice.
Train, Test, and Validation Sets
• leave-one-out cross-validation: train k functions f1:k , each time leaving out a different input example xi ,
and evaluating the resulting function fi () on its ability to predict xi . Then train another function f () on the
entire trainings set xi:k . Assuming that the training set is a representative sample of the population, this
percentage of functions fi () that produced correct prediction on the left-out samples is a good approximation
of the accuracy of f () on new inputs.
– very costly in terms of computation time
– used only in cases where the number of annotated examples is very small
• Held-out Set: Split the training set into two subsets, say in a 80%/20% split, train a model on the large set
(the training set) and test its accuracy on the small subset (the held-out set).
– Wasteful in terms of training samples
• Three-Way Split: Split the data into train, validation (also called development), and test sets. This gives
you two heldout sets: a validation set (also called development set) and a test set.
– All the experiments, tweaks, error analysis, and model selection should be performed based on the
validation set.
– A single run of the final model over the test set will give a good estimate of its expected quality on
unseen examples
Linear Models
• Binary Classification: Single output. dout = 1, w is a vector and b a scalar
f (x) = x · w + b
• Sign Function: Function that maps negative values to -1 (negative class) and non-negative values to +1
(positive class)
• A dataset is (binary) linearly separable: the two classes can be separated by a straight line
• Nonlinearly separable: Not possible to separate the data-points using a straight line. Beyond the hypothesis
class of linear classifiers.
• Feature: A property by which we classify the data points
• Feature Extraction: Function that maps a real world object to a vector of measurable quantities which can
be used as inputs to the model
• Feature Engineering: The design of the feature functions
• Deep learning promises to simplify the feature-engineering process by allowing the model designer to specify
a small set of core, basic, or ”natural” features, and letting the trainable neural network architecture combine
them into more meaningful higher-level features, or representations
2
December 20, 2017
• Sigmoid Function: Maps values from [−∞, ∞] to the range [0,1], by squashing output:
1
σ=
1 + exp −x
– It is monotonically increasing, with 0 being mapped to 21 .
– When used with a suitable loss function, binary predictions can be interpreted as class membership
probability estimates
– Closer value is to 0 or 1, the more certain the model is in prediction, with 0.5 indicating uncertainty
• Multi-class Classification: Classification problem in which we should assign an example to one of k classes
Representations
• Representation: (of document), vector that captures the properties of document that are important
One-Hot and Dense Vector Representations
• One-hot Vector: All entries are zero except a single entry.
• Bag-of-words (BOW): contains information about the identities of all the ”words” of the document, without
considering their order
• Continuous Bag of Words (CBOW): Representation composed of a sum of word representations, where
each ”word” representation is a low-dimensional, continuous vector.
Log-Linear Multi-Class Classification
• Softmax Function: forces the values in ŷ to be positive and sum to 1, making them interpretable as a
probability distribution
ex[i]
sof tmax(x)[i] = P x[j]
je
Training As Optimization
• loss function: function for quantifying the loss suffered when predicting ŷ while the true label is y.
– A loss function L(ŷ, y) assigns a numerical score (a scalar) to a predicted output ŷ given the true expected
output y.
– Bounded from below, with the minimum attained only for cases where prediction is correct.
• The parameters of the learned function (W and b) are set in order to minimize the loss L over the training
examples
• Regularization Term: A function R(Θ) that takes as input that parameters and returns a scalar that
reflects their ”complexity”. Low regularization helps counter over-fiting
• Hinge Loss: Loss is 0 when y and ỹ share the same sign and |ỹ| ≥ 1. Otherwise, the loss is linear.
• Hinge (Multi-class) Loss: Denote t = argmaxi y[i] the correct class, and by k = argmaxi6=t ŷ[i] the highest
scoring class such that k 6= t. The multi-class hinge loss is defined as:
Lhinge(multi−class) (ŷ, y) = max(0, 1 − (ŷ[t] − ŷ[k] ))
• Log Loss: A ”soft”version of the hinge loss with an infinite margin.
• Binary Cross Entropy: Also referred to as logistic loss is used in binary classification with conditional
probability outputs.
• Categorical Cross-Entropy Loss: The categorical cross-entropy loss (also referred to as a negative log
likelihood) is used when a probabilistic interpretation of the scores is desired.
• Ranking Loss: Setting where not given supervision in terms of labels, but rather as pairs of correct and
incorrect items x and x0 , and our goal is to score correct items above incorrect ones.
3
December 20, 2017
• In practice, the regularizers R equate complexity with large weights, and work to keep the parameter values
low.
• Regularizers R measure the norms of the parameter matrices, and drive the learner toward solutions with low
norms
• L2 Regularization: R takes the form of the squared L2 norm of the parameters, trying to keep the sum of
the squares of the parameter values low:
• L1 Regularization: R takes the form of the L1 norm of the parameters, trying to keep the sum of the
absolute values of the parameters low.
• Elastic Net Regularization: Combines both L1 and L2 regularization in weighted sum.
• Dropout: Another form of regularization which is very effective in neural networks.
Gradient-Based Optimization
• Gradient-based methods work by repeatedly computing an estimate of the loss L over the training set, com-
puting the gradients of the parameters Θ with respect to the loss estimate, and moving the parameters in the
opposite directions of the gradient.
• Convex Function: A function whose second-derivative is always non-negative, and hence has a single
minimum point
• Concave Function: A function whose second-derivative is always non-positive, and hence has a single
maximum point
• For functions that are neither convex nor concavee, a gradient-based optimization procedure may converge to
a local extremum point, missing the global optimum.
• SGD: Stochastic Gradient Descent, a general optimization algorithm
– Goal is to set the parameters of Θ so as to minimize the total loss over the training set.
– Works by repeatedly sampling a training example and computing the gradient of the error on the example
with respect to the parameters Θ.
– The loss is treated as a function of the parameters Θ.
– The parameters of Θ are updated in the opposite direction of the gradient, scaled by a learning rate ηt
4
December 20, 2017
3 Neural Networks - Ch.3 From Linear Models to Multi-layer Percep-
trons
Limitations of Linear Models: The Xor Problem
• The hypothesis class of linear (and log-linear) models is restricted - cannot represent the XOR function.
Nonlinear Input Transformations
• One can successfully train a linear classifier over a dataset which is not linearly separable by defining a function
that will map the data to a representation in which it is linearly separable, and then train a linear classifier
on the resulting representation
• Often to make data linearly separable,one needs to map it to a space with higher dimensions
Kernel Methods
• SVM: (Kernelized) Support Vector Machines
• SVMS define a set of generic mappings of data into high dimensional spaces, and then perform linear classi-
fication in the transformed space.
• Polynomial Mapping: φ(x) = xd is example of mapping that gives all combinations of d variables. For
d = 2, can solve XOR.
• High-dimensional spaces can become computationally prohibitive and risk overfitting
• Kernel Trick: allows one to work in transformed space without ever computing the transformed represen-
tation. Makes classification procedure for SVMs dependent linearly on size of training set.
Trainable Mapping Functions
• Trainable Nonlinear Mapping Function: Mapping function that is trained in conjunction with the linear
classifier.
• Finding the suitable representation becomes the responsibility of the training algorithm.
• Mapping function can take the form of a parameterized linear model, followed by a nonlinear activation
function g that is applied to each of the output dimensions.
ŷ = φ(x)W + b
φ(x) = g(xW 0 + b0 )
• This is the main idea behind deep learning and neural networks, specifically multi-layer perceptions (MLP.
5
December 20, 2017
4 Neural Networks - Ch.4 Feed-forward Neural Networks
A Brain-Inspired Metaphor
• For every neuron, each input has an associated weight. Neuron multiplies each input by its weight, and then
sums them, applies a nonlinear function to the result, and passes it to its output
• Neurons are arranged in layers, reflecting the flow of information.
• Bottom layer has no incoming arrows and is the input
• Top-most layer has no outgoing arrows and is the output
• Other layers of network are ”hidden layers”
• Fully Connected Layer / Affine Layer: Each neuron is connected to all of he neurons in the next layer
• A feed-forward network is just a stack of linear models separated by nonlinear functions.
• The values of each row in the network can be thought of as a vector
• A fully connected layer implements a vector-matrix multiplication, h = xW where the weight of the connection
from the ith neuron in the input row to the jth neuron in the output row is W[i,j] . The values of h are then
transformed by a nonlinear function g that is applied to each value before being passed on as input to the
next layer.
In Mathematical Notation
• Perceptron: The simplest neural network.Simply a linear model:
N NP erceptron (x) = xW + b
x∈R din
W ∈ Rdin ×dout b ∈ Rdout
where W is the weight matrix and b is the bias term.
• A Multi-Layer Perceptron with one hidden-layer (MLP1) is a feed-forward neural network with the form:
N NP erceptron (x) = g(xW 1 + b1 )W 2 + b2
• Activation Function: g, a nonlinear function that is applied element-wise
• Each hidden layer is followed by a nonlinear activavtion
• Deep Networks: Networks with several hidden layers
• For a fully connected layer l(x) = xW + b with input dimensionality din and output dimensionality dout , the
dimensions of x is 1 × din , of W is din × dout and of b is 1 × dout .
• The matrices and bias terms are the parameters of the network.
• The loss function of multi-layer neural networks with respect to their parameters is not convex, but gradient-
based optimization methods can still be applied and perform very well in practice.
Representation Power
• Theoretically, don’t need to go beyond MLP1 but the dimensionality is exponentially high and learnability is
bad.
Common Nonlinearities
• Sigmoid: Transforms each value x into the range [0, 1].
• Hyperbolic Tangent (tanh): Transforms the values x into the range [−11].
• Hard tanh: Approximation of the tanh function which is faster to compute and to find derivatives thereof.
• Rectifier (ReLU): (Rectified Linear Unit), clips each value x¡0 at 0. Performs well combined with dropout
regularization.
6
December 20, 2017
Loss Functions
• L(ŷ, y) assigns a numerical score (a scalar) to the network’s output ŷ given the true expected output y.
Regularization and Dropout
• Common Regularizers: L1 , L2 , elastic net
• L2 regularization, also called weight decay, is effective for achieving good generalization performance.
• Dropout Training: Designed to prevent the network from learning to rely on specific weights. Works by
randomly dropping (setting to 0) half of the neurons in the network (or in a specific layer) in each training
example in the stochastic-gradient training. (Do this by adding random masking vectors in between layers
during training).
Similarity and Distance Layers
• Sometimes we want scalar values based on two vectors that reflect similarity, compatibility, or distance.
• Common functions that take in two vectors and return a scalar: dot product and euclidean distance.
• Trainable Forms: Unlike the fixed functions of dot product and euclidean distance, a parameterized function
that can be trained to produced desired similarity (or dissimilarity) values by focusing on specific dimensions
of the vectors.
• Bilinear Form: A common trainable similarity function:
simbilinear (u, v) = uM v
where the matrix M is a parameter that needs to be trained.
• Trainable Distance Function:
dist(u, v) = (u − v)M (u − v)
• A multi-layer perceptron with a single output neuron can also be used for producing a scalar from two vectors,
by feeding it the concatenation of the two vectors
Embedding Layers
• Embedding Layer: Layer that performs the mapping from symbolic feature values to d-dimensional vectors
(also called a lookup layer).
7
December 20, 2017
5 Neural Networks - Ch.5 Neural Network Training
• neural networks are differentiable parameterized functions, and are trained using gadient-based optimization
• Gradient computation is simply following the chain-rule of differentiation
• Gradients can be efficiently and automatically computer using the backpropagation algorithm - method-
ically computing the derivatives of a complex expression using the chain rule, while caching intermediary
results.
The Computation Graph Abstraction
• The computation graph abstraction allows us to easily construct arbitrary networks, evaluate their predictions
for given inputs (forward pass), and computer gradients for their parameters with respect to arbitrary scalar
losses (backward pass)
• A computation graph is a representation of an arbitrary mathematical computation - it’s a DAG in which nodes
correspond to mathematical operations or (bound) variables and edges correspond to the flow of intermediary
values between nodes. It defines the order of the computations of the dependencies.
• A neural network is essentially a mathematical expression, so it can be represented as a computation graph
• Forward Pass: Computes the outputs of the nodes in the graph. Each node’s output depends only on itself
and on its incoming edges
• Backward Pass: Begins by designating a node N with scalar (1 × 1) output as a loss-node, and running
forward computation up to that node. The backward computation computes the gradients of the parameters
with respect to that node’s value.
• Dynamic Graph Construction: a different computation graph is created from scratch for each training
sample. Forward and backward propagation are then applied to this graph
• Static Graph Construction: the shape of the computation graph is defined once in the beginning of
the computation with place-holder variables indicating input and output values. Then, an optimizing graph
compiler produces an optimized computation graph, and each training example is fed into the (same) optimized
graph.
• New graph creation for every training example accommodates recurrent and recursive neural networks, in
which the network structure varies between training examples.
• For networks with fixed structures such as MLPs, computing one base computation graph is more efficient.
• As long as the network’s output is a vector (1 × k matrix), it is trivial to compose networks by making the
output of one network the input of another, creating arbitrary networks.
Practicalities
• The non-convexity of the objective function means the optimization procedure may get stuck in a local
minimum or a saddle point, and that starting from different initial points may result in different results.
• Random Restarts: Running the training process several times, each with a different random initialization,
and choosing the best one on the development set.
• Model Ensembles: Basing a prediction on an ensemble of several trained models, rather than on a single
one (done by taking majority vote or averaging output vectors).
• In deep networks, it is common for the error gradients to either vanish (become exceedingly close to 0) or
explode (become exceedingly high)
– Vanishing gradients are open research question, but LSTMs and GRUs assist in gradient flow
– To deal with exploding gradients, clip the gradients if their norm exceeds a given threshold.
• Saturated Neurons: Neurons with output values all close to their upper-limit and with very small gradients.
– Layers with tanh and sigmoid activations can become saturated with output values all close to 1
– Saturaed neurons are caused by too large values entering the layer
– Controlled for by changing initialization, scaling the range of the input values, or changing learning rate
– Layer normalization (normalizing output vector) will help but is computationally expensive
• Dead Neurons: Neurons with output values most or all values negative and thus clipped at zero for all
inputs, resulting in 0 gradient.
– Layers with ReLU activations can’t become saturated, but they can die.
– Caused by all signals entering the layer being negative
8
December 20, 2017
– Reducing the learning rate will help
• batch normalization: Activations at each layer are normalized so that they have mean 0 and variance 1
across each mini-batch.
• Order of training examples is important, so shuffling before each pass is advisable.
• Too large learning rates prevent network from converging on effective solution, but too small learning rates
take a long time to converge.
• Learning Rate Scheduling: Decrease the learning rate as a function of the number of observed minibatches.
• parameter updates occur either every training example (minibatches of size 1) or every k training examples.
9
December 20, 2017
6 Neural Networks - Ch.10 Pre-trained Word Representations
Random Initialization
• When enough supervised training data is available, one can just treat the feature embeddings the same as
other model parameters: initialize the embedding vectors to random values, and let the network-training
procedure tune them
• Method used by Word2Vec is to initalize the word vectors to uniformly sampled random numbers in the range
1 1
[− 2d , 2d ], where d is the number of dimensions
• In practice, random initialization is used to initialize the embedding vectors of commonly occurring features
and supervised or unsupervised pre-training is used to initialize the potentially rare features.
Supervised Task-Specific Pre-Training
• If we are interested in taks A, for which we only have a limited amount of labeled data, but there is an
auxiliary task B for which we have much more labeled data, we may want to pre-train our word vectors so
that they perform well as predictors for task B, and then use the trained vectors for training task A.
Unsupervised Pre-Training
• Want embedding vectors of ”similar” words to have similar vectors.
• Distributional Hypothesis: words are similar if they appear in similar contexts
• An important benefit of training word embeddings on large amounts of unannotated data is that it provides
vector representations for words that do not appear in the supervised traiing set.
• When using pre-trained embeddings, there are 2 choices:
– Should the pre-trained word vectors be used as is, or should each vector be normalized to unit length
– How do we fine-tune the pre-trained vectors for the task? Do we treat embedding matrix E as parameter
and train it?
Word Embedding Algorithms
• Distributed Representation: Each entity is represented as a vector of value (”a pattern of activations”),
and the meaning of the entity and its relation to other entities are captured by the activations in the vector,
and the similarities between different vectors. Each word will be associated with a d-dimensional vector, and
the meaning of the word will be captured by its relation to other words and the activation values in its vector.
• Distributional Semantics: A meaning of a word could be derived from its distribution in a corpus, i.e.,
from the aggregate of the contexts in which it is being used. Words that tend to occur in similar contexts
have similar meanings.
• Word-context Matrices: Captures distributional properties of words. Each row i represents a word, each
column j represents a linguistic context in which words can occur, and a matrix entry M[i,j] quantifies the
strength of association between a word and a context in a large corpus.
• Cosine Similarity: Word similarity metric that measures the cosine of the angle between the corresponding
vectors.
• Pointwise Mutual Information (PMI): P M I(w, c) measures the association between a word w and a
context c by calculating the log of the ration between their joint probability (the frequency in which they
co-occur together) and their marginal probabilities (the frequencies in which they occur individually)
• Single Value Decomposition(SVD): An algebraic technique by which an m × n real or complex matrix
M is factorized into three matrices.
• In a distributed representation, each word is associated with a vector in Rd , where the ”meaning” of the word
with respect to some task is captured in the different dimensions of the vector, as well as in the dimensions
of other words.
• The language modeling task poses two important requirements:
– the need to produce a probability distribution over words
10
December 20, 2017
– the need to condition on contexts that can be combined using the chain-rule of probability to produce
sentence-level probability estimates
• The use of randomly sampled words to produce negative examples of incorrect word-context to drive the
optimization is at the core of the Word2Vec algorithm. item For a multi-word context c1:k , the CBOW
variant of Word2Vec Pk defined the context vector c to be a sum of the embedding vectors of the context
components: c = i=1 ci . It then defines the score to be s(w, c) = w · c
• Skip-Gram: For a k-elements context c1:k , the skip-gram variant assumes that the elements ci in the context
are independent from each other, essentially treating them as k different contexts, i.e., a word-context pair
(w, ci:k ) will be represented in D as k different contexts: (w, c1 ), ..., (w, ck ).
• GLoVe Algorithm: constructs an explicit word-context matrix, and trains the word and context vectors w
and c attempting to satisfy:
w · c + b[w] + b[c] = log#(w, c) ∀(w, c) ∈ D
where b[w] and b[c] are word-specific and context-specific trained biases.
The Choice of Contexts
• The most common approach is a sliding window approach. The middle word is called the focus word and
the m words to each side are the contexts.
• Large windows tend to produce more topical similarities, while smaller windows tend to produce more func-
tional and syntactic similarities.
• position contexts: indicating for each context word also its relative position to the focus words.
Dealing With Multi-Word Units and Word Inflections
• In the multi-toekn units case, one can derive a list of such multi-token items, and replace them in the text
with single entities
• In the inflections case, one can mitigate the problem by pre-processing the corpus by lemmatizing some or all
of the words, embedding the lemmas instead of the inflected forms
• A related pre-processing is POS-tagging the corpus, and replacing the words with (word,POS)
Limitations of Distributional Methods
• Definition of Similarity: the definition of similarity in distributional approaches is completely operational:
words are similar if used in similar contexts. But in practice, there are many facets of similarity.
• Black Sheeps: When using texts as the conditioning contexts, many of the more ”trivial” properties of the
word may not be reflected in the text,and thus not captured in the representation.
• Antonyms: Words that are the opposite of each other. Tend to appear in similar contexts. As a consequence,
models based on the distributional hypothesis tend to judge antonyms as very similar to each other.
• Corpus Biases: The distributional methods reflect the usage patterns in the corpora on which they are
based, and the corpora in turn reflect human biases in the real world.
• Lack of Context: The distributional approaches aggregate the contexts in which a term occurs in a large
corpus. The result is a word representation which is context independent. In reality, there is no such thing
as a context-independent meaning for a word
• Polysemy: Words that have multiple senses.
11
December 20, 2017
7 Neural Networks - Ch.11 Using Word Embeddings
Obtaining Word Vectors
Word Similarity
• to compute similarity between two words using a similarity function over vectors, use cosine similarity,
corresponding to the cosine of the angle between the vectors
Word Clustering
• Word vectors can be easily clustered.
• Clusters can then be used as features in learning algorithms that work with discrete features or require discrete
symbols.
Finding Similar Words
• With row-normalized embeddings matrix, the cosine between two words w1 and w2 is given by:
simcos (w1 , w2 ) = E[w1 ] · E[w2 ]
• Similarity to all other words can be computed by the matrix-vector multiplication s = Ew. The result s is a
vector of similarities, where s[i] is the similarity of w to the ith word in the vocabulary. Can use this to find
k most similar words to a word.
• to find the most similar word to a group of words, we define similarity to a group of words to be the average
similarity to the items in the group. Can be computed as:
s = E(w1 + w2 + · · · wk )/k
Odd-One Out
• Odd-one-Out question can be answered by computing the similarity between each word to the average word
vector of the group, and returning the least similar word.
Short Document Similarity
• To find similarity between documents, represent each document as a bag-of-words, and define the similarity
between the documents to be the sum of the pairwise similarities between the words in the documents
m X
X n
simdoc (D1 , D2 ) = cos(wi1 , wj2 )
i=1 j=1
X m Xn
simdoc (D1 , D2 ) =( wi1 ) ·( wj2 )
i=1 j=1
Word Analogies
• Analogy Solving Task: Different word embeddings are evaluated on their ability to answer analogy ques-
tions.
• 3CosAdd: Maximize similarity to analogy end-word, dissimilarity to analogy start-word, and similarity to
input start-word.
12
December 20, 2017
Retrofitting and Projections
• Often have a representative and relatively large list of word pairs that reflects the desired similarity better
than the word embeddings, but has worse coverage. Retrofitting allows to use such data in order to improve
the quality of the word embeddings matrix.
• One can bridge the gap between two embedding spaces using a linear projection
Practicalities and Pitfalls
• Source of training corpus, contexts that were used for defining distributional similarities, and hyper-parameters
of the learning can all greatly influence the results.
• Similarities induced by word vectors are based on distributional signals and are susceptible to all the limitations
of distributional similarity methods.
13