KEMBAR78
Cs 419 Notes | PDF | Bayesian Inference | Cross Validation (Statistics)
0% found this document useful (0 votes)
10 views36 pages

Cs 419 Notes

Notes in english

Uploaded by

kvedha997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views36 pages

Cs 419 Notes

Notes in english

Uploaded by

kvedha997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Introduction to Machine Learning (CS419)

Instructor: Saketh
Contents

Contents i

1 Basic Concepts and Definitions 3


1.1 Human Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Non-Bayesian Set-up for Learning . . . . . . . . . . . . . . . . 6
1.2.2 Bayesian Set-up for Learning . . . . . . . . . . . . . . . . . . 7

2 Overview of Machine Learning Algorithms 9


2.1 Supervised Learning with Deterministic Models . . . . . . . . . . . . 9
2.2 Unsupervised Learning with Probabilistic Models . . . . . . . . . . . 10
2.2.1 Non-Bayesian Set-up . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Bayesian Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Other Hybrids . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Supervised Learning with Probabilistic Models . . . . . . . . . . . . . 13
2.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Examples of Various Models 19


3.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Bernoulli and Beta-Bernoulli Models . . . . . . . . . . . . . . 19
3.1.2 Multinoulli and Dirichlet-Multinoulli Models . . . . . . . . . . 19
3.1.3 Gaussian and Gaussian-inverse-Gamma-Gaussian Models . . . 19

i
3.1.4 Multivariate Gaussian and Gaussian-inverse-Wishart-Gaussian
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.5 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.6 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1.1 Multivariate Bernoulli Naive Bayes Classifier . . . . 22
3.2.1.2 Gaussian Discriminant Analysis . . . . . . . . . . . . 23
3.2.1.3 Logistic Regression . . . . . . . . . . . . . . . . . . . 23
3.2.1.4 Linear Regression . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Deterministic Models . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2.1 Support Vector Machine . . . . . . . . . . . . . . . . 24
3.2.2.2 Support Vector Regression . . . . . . . . . . . . . . . 24
3.2.2.3 Kernel Machines . . . . . . . . . . . . . . . . . . . . 25

1
2
Chapter 1

Basic Concepts and Definitions

Machine learning aims at developing algorithms that mimic the ability in humans to
learn i.e., improve their “performance” with experience. By performance, we mean
their various cognitive abilities. A short note about this is presented below. It is
easy to observe that machine learning algorithms will have far reaching consequences
in all aspects of living and hence are worth developing.

1.1 Human Learning


Humans seem to have various cognitive abilities. Some of which1 are: i) finding
similarities or patterns or groupings in data, ii) categorizing objects not seen before
as novel iii) sequence completion iv) recognition abilities including speech and object
recognition v) Summarizing or abstracting text/images/multi-media vi) Decision
making vii) Problem solving etc. A little thought will convince that all these abilities
improve2 with increasing experience.
Above discussion convinces that associated with learning there is always a
target/unknown concept/ability. Hence-forth, we will call this as the unknown-
concept. For e.g., in the phenomenon of learning to group data, the unknown
concept is the relationship between data/objects and group-ids. In the phenomenon
of speech recognition, it is the relationship between speech utterances and English
transcriptions etc.
Needless to say, learning happens through experience. Hence experience is
an important aspect of learning. It is easy to see that experience is simply a finite-
sized realization (sampling) of the truth in the unknown-concept. Depending on
1
Examples are picked based on the settings that machine researchers have formally studied.
2
Not necessarily strictly monotonically improving.

3
the need/application humans express3 the unknown-concept through some input-
output pairs. For e.g., in the grouping/clustering of data example, the input is the
datum and the output is the group-id etc. This clarifies what we mean by input
and output.
Also, how well or fast will a person learn will definitely depend on his current
state/capacity/bias/background. We will refer to this as background hence-forth.
Given this terminology, one could say that the objective in human learning is to
determine the unknown-concept, but it seems enough to say that the goal is to be
able to provide the “right” output for any input (because this is how usually humans
express their ability/unknown-concept that has been learnt). Summarizing, here are
the five important aspects in human learning:

1. Input and Output

2. Unknown-concept

3. Experience

4. Background

5. Objective of learning

1.2 Machine Learning


Though humans possess very many abilities, they are currently far from understand-
ing how they learn/acquire/improve these abilities. So the idea in machine learning
is to develop mathematical models and algorithms that mimic human learning rather
than understanding the phenomenon of human learning and replicating it. Hence,
we begin by writing down the mathematical concepts by which we represent each of
the five aspects in human learning mentioned above.

1. Input by x ∈ X ⊂ Rn and4 Output by y ∈ Y ⊂ R.

2. Unknown-concept by a Probability distribution [Ross, 2002] over the in-


puts (hence-forth denoted by UX ) or over the input-output pairs (hence-forth
denoted by UXY ). For e.g., in case of support/mean/density estimation or
3
If humans could express the unknown-concept directly, rather than in terms of input-output
pairs, then perhaps, humans would also understand how they learn!
4
Euclidean model for input space and real space for output is what we employ in CS419.
Machine learning, in general, is not restricted to these.

4
sequence filling problems the probability distribution is over the inputs alone
and in case of grouping/clustering data and speech/object recognition prob-
lems the probability distribution is over the input-output pairs.

3. Experience by:

(a) set of input-output pairs. In this case learning is said to be supervised.


(b) set of inputs. In this case learning is said5 to be unsupervised.

In either case, we call this set as the training set. Most importantly, the
training set and the unknown distribution have a relation: we assume that
the training set is an instantiation of a set of iid random samples from the
unknown distribution. We denote the set of iid random variables gener-
ating the training set as D = {X1 , . . . , Xm } in un-supervised case and as
D = {(X1 , Y1 ), . . . , (Xm , Ym )} in the supervised case. The training set is an
instantiation of D (hence-forth denoted by D).

4. Background by (mathematical) Model. Models are of two types:

(a) Deterministic Models: Linear models (set of all linear functions [Shel-
don Axler, 1997] over the input space), Quadratic models (set of all
quadratic functions over the input space), Analytic models (set of all
analytic functions over the input space), Convex models (set of all con-
vex functions [Boyd and Vandenberghe, 2004] over the input space) etc.
(b) Probabilistic Models [Ross, 2006]6 : Bernoulli models (set of all Bernoulli
distributions over the input space), Gaussian models (set of all Gaussian
distributions over the input space).

Note that the members of each model differ only in terms of their parameters.
These parameters are hence called as model parameters. For eg. for linear
models θ ∈ Rn are the parameters and for Gaussian models the mean vector
and covariance matrix are the parameters. We will reserve M for denoting
model and θ for denoting the vector of model parameters.

5. Objective of learning by some mathematical statement (discussed below).


5
In lecture, we attempted to categorize some eg. of human learning into supervised and unsu-
pervised and realized that in some cases this is clear and some cases it is not clear. This means that
one should consider other paradigms too in machine learning than these two. However in CS419
we will only focus on these two. Other paradigms are semi-supervised learning, active learning,
reinforcement learning etc.
6
With these models, one can give answers like it is very likely to rain today, which humans do.

5
Broadly, there are two schools of thought when it comes to formalizing the
objective of learning. The first is to view learning as the process of selecting the
“best” model parameter after observing the training set (given the model). In this
case the output for a given input is predicted/estimated/inferred only using this
best model parameter. The second is to view learning as the process of estimating
the uncertainty of each model parameter being the “right” one and then the output
is estimated/predicted/inferred (for a given input) by taking an aggregate opinion
of each model parameter7 . The latter approach is called as Bayesian learning. We
begin with the former set-up of non-Bayesian learning and then provide a description
of Bayesian learning.

1.2.1 Non-Bayesian Set-up for Learning


Recall that here, the goal is to pick the “best” model parameter that gives good
“performance”. Here is a natural ways of writing down this goal mathematically in
the case of deterministic models for supervised learning:
(1.1) min R[gθ ],
θ∈ϑ

where ϑ is the set of possible values the model parameter, θ, could take, and gθ
is the particular member of the deterministic model (i.e., member of the set of
functions) with θ as the parameter8 . R[gθ ] is the risk incurred by employing gθ as
the “best” hypothesis for prediction. Further, we define risk as the expected loss,
where loss is any non-negative function, l : Y × Y 7→ R+ that measures how far
the predicted and true output are for the given input with the given function i.e.,
R[gθ ] ≡ E [l(gθ (X), Y )]. Examples of loss function9 are: i) 0-1 loss: l(gθ (X), Y ) ≡
1gθ (X)=Y ii) square-loss: l(gθ (X), Y ) ≡ (gθ (X) − Y )2 . Needless to say, one cannot
even compute this expectation as the distribution (UXY ) is unknown and hence
minimizing it is impossible. Later on we will see how to approximate this goal using
the training set.
In this case, given a x, the corresponding y is predicted using y = gθ∗ (x), where

θ is the minimizer of the objective in (1.1).
In case of probabilistic models for unsupervised learning with UX as the un-
known distribution, here is a natural way of writing down the goal:
Z
(1.2) min (Fθ (x) − UX (x))2 dx,
θ∈ϑ X
7
To better understand this statement, you may want to view a model as a set of hypothesis
and each model parameter corresponds to a particular hypothesis in this set.
8
In other words, the model is the set of all gθ for various θ ∈ ϑ.
9
Later on we will provide more examples of loss functions.

6
where Fθ is the particular distribution in the model with the parameter as θ. In
plain words, this simply picks that θ that induces the closest distribution to the
unknown one. Further, assuming that for all the distributions in the model and
UX the moment-generating-function (mgf) exists, here is another way of writing the
goal.

X 2
(1.3) min EUX [X i ] − EFθ [X i ]
θ∈ϑ
i=1

In plain words, this simply picks that θ that induces the distribution whose moments
are closest to the unknown one.

1.2.2 Bayesian Set-up for Learning


Recall that here, the goal is to get the “right” distribution over the model paramters
given the training set such that the distribution reflects the probability that a model
parameter is “correct”. In other words, obtaining the “right” F (θ/D) is the objec-
tive. Ofcourse now one can again resort to the set-up in non-Bayesian and assume
a model over θ etc. and then try to pick the “best”. However, such a method would
not come under Bayesian set-up. In the subsequent chapter, we will show that Bayes
rule can be employed to compute this posterior distribution10 of θ.
In the subsequent chapter we will present a summary of well celebrated al-
gorithms that approximate/achieve the objective of learning in the various settings
illustrated above.

10
Posterior distribution because it is the distribution after observing something .... (here it is
the distribution of θ after seeing the training set).

7
8
Chapter 2

Overview of Machine Learning


Algorithms

Note that the objective in case of Bayesian set-up is achievable/realizable (exactly),


ofcourse assuming that the prior distribution over the model parameters is known.
However the objective in the non-Bayesian set-up is not achievable (impossible to
achieve). Hence the non-Bayesian algorithms need a special introduction.

2.1 Supervised Learning with Deterministic Mod-


els
Here we present algorithms for supervised learning with deterministic models. In
this case the non-Bayesian set-up is natural1 .
Lets start with the problem (1.1). One way to devise a algorithm is by
looking at a quantity computable from the training set and approximates the ob-
jective in (1.1). From weak law of large numbers it is clear that the so-called
empirical risk,Pm which is defined as the average loss over the training set i.e.,
1
R̂m [gθ ] ≡ m i=1 l(gθ (xi ), yi ), is a probably-approximately-correct (PAC) approx-
imation of the
Pmrisk i.e., for a given θ the sequence of random variables (with increas-
1
ing m) m i=1 l(gθ (Xi ), Yi ) converge in probability to R[gθ ]. This observation
motivates the so-called Empirical Risk Minimization (ERM):

(2.1) min R̂m [gθ ],


θ∈ϑ

1
Bayesian set-up is natural with probabilistic models only.

9
Note that, unlike (1.1), this minimization is not impossible to compute2 .
While this is encouraging, the immediate question is whether the minimum of
empirical risk is PAC version of minimum of risk ? We say ERM is statistically
consistent iff this happens. Fortunately, for the loss functions and models we use
in this course, this can be shown [Schölkopf and Smola, 2002, Vapnik, 1998].

2.2 Unsupervised Learning with Probabilistic Mod-


els
We begin with a discussion for non-Bayesian set-up and study Bayesian set-up sub-
subsequently.

2.2.1 Non-Bayesian Set-up


Now lets consider the problem (1.3). Again motivated by law of large numbers the
following algorithm, often called as the method of moments, is immediate:

∞ m
!2
X 1 X i
(2.2) min [xj ] − EFθ [X i ]
θ∈ϑ
i=1
m j=1

Under some mild regularity conditions one can show that the method of moments
is statistically consistent [Hansen, 1982].
Infact, one can easily generalize this algorithm: suppose UX ∈ M i.e., there
exists θ∗ ∈ ϑ such that Fθ∗ = UX . Lets also assume no two distributions in the model
are the same.
 And, suppose there exist n functions gi : X × ϑ 7→ R, i = 1, . . . , n

g1
 g2 
and g =  ..  such that EUX [g(X, θ)] = 0 if and3 only if θ = θ∗ . We define the
 
 . 
gn
method of generalized moments as:

n m
!2
X 1 X
(2.3) min gi (xj , θ)
θ∈ϑ
i=1
m j=1

2
This does not mean that (2.1) can be solved in polynomial time. Infact even with 0-1 loss and
linear model, this problem is computationally hard [Feldman et al., 2009].
3
Note that we define expectation of a vector to be the vector of expectations.

10
Note that if we choose gi (X, θ) ≡ X i − EFθ [X i ], i = 1, 2, . . ., then we will get back
the method of moments. Hence this method indeed generalized the earlier one.
From the literature of information theory, we know that g(X, θ̄) = ∇θ ln(fθ (X))|θ=θ̄ ,
known as the Fisher information4 , satisfies5 the required zero expectation criteria
mentioned above i.e., the expectation of gradient of log-likelihood function is zero
only with the correct model parameters (θ∗ ). Here, and hence-forth, we denote the
pmf/pdf of Fθ by fθ and call it the likelihood6 . Note that with this choice of g,
the method of generalized moments is:
(2.4) min k∇θ ln(fθ (D))k2
θ∈ϑ

where fθ (D) is the likelihood of the training set, D, wrt. the Fθ distribution.
Motivated by the above, we write the following algorithm, popular as the
maximum likelihood algorithm:
(2.5) max ln(fθ (D))
θ∈ϑ

While the above algorithm7 is very intuitive, it simply says the best distribution
is the one with which the likelihood of seeing the training data is the most, it is
interesting that under some sufficient conditions (2.4), the generalized method of
moments, and (2.5), the maximum likelihood , are the same: i) log-likelihood as
function of θ is concave and ii) ϑ is an open set.
We finish this section with another algorithm that again is closely related
to the maximum likelihood algorithm. Given two likelihood functions, f1 , f2 , one
way of computing inner-product Sheldon Axler [1997] is: hf1 , f2 i ≡ Ef1 [f2 (X)] =
Ef2 [f1 (X)], the expected likelihood Jebara et al. [2004]. Needless to say, given an
inner-product, a distance/norm function is induced: kf1 −f2 k2 ≡ hf1 , f1 i+hf2 , f2 i−
2hf1 , f2 i. This motivates the following objective for learning: minθ∈ϑ kfθ − Uθ k2 =
minθ∈ϑ Efθ [fθ (X)] − 2EUX [fθ (X)]. Again, motivated by law of large numbers, we
propose the following algorithm for minimizing the above objective:
m
X
(2.6) min Efθ [fθ (X)] − 2 fθ (xi ).
θ∈ϑ
i=1

Interestingly, the first term is a term independent of training data and is fixed given
the model. For e.g., for exponential distribution, f (x) = λ exp −λx, λ > 0, this
4
For this course the Wikipedia article is enough: http://en.Wikipedia.org/wiki/Fisher_
information.
5
∇x g(x) denotes the gradient of g wrt x i.e., the vector of partial derivatives of g wrt. xi s.
6
Instead of saying probability with discrete RVs and density with conts. RV, machine learn-
ing/statistics community uses likelihood term for both kinds of RVs.
7
Note that the θ∗ that maximizes likelihood also maximizes log-likelihood and vice-versa.

11
term is simply λ/2. While this first term is minimized, the second term (without
the minus sign) is the sum of likelihoods of training examples, which is maximized.
Hence the objective is not only performing some kind of maximum likelihood, but
also preferring low-energy8 models. Such terms in the objective that do not depend
on training set are called as regularizers. We also, note that, without the first
term, this algorithm maximizes the arithmetic mean of likelihoods, whereas the
maximum likelihood algorithm maximizes the geometric mean of the likelihoods
(and AM≥GM). This is another way to motivate the popular maximum likelihood
algorithm. In the subsequent section, we shift our focus to the Bayesian set-up.
Note that all methods/algorithms mentioned above approximate the true ex-
pectation/mean/risk wrt. the unknown distribution with the sample average/mean/empirical-
risk. And then hope9 for statistical consistency. Such algorithms that depend on
the notion of probability as frequency as called as Frequentist algorithms10 .

2.2.2 Bayesian Set-up


As mentioned earlier, unlike Frequentists, here we will employ Bayes rule11 to obtain
F (θ/D):

fD/Θ (D/θ)fΘ (θ)


(2.7) fΘ/D (θ/D) = .
fD (D)

Observe that given fΘ , usually called as prior12 , fΘ/D can be computed. Also, for
prediction, we aggregate over all the model parameters
P using the total probability
rule: consider the UX case, then fX/D (x/D) = f
θ∈ϑ RX/Θ (x/θ)fΘ/D (θ/D) if the
posterior distribution over θ is discrete and fX/D (x/D) = ϑ fX/Θ (x/θ)fΘ/D (θ/D) dθ
if the posterior distribution over θ is continuous. Hence-forth we will refer to fX/D
as the Bayesian Average Model (BAM).

2.2.3 Other Hybrids


Once the posterior distribution for the model parameters is determined, instead of
using it for computing the posterior of x given D, sometimes it used to determined
8
Energy is simply integral of the function-squared, the first term. For exponential distribution,
the energy is simply λ/2. The algorithm thus prefers exponential distributions that decay slowly!
9
Ofcourse, the goal is to prove consistency.
10
Note that ERM is also a frequentist algorithm.
11
Recall that Bayes rule can be applied to any joint distribution where all the conditionals and
marginals are either discrete or continuous distributions.
12
The distribution prior to seeing the training data.

12
the “best” parameter and used for prediction (like in non-Bayesian set-up). For
e.g., compute the mode of the posterior fΘ/D and use it as the “best” parameter,
popularly known as the Maximum Aposteriori Algorithm (MAP); or compute
the mean or median etc. Such half-way Bayesian approaches are often called as
posterior plug-in approximations.
If one now asks the question which of these methods MLE, MAP, BAM may
have better predictive performance, then my answer would be: simply because MAP
and BAM utilize domain knowledge, which is a superior form of knowledge/wisdom,
they may perform better (ofcourse their performance is doubtful when there is no
such explicit domain knowledge) than MLE. In between MAP and BAM, I would
intuitively say, if the model has the distribution that is equal or close to the unknown
distribution being modeled, then MAP or some other such plug-in approximation
based on the posterior is perhaps better. In anycase, we will be able to answer this
question better only in section 2.4.

2.3 Supervised Learning with Probabilistic Mod-


els
Algorithms for this case are exactly same as those in the previous section. However,
the modeling is different and infact multiple choices for modeling exist when dealing
with supervised learning. The following text explains these choices.
In supervised learning, UXY is the unknown distribution, D = {(x1 , y1 ), . . . , (xm , ym )},
and the usual goal is to estimate the conditional distribution UY /X . Now this can
be estimated by two fundamentally different types of models:

Discriminative Models: Model UY /X directly13 . Needless to say, this is very nat-


ural and intuitive.

Generative Models: Model UXY , the unknown distribution and then use Bayes
rule for obtaining UY /X from it14 . At first this seems an overkill, but as sum-
marized in section 8.6 in Murphy [2012], both generative and discriminative
models have their own trade-offs. Further, there are two different choices in
modeling UXY :

1. Model UXY directly.


2. Model two things: UX/Y and UY . Needless to say, given these two, UXY
is fixed and vice-versa.
13
Model for UY /X is a set of distributions for Y /X for all X = x.
14
Note that from UXY one can obtain UY /X however the converse is impossible.

13
Ofcourse, any of Frequentist, Bayesian and other plug-in approximations may be
employed for learning with any of the above types of models. Through various
examples we will later realize the conveniences with each of these three different
types of supervised models.
We end this section with some terminology: the supervised problem where Y
is a discrete set15 is called as classification problem. And more specifically, if Y
is a discrete set with exactly two members, then it is called as a binary classi-
fication problem. Typically, when employing a probabilistic model for predicting
the label/output for a given x with a learnt model, one resorts to what is pop-
ularly known as Maximum Aposteriori Inference/Prediction (MAP Infer-
ence/Prediction) i.e. the predicted label of x is given by argmaxy∈Y f (y/x), where
f is the estimate for UY /X=x . This MAP derived function that provides the predic-
tion g(x) = ŷ is called the classifier. If in a classification problem one employs the
third type of model i.e., model both UX/Y and UY , then the resulting classifier is
called as the Bayes classifier. Further, if one employs the Bayesian average model
for both UX/Y and UY , then we qualify the classifier as Bayesian Bayes Classifier.
Sometimes, in addition to making the assumption that the examples in the train-
ing set are independent, one makes the assumption that the features/dimensions of
the data X ∈ Rn are independent. If a Bayes classifier makes such an assumption
i.e., assume UX/Y = UX1 /Y . . . UXn /Y , then the resulting classifier is called a Naive
Bayes classifier.
If Y = R (with the usual Euclidean space for R), then the supervised problem
is called as a regression problem.

2.4 Model Selection


Till now we assumed that a model is given (and we were mimicking the learning in
a particular human being). The obvious question to be answered is what to do if
multiple models are available or if the model under consideration does not contain
the unkown distribution? This section is dedicated to a discussion on this question.
In the following text, we use model and hyperparameter interchangeably.
We begin with the scenario where multiple models m1 , . . . , mr are given or
equivalently, hyperparameters α1 , . . . , αr are given. In this case, you may already
observe the following fact: previously we assumed hyperparameter/model was given
i.e., the set of parameters to choose from was given and we discussed how to choose
the parameters from this fixed set. Now we are fixing the set of models or equiva-
lently assuming that the set of hyperparameters to choose from is known and asking
15
It is assumed that the members of this discrete set are NOT comparable.

14
how to choose the hyperparameter (and parameter) given this fixed set. So essen-
tially, whatever ideas we used for parameter selection or Bayesian averaging, can
be used again; with the only difference being that the hyperparameters now play
the role of parameters and perhaps we should call those playing the role of hyper-
parameters are hyperhyperparameters. Needless to say, this process can be done
recursively. In the following we will begin with some methods that are essentially
equivalent to the ideas in parameter selection. However, later we will present a
statistical learning theory result that introduces a new component to the usual ob-
jective of training-data fit. This will motivate us to devise a new class of algorithms
(please refer to structural risk minimization in the following for that).
Let’s first start with simple algorithms for model/hyperparameter selection/averaging
that are similar in spirit to parameter selection/averaging: a Bayesian’s answer
would be to simply take a weighted opinion of all the models under consider-
ation. For this you may use domain knowledge that gives a prior16 over mod-
els/hyperparameters. We can write an expression17 for the Bayesian average model:
p(x/D) = i=1 p(x/mi , D)p(mi /D) or equivalently, p(x/D) = ri=1 p(x/αi , D)p(αi /D),
Pr P
where p(αi /D) is the posterior distribution of the hyperparameters that ∝ p(D/αi )p(αi ),
i.e., proportional to the product of the marginal likelihood of the data under
a model/hyperparameter and the prior of the hyperparameter/model. Further,
p(x/αi , D) is essentially the BAM for a given hyperparameter/model (already fa-
miliar
R to you) and the expression for marginal likelihood is ofcourse: p(D/αi ) =
θ∈mi
p(D/θ)p(θ) dθ (average opinion of all distributions in the model mi ). Hence
this algorithm can be called Bayesian average of Bayesian average models.
While this is the case with a Bayesian, a non-Bayesian would argue for selecting
the “best” hyperparameter. Here are some obvious alternatives:

• An algorithm that is analogous to the MAP estimation of parameter selection:


the idea is to use the posterior distribution of hyperparameters: p(αi /D) and
pick the one that maximizes it, say α̂. Now given this “best” hyperparameter
there are again options to predict using this model: we can do the usual BAM
or MAP or MLE based parameter selection and subsequent prediction.

• An algorithm that is analogous to the MLE estimation of parameter selection:


the idea is to use the marginal likelihood of hyperparameters: p(D/αi ) and
pick the one that maximizes it, say α̂. Now given this “best” hyperparameter
there are again options to predict using this model: we can do the usual BAM
or MAP or MLE based parameter selection and subsequent prediction. Such a
16
The parameters of this prior distribution over hyperparameters may be called as hyperhyper-
parameters.
17
We can also write the expression for the case where the set of hyperparameters is an interval
etc.; the sum needs to be replaced by an integral.

15
model selection algorithm is known as the maximum marginal likelihood
algorithm.
• Another simple algorithm is to pick that parameter from all (the union of) the
models that maximizes the likelihood of the training data. This is equivalent
to doing MLE in the union of all models.

Note that each of the four types of algorithms stated above may give different
predictive functions. In particular, the method of maximum marginal likelihood is
not equivalent to the last one (simple MLE over the union). This is because the
marginal likelihood is the likelihood’s average over the entire model and hence the
model for which it is maximum need not be same as that model which contains the
parameter/distribution that maximizes the likelihood of the data. Infact, it is easy
to see that the maximum marginal likelihood and the algorithm mentioned above it
inherently penalize big/complex models! Please read section 5.3 in Murphy [2012]
for a detailed discussion regarding this phenomenon.
This is a key observation that merits a discussion: the simple MLE algorithm
achieves the best fit to the training data because it essentially searches a bigger
model; whereas the others penalize big models; in the sense that bigger the model,
it is more probable that the likelihood, with the corresponding predictive distribution
obtained by the other three algorithms, is less. While this is the case, at the first
look, one might argue that since bigger models have a better chance to contain
the unknown distribution/function being modeled, than smaller models, the simple
MLE algorithm is superior. However, results such as the following18 hint that bigger
models need not necessarily be better:
Theorem 2.4.1. Let G be a set of functions/distributions parametrized by θ ∈ ϑ.
Denote a typical member of this model by gθ . Recall that R[gθ ] and R̂m [gθ ] denote
the true and empirical risks with gθ (computed using a loss function l). Then with
probability 1 − δ, we have:
r
C(G, l) 1 2
(2.8) R[gθ ] ≤ R̂m [gθ ] + 2 √ +3 log ∀ θ ∈ ϑ.
m 2m δ
where, C(G, l) denotes the complexity of the model and loss combination that satisfies
the property that C(G1 , l) ≤ C(G2 , l) whenever G1 ⊂ G2 .

Later on we will give an expression for this complexity for a particular model,
loss combination.
At a first look this theorem’s result may not seem great because after all R̂m is
simply the sample average, and R is the expectation, so it should be easy to derive
18
Refer http://www.cse.iitb.ac.in/saketh/teaching/cs729Notes.pdf for details.

16
such a bound (for example, the proof of weak law of large numbers will itself provide
a bound). Moreover, such a bound will not have anything to do with G. If this is
your thought process, then you should note that this is true if the gθ if fixed apriori.
However, note that if one needs to ascertain how good the ERM candidate, denoted
by θERM , is i.e., how close R̂m [gθERM ] and R[gθERM ] are, then such simple bounding
techniques will not work simply because the θERM itself is a function of the training
data and hence the R̂m is no more a simple average of iid random variables. The
coolest thing about the theorem’s result it that it holds uniformly for every θ ∈ ϑ
and hence can be used to compare R̂m [gθERM ] and R[gθERM ].
Also, in general, this bound cannot be improved. From this theorem it is very
clear that big models are not necessarily better if we desire that the empirical risk of
our candidate is close to its true risk. On the other hand, with smaller models, even
though the empirical risk is perhaps close to the true risk, just because the model
is perhaps too small to contain the unknown distribution, the predictive function
that ERM/MLE/MAP/BAM chooses may be rendered useless. Hence an optimum
balance needs to be striken where perhaps the bound in (2.8), which is usually called
as the guaranteed risk, itself is minimized.
Based on the above discussion, one might hurry to write down the follow-
ing algorithm for model selection and parameter selection, which is usually called
as structural risk minimization (SRM). For reasons of convenience, usually
while talking about this algorithm, one assumes that the given models are nicely
arranged in a structure19 such as m1 ⊂ m2 ⊂ m3 . . .. SRM simply selects the
model/hyperparameter and the parameter in it that minimizes the guaranteed risk.
While this seems flawless at first sight, one has to note that the theorem holds when
the model is fixed/given and perhaps not when the model is to be selected. One may
go ahead and re-derive a theorem for this case where the structure/hierarchy of mod-
els is given (and not the model itself). You can believe me, this will simply lead to
an additive term to (2.8) that is a function of complexity of this hierarchy/structure
of models. Hence SRM is indeed a “safe” algorithm for model selection; however
not so if one further wants to choose the hierarchy/structure itself. It should now
be clear that one can keep on recursively keep minimizing additive variants of the
bound in (2.8). In practice, users usually stop at the level of hyperparameter/model
selection, but somehow ensure that they use some (finite) structure/hierarchy of
models that will certainly contain or approximate the unknown distribution.
We will end this discussion with a slight variant of the simple MLE based model
selection algorithm (the fourth one in the above list). It so happens that practitioners
19
It so happens that this hierarchy/structure is a way to encode prior knowledge about which
models or which parameters are expected to be “good” candidates. The idea is to simply put
the most likely candidates in/as the smallest subset/model so on. etc. since SRM will prefer the
smaller models this is like using prior knowledge that they are good.

17
more commonly employ different subsets of training data to perform parameter
selection and hyperparameter selection using the same criteria of ERM/MLE/fit-to-
the-data. Such algorithms as known as validation based algorithms. Here are some
popular variants:

• The method of Validation for hyperparameter selection. Here, the idea is to


randomly partition the training set into two parts, the validation set and the
“new” training set. For each hyperparameter, using MLE/ERM etc., the pa-
rameter/candidate selection is performed using the “new” training set alone.
Then each hyperparameter’s performance is approximated by the performance
on the validation set20 . The hyperparameter/model that achieves highest val-
idation set accuracy is chosen.

• The extreme case of the above called as leave-one-out cross-validation. Here,


m iterations are performed: at ith iteration, the ith training example alone
is used as the validation set and the rest are used for parameter selection
and the validation accuracy on the left out example is computed for every
hyperparameter. After all rounds, these validation accuracies are averaged
(this average is called as the leave-one-out cross-validation accuracy). The
hyperparameter/model that achieves the highest leave-one-out cross-validation
accuracy is chosen.

• The k-fold cross-validation algorithm, which can be viewed as a hybrid of the


above two: the initial dataset is randomly partitioned into k subsets, hence-
forth called as folds. Here, k iterations are performed: at ith iteration, the
ith fold alone is used as the validation set and the rest are used for parameter
selection and the validation accuracy on the left out fold is computed for ev-
ery hyperparameter. After all rounds, these validation accuracies are averaged
(this average is called as the k-fold cross-validation accuracy). The hyperpa-
rameter/model that achieves the highest k-fold cross-validation accuracy is
chosen.

The important point to note however, owing to the discussion earlier, is that with any
of these model selection algorithms, one pre-fixes the (finite) set of models/hyperparameters.
And it is not necessarily better if more and more hyperparameters/models are con-
sidered. The only conscious choice that can be made for choosing the set of mod-
els/hyperparameters is to somehow ensure that the unknown distribution can be
well approximated by this set.

20
Again, this is intuitive from law of large numbers. However, one should still prove statistical
consistency. Since in practice always one takes a finite set of hyperparameters, it turns out that
statistically consistent is guaranteed.

18
Chapter 3

Examples of Various Models

This this chapter we work out the various algorithms presented earlier for various
models. We begin with unsupervised learning examples.

3.1 Unsupervised Learning


We begin with the case where UX is the unknown distribution, D = {x1 , . . . , xm },
and the goal is to estimate the entire unknown distribution UX .

3.1.1 Bernoulli and Beta-Bernoulli Models

Refer section 3.3 in Murphy [2012].

3.1.2 Multinoulli and Dirichlet-Multinoulli Models

Refer section 3.4 in Murphy [2012].

3.1.3 Gaussian and Gaussian-inverse-Gamma-Gaussian Mod-


els

Refer sections 4.1 and 4.6 in Murphy [2012].

19
3.1.4 Multivariate Gaussian and Gaussian-inverse-Wishart-
Gaussian Models
Refer sections 4.1 and 4.6 in Murphy [2012].

3.1.5 Mixture Models


Refer sections 11.2.1-11.2.3, 11.3, 11.4.1-11.4.2.5, 11.4.7 in Murphy [2012] and Ap-
pendix 1 appearing near the end of this notes.
Here are some further comments:

• Please refer Wu [1983] for details on sufficiency conditions for convergence to


stationary point by EM algorithm and proof details.

• Note that from the theorems in the above paper, it is clear that EM algorithm
directly implements the generalized method of moments algorithm (2.4), where
the goal is infact to find stationary points of log-likelihood. Hence, the EM
algorithm is better motivated by the generalized method of moments algorithm
rather than by the maximum likelihood algorithm. Somehow strangely, many
books introduce EM algorithm for solving the maximum likelihood problem
instead.

• For a novice there seems to be the following paradox: since it is known that
(under some mild conditions) the generalized method of moments is statisti-
cally consistent, you may think that just by providing millions and millions
of samples from UX , through EM algorithm, you are able to recover the joint
UXY ! However a closer look will convince you that the mixture model itself
never involves Y and it is we who are interpreting1 the component densities
are class conditionals and the mixing probabilities as the class prior. Hence
the paradox is resolved.

• Moreover, two different mixing probabilities of component distributions may


give exactly the same mixture distribution2 .

• However, the above never happens when the components are Gaussians with
distinct means. A more comprehensive result is that Gaussian mixture model,
1
We could arrive at the same distribution with a totally different re-parameterization.
2
It is very easy to see this if the component distributions are discrete and hence equivalent to
Euclidean vectors. In this case this essentially is the same as realizing that two different convex
combinations of a set of vectors may give the same vector.

20
which is the set of all Gaussian mixture models with various number of com-
ponents, is universal i.e., given any continuous distribution (over X ), there
is a Gaussian mixture distribution (over X ) with finite number of components
that closely approximates the given distribution. Please refer McLachlan and
Peel [2000] for details. In other words, the Gaussian mixture model is rich
enough for any learning problem (over X ).
• A Gaussian mixture model hence need not only be used for a clustering prob-
lem but also for example to model class conditionals in a classification problem
(which is supervised learning).

3.1.6 Hidden Markov Models


Refer sections 17.3.1-17.5.2.3 in Murphy [2012] and Appendix 2 appearing at the
end of this notes.
Here are some further comments:

• For a tutorial on speech recognition, which was the motivating application for
HMMs in the lectures, please refer Rabiner [1989].
• We defined a HMM as a family of distributions over X , Y (calX is a space of
(finite) sequences of vectors3 and Y is a space of (finite) sequences of objects
in a discrete set4 ) that satisfies the following conditional independence and
homogeneity assumptions: fX1 ,...,Xt ,Y1 ,...,Yt (x1 , . . . , xt , y1 , . . . , yt ) =
fX1 ,...,Xt /Y1 ,...,Yt (x1 , . . . , xt /y1 , . . . , yt )fY1 ,...,Yt (y1 , . . . , yt )
= fX1 /Y1 (x1 /y1 ) . . . fXt /Yt (xt /yt )fY1 (y1 )fY2 /Y1 (y2 /y1 ) . . . fYt /Yt−1 (yt /yt−1 )
(∵ the conditional independence assumptions. The first set say given the Yt , Xt is conditionally independent of other Xi )
(∵ while the second set is simply the Markovian assumption described in section 17.2.)
= f (x1 /y1 ) . . . f (xt /yt )f (y1 )f (y2 /y1 ) . . . f (yt /yt−1 )
(∵ Homogenity assumption that says the (conditional) distribution does not depend on the position in the sequence.)

• We also noted two graphical ways of representing HMMs:


– One is by using Directed Graphical Models (DGMs). Here the conditional
independences only can be represented (not the homogeneity assump-
tions). Please refer sections 10.1-10.2.2 in Murphy [2012]. The section
3
In the speech recognition example, a member of X is a sequence of 12-dimensional MFCC
cepstral coefficients computed for every frame in a speech signal.
4
In the speech recognition example, a member of Y is a sequence of phonemes, one for every
frame in a speech signal.

21
10.2.2 provides the representation for HMMs. By this it should be clear
how HMMs are to be generalized.
– Another is by representing the Markov chain part in the HMM (i.e., the
distribution over Y ) through state transition diagrams (example figure
17.1 in Murphy [2012]).

• We next commented that there are other conditional independences that follow
from those made above in a HMM. One of them was eqn. (17.50) in the book.
We noted that this implied conditional independence is crucial for the EM
algorithm.

• Though we motivated HMMs through the application of speech recognition,


now we can use it model say, the class conditional densities in any classification
problem involving inputs that are sequences. For e.g., speaker recognition
where the training set is pairs of speech signals labeled with the speaker’s
name etc. (see also section 17.5.4 in Murphy [2012]).

• In lectures we also considered training HMMs in a supervised fashion. Though


the maximum likelihood algorithm turns out to be simple, we commented that
collecting such training data, where each time frame in a speech signal has to
be labelled with the phoneme, is extremely tedious and hence usually we talk
about HMMs trained in an unsupervised fashion.

• Infact, the most popular way of using HMMs is with loose supervision where
for each speech signal the label is simply given as the word uttered. It turns
out that one can tweak the unsupervised HMM a bit to utilize such weak
supervision.

3.2 Supervised Learning


We begin with the case where UXY is the unknown distribution, Y is discrete,
D = {(x1 , y1 ), . . . , (xm , ym )}, and the goal is to estimate the conditional distribu-
tion UY /X . We begin with probabilistic models and then finish with a section on
deterministic models.

3.2.1 Probabilistic Models


3.2.1.1 Multivariate Bernoulli Naive Bayes Classifier

Refer sections 3.5.1-3.5.2 in Murphy [2012].

22
3.2.1.2 Gaussian Discriminant Analysis

Refer section 4.2 in Murphy [2012].

3.2.1.3 Logistic Regression

Refer sections 8.1-8.3.1 in Murphy [2012].

3.2.1.4 Linear Regression

Refer sections 7.1-7.3 and 7.5.1 in Murphy [2012].

3.2.2 Deterministic Models


The basic model in the deterministic case is the linear model (set of all linear func-
tions over the input space= Rn ), denoted by L = {f | ∃w ∈ Rn 3 f (x) = w> x ∀ x ∈
Rn }. It is evident that w ∈ Rn is the parameter for this model. The following loss
functions are popularly employed:

• Binary Classification:

– Hinge-loss: l(w> x, y) ≡ max(0, 1 − yw> x). We commented that this loss


is piece-wise linear, convex (wrt.w) and hence attractive. The zero-one
loss on the other hand is non-convex and hence rarely employed5 .
– Squared-hinge-loss: l(w> x, y) ≡ max(0, 1 − yw> x)2 . This loss is further
differentiable wrt. w, hence preferred over hinge-loss sometimes.
– Logistic-loss: l(w> x, y) ≡ log(1 + exp(−yw> x)). This is also convex and
differentiable (wrt. w).

• Regression:

– Square-loss: l(w> x, y) ≡ (y − w> x)2 .


– -insensitive loss: l(w> x, y) ≡ max(0, |y − w> x| − ).

With each of these one can implement ERM given by (2.1). It is easy to see that
the logistic-loss ERM problem is exactly same as the MLE problem with logistic
5
Refer Feldman et al. [2009] to know why the non-convexity in the zero-one loss makes ERM a
computationally hard problem.

23
regression. Also, the square-loss ERM problem is exactly same as the MLE problem
with linear regression.
While the above provide examples for equivalence with MLE, it so happens
that there are examples where there is equivalence with MAP in logistic and linear
regression. For this one usually employs the following subset of the linear model:
FW ≡ {f | ∃w ∈ Rn , kwk ≤ W 3 f (x) = w> x ∀ x ∈ Rn }. Here W is the
hyperparameter.
This model together with logistic-loss is equivalent to MAP based logistic
regression. Here is why: the ERM problem in this case is
Pm >
(3.1) minn i=1 log(1 + exp(−yi w xi ))
w∈R
(3.2) s.t. kwk ≤ W
From optimization theory (Lagrange duality), one can show that for every W there
exists some C > 0 such that the above problem and the following have the same
optimal solution set6 :
m
1 X
(3.3) min kwk2 + C log(1 + exp(−yi w> xi ))
w∈Rn 2 i=1

It is now easy to see that the above problem is exactly the same form of MAP based
logistic regression with a zero-mean Gaussian prior over w.
Similarly, (after using Lagrange duality) it is easy to see that FW together with
square-loss is equivalent to MAP based linear regression with a zero-mean Gaussian
prior over w (in other words, ridge regression).
The predictive function that results from FW together with hinge-loss is called
as the (soft-margin) Support Vector Machine (refer section 3.2.2.1). And that with
F and -insensitive loss is called as Support Vector Regression (refer section 3.2.2.2).

3.2.2.1 Support Vector Machine

Refer section 14.5.2 in Murphy [2012].

3.2.2.2 Support Vector Regression

Refer section 14.5.1 in Murphy [2012].


Now one can similarly talk about various combinations of quadratic or higher-
order non-linear models with the various losses. It so happens that there is a very
6
C is now the hyperparameter that determines the model.

24
convenient way of generalizing these non-linear models, which is discussed in the
subsequent section:

3.2.2.3 Kernel Machines

Consider the model FW,φ ≡ {f | ∃w ∈ H, kwk ≤ W 3 f (x) = hw, φ(x)i ∀ x ∈ Rn },


where H is some Euclidean space or its infinite-dimensional generalization, called
Hilbert space, h·, ·i denotes the inner-product7 in that space and φ : Rn 7→ H.
For example, let φ(x) be defined as the vector of all possible monomials with
components of x upto degree two, let the inner-product be the dot-product and let
W → ∞. Then FW,φ in this case is nothing but the set of all quadratic functions (i.e.,
the quadratic model). The advantage with this representation however is that w is
still the parameter. Needless to say, W, φ are the hyperparameters that determine
the model.
Now we claim that with such a model and any loss function,P the corresponding
ERM problem will have an optimal solution of the form w∗ = m Pα
i=1 i φ(xi ). In other
∗ m
words, there will exist αi ∈ R ∀ i = 1, . . . , m such that w = i=1 αi φ(xi ). The
proof is simple:

Pm
• any w can be written as i=1 αi φ(xi ) plus some vector in the orthogonal
complement of the space spanned by φ(x1 ), . . . , φ(xm ). Denote this orthogonal
complement by w⊥ .

• Re-write the ERM problem in terms of αs and w⊥ :


(3.4)
m
! 2 m
* m ! + !
1 X X X
min αi φ(xi ) + w⊥ +C l αj φ(xj ) + w⊥ , φ(xi ) , yi ,
α∈Rm ,w⊥ ∈H 2
i=1 i=1 j=1

which is same as (∵ hw⊥ , φ(xi )i = 0 for any i = 1, . . . , m by the very definition


of orthogonal complement):
(3.5)
m
! 2 m
* m + !
1 X X X
min αi φ(xi ) + w⊥ + C l αj φ(xj ), φ(xi ) , yi ,
α∈Rm ,w⊥ ∈H 2 i=1 i=1 j=1

7
For the purposes of this course it is enough to understand that Hilbert space and inner-product
are simply the generalizations of the Euclidean space and dot-product. Interested students may
refer http://www.cse.iitb.ac.in/saketh/teaching/cs723Scans3.pdf for further details.

25
which is same as (∵ Pythogorus theorem holds in H):
m 2 m
* m + !
1 X X X
(3.6) min αi φ(xi ) + C l αj φ(xj ), φ(xi ) , yi ,
α∈Rm 2 i=1 i=1 j=1

which is same as (∵ using properties of inner product):


(3.7) !
m m m m
1 XX X X
min αi αj hφ(xi ), φ(xj )i + C l αj hφ(xj ), φ(xi )i , yi ,
α∈Rm 2 i=1 j=1 i=1 j=1

The above result is known as the representer theorem. In addition to the result,
from the proof it is clear that ERM, which is originally an optimization problem in
H (that potentially could be infinite dimensional), is essentially a problem in Rm .
Secondly, from the proof it is clear that neither solving ERM (training phase) nor for
computing the prediction function hw, φ(x)i (inference phase) the hyperparameter φ
is not explicitly needed. What is needed is only inner-product between φ(xi ), φ(x).
It so happens that mathematicians (in 1950s) have well-understood functions
that may represent inner products in some Hilbert space. Here is the result:
 
k(x1 , x1 ) . . . k(x1 , xm )
 k(x2 , x1 ) . . . k(x2 , xm ) 
Suppose k : X ×X 7→ R is a function such that G = 
 
.. .. .. 
 . . . 
k(xm , x1 ) . . . k(xm , xm )
is a symmetric positive semi-definite matrix for any {x1 , x2 , . . . , xm } ⊂ X and for
any m ∈ N, then there exists a Hilbert space Hk and a map φk : X 7→ Hk such that
k(x1 , x2 ) = hφ(x1 ), φ(x2 )i ∀ x1 , x2 ∈ X . Such functions k are called as kernels.
Some examples of kernels are linear kernel k(x1 , x2 ) ≡ x> 1 x2 , polynomial
>
d
kernel k(x1 , x2 ) ≡ 1 + x1 x2 (here d is the hyperparameter that controls the
degree of the polynomial),
 Gaussian
 kernel or Radial Basis Function (RBF) kernel
kx1 −x2 k2
k(x1 , x2 ) ≡ exp − 2σ2 (here σ > 0 is the hyperparameter). Please verify if the
definition is satisfied for all these examples. Your book Murphy [2012], in section
14.2, provides examples of kernels on non-Euclidean spaces too!
This discussion makes it clear that, instead of using the model FW,φk , we may
now use the equivalent FW,k and the ERM problem is of size m. The later model is
more convenient because of the following reasons:

• In many applications, it is easier for domain experts to say how similar two
inputs are rather than to provide a Euclidean representation. For e.g., if the
inputs are graphs, it is easy to say how similar two graphs are rather than to

26
give a Euclidean vector representation for a graph. Such similarity functions
can be directly used as kernels (provide the mathematical conditions are satis-
fied); whereas the former model will explicitly require φ i.e., the representation
of the input.

• If the dimensionality of H is high (or say ∞), then FW,k is an efficient repre-
sentation. Moreover, the ERM in the form (3.7) is of manageable size.

• Useful for building non-linear


 models: for example, with the Gaussian kernel,
Pm kxi −xk2
hw, φk (x)i = i=1 αi exp − 2σ2 , which is a non-linear function of x.

• One can show that the (infinite) hierarchy F10−10 ,k ⊂ F10−9 ,k ⊂ . . . ⊂ F1,k ⊂
F10,k ⊂ . . ., where k is the Gaussian kernel, is universal in the sense that the
union of all these can approximate any (measurable) function over Rn . For
practitioners, this simply means that they can start with Gaussian kernel and
SVM, tune hyperparameters using some model selection, then the classifier
will perform well (given enough training data). This is important because no
such guarantee, even with lots of training examples, can be given for say SVM
with linear or polynomial kernel or for most of the models discussed in the
course.

27
28
Bibliography

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge Univer-


sity Press, 2004.

V. Feldman, V. Guruswami, P. Raghavendra, and Yi Wu. Agnostic learning of


monomials by halfspaces is hard. IEEE Symposium on Foundations of Computer
Science (FOCS), pages 385–394, 2009.

Lars Peter Hansen. Large sample properties of generalized method of moments


estimators. Econometrica, 50:1029–1054, 1982.

Tony Jebara, Risi Kondor, and Andrew Howard. Probability product kernels. Jour-
nal of Machine Learning Research, 5:819–844, 2004. ISSN 1532-4435.

G. McLachlan and D. Peel. Finite Mixture Models. John Wiley and Sons, New York,
2000.

Kevin Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 1 edition,


2012.

Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications


in speech recognition. In Proceedings of the IEEE, pages 257–286, 1989.

S. M. Ross. A First Course in Probability. Pearson Education, 6 edition, 2002.

S. M. Ross. Introduction to Probability Models. Academic Press, 9 edition, 2006.

Bernhard Schölkopf and Alex Smola. Learning with Kernels. MIT press, Cambridge,
2002.

Sheldon Axler. Linear Algebra Done Right. Springer-Verlag, 1997.

V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, Inc., 1998.

C. F. Jeff Wu. On the Convergence Properties of the EM Algorithm. Annals of


Statistics, 11(1):95–103, 1983.

29

You might also like