0 ratings0% found this document useful (0 votes) 21 views26 pagesMachine Learning Unit3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
UNIT Il
Ensemble Techniques
and Unsupervised Learning
syllabus
Combining mutiple learners : Model combination schemes, Voting, Ensemble Learning - bagging,
boosting. stacking, Unsupervised learning : K-means, Instance Based Learning KNN, Gaussian
enure niodels and Expectation maximization
Contents
34 Combining Multiple Learners
32. Ensemble Learning
33 Clustering
34. Instance Based Learning : KNN
35 Gaussian Mixture Models
3.6 Two Marks Questions with AnswersoN
3-2 Ensemble Techniques and Unsupervisey oom
Machine Leaming
El Combining Multiple Learners
* When designing a learning machine, we generally en choices ly
s of machine, training data, representation, ete. This implies some
Parameters ol erformance. For example, in a classification setting, we CaN tig '
a fier or in a multilayer perceptron, we should also decide on ie
ing
number of hidden units.
* Each learning algorithm dictates a certain model that comes with a set,
assumptions. This inductive bias leads to error if the assumptions do not hold fp
the data,
* Different learning algorithms have different accuracies. The no free lunch theorem
asserts that no single learning algorithm always achieves the best Performance in
any domain. They can be combined to attain higher accuracy.
* Data fusion is the process of fusing multiple records representing the same
real-world object into a single, consistent, and clean representation. Fusion of data
for improving prediction accuracy and reliability is an important problem in
machine learning.
* Combining different models is done to improve the performance of deep earning
models. Building a new model by combination requires less time, data, and
computational resources. The most common method to combine models is by
averaging multiple models, where taking a weighted average improves the
accuracy,
lead to different classifiers,
* Different Hyper-parameters : We can use the same learning algorithm but use it
with different hyper-parameters,
* Different Input Representations: Different Tepresentations make different
characteristics explicit allowing better identification,
* Different Training Sets ; Another Possibility is to train
different base-learners by
different subsets of the training set.
A Model Combination Schemes
* Different methods are used for Benerating final output for multiple base learners
are Multiexpert and multistage combination.
1. Multiexpert combination,
TECHNICAL PUBLICATIONS® - an up.thrust for knowiedgerg
eaming -
oct 3-3 Ensemble Techniques and Unsupervised Learning
. Mali Sra nee methods have base-learners that work in parallel
a) Globa : Ee rf i Samer fusion) : given an input, all base-learners generate an
output and all these outputs ate used, such as voting and stacking
b) Local approach (learner selection) : in mixture of experts, there is a gating
model, which looks at the input and chooses one (or very few) of the learners
as responsible for generating the output,
+ Multistage combination ; Multistage combination methods use a serial approach
where the next multistage combination base-learner is trained with or tested on
only the instances where the previous base-learners are not accurate enough.
«+ Let's assume that we want to construct a function that maps inputs to outputs
from a set of known Nyrain input-output pairs.
« Let's assume that we want to construct a function that maps inputs to outputs
from a set of known N train input-output pairs.
Nea
Durain = {i-ya)l
where xj € X is a D dimensional feature input vector, y; € Y is the output.
+ Classification : When the output takes values in a discrete set of class labels
Y = {c1;€p/...¢,], Where K is the number of different classes. Regression consists
in predicting continuous ordered outputs, Y = R.
ERE] Voting
« The simplest way to combine multiple classifiers is by voting, which corresponds
to taking a linear combination of the learners. Voting is an ensemble machine
learning algorithm.
+ For regression, a voting ensemble involves making a prediction that is the average
of multiple other regression models.
* In classification, a hard voting ensemble
involves summing the votes for crisp
class labels from other models and
Predicting the class with the most votes.
A soft voting ensemble _ involves
summing the predicted probabilities for
class labels and predicting the class label
with the largest sum probability.
* Fig. 3.1.1 shows Base-learers with their
outputs.
Fig. 3.1.1 Base-learners with their
outputs
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
—N
Machine Learning a4 Ensomble Techniques and Unsupervised 4,
ming
0
multiple classification/ regression
del can be created using different
using the same dataset
«In this methods, the first step is to create
using some training dataset. Each base mo
of the same training dataset and same algorithm, or
els
SPlits
; : with
different algorithms, or any other method.
Learn multiple alternative definitions of a conce t using different training dat,
different learning algorithms. It combines decisions of multiple definitions
using weighted voting.
Fig. 3.1.2 shows general idea of Base-
1,
Jearners with model combiner.
Training data set
[Learner ty
I I
Model combiner
Final model
TECHNICAL PUBLICATIONS®
ONS” - an upcthrust for
inowledgeMachine Learning S25) Ensemble Techniques and Unsupervised Leaming
+ Use a single, arbitrary learning algorithm b
oneal ut manipulate training data to make it
learn multiple models.
[SES Error-Correcting Output Codes
«In Error-Correcting Output Codes main classification task is defined in terms of a
number of subtasks that are implemented by the base-learners. The idea is that the
original task of separating one class from all other classes may be a difficult
problem.
+ So, we want to define a set of simpler classification problems,
one aspect of the task, and combining these simpler classifie
classifier.
each specializing in
rs, we get the final
+ Base-learners are binary classifiers having output - 1/ + 1, and there is a code
matrix W of K x L whose K rows are the binary codes of classes in terms of the L
base-learners dj.
* Code matrix W codes classes in terms of learners
© One per class L = K
410-1 -1 41
aH -71-4
Wed a a i
A-a-t Aj ee
+ The problem here is that if there is an error with one of the base-learners, there
may be a misclassification because the class code words are so similar. So the
approach in error-correcting codes is to have L > K and increase the Hamming
distance between the code words.
One possibility is pairwise separation of classes where there is a separate
base-learner to separate C; from C, for i a
Fig. 3.2.4
aL
AdaBoost : : /
* AdaBoost, short for "Adaptive Boosting", is a machine learning meta - algorithm
formulated by Yoav Freund and Robert Schapire who won the prestigious "Gédel
Prize" in 2003 for their work. It can be used in conjunction with many other types
of learning algorithms to improve their performance.
* It can be used to learn weak classifiers and final classification based on weighted
vote of weak classifiers. 3
© It is linear classifier with all its desirable properties. It has good generalization
Properties.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge, oF
10 Ensomble Techniques and Unsupervised Lo arg |
te prediction rule bY calling tn,
raining examples,
accural
jons over the #
the weights of incorrectly
tions that the Previously
ration.
or to form a highly
# To use the weak learner fe i '
wwaak leamer repeatedly’ oF different distribu me
cet equally, but each
‘tially, atl weights are set equally k a
classified ew a ; ‘
classifier poorly prvicts receive greater weight on th
Advantages of AdaBoost }
to implement
1. Very simple
2. Fairly good generalization
The prior error need not be
s known ahead of time.
Disadvantages of AdaBoost :
1, Suboptimal solution
2 Can over fit in presence of noise.
Boosting Steps :
Draw a random subset of training samples d1 without replacement from the
training set D to train a weak learner CL
Draw second random training subset d2 without replacement from the training set
that were previously falsely
and add 50 percent of the samples
lassified/misclassified to train a weak learner C2
Find the training samples d3 in the training set D on which C1 and C2 disagree to
train a third weak learner C3
4. Combine all the weak learners via majority voting.
‘Advantages of Boosting :
1. Supports different loss function.
2 Works well with interactions.
Disadvantages of Boosting :
1. Prone to over-fitting.
2 a ; ,
Requires careful tuning of different hyper - parameters,
Stacking
* Stacking, sometimes called stacked eneralization, achine learning
8 ization, is an ensemble machieaming
v 3-11 Ensemble Tochniquos and Unsupervised Leaming
model is trai
+ The an on the nares on the complete training data, and then the meta-model
is wane ae ra a tie Of the base models. The advantage of stacking is the
ability ‘tion space with different models in the same problem.
, The stacking based model can be visualized in levels and has at least two levels of
the models. The first level typically trains the two or more base learners(can be
heterogeneous) and the second level might be a single meta learner that utilizes
the base models predictions as input and gives the final result as output. A
sacked model can have more than two such levels but increasing the levels
doesn't always guarantee better performance,
» Inthe classification tasks, often logistic regression is used as a meta learner, while
linear regression is more suitable as a meta leamer for regression-based tasks.
» stacking is concerned with combining multiple classifiers generated by different
learning algorithms Lj,...,.Ly)y on a single dataset S, which is composed by a
feature vector Sj =(x;,t))
« The stacking process can be broken into two phases :
1. Generate a set of base - level classifiers C1,...,Cyy where C; = L\(S)
2, Train a meta - level classifier to combine the outputs of the base - level
dassifiers.
Fig. 32.2 shows stacking frame.
reaining set[ 1] 2 [3 4] |
Hypotheses |) 3) G3) se G) | Training observations
Predictions on
training observations a
eee H_ |+—Least observation
Final prediction] P-
Fig. 3.2.2 Stacking frame
TECHNICAL PUBLICATIONS® - an up-thrust for knowiedge
—i id Unsupervised L
ible Techniques an amir
Machine Leaming 3-12 Ensem 7
ted through a leave - one |
© The training set for the meta - level classifier is general
out cross validation process.
Vy = 1, mand V_ Cy
= L,(S-si)
icti for Ss
‘The leamed classifiers are then used to generate predictions for sj
= Chx;)
oh yi
vel dataset consists of examples of the form ((¥j--- yi) where
© The meta - let
s of the base - level classifiers and the class is the
the features are the prediction:
correct class of the example in hand.
«Why do ensemble methods work ?
* Based on one of two basic observations : : oe
1. Variance reduction : If the training sets are completely independent, it will
always helps to average an ensemble because this will reduce variance without
affecting bias (e.g. - bagging) and reduce sensitivity to individual data points.
2. Bias reduction : For simple models, average of models has much greater
capacity than single model Averaging models can reduce bias substantially by
increasing capacity and control variance by Citting one component at a time.
E23 Adaboost
* AdaBoost also referred to as adaptive boosting
is a method in Machine Learning used as an
ensemble method. The maximum not unusual
algorithm used with AdaBoost is selection trees
with one stage meaning with decision trees with
most effective 1 split. These trees also are
referred to as decision stumps.
Stump.
* The working of the AdaBoost version follows Fig. 3.2.3 Adaboost
the beneath-referred to path :
+ Creation of the base leamer.
= Calculation of the total error via the beneath formulation.
* Calculation of performance of the decision stumps.
«Updating the weights in line with the misclassified factors.
Creation of a new database :
AdaBoost ensemble :
* In the ensemble approach, we upload the susceptible fashions sequentially and
then teach them the use of weighted schooling records.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Learning 3-13 Ensemble Techniques and Unsupervised Learning
+ We hold to iterate the process till we gain the advent of a pre-set range of
vulnerable learners or we can not look at further improvement at the dataset. At
the end of the algorithm, we are left with some vulnerable learners with a stage
B ifee.
"BFAD Difference between Bagging and Boosting
Bagging . Boosting
Bagging is a technique that builds Boosting refers to a group of algorithms |
multiple homogeneous models from that utilize weighted averages to make
different subsamples of the same weak learning algorithms stronger
training dataset to obtain more accurate _ learning algorithms.
predictions than its individual models
Learns them independently from each __Learns them sequentially in a very
other in parallel adaptative way
It helps in reducing variance. It helps in reducing bias and variance.
Every model receives an equal weight. _ Models are weighted by their
performance.
Clustering
+ © Given a set of objects, place them in groups such that the objects in a group are
similar (or related) to one another and different from (or unrelated to) the objects
in other groups.
© Cluster analysis can be a powerful data-mining tool for any organization that
needs to identity discrete groups of customers, sales transactions, or other types of
behaviors and things. For example, insurance providers use cluster analysis to
detect fraudulent claims and banks used it for credit scoring.
* Cluster analysis uses mathematical models to discover groups of similar customers
based on the smallest variations among customers within each group.
* Cluster is a group of objects that belong to the same class, In another words the
_ similar object are grouped in one cluster and dissimilar are grouped in other
cluster,
Clustering is a process of partitioning a set of data in a set of meaningful
subclasses. Every data in the sub class shares a common trait. It helps a user
Understand the natural grouping or structure in a data set.
parious types of clustering methods are partitioning methods, hierarchical
lustering, fuzzy clustering, density based clustering and model based clustering.
Cluster anlysis is process of grouping a set of data objects into clusters.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeTechniques 2nd UNSUPerV0 Lo4m,
g-14___ Ensemble ;
Machine Leeming .
stering algorithm are as fol
i clu
+ Desirable properties of @
Fig. 3.3.1
1. Scalability (in terms of both time and space)
2. Ability to deal with different data types _
' uirements for domain knowledge to determine input parameters,
3. Minimal req}
4. Interpretability and usability.
eee of data is a method by which large sets of data are grouped into
, clusters of smaller sets of similar data. Clustering can be considered the most
important unspervised learning problem.
A cluster is therefore a collection of objects which are "similar" between them and
are dissimilar" to the objects belonging to other clusters. Fig. 3.3.1 shows cluster.
In this case we easily identify the 4 clusters into which the data can be divided;
the similarity criterion is distance : two or more objects belong to the same cluster
if they are “close” according to a given distance (in this case geometrical distance),
This is called distance-based clustering.
Rew data Clustering algorithm Clusters of data
Fig. 3.3.2
* Clustering means grouping of data or dividing a large
data set into smaller data sets of some similarity.
. " ‘ - Centroid
A clustering algorithm attempts to find natural groups =
components or data based on some similarity. Also, the
clustering algorithm finds the centroid of a group of
TECHNICAL PUBLICATIONS® - an Up-thrust for knowl
owledgeMachine Leaming 3-15 Ensemble Techniques and Unsupervised Learning
basically a statistical description of the cluster centroids with the number of
components in each cluster.
+ Cluster centroid: The centroid of a cluster is a point whose parameter values are
the mean of the parameter values of all the points in the cluster. Each cluster has
a well defined centroid.
« Distance : The distance between two points is taken as a common metric to as see
the similarity among the components of population. The commonly used distance
measure is the euclidean metric which defines the distance between two points
P=(P1/P2---) and q =(q1,q2,...)is given by,
k
d = Yi-ai)?
a
+ The goal of clustering is to determine the intrinsic grouping in a set of unlableled
data. But how to decide what constitutes a good clustering ? It can be shown that
there is no absolute "best" criterion which would be independent of the final aim
of the clustering. Consequently, it is the user which must supply criterion, in such
a way that the result of the clustering will suit their needs.
« Clustering analysis helps construct meaninful partitioning of a large set of objects
Cluster analysis has been widely used in numerous applications, including pattern
recognition, data analysis, image processing etc.
* Clustering algorithms may be classified as listed below :
1. Exclusive clustering
2. Overlapping clustering
3. Hierarchical clustering
4, Probabilisitic clustering.
* A good clustering method will produce high quality clusters high intra - class
similarlity and low inter - class similarity. The quality of a clustering result
depends on both the similarity measure used by the method and its
implementation. The quality of a clustering method is also measured by it's ability
to discover some or all of the hidden patterns.
* Clustering techniques types : The major clustering techniques are,
a) Partitioning methods
>) Hierarchical methods
°) Density - based methods.
TECHNICAL PUBLICATIONS® - an up-hrust for knowledgeET |
Machine Learning
means
ach cluster is represented
4. Here © it is typicall bY the
usters, it is typically a user in,
‘Put
matically estimate K.
unsupervised Learning *
F tho
«js heuristic Me
K-Means clustering 1 h stands for number of ch
center of the cluster. can be used to auto
thm; some ;
to the algorithm; s°! £ components of the population equa a
This method ity take o! this step itself ts fal required numbe
E ee
De ee al such that the points are mutually farthes' apart.
us apart
o! each component in the population and assigns it to one of the
8 minimum distance. The centroid’s position j,
o the cluster and this continue
onent is added t
‘he final required number of clusters,
criteria
akes the number ©
f clusters. In
Next, it examine:
clusters dependi
recalculated everytime a CO
until all the components are 8}
Given K, the K-means algorithm consists of f
1. Select initial centroids at random.
2. Assign each object to the cluster wii
3. Compute each centroid as the mean of the obj
ling on the
mp:
rouped into #
four steps :
ith the nearest centroid.
jects assigned to it.
4. Repeat previous 2 steps until no change.
The x1,..,xy are data points or vectors of observations. Each observation
(vector x;) will Es assigned to one and only one cluster. The C(i) denotes cluster
number for the i observation. K-means minimizes within-cluster point scatter :
K
wo =>
22 cece Ix;
K
K
= ZNK OY linmg |?
Kel C=K
i IP
where
mx is the mean vector of the K"®: clustey
r.
Nx is the numbe
er of observations in Kth
fons in K cluster.
K-Means Algorithm Properties
1. There are always K clusters,
2. There i:
is always at least one item in ch
each cluster,
3. The clusters are i
Non-hierarchical and they d
cluster js closer t eee Weis
ways j 0 its ch
YS involve the “omter’ ofc than any other cluster bea
lusters,
4. Every member of a
closeness does not al
rejing Learning: 3-17 __Ensomble Techniques and Unsupervised Leaning
_ Mech
K-Means Algorithm Process
The dataset is partitioned into K clusters and the data points are randomly
assigned to the clusters resulting in clusters that have roughly the same number of
data points.
For each data point.
Calculate the distance from the data point to each cluster.
p, If the data point is closest to its own cluster, leave it where it is.
If the data point is not closest to its own cluster, move it into the closest
cluster.
Repeat the above step until a complete pass through all the data points results in
no data point moving from one cluster to another. At this point the clusters are
stable and the clustering process ends.
The choice of initial partition can greatly affect the final clusters that result, in
terms of inter - cluster and intracluster distances and cohesion.
Kemeans algorithm is iterative in nature. It converges, however only a local
minimum is obtained. It works only for numerical data. This method easy to
c
- implement.
+ Advantages of K-Means Algorithm :
1, Efficient in computation
2, Easy to implement.
+ Weaknesses
1. Applicable only when mean is defined.
Need to specify K, the number of clusters, in advance.
2
3. Trouble with noisy data and outliers.
4. Not suitable to discover clusters with non-convex shapes.
"© kNearest Neighbour is one of the only Machine Learning algorithms based totally
on supervised learning approach.
* NN algorithm assumes the similarity between the brand new case/facts and
available instances and placed the brand new case into the category that is
Maximum similar to the to be had classes.
: NN set of rules shops all of the to be had facts and classifies a new statistics
Point based at the similarity. This means when new data seems then it may be
effortlessly categorised into a properly suite class by using k-NN algorithm.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge18 Ensemble Techniques and Unsupervised Learning
Machine Learning SI
nas well as for classification howeye,
18801
> .s can be used for regres
ane we ification troubles.
cause of this it does no longer makes any
normally it's miles used for the class
K-NN is a non-parametric algorithm, be
assumption on underlying data.
because it does no lon,
learner set of rules ger
is also referred to as a lazy pape on
- a trom the training set immediately as a substitute it shops ataset ang
ese: .
at the time of class, it plays an movement at the datase'
‘The KNN set of rules at the schooling section simply stores the dataset and when
then it classifies that statistics into a class that is an awful Jot
it gets new data,
similar to the brand new data. ;
Example ; Suppose, we've an picture of a creature that ole oe like cat and
dog, but we want to know both it is a cat or dog. So for this identity, we are able
to use the KNN algorithm, because it works on a similarity degree. Our kKNN
version will discover the similar features of the new facts set to the cats and dogs
snap shots and primarily based on the most similar functions it will place it in
both cat or canine class.
EEE why Do We Need KNN ?
* Suppose there are two categories, ile, category A and category B and we've a
brand new statistics point x1, so this fact point will lie within of these classes. To
solve this sort of problem, we need a k-NN set of rules. With the help of k-NN,
we will without difficulty discover the category or class of a selected dataset.
Consider the underneath diagram :
x,
% %
oo
0 20
o%o a
Q 0° 0 0% 0
Category B ° °
. ‘ \ Category B
°
O80 New data point ° AC
ee ©0600 New data point
oo assigned to
Category A 5 cates
I oo Category A
X | _
Before k-NN x,
After k-NN
Fig. 3.4.1 Why do we need KNN ?
TECHNICAL PUBLicatinuin®chine Learning 3.
E Ma 19 Ensemble Techniques and Unsupervised Learning
po How Does KNN Work 2
The k-NN working can b is
. ‘SB explained on the basis of the below algorithm :
step- 1? Select the wide variety k of the acquaintances.
B step 2? Calculate the Euclidean distance of k variety of friends.
_ step + 3: Take the k nearest neighbors as according to the calculated Euclidean
- distance.
step - 4: Among these ok pals, count number the number of the data points in each
dass.
step -5: Assign the brand new records points to that cate; for which th tit
"of the neighbor is maximum. P ce ha ee
step - 6: Our model is ready.
+ Suppose we've got a brand new information point and we want to place it in the
required category. Consider the under image
X2
& 6
° 4°
. @ °
Category B
o\
Category A
x
Fig. 3.4.2 KNN example
we are able to select the
_ ® Firstly,
ok = 5.
we are able to pick the number of friends, so
Euclidean distance between the facts points. The
* Next, we will calculate the
which we've got already studied in
Euclidean distance is the gap between points,
geometry. It may be calculated as :
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeand Unsupervised Leary
Techniques 9
Ensemble
3-20
Machine Leaming
Best hyperplane
Euclidean distance between A, and By %
+(Y2-Y4)'
Fig. 3.4.3 KNN example continue
By calculating the Euclidean distance we Bot the nearest acquaintances, as 3
egory A and two Nearest associates in class B, Consider
TECHNICAL PUBLICATIONS® . an up.tnrust for knowledge3-21
°
°
@° 0
\.@ %o
Category 8
New data
point
°
oo
ee?
Category A
Ensemble Techniques and Unsupervised Leaming
Category A: 3 neighbours
Category B : 2 neighbours
%
Fig. 3.4.4 KNN example continue
« As we are able to see the three nearest acquaintances are from category A,
subsequently this new fact point must belong to category A.
_ EEE] Difference between K-means and kNN
Sr. No. K-means
i K-Means is an unsupervised
machine learning algorithm used
for clustering.
as K-Means is an eager learner
3. It is used for Clustering
4 K' in K-Means is the number of
clusters the algorithm is trying to
identify/learn from the data
5. K-means require unlabelled data.
It gathers and groups data into k
number of clusters.
K-NN is a lazy learner
kKNN
KNN is a supervised machine
learning algorithm used for
classification.
It is used mostly for
Classification, and sometimes
even for Regression
K in KNN is the number of
nearest neighbours used to :
classify or predict a test sample |
KNN require labelled data and
will give new data points
accordingly to the k number or
the closest data points.semble Techniques 2d UNSUPESe¢ Loon,
0
3-22 En:
Machine Learning
7 ixture Models
Gaussian Mixtu' |
Es . odels is a “soft” clustering algorithm, where each po;
© Gaussian Sa ianeT to all clusters. This is different than k-means where g,4,
bilistical
proba ster.
point belongs to one clu
« The Gaussian mixture mo :
points are generated from a mix of
parameters. ; oe .
« For example, in modeling human height data, height is pea oaned as a
normal distribution for each gender with a mean of approximately 5'10" for males
and 5'5" for females. Given only the height data and not the gender assignments
for each data point, the distribution of all heights would follow the sum of two
scaled (different variance) and shifted (different mean) normal distributions. 4
model making this assumption is an example of a Gaussian mixture model.
* Gaussian mixture models do not rigidly classify each and every instance into one
class or the other. The algorithm attempts to produce K-Gaussian distributions that
would take into account the entire training space. Every point can be associated
with one or more distributions. Consequently, the deterministic factor would be
the probability that each point belongs to a certain Gaussian distribution.
Mean vectors and covariance
del is a probabilistic model that assumes all the datg
f Gaussian distributions with unknoys
* Gaussian mixture model consists of two parts :
matrices.
+ A Gaussian distribution is defined as a continuous probability distribution that
takes on a bell-shaped curve. Another name for Gaussian distribution is the
normal distribution.
* In a one dimensional space, the probability density function of a Gaussian
distribution is given by :
bow)?
f(x|p,02) =
where 1 is the mean and o? is the variance
. ae mixture models can be used for a variety of use cases, including
fying customer segments, detecting fraudulent activity and clustering image
* GMMs have a variety of real-wor i
I-world applicati ' ,
a) Used for signal processing applications. Some of them are listed below,
b) Used for customer churn analysis
©) Used for language identification
eeeporing Le0rind 3-23 Ensemble Techniques and Unsupervised Leaming
4) Used in video game industry
e) Genre classification of songs,
al Expectation - Maximization
ein Gaussian ae models, an expectation - maximization method is a powerful
tool for estimating the parameters of a Gaussian mixture model. The expectation
is termed E and maximization is termed M.
«Expectation is used to find the Gaussian parameters which are used to represent
each component of Gaussian mixture models. Maximization is termed M and it is
involved in determining whether new data points can be added or not.
The Expectation-Maximization (EM) algorithm is used in maximum likelihood
estimation where the problem involves two.sets of random variables of which one,
X, is observable and the other Z, is hidden.
«The goal of the algorithm is to find the parameter vector @ that maximizes the
likelihood of the observed values of X, L($1X).
‘© But in cases where this is not feasible, we associate the extra hidden variables Z
and express the underlying model using both, to maximize the likelihood of the
joint distribution of X and Z, the complete likelihood L.(@1X,Z).
[EE@] EM Algorithm
_« Expectation-Maximization (EM) is an iterative method used to find maximum
likelihood estimates of parameters in probabilistic models, where the model
depends on unobserved, also called latent, variables.
EM alternates between performing an expectation (E) step, which computes an
expectation of the likelihood by including the latent variables as if they were
observed, and maximization (M) step, which computes the maximum likelihood
estimates of the parameters by maximizing the expected likelihood found in the E
step.
© The parameters found on the M step are then used to start another E step, and the
"process is repeated until some criterion is satisfied. EM is frequently used for data
clustering like for example in Gaussian mixtures.
In the Expectation step, find the expected values of the latent variables (here you
need to use the current parameter values).
In the Maximization step, first plug in the expected values of
in the log-likelihood of the augmented data. Then maximize this log-likelihood to
the latent variables
Teevaluate the parameters.
X
|
|
|ae |
Machine Leaming
9-24 Ensemble Techniques and Unsupervised Leaming
tion (EM) is a technique used in point estimation. Given
al is
Expectation-Maximiz: latent) variables Z we want to
set of observable variables X and unknown (
estimate parameters @ in a model. /
The Expectation Maximization (EM) algorithm SE ees asim
likeli-hood estimation procedure for statistical models when © of
the variables in the model are not observed ;
The EM algorithm is an elegant and powerful method for finding the maximum
likelihood of models with hidden variables. The key concept in the EM algorithm
is that it iterates between the expectation step (E-step) and maximization step
(M-step) until convergence.
In the E-step, the algorithm estimates the posterior distribution of the hidden
variables Q given the observed data and the current parameter settings; and in the
M-step the algorithm calculates the ML parameter settings with Q fixed.
At the end of each iteration the lower bound on the likelihood is optimized for the
given parameter setting (M-step) and the likelihood is set to that bound (E-step),
which guarantees an increase in the likelihood and convergence to a local
maximum, or global maximum if the likelihood function is unimodal.
Generally, EM works best when the fraction of missing information is small and
the dimensionality of the data is not too large. EM can require many iterations,
and higher dimensionality can dramatically slow down the E-step.
EM is useful for several reasons: conceptual simplicity, ease of implementation,
and the fact that each iteration improves 1(@). The rate of convergence on the first
few steps is typically quite good, but can become excruciatingly slow as you
approach local optima.
Sometimes the M-step is a constrained maximization, which means that there are
constraints on valid solutions not encoded in the function itself.
Expectation maximization is an effective technique that is often used in data
analysis to manage missing data. Indeed, expectation maximization overcomes
Some of the limitations of other techniques, such as mean substitution or
aca substitution. These alternative techniques generate biased estimates-and,
Specifically, underestimate the standard e i 1
rors. Expectati imizati
overcomes this problem. ee
Two Marks Questions with Answers
Q1
What Is unsupervised learning 7
TECHNICAL PUBLICATIONS® « an up.tnnsy for knowled
190Machine Learning 3-25 Ensemble Techniques and Unsupervised Learning
What is semi-supervised learning 7
a2
Semi-supervised learning uses both labeled and unlabeled data to improve
Ans.
supervised learning.
what is ensemble method ?
a3
‘Ans. : Ensemble methods is a machine learning technique that combines several base
models in order to produce one optimal predictive model. It combine the insights
obtained from multiple learning models to facilitate accurate and improved decisions.
Q4 What is cluster ?
‘Ans. : Cluster is a group of objects that belong to the same class. In other words the
similar object are grouped in one cluster and dissimilar are grouped in other cluster.
a5 Explain clustering.
‘Ans. : Clustering is a process of partitioning a set of data in a set of meaningful
subclasses. Every data in the subclass shares a common trait. It helps a user understand
the natural grouping or structure in a data set.
Q6 What is bagging ?
‘Ans. : Bagging is also known as Bootstrap aggregation, ensemble method works by
training multiple models independently and combining later to result in a strong
model.
Q7 Define boosting.
Ans. : Boosting refers to a group of algorithms that utilize weighted averages to make
weak learning algorithms stronger learning algorithms
Q8 What is k-Nearest Neighbour Methods ?
Ans.: © The k-Nearest Neighbor (KNN) is a classical classification method and
requires no training effort, critically depends on the quality of the distance
measures among examples.
* The KNN classifier uses mahalanobis distance function. A sample is classified
according to the majority vote of the its nearest k training samples in the
feature’ space. Distance of a sample to its neighbors is defined using a
distance function,
Q9 Which are the performance factors that Influence KNN algorithm ?
Ans. : The performance of the kNN algorithm is influenced by three main factors :
1. The distance function or distance metric used to determine the nearest
neighbors.
2. The decision rule used to derive a classification from the k-nearest neighbors.
3. The number of neighbors used to classify the new example.
TECHNICAL PUBLICATIONS®
= an up-thrust for knowledge6 Ensemble Technigue’ and Unsupenisgg
3-2
Machine Leaming
fe Komeans clustering ?
heuristic me!
kemeans algoti
k-clusters $0
shod. Here each cluster is representeg
thm takes the input parameter, the
that the resulting intracluster aia ang
tity
is
Q.10 What Ii
Ans. : k-means clustering is
center of the cluster. The
n objects into
Juster similarity is low.
perties of K-Means algorithm.
partitions a set of
high but the intercl
Q.11 List the pro
‘Ans. : 1. There are always k clusters.
2, There is always at least one item in each cluster.
3, The clusters are non-hierarchical and they do not overlap.
Q.12 What is stacking ?
aa Stacking, sometimes called stacked generalization, is an ensemble mach
g method that combines multiple heterogeneous base or component mises
a meta-model.
Q.13 How do GMMs differentiate from K-means clustering ?
Ans.: GMMs and K-means, both i pervis
anaes , both are clustering algorithms used ;
lamin ng tas. However, the basic difference between the is — ects
clusteri: i i ee :
gears ing method while GMMs is a distribution based ae :
tering
Quo