Learning Associations
• Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.
Market-Basket transactions
Example: P ( chips | beer ) = 0.7
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Training and testing
Data Practical
acquisition usage
Universal
set
(unobserve
d)
Training Testing set
set (unobserve
(observed) d)
Training and testing
• Training is the process of making the system able to learn.
• No free lunch rule:
– Training set and testing set come from the same distribution
– Need to make some assumptions or bias
Performance
• There are several factors affecting the performance:
– Types of training provided
– The form and extent of any initial background
knowledge
– The type of feedback provided
– The learning algorithms used
• Two important factors:
– Modeling
– Optimization
Classification: Applications
• Face recognition: Pose, lighting, occlusion (glasses,
beard), make-up, hair style
• Character recognition: Different handwriting styles.
• Speech recognition: Temporal dependency.
– Use of a dictionary or the syntax of the language.
– Sensor fusion: Combine multiple modalities; eg,
visual (lip image) and acoustic for speech
• Medical diagnosis: From symptoms to illnesses
• Web Advertising: Predict if a user clicks on an ad on the
Internet.
Steps
Training Training
Labels
Training
Images
Image Learned
Training
Features model
Testing
Image Learned
Prediction
Features model
Test Image
Prediction: Regression
• Example: Price of
a used car
y = wx+w0
• x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters
Regression Applications
• Navigating a car: Angle of the steering
wheel (CMU NavLab)
• Kinematics of a robot arm
(x,y)
α2
α1= g1(x,y)
α2= g2(x,y)
α1
Inductive Learning
• Given examples of a function (X, F(X))
• Predict function F(X) for new examples X
– Discrete F(X): Classification
– Continuous F(X): Regression
– F(X) = Probability(X): Probability estimation
Supervised Learning: Uses
Example: decision trees tools that create rules
• Prediction of future cases: Use the rule to
predict the output for future inputs
• Knowledge extraction: The rule is easy to
understand
• Compression: The rule is simpler than the
data it explains
• Outlier detection: Exceptions that are not
covered by the rule, e.g., fraud
Algorithms
• The success of machine learning system also depends
on the algorithms.
• The algorithms control the search to find and build the
knowledge structures.
• The learning algorithms should extract useful information
from training examples.
Algorithms
• Supervised learning ( )
– Prediction
– Classification (discrete labels), Regression (real values)
• Unsupervised learning ( )
– Clustering
– Probability distribution estimation
– Finding association (in features)
– Dimension reduction
• Semi-supervised learning
• Reinforcement learning
– Decision making (robot, chess machine)
Algorithms
Supervised Unsupervised
learning learning
37
Semi-supervised
learning
What are we seeking?
• Supervised: Low E-out or maximize probabilistic terms
E-in: for training
set
E-out: for testing
set
• Unsupervised: Minimum quantization error, Minimum
distance, MAP, MLE(maximum likelihood estimation)
Machine learning structure
• Supervised learning
Unsupervised Learning
• Learning “what normally happens”
• No output
• Clustering: Grouping similar instances
• Other applications: Summarization,
Association Analysis
• Example applications
– Customer segmentation in CRM
– Image compression: Color quantization
– Bioinformatics: Learning motifs
Machine learning structure
• Unsupervised learning
Clustering Analysis
• Definition
Grouping unlabeled data into clusters, for the purpose of
inference of hidden structures or information
• Dissimilarity measurement
– Distance : Euclidean(L2), Manhattan(L1), …
– Angle : Inner product, …
– Non-metric : Rank, Intensity, …
• Types of Clustering
– Hierarchical
• Agglomerative or divisive
– Partitioning
• K-means, VQ, MDS, …
43
(Matlab helppage)
Find K partitions with the total
intra-cluster variance minimized
Iterative method
Initialization : Randomized yi
Assignment of x (yi fixed)
Update of yi (x fixed)
Problem?
Trap in local minima
(MacKay, 2003)44
Deterministically avoid local minima
No stochastic process (random walk)
Tracing the global solution by changing
level of randomness
Statistical Mechanics (Maxima and Minima, Wikipedia)
Gibbs distribution
Helmholtz free energy F = D – TS
▪ Average Energy D = < Ex>
▪ Entropy S = - P(Ex) ln P(Ex)
▪ F = – T ln Z
In DA, we make F minimized
45
Analogy to physical annealing process
Control energy (randomness) by temperature (high low)
Starting with high temperature (T = 1)
▪ Soft (or fuzzy) association probability
▪ Smooth cost function with one global minimum
Lowering the temperature (T ! 0)
▪ Hard association
▪ Revealing full complexity, clusters are emerged
Minimization of F, using E(x, yj) = ||x-yj||2
Iteratively,
46
Definition
Process to transform high-dimensional data into low-
dimensional ones for improving accuracy, understanding, or
removing noises.
Curse of dimensionality
Complexity grows exponentially
in volume by adding extra
dimensions
Types (Koppen, 2000)
Feature selection : Choose representatives (e.g., filter,…)
Feature extraction : Map to lower dim. (e.g., PCA, MDS, … )
47
Machine Learning in a
Nutshell
• Tens of thousands of machine learning
algorithms
• Hundreds new every year
• Every machine learning algorithm has
three components:
– Representation
– Evaluation
– Optimization
Generative vs. Discriminative
Classifiers
Generative Models Discriminative Models
• Represent both the data and • Learn to directly predict the labels
the labels from the data
• Often, makes use of conditional • Often, assume a simple boundary
independence and priors (e.g., linear)
• Examples • Examples
– Naïve Bayes classifier – Logistic regression
– Bayesian network – SVM
– Boosted decision trees
• Models of data may apply to • Often easier to predict a label from
future prediction problems the data than to model the data
Slide credit: D. Hoiem
Classifiers: Logistic Regression
Maximize likelihood of
label given data,
assuming a log-linear male
model
female
x2
x1 Pitch of voice
P( x1 , x2 | y 1)
log wT x
P( x1 , x2 | y 1)
P ( y 1 | x1 , x2 ) 1 / 1 exp w T x
Classifiers: Nearest neighbor
Test Training
Training examples
examples example
from class 2
from class 1
f(x) = label of the training example nearest to x
All we need is a distance function for our inputs
No training required!
Slide credit: L. Lazebnik
Nearest Neighbor Classifier
• Assign label of nearest training data point to each test data point
partitioning of feature space for two-category 2D and 3D data
Source: D. Lowe
K-nearest neighbor
• It can be used for both classification and regression problems.
• However, it is more widely used in classification problems in the
industry.
• K nearest neighbours is a simple algorithm
– stores all available cases and
– classifies new cases by a majority vote of its k neighbours.
– The case being assigned to the class is most common amongst its K
nearest neighbours measured by a distance function.
– These distance functions can be Euclidean, Manhattan, Minkowski
and Hamming distance.
• First three functions are used for continuous function and
• Fourth one (Hamming) for categorical variables.
– If K = 1, then the case is simply assigned to the class of its nearest
neighbour.
– At times, choosing K turns out to be a challenge while performing KNN
modelling.
Naïve Bayes
• Bayes theorem provides a way of calculate
– posterior probability P(c|x) from P(c), P(x) and P(x|c).
• Look at the equation below:
– Here,
– P(c|x) is the posterior probability of class (target)
given predictor (attribute).
– P(c) is the prior probability of class.
– P(x|c) is the likelihood which is the probability of predictor given class.
– P(x) is the prior probability of predictor
Naïve Bayes Example
• Let’s understand it using an example.
– Have a training data set of weather and corresponding target variable
‘Play’.
– Now, we need to classify whether players will play or not based on
weather condition.
– Let’s follow the below steps to perform it.
• Step 1: Convert the data set to frequency table
• Step 2: Create Likelihood table by finding the probabilities like
– Overcast probability = 0.29 and
– probability of playing is 0.64.
Naïve Bayes
– Step 3: Now, use Naive Bayesian equation to calculate the posterior
probability for each class.
• The class with the highest posterior probability is the outcome of prediction.
• Problem:
– Players will pay if weather is sunny, is this statement is correct?
– We can solve it using above discussed method,
• so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
• Here we have, P (Sunny |Yes) = 3/9 = 0.33,
P(Sunny) = 5/14 = 0.36,
P( Yes)= 9/14 = 0.64
• Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60,
• which has higher probability.
• Naive Bayes uses a similar method to
– predict the probability of different class based on various attributes.
– This algorithm is mostly used in text classification and
– with problems having multiple classes.
EM algorithm
• Problems in ML estimation
– Observation X is often not complete
– Latent (hidden) variable Z exists
– Hard to explore whole parameter space
• Expectation-Maximization algorithm
– Object : To find ML, over latent distribution P(Z |X,)
– Steps
0. Init – Choose a random old
1. E-step – Expectation P(Z |X, old)
2. M-step – Find new which maximize likelihood.
3. Go to step 1 after updating old à new
60
Problem
Estimate hidden parameters (={, })
from the given data extracted from
k Gaussian distributions
Gaussian distribution (Mitchell , 1997)
Maximum Likelihood
With Gaussian (P = N),
Solve either brute-force or numeric method
61