Artificial Intelligence & Data Mining
Week 12
WEN Bihan (Asst Prof)
Homepage: https://personal.ntu.edu.sg/bihan.wen/
1
Recap: Bias and Variance
• Model Complexity Analysis:
Underfitting Overfitting
High bias and low variance Low bias and high variance
2
Recap: Overfitting and Underfitting
• Simple Model
• High Bias
• Cause an algorithm to miss relevant
relations between the input features and
the target outputs.
• Complex Model
• High Variance
• Cause an algorithm to model the noise in
the training set.
3
Recap: Overfitting and Underfitting
1. High complexity -> Large Gap -> Overfitting
2. Low complexity -> Small Gap -> Underfitting
4
Overfitting and Underfitting - Examples
• Suppose that you are training a NN for classification. When you
reduce the model complexity by removing a layer, your validation
accuracy increases.
• How does the model bias and variance change?
• Does your model suffer from overfitting or underfitting?
• Now, if you make your NN even deeper (higher complexity)
• Will your training accuracy increase?
• Will your validation accuracy increase?
5
Regularization to prevent overfitting
• Solutions, in the context of learning neural networks:
1. Limit the model complexity by reducing the model expressiveness.
2. Increase the training data complexity / size, to reduce the variance.
3. Simplify data distribution and dimensionality.
• Dimensionality Reduction
k = 200 k = 50 k=2
6
Dimensionality Reduction
WEN Bihan (Asst Prof)
Homepage: https://personal.ntu.edu.sg/bihan.wen/
7
Outline
• Concept of Dimensionality Reduction
• Principal Component Analysis (PCA)
• How to derive PCA (not examed)
• Examples
8
Carry-on Questions
• Why do we need dimensionality reduction?
• What are two objectives achieved by PCA?
• How to calculate PCA (not examed)?
9
Unsupervised Learning
• Clustering
• Exploit similar structures amongst the data themselves.
• One way to summarize various data points with a single categorical variable,
i.e., the cluster centroid.
10
Unsupervised Learning
• Clustering
• Exploit similar structures amongst the data themselves.
• One way to summarize various data points with a single categorical variable,
i.e., the cluster centroid.
• Dimensionality Reduction
• Another way to simplify the complex and high-dimensional data.
• Summarize data with a lower dimensional real valued vector.
11
Dimensionality Reduction
• Goal: find the best-fitting low-dimensional subspace to
𝑖=1 .
represent (𝑥𝑖 )𝑁
• Fit 2-d data with a 1-d line.
• Fit 3-d data with a 2-d plane.
12
Dimensionality Reduction
• Goal: find the best-fitting low-dimensional subspace to
𝑖=1 .
represent (𝑥𝑖 )𝑁
• Given data points in d dimensions
• Convert them to data points in r<d dimensions
• With minimal loss of information
13
Example: Reduce Data from 2D to 1D
(inches)
(cm)
14
Example: Reduce Data from 2D to 1D
Reduce data from
(inches)
2D to 1D
15
Example: Reduce Data from 2D to 1D
Reduce data from
(inches)
2D to 1D
16
Example: Reduce Data from 3D to 2D
3D data points 2D data points
(𝑥1 , 𝑥2 , 𝑥3 ) (𝑧1 , 𝑧2 )
Approximated by a 2D plane Projected onto the plane
17
Why Dimensionality Reduction?
• What is wrong with high dimensions?
• High Dimensions = Lots of Features
• Complex system to process
• Inefficient algorithm
• Overfitting to noise or other data corruptions.
18
Dimensionality Reduction
• Why do we need dimensionality reduction?
• Remove Feature Redundancy + Noise
• Prevent Overfitting
• Reduced Model Complexity
• Simplified Data distribution
• Better Visualization
19
Principal Component Analysis
• Principal Component Analysis (PCA)
• Simple and popular method.
• Unsupervised learning.
• To learn the “best” low-dimensional subspace for
data projection.
• What does the “best” mean here?
20
Principal Component Analysis
• Two ways to interpret the “Best”:
1. The mean square error (MSE) of the projected data is
minimized.
2. The variance of the projected data is maximized.
• These two objectives are achieved simultaneously by
applying PCA for dimensionality reduction.
• Why and How?
21
Principal Component Analysis
1. The mean square error (MSE) of the projected data
is minimized.
• 𝒙𝒊 ( black ): the original data.
• 𝒗 ( red ) : PCA subspace.
• 𝒗𝑻 𝒙𝒊 𝒗 ( blue ) : projected data.
• Minimize MSE ( green ) for all {𝒙𝒊 }
i.e., the perpendicular offset.
22
Principal Component Analysis
2. The variance of the projected data is maximized.
• 𝒗𝑻 𝒙𝒊 𝒗 ( blue ) : projected data.
• Variance = magnitude of ( blue )
• Blue variance is maximized.
23
Principal Component Analysis
• Minimizing MSE <=> Maximizing Projected Variance
• Blue2 + green2 = black2
• Black is fixed (given data)
• Maximizing blue (variance)
is equivalent to
Minimizing green (MSE).
24
Example: PCA
• Which of the following 𝒗 generates higher
projected variance?
25
Example: PCA
• Which of the following 𝒗 generates higher
projected variance?
• Find the projected mean 𝒖
𝒗 • Calculate the distance between each
point to the mean 𝒅𝒊
• The variance of the projected data
𝒅𝟐 𝒖
𝑁
𝒅𝟏 1
𝑣𝑎𝑟 = 𝒅𝒊 2
𝑁
𝑖=1
26
Example: PCA
• Subspace 𝒗 on the left generates higher
projected variance
Larger projected
variance!
Smaller projected 𝒗
variance!
27
Example: PCA
• Which of the following 𝒗 generates smaller
projected MSE?
28
Example: PCA
• Subspace 𝒗 on the left generates smaller
projected MSE – Smaller Perpendicular Offset
𝒗 Larger
projected MSE
Smaller
projected MSE
𝒗
29
Derivation of PCA - Optional
1. The mean square error (MSE) of
the projected data is minimized.
• For all {𝒙𝑖 }𝑁
𝑖=1 , the green
2
(in average) is minimized.
• Find the best red 𝒗 (unit vector)
to minimize the green2.
• The principal component:
𝒛𝒊 = 𝒗𝑇 𝒙𝒊 𝒗
30
Derivation of PCA - Optional
1. The mean square error (MSE) of
the projected data is minimized.
• The principal component:
𝒛𝒊 = 𝒗𝑇 𝒙𝒊 𝒗
• The MSE is minimized:
𝑁
1
𝑀𝑆𝐸 = (𝒛𝒊 − 𝒙𝒊 )2
𝑁
𝑖=1
31
Derivation of PCA - Optional
2. The variance of the projected
data is maximized.
• Variance of the projected data:
𝑁
1
𝑉𝑎𝑟 = (𝒛𝒊 − 𝒛ത )2
𝑁
𝑖=1
• 𝒛ത denotes the mean of {𝒛𝒊 }.
32
Derivation of PCA - Optional
2. The variance of the projected
data is maximized.
• For zero-mean data {𝒙𝑖 }𝑁
𝑖=1 , we
have 𝒛ത = 𝟎, then
𝑁 𝑁
1 1
𝑉𝑎𝑟 = (𝒛𝒊 ) = (𝒗𝑇 𝒙𝒊 )2
2
𝑁 𝑁
𝑖=1 𝑖=1
33
Derivation of PCA - Optional
• Mathematical Problem Formulation:
• PCA is to maximize the projected data variance.
• Consider a set of zero-mean {𝐱 i }N
i=1 ,
𝑁 𝑁
1 1
𝑉𝑎𝑟 = (𝒛𝒊 )2 = (𝒗𝑇 𝒙𝒊 )2
𝑁 𝑁
𝑖=1 𝑖=1
• Note that this variance is different from the
variance & bias for overfitting analysis.
34
Derivation of PCA - Optional
• Mathematical Problem Formulation:
• PCA is to maximize the projected data variance.
• Consider a set of zero-mean {𝐱 i }N
i=1 , the first weight unit vector 𝒗 satisfies:
𝑁
1
max (𝒗𝑇 𝒙𝒊 )2
𝒗 2 =1 𝑁
𝑖=1
35
Derivation of PCA - Optional
• Recap: represent a set of data vectors {𝒙𝑖 }𝑁
𝑖=1 as a data matrix 𝑿:
𝒙𝒊 𝑻
𝒙𝟏 𝑻
⋮
=
𝑁
𝒙𝒊 𝑻
⋮
𝑻
𝒙𝑵
𝑑: feature dimension
36
Derivation of PCA - Optional
• Mathematical Problem Formulation:
• PCA is to maximize the projected data variance.
• Consider a set of zero-mean {𝐱 i }N
i=1 , the first weight unit vector 𝒗 satisfies:
𝑁
1 1 𝑇 𝑇
max (𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1
37
Derivation of PCA - Optional
• Mathematical Problem Formulation:
• PCA is to maximize the projected data variance.
• Consider a set of zero-mean {𝐱 i }N
i=1 , the first weight unit vector 𝒗 satisfies:
𝑁
1 1 𝑇 𝑇
max (𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1
1 𝑁 1 𝑇 𝑇
• Here σ𝑖=1(𝒗𝑇 𝒙𝒊 )2 = 𝒗 𝑿 𝑿𝒗 is the variance of the projected data.
𝑁 𝑁
• The optimal 𝒗* is the first unit weight vector for the PCA.
• The first Principal Component: 𝒛𝒊 = (𝒗∗𝑇 𝒙𝒊 )𝒗∗
38
Derivation of PCA - Optional
• Mathematical Problem Formulation :
• The optimal 𝒗* is the first unit weight vector for PCA:
𝑁
1 1 𝑇 𝑇
max (𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1
• You can calculate the other 𝒗 and 𝒛𝒊 in an incremental fashion:
• Substitute 𝒙𝒊 with the residual (𝒙𝒊 - 𝒛𝒊 ).
• Calculate the next unit weight vector and principle component.
• How to obtain the optimal 𝒗 here?
39
Derivation of PCA - Optional
• Algorithm:
• We need to use Singular Value Decomposition (SVD):
40
Derivation of PCA - Optional
• Algorithm:
𝚺 𝑇
1. Apply SVD to the data matrix 𝑿, as sv𝑑 𝑿 = 𝑼 𝑽 .
𝟎
2. 𝑽 = 𝒗𝟏 , 𝒗𝟐 , … , 𝒗𝒌 , … 𝒗𝒅 , where 𝒗𝒌 is the k-th unit weight vector for PCA.
3. The k-th Principal Component of 𝒙𝒊 : (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘
4. Project to the k-dimensional subspace 𝑽𝒌 = 𝒗𝟏 , 𝒗𝟐 , … , 𝒗𝒌 :
𝒗1 𝑇 𝒙𝒊
𝑽 𝒌 𝑻 𝒙𝒊 = ⋮
𝒗𝑘 𝑇 𝒙𝒊
41
Derivation of PCA - Example
• Given data points, 𝒙1 = 4, 6, 10 , 𝒙2 = 3, 10, 13 , 𝒙3 = −2, −6, −8 .
• Calculate the first unit weight vector 𝒗, and the first Principal Component.
4 6 10
1. Data Matrix: 𝐗 = 3 10 13
−2 −6 −8
2. SVD: 𝐗 = 𝐔𝚺𝑽𝑻
−0.22 0.78 −0.58
3. 𝑽 = −0.57 −0.59 −0.58 −0.22
−0.79 0.20 0.58 𝒗𝟏 = −0.57 𝒛𝑖,𝑘 = (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘
−0.79
42
PCA Example 1
• Representation of handwriting digits in MNIST dataset:
Sample Variance
with rotations,
scales, etc.
43
PCA Example 1
• Representation of handwriting digits in MNIST dataset:
• PCA provides more robust and invariant features.
Reconstructions:
x k=2 k = 10 k = 50 k = 200
44
PCA Example 2
• Image Compression
Original Image
• Divide the original 372x492 image into patches:
• Each patch is an instance that contains 12x12 pixels on a grid
• View each as a 144-D vector
45
PCA Example 2
• Image Compression
PCA compression: 144D → 60D
46
PCA Example 2
• Image Compression
PCA compression: 144D → 16D
47
PCA Example 2
• Image Compression
16 most important eigenvectors
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
48
PCA Example 2
• Image Compression
PCA compression: 144D → 6D
49
PCA Example 2
• Image Compression
PCA compression: 144D → 1D
50
Carry-on Questions
• Why do we need dimensionality reduction?
• To remove redundancy, prevent overfitting, remove noise, etc.
• What are two objectives achieved by PCA?
• Minimizing MSE and Maximizing Projected Variance
• How to calculate PCA?
• Apply SVD to the data matrix. Select the unit weight vector 𝒗𝒌
from the V matrix, and calculate (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘 .
51
Bayesian Inference
WEN Bihan (Asst Prof)
Homepage: https://personal.ntu.edu.sg/bihan.wen/
52
Outline
• Probability and Conditional Probability
• Bayes’ Theorem
• Naïve Bayes
• Examples
53
Carry-on Questions
• What is Bayes’ Theorem?
• What is Naïve Bayes assumption?
54
From deterministic to probabilistic learning
• Training a deep neural network for classification:
• Deterministic model: always gives you the same output for the same input.
55
From deterministic to probabilistic learning
• Training a deep neural network for classification:
• Deterministic inference: gives you the same output for the same input.
• Bayesian Inference
• Tossing a coin twice, what outcomes will we get?
• There are 4 possible outcomes of this experiment
𝑆 = {𝐻𝐻, 𝐻𝑇, 𝑇𝐻, 𝑇𝑇}
• Probabilistic model: 𝑃𝑟𝑜𝑏 𝐻𝐻 = 0.25
56
Basic Concepts in Probability Theory
• Concepts and Notations:
• h = A hypothesis or an event. Tossing a coin
57
Basic Concepts in Probability Theory
• Concepts and Notations:
• h = A hypothesis or an event.
• D = A collection of data (e.g., training data).
We tossed the coin 8
times, and got 4 tails
and 4 heads
sequentially.
{H,H,H,H,T,T,T,T}
58
Basic Concepts in Probability Theory
• Concepts and Notations:
• h = A hypothesis or an event.
• D = A collection of data (e.g., training data).
• P(h) = Probability that Hypothesis h holds.
• P(D) = Probability of observing the training data D.
P(h) = probability of getting heads
59
Basic Concepts in Probability Theory
• Concepts and Notations:
• h = A hypothesis or an event.
• D = A collection of data (e.g., training data).
• P(h) = Probability that Hypothesis h holds.
• P(D) = Probability of observing the training data D.
• P(D|h) = Probability of observing D when h holds. (read
“probability of D given h”, “probability of D conditioned on h” ).
60
Basic Concepts in Probability Theory
• We are interested of P(h|D), because
• We always learn from history / knowledge.
• We normally have the access to the training data D.
• We need to know P(h|D), aims to find the most probable
hypothesis given that data.
61
Basic Concepts in Probability Theory
• Concepts and Notations:
• h = A hypothesis or an event.
• D = A collection of data (e.g., training data).
• P(h) = Probability that Hypothesis h holds.
• P(D) = Probability of observing the training data D.
• P(D|h) = Probability of observing D when h holds. (read
“probability of D given h).
• Can we calculate P(h|D)? 62
Joint and Conditional Probability
• Conditional Probability 𝑃 𝐴 | 𝐵 :
The probability that A happens given that B happened.
• Joint probability 𝑃 𝐴, 𝐵 :
The probability that A and B both happen.
• 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 ∗ 𝑃(𝐵)
63
Bayes’ Theorem
• The simple form of Bayes’ Theorem:
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)
• How to derive the theorem?
• Use joint probability: 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵) 𝑃 𝐵 = 𝑃 𝐵 𝐴) 𝑃(𝐴).
• Therefore:
𝑃(𝐴, 𝐵) 𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 = =
𝑃(𝐵) 𝑃(𝐵)
64
Bayes’ Theorem
• Given P(B) which is a constant, the proportional form:
• Sometimes P(B) can be also calculated as
• This is based on Law of total probability:
65
Bayes’ Theorem
• Now, we can predict 𝑃(ℎ|𝐷):
𝑃 𝐷 ℎ) 𝑃(ℎ)
𝑃 ℎ𝐷 =
𝑃(𝐷)
• Some terminologies we are going to use:
• Prior probability 𝑃(ℎ): prior knowledge of ℎ before observing 𝐷.
• Posterior 𝑃 ℎ 𝐷 : the probability of ℎ after we have observed 𝐷.
• Likelihood 𝑃 𝐷 ℎ : Likelihood of observing 𝐷 given ℎ.
66
Bayes’ Theorem
• Maximum A Posteriori (MAP) hypothesis:
𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏(𝐡|𝐃)
𝑃 𝐷 ℎ) 𝑃(ℎ)
• We just learned that 𝑃 ℎ 𝐷 = . Given D, we have
𝑃(𝐷)
𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏 𝐃 𝐡 𝐏(𝐡)
• If P(h) is constant, MAP is equivalent to Maximum Likelihood (ML):
𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏 𝐃 𝐡
67
Example Questions
Quiz 1: Flipping a coin
• Suppose we are flipping a fair coin twice.
What is the probability that both flips are heads?
68
Example Questions
• Quiz 2: Flipping a coin
• Suppose we are flipping a fair coin twice.
Given that the outcome of first flip is heads, what is the
probability that both flips are heads?
69
Example Questions
• Quiz 3: Our IE4483 is attended by students from both EEE and IEM.
Only 50% of the IEM students and 30% of the EEE students pass the
exam. Given that 60% of the entire class are EEE students, what is the
percentage of IEM students amongst those who pass the exam?
70
Example Questions
• Quiz 4: Monty Hall Problem
• You’re a contestant on a game show. You see three closed doors, and
behind one of them is a prize. You choose one door, and the host opens
one of the other doors and reveals that there is no prize behind it. Then
he offers you a chance to switch to the remaining door. Should you take
it?
http://en.wikipedia.org/wiki/Monty_Hall_problem 71
Naïve Bayes
• The basic form of Bayes’ Theorem:
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)
• It is straightforward if A and B are both single attributes.
• But what if we have multiple conditions or query events?
• How to represent their conditional probabilities?
72
Naïve Bayes
• Extend from simple example to large training datasets.
• Single attribute Multiple attributes
• How does 𝑃 𝑎1 , 𝑎2 𝑣𝑗 ) relate to 𝑃 𝑎1 𝑣𝑗 ) and 𝑃 𝑎2 𝑣𝑗 ) ?
• Naïve Bayes Assumption:
𝑃 𝑎1 , 𝑎2 𝑣𝑗 ) = 𝑃 𝑎1 𝑣𝑗 )𝑃 𝑎2 𝑣𝑗 )
73
Naïve Bayes
• Naïve Bayes assumption:
• The conditional independence assumption
• The values of some features are conditionally independent on the others.
• Mathematical Form:
𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = 𝑃 𝑎1 𝑣𝑗 ) … 𝑃 𝑎𝑑 𝑣𝑗 )
𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = ෑ 𝑃 𝑎𝑖 𝑣𝑗 )
𝑖=1
74
Naïve Bayes - Example
• Play Tennis:
• New Observation: <Sunny,Cool,High,Strong>, do you play tennis?
75
Naïve Bayes - Example
• Play Tennis:
• Compare 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 >) and
𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 >)
S = Sunny (outlook)
• According to Bayes’ Theorem, it is to compare
𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) and 𝑃(𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >)
C = Cool (Temp.)
• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝑃 𝑦𝑒𝑠 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠) H = High (Humidity)
S = Strong (Wind)
76
Bayes’ Theorem
• Conditional Prob to Joint Prob
𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 ∗ 𝑃(𝐵)
𝑃 𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 > ∗ 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >)
𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 > ∗ 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >)
• 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >) is a constant, i.e., same for both choices.
• Let’s call as the 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >) as the 𝑃(𝑂𝑏𝑠𝑒𝑟. )
77
Bayes’ Theorem
• Bayes’ Theorem:
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)
𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑦𝑒𝑠 𝑃(𝑦𝑒𝑠)
𝑃 𝑦𝑒𝑠 | 𝑂𝑏𝑠𝑒𝑟. =
𝑃(𝑂𝑏𝑠𝑒𝑟. )
• Again, 𝑃(𝑂𝑏𝑠𝑒𝑟. ) is a constant, i.e., same for both choices.
• Just compare 𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑦𝑒𝑠 𝑃(𝑦𝑒𝑠) and 𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑛𝑜 𝑃(𝑛𝑜)
78
Naïve Bayes - Example
• Play Tennis:
• Compare 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 >) and
𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 >)
• According to Bayes’ Theorem, it is to compare S = Sunny (outlook)
𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) and 𝑃(𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >)
C = Cool (Temp.)
• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝑃 𝑦𝑒𝑠 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠)
H = High (Humidity)
• Based on Naïve Bayes assumption, we have
𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠) = S = Strong (Wind)
𝑃 𝑆 𝑦𝑒𝑠 𝑃 𝐶 𝑦𝑒𝑠 𝑃 𝐻 𝑦𝑒𝑠 𝑃 𝑆 𝑦𝑒𝑠
2 3 3 3
= 9 × 9 × 9 × 9 = 0.00823
• P(yes) = #yes / 15 = 9/14 By simply counting
9
• Thus, 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 14 × 0.00823 = 𝟎. 𝟎𝟎𝟓𝟏
79
Naïve Bayes - Example
• Play Tennis:
• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝟎. 𝟎𝟎𝟓𝟏
• Similarly, 𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑛𝑜 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑛𝑜)
= 𝑃 𝑛𝑜 𝑃 𝑆 𝑛𝑜 𝑃 𝐶 𝑛𝑜 𝑃 𝐻 𝑛𝑜 𝑃 𝑆 𝑛𝑜
= 0.36 × 0.6 × 0.2 × 0.8 × 0.6 = 𝟎. 𝟎𝟐𝟎𝟕
• Do you play tennis?
• 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 > < 𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >
• Given the observation <Sunny,Cool,High,Strong>, more likely you
will NOT play tennis.
80
Carry-on Questions
• What is Bayes’ Theorem?
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)
• What is Naïve Bayes assumption?
𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = ෑ 𝑃 𝑎𝑖 𝑣𝑗 )
𝑖=1
81
What we have learned
• Probability and Conditional Probability
• Hypothesis, joint / conditional probability, etc.
• Bayes’ Theorem
• Various forms of Bayes’ Theorem, prior, likelihood, Posterior, etc.
• Naïve Bayes
• Multi-attributes, Naïve Bayes assumption, examples.
82