0% found this document useful (0 votes)

37 views82 pages

Week12 PCA BayesianInference Before Lecture

NTU EE6483 Week12_PCA_BayesianInference_before_lecture

Uploaded by

yimingxiao2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views82 pages

Week12 PCA BayesianInference Before Lecture

NTU EE6483 Week12_PCA_BayesianInference_before_lecture

Uploaded by

yimingxiao2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Artificial Intelligence & Data Mining

Week 12

WEN Bihan (Asst Prof)

Homepage: https://personal.ntu.edu.sg/bihan.wen/

1
Recap: Bias and Variance

• Model Complexity Analysis:

Underfitting Overfitting
High bias and low variance Low bias and high variance

2
Recap: Overfitting and Underfitting

• Simple Model

• High Bias

• Cause an algorithm to miss relevant

relations between the input features and
the target outputs.

• Complex Model

• High Variance

• Cause an algorithm to model the noise in

the training set.

3
Recap: Overfitting and Underfitting

1. High complexity -> Large Gap -> Overfitting

2. Low complexity -> Small Gap -> Underfitting

4
Overfitting and Underfitting - Examples

• Suppose that you are training a NN for classification. When you

reduce the model complexity by removing a layer, your validation
accuracy increases.

• How does the model bias and variance change?

• Does your model suffer from overfitting or underfitting?

• Now, if you make your NN even deeper (higher complexity)

• Will your training accuracy increase?

• Will your validation accuracy increase?

5
Regularization to prevent overfitting

• Solutions, in the context of learning neural networks:

1. Limit the model complexity by reducing the model expressiveness.

2. Increase the training data complexity / size, to reduce the variance.

3. Simplify data distribution and dimensionality.

• Dimensionality Reduction

k = 200 k = 50 k=2
6
Dimensionality Reduction

WEN Bihan (Asst Prof)

Homepage: https://personal.ntu.edu.sg/bihan.wen/

7
Outline

• Concept of Dimensionality Reduction

• Principal Component Analysis (PCA)

• How to derive PCA (not examed)

• Examples

8
Carry-on Questions

• Why do we need dimensionality reduction?

• What are two objectives achieved by PCA?

• How to calculate PCA (not examed)?

9
Unsupervised Learning

• Clustering

• Exploit similar structures amongst the data themselves.

• One way to summarize various data points with a single categorical variable,
i.e., the cluster centroid.

10
Unsupervised Learning

• Clustering

• Exploit similar structures amongst the data themselves.

• One way to summarize various data points with a single categorical variable,
i.e., the cluster centroid.

• Dimensionality Reduction

• Another way to simplify the complex and high-dimensional data.

• Summarize data with a lower dimensional real valued vector.

11
Dimensionality Reduction

• Goal: find the best-fitting low-dimensional subspace to

𝑖=1 .
represent (𝑥𝑖 )𝑁

• Fit 2-d data with a 1-d line.

• Fit 3-d data with a 2-d plane.

12
Dimensionality Reduction

• Goal: find the best-fitting low-dimensional subspace to

𝑖=1 .
represent (𝑥𝑖 )𝑁

• Given data points in d dimensions

• Convert them to data points in r<d dimensions

• With minimal loss of information

13
Example: Reduce Data from 2D to 1D
(inches)

(cm)

14
Example: Reduce Data from 2D to 1D

Reduce data from

(inches)

2D to 1D

15
Example: Reduce Data from 2D to 1D

Reduce data from

(inches)

2D to 1D

16
Example: Reduce Data from 3D to 2D

3D data points 2D data points

(𝑥1 , 𝑥2 , 𝑥3 ) (𝑧1 , 𝑧2 )

Approximated by a 2D plane Projected onto the plane

17
Why Dimensionality Reduction?

• What is wrong with high dimensions?

• High Dimensions = Lots of Features

• Complex system to process

• Inefficient algorithm

• Overfitting to noise or other data corruptions.

18
Dimensionality Reduction

• Why do we need dimensionality reduction?

• Remove Feature Redundancy + Noise

• Prevent Overfitting

• Reduced Model Complexity

• Simplified Data distribution

• Better Visualization

19
Principal Component Analysis

• Principal Component Analysis (PCA)

• Simple and popular method.

• Unsupervised learning.

• To learn the “best” low-dimensional subspace for

data projection.

• What does the “best” mean here?

20
Principal Component Analysis

• Two ways to interpret the “Best”:

1. The mean square error (MSE) of the projected data is
minimized.

2. The variance of the projected data is maximized.

• These two objectives are achieved simultaneously by

applying PCA for dimensionality reduction.

• Why and How?

21
Principal Component Analysis

1. The mean square error (MSE) of the projected data

is minimized.

• 𝒙𝒊 ( black ): the original data.

• 𝒗 ( red ) : PCA subspace.

• 𝒗𝑻 𝒙𝒊 𝒗 ( blue ) : projected data.

• Minimize MSE ( green ) for all {𝒙𝒊 }

i.e., the perpendicular offset.
22
Principal Component Analysis

2. The variance of the projected data is maximized.

• 𝒗𝑻 𝒙𝒊 𝒗 ( blue ) : projected data.

• Variance = magnitude of ( blue )

• Blue variance is maximized.

23
Principal Component Analysis

• Minimizing MSE <=> Maximizing Projected Variance

• Blue2 + green2 = black2

• Black is fixed (given data)

• Maximizing blue (variance)

is equivalent to
Minimizing green (MSE).

24
Example: PCA

• Which of the following 𝒗 generates higher

projected variance?

25
Example: PCA

• Which of the following 𝒗 generates higher

projected variance?
• Find the projected mean 𝒖

𝒗 • Calculate the distance between each

point to the mean 𝒅𝒊

• The variance of the projected data

𝒅𝟐 𝒖
𝑁
𝒅𝟏 1
𝑣𝑎𝑟 = ෍ 𝒅𝒊 2
𝑁
𝑖=1

26
Example: PCA

• Subspace 𝒗 on the left generates higher

projected variance

Larger projected
variance!
Smaller projected 𝒗
variance!
27
Example: PCA

• Which of the following 𝒗 generates smaller

projected MSE?

28
Example: PCA

• Subspace 𝒗 on the left generates smaller

projected MSE – Smaller Perpendicular Offset

𝒗 Larger
projected MSE

Smaller
projected MSE
𝒗

29
Derivation of PCA - Optional

1. The mean square error (MSE) of

the projected data is minimized.

• For all {𝒙𝑖 }𝑁

𝑖=1 , the green
2

(in average) is minimized.

• Find the best red 𝒗 (unit vector)

to minimize the green2.

• The principal component:

𝒛𝒊 = 𝒗𝑇 𝒙𝒊 𝒗

30
Derivation of PCA - Optional

1. The mean square error (MSE) of

the projected data is minimized.

• The principal component:

𝒛𝒊 = 𝒗𝑇 𝒙𝒊 𝒗

• The MSE is minimized:

𝑁
1
𝑀𝑆𝐸 = ෍(𝒛𝒊 − 𝒙𝒊 )2
𝑁
𝑖=1

31
Derivation of PCA - Optional

2. The variance of the projected

data is maximized.

• Variance of the projected data:

𝑁
1
𝑉𝑎𝑟 = ෍(𝒛𝒊 − 𝒛ത )2
𝑁
𝑖=1

• 𝒛ത denotes the mean of {𝒛𝒊 }.

32
Derivation of PCA - Optional

2. The variance of the projected

data is maximized.

• For zero-mean data {𝒙𝑖 }𝑁

𝑖=1 , we
have 𝒛ത = 𝟎, then

𝑁 𝑁
1 1
𝑉𝑎𝑟 = ෍(𝒛𝒊 ) = ෍(𝒗𝑇 𝒙𝒊 )2
2
𝑁 𝑁
𝑖=1 𝑖=1

33
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N

i=1 ,

𝑁 𝑁
1 1
𝑉𝑎𝑟 = ෍(𝒛𝒊 )2 = ෍(𝒗𝑇 𝒙𝒊 )2
𝑁 𝑁
𝑖=1 𝑖=1

• Note that this variance is different from the

variance & bias for overfitting analysis.

34
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N

i=1 , the first weight unit vector 𝒗 satisfies:

𝑁
1
max ෍(𝒗𝑇 𝒙𝒊 )2
𝒗 2 =1 𝑁
𝑖=1

35
Derivation of PCA - Optional

• Recap: represent a set of data vectors {𝒙𝑖 }𝑁

𝑖=1 as a data matrix 𝑿:

𝒙𝒊 𝑻
𝒙𝟏 𝑻
⋮
=
𝑁
𝒙𝒊 𝑻
⋮
𝑻
𝒙𝑵

𝑑: feature dimension
36
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N

i=1 , the first weight unit vector 𝒗 satisfies:

𝑁
1 1 𝑇 𝑇
max ෍(𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1

37
Derivation of PCA - Optional

• Mathematical Problem Formulation:

• PCA is to maximize the projected data variance.

• Consider a set of zero-mean {𝐱 i }N

i=1 , the first weight unit vector 𝒗 satisfies:

𝑁
1 1 𝑇 𝑇
max ෍(𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1
1 𝑁 1 𝑇 𝑇
• Here σ𝑖=1(𝒗𝑇 𝒙𝒊 )2 = 𝒗 𝑿 𝑿𝒗 is the variance of the projected data.
𝑁 𝑁

• The optimal 𝒗* is the first unit weight vector for the PCA.

• The first Principal Component: 𝒛𝒊 = (𝒗∗𝑇 𝒙𝒊 )𝒗∗

38
Derivation of PCA - Optional

• Mathematical Problem Formulation :

• The optimal 𝒗* is the first unit weight vector for PCA:

𝑁
1 1 𝑇 𝑇
max ෍(𝒗𝑇 𝒙𝒊 )2 = max 𝒗 𝑿 𝑿𝒗
𝒗 2 =1 𝑁 𝒗 2 =1 𝑁
𝑖=1

• You can calculate the other 𝒗 and 𝒛𝒊 in an incremental fashion:

• Substitute 𝒙𝒊 with the residual (𝒙𝒊 - 𝒛𝒊 ).

• Calculate the next unit weight vector and principle component.

• How to obtain the optimal 𝒗 here?

39
Derivation of PCA - Optional

• Algorithm:

• We need to use Singular Value Decomposition (SVD):

40
Derivation of PCA - Optional

• Algorithm:

𝚺 𝑇
1. Apply SVD to the data matrix 𝑿, as sv𝑑 𝑿 = 𝑼 𝑽 .
𝟎

2. 𝑽 = 𝒗𝟏 , 𝒗𝟐 , … , 𝒗𝒌 , … 𝒗𝒅 , where 𝒗𝒌 is the k-th unit weight vector for PCA.

3. The k-th Principal Component of 𝒙𝒊 : (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘

4. Project to the k-dimensional subspace 𝑽𝒌 = 𝒗𝟏 , 𝒗𝟐 , … , 𝒗𝒌 :

𝒗1 𝑇 𝒙𝒊
𝑽 𝒌 𝑻 𝒙𝒊 = ⋮
𝒗𝑘 𝑇 𝒙𝒊

41
Derivation of PCA - Example

• Given data points, 𝒙1 = 4, 6, 10 , 𝒙2 = 3, 10, 13 , 𝒙3 = −2, −6, −8 .

• Calculate the first unit weight vector 𝒗, and the first Principal Component.

4 6 10
1. Data Matrix: 𝐗 = 3 10 13
−2 −6 −8

2. SVD: 𝐗 = 𝐔𝚺𝑽𝑻

−0.22 0.78 −0.58

3. 𝑽 = −0.57 −0.59 −0.58 −0.22
−0.79 0.20 0.58 𝒗𝟏 = −0.57 𝒛𝑖,𝑘 = (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘
−0.79
42
PCA Example 1

• Representation of handwriting digits in MNIST dataset:

Sample Variance
with rotations,
scales, etc.

43
PCA Example 1

• Representation of handwriting digits in MNIST dataset:

• PCA provides more robust and invariant features.

Reconstructions:

x k=2 k = 10 k = 50 k = 200

44
PCA Example 2
• Image Compression

Original Image

• Divide the original 372x492 image into patches:

• Each patch is an instance that contains 12x12 pixels on a grid
• View each as a 144-D vector
45
PCA Example 2
• Image Compression
PCA compression: 144D → 60D

46
PCA Example 2
• Image Compression
PCA compression: 144D → 16D

47
PCA Example 2
• Image Compression

16 most important eigenvectors

2 2 2 2
4 4 4 4
6 6 6 6
8 8 8 8
10 10 10 10
12 12 12 12
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

48
PCA Example 2
• Image Compression
PCA compression: 144D → 6D

49
PCA Example 2
• Image Compression
PCA compression: 144D → 1D

50
Carry-on Questions

• Why do we need dimensionality reduction?

• To remove redundancy, prevent overfitting, remove noise, etc.

• What are two objectives achieved by PCA?

• Minimizing MSE and Maximizing Projected Variance

• How to calculate PCA?

• Apply SVD to the data matrix. Select the unit weight vector 𝒗𝒌
from the V matrix, and calculate (𝒗𝑘 𝑇 𝒙𝒊 )𝒗𝑘 .
51
Bayesian Inference

WEN Bihan (Asst Prof)

Homepage: https://personal.ntu.edu.sg/bihan.wen/

52
Outline

• Probability and Conditional Probability

• Bayes’ Theorem

• Naïve Bayes

• Examples

53
Carry-on Questions

• What is Bayes’ Theorem?

• What is Naïve Bayes assumption?

54
From deterministic to probabilistic learning

• Training a deep neural network for classification:

• Deterministic model: always gives you the same output for the same input.

55
From deterministic to probabilistic learning

• Training a deep neural network for classification:

• Deterministic inference: gives you the same output for the same input.

• Bayesian Inference

• Tossing a coin twice, what outcomes will we get?

• There are 4 possible outcomes of this experiment

𝑆 = {𝐻𝐻, 𝐻𝑇, 𝑇𝐻, 𝑇𝑇}

• Probabilistic model: 𝑃𝑟𝑜𝑏 𝐻𝐻 = 0.25

56
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event. Tossing a coin

57
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

We tossed the coin 8

times, and got 4 tails
and 4 heads
sequentially.
{H,H,H,H,T,T,T,T}
58
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

• P(h) = Probability that Hypothesis h holds.

• P(D) = Probability of observing the training data D.

P(h) = probability of getting heads

59
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

• P(h) = Probability that Hypothesis h holds.

• P(D) = Probability of observing the training data D.

• P(D|h) = Probability of observing D when h holds. (read

“probability of D given h”, “probability of D conditioned on h” ).

60
Basic Concepts in Probability Theory

• We are interested of P(h|D), because

• We always learn from history / knowledge.

• We normally have the access to the training data D.

• We need to know P(h|D), aims to find the most probable

hypothesis given that data.

61
Basic Concepts in Probability Theory

• Concepts and Notations:

• h = A hypothesis or an event.

• D = A collection of data (e.g., training data).

• P(h) = Probability that Hypothesis h holds.

• P(D) = Probability of observing the training data D.

• P(D|h) = Probability of observing D when h holds. (read

“probability of D given h).

• Can we calculate P(h|D)? 62

Joint and Conditional Probability

• Conditional Probability 𝑃 𝐴 | 𝐵 :

The probability that A happens given that B happened.

• Joint probability 𝑃 𝐴, 𝐵 :

The probability that A and B both happen.

• 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 ∗ 𝑃(𝐵)

63
Bayes’ Theorem

• The simple form of Bayes’ Theorem:

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

• How to derive the theorem?

• Use joint probability: 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵) 𝑃 𝐵 = 𝑃 𝐵 𝐴) 𝑃(𝐴).

• Therefore:
𝑃(𝐴, 𝐵) 𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 = =
𝑃(𝐵) 𝑃(𝐵)

64
Bayes’ Theorem

• Given P(B) which is a constant, the proportional form:

• Sometimes P(B) can be also calculated as

• This is based on Law of total probability:

65
Bayes’ Theorem

• Now, we can predict 𝑃(ℎ|𝐷):

𝑃 𝐷 ℎ) 𝑃(ℎ)
𝑃 ℎ𝐷 =
𝑃(𝐷)
• Some terminologies we are going to use:

• Prior probability 𝑃(ℎ): prior knowledge of ℎ before observing 𝐷.

• Posterior 𝑃 ℎ 𝐷 : the probability of ℎ after we have observed 𝐷.

• Likelihood 𝑃 𝐷 ℎ : Likelihood of observing 𝐷 given ℎ.

66
Bayes’ Theorem

• Maximum A Posteriori (MAP) hypothesis:

𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏(𝐡|𝐃)

𝑃 𝐷 ℎ) 𝑃(ℎ)
• We just learned that 𝑃 ℎ 𝐷 = . Given D, we have
𝑃(𝐷)

𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏 𝐃 𝐡 𝐏(𝐡)

• If P(h) is constant, MAP is equivalent to Maximum Likelihood (ML):

𝐡𝐌𝐀𝐏 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝐏 𝐃 𝐡

67
Example Questions

Quiz 1: Flipping a coin

• Suppose we are flipping a fair coin twice.

What is the probability that both flips are heads?

68
Example Questions

• Quiz 2: Flipping a coin

• Suppose we are flipping a fair coin twice.

Given that the outcome of first flip is heads, what is the
probability that both flips are heads?

69
Example Questions

• Quiz 3: Our IE4483 is attended by students from both EEE and IEM.
Only 50% of the IEM students and 30% of the EEE students pass the
exam. Given that 60% of the entire class are EEE students, what is the
percentage of IEM students amongst those who pass the exam?

70
Example Questions

• Quiz 4: Monty Hall Problem

• You’re a contestant on a game show. You see three closed doors, and
behind one of them is a prize. You choose one door, and the host opens
one of the other doors and reveals that there is no prize behind it. Then
he offers you a chance to switch to the remaining door. Should you take
it?

http://en.wikipedia.org/wiki/Monty_Hall_problem 71
Naïve Bayes

• The basic form of Bayes’ Theorem:

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

• It is straightforward if A and B are both single attributes.

• But what if we have multiple conditions or query events?

• How to represent their conditional probabilities?

72
Naïve Bayes

• Extend from simple example to large training datasets.

• Single attribute Multiple attributes

• How does 𝑃 𝑎1 , 𝑎2 𝑣𝑗 ) relate to 𝑃 𝑎1 𝑣𝑗 ) and 𝑃 𝑎2 𝑣𝑗 ) ?

• Naïve Bayes Assumption:

𝑃 𝑎1 , 𝑎2 𝑣𝑗 ) = 𝑃 𝑎1 𝑣𝑗 )𝑃 𝑎2 𝑣𝑗 )

73
Naïve Bayes

• Naïve Bayes assumption:

• The conditional independence assumption

• The values of some features are conditionally independent on the others.

• Mathematical Form:

𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = 𝑃 𝑎1 𝑣𝑗 ) … 𝑃 𝑎𝑑 𝑣𝑗 )

𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = ෑ 𝑃 𝑎𝑖 𝑣𝑗 )
𝑖=1

74
Naïve Bayes - Example

• Play Tennis:

• New Observation: <Sunny,Cool,High,Strong>, do you play tennis?

75
Naïve Bayes - Example

• Play Tennis:

• Compare 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 >) and

𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 >)
S = Sunny (outlook)
• According to Bayes’ Theorem, it is to compare
𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) and 𝑃(𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >)
C = Cool (Temp.)
• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝑃 𝑦𝑒𝑠 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠) H = High (Humidity)

S = Strong (Wind)

76
Bayes’ Theorem

• Conditional Prob to Joint Prob

𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 ∗ 𝑃(𝐵)

𝑃 𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 > ∗ 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >)

𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 > ∗ 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >)

• 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >) is a constant, i.e., same for both choices.

• Let’s call as the 𝑃(< 𝑆, 𝐶, 𝐻, 𝑆 >) as the 𝑃(𝑂𝑏𝑠𝑒𝑟. )

77
Bayes’ Theorem

• Bayes’ Theorem:

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑦𝑒𝑠 𝑃(𝑦𝑒𝑠)

𝑃 𝑦𝑒𝑠 | 𝑂𝑏𝑠𝑒𝑟. =
𝑃(𝑂𝑏𝑠𝑒𝑟. )

• Again, 𝑃(𝑂𝑏𝑠𝑒𝑟. ) is a constant, i.e., same for both choices.

• Just compare 𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑦𝑒𝑠 𝑃(𝑦𝑒𝑠) and 𝑃 𝑂𝑏𝑠𝑒𝑟. 𝑛𝑜 𝑃(𝑛𝑜)

78
Naïve Bayes - Example

• Play Tennis:

• Compare 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 >) and

𝑃 𝑛𝑜 < 𝑆, 𝐶, 𝐻, 𝑆 >)

• According to Bayes’ Theorem, it is to compare S = Sunny (outlook)

𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) and 𝑃(𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >)
C = Cool (Temp.)
• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝑃 𝑦𝑒𝑠 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠)
H = High (Humidity)
• Based on Naïve Bayes assumption, we have
𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑦𝑒𝑠) = S = Strong (Wind)
𝑃 𝑆 𝑦𝑒𝑠 𝑃 𝐶 𝑦𝑒𝑠 𝑃 𝐻 𝑦𝑒𝑠 𝑃 𝑆 𝑦𝑒𝑠
2 3 3 3
= 9 × 9 × 9 × 9 = 0.00823
• P(yes) = #yes / 15 = 9/14 By simply counting

9
• Thus, 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 14 × 0.00823 = 𝟎. 𝟎𝟎𝟓𝟏
79
Naïve Bayes - Example

• Play Tennis:

• 𝑃(𝑦𝑒𝑠, < 𝑆, 𝐶, 𝐻, 𝑆 >) = 𝟎. 𝟎𝟎𝟓𝟏

• Similarly, 𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 > = 𝑃 𝑛𝑜 𝑃 < 𝑆, 𝐶, 𝐻, 𝑆 > 𝑛𝑜)

= 𝑃 𝑛𝑜 𝑃 𝑆 𝑛𝑜 𝑃 𝐶 𝑛𝑜 𝑃 𝐻 𝑛𝑜 𝑃 𝑆 𝑛𝑜
= 0.36 × 0.6 × 0.2 × 0.8 × 0.6 = 𝟎. 𝟎𝟐𝟎𝟕

• Do you play tennis?

• 𝑃 𝑦𝑒𝑠 < 𝑆, 𝐶, 𝐻, 𝑆 > < 𝑃 𝑛𝑜, < 𝑆, 𝐶, 𝐻, 𝑆 >

• Given the observation <Sunny,Cool,High,Strong>, more likely you

will NOT play tennis.

80
Carry-on Questions

• What is Bayes’ Theorem?

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴|𝐵 =
𝑃(𝐵)

• What is Naïve Bayes assumption?

𝑃 𝑎1 , … , 𝑎𝑑 𝑣𝑗 ) = ෑ 𝑃 𝑎𝑖 𝑣𝑗 )
𝑖=1

81
What we have learned

• Probability and Conditional Probability

• Hypothesis, joint / conditional probability, etc.

• Bayes’ Theorem

• Various forms of Bayes’ Theorem, prior, likelihood, Posterior, etc.

• Naïve Bayes

• Multi-attributes, Naïve Bayes assumption, examples.

20 Pca
No ratings yet
20 Pca
50 pages
Module3 OTML
No ratings yet
Module3 OTML
67 pages
Dimensionality Reduction Explained
No ratings yet
Dimensionality Reduction Explained
60 pages
CHAPTER 6 Dimensionality Reduction
No ratings yet
CHAPTER 6 Dimensionality Reduction
20 pages
Machine Learning (CSO851) - Lecture 03
No ratings yet
Machine Learning (CSO851) - Lecture 03
71 pages
AI Unsupervised Learning Guide
No ratings yet
AI Unsupervised Learning Guide
44 pages
کتاب نهم بارگزاری شده
No ratings yet
کتاب نهم بارگزاری شده
55 pages
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
62 pages
Dim Reduction & Pattern Recognition
No ratings yet
Dim Reduction & Pattern Recognition
63 pages
Pca
No ratings yet
Pca
6 pages
Linear Regression: Dimensionality Reduction
No ratings yet
Linear Regression: Dimensionality Reduction
7 pages
03 Dimensionality Reduction
No ratings yet
03 Dimensionality Reduction
38 pages
10 Autoencoders
No ratings yet
10 Autoencoders
42 pages
cs229 Notes10 PDF
No ratings yet
cs229 Notes10 PDF
6 pages
Lecture8 2015
No ratings yet
Lecture8 2015
51 pages
Unit 3
No ratings yet
Unit 3
102 pages
Dimension Reduction
No ratings yet
Dimension Reduction
23 pages
Data Pre-Processing-IV (Feature Extraction-PCA)
No ratings yet
Data Pre-Processing-IV (Feature Extraction-PCA)
23 pages
AML Unit - 1 Material
No ratings yet
AML Unit - 1 Material
36 pages
PCA
100% (1)
PCA
33 pages
Chapter 10. Dimensionality Reduction With PCA
No ratings yet
Chapter 10. Dimensionality Reduction With PCA
23 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
P-3.1.4 - Pca
No ratings yet
P-3.1.4 - Pca
44 pages
Lecture 9 - Data Prep - Reduction - PCA-M
No ratings yet
Lecture 9 - Data Prep - Reduction - PCA-M
44 pages
PCA (v3)
No ratings yet
PCA (v3)
34 pages
Principal Component Analysis: Atent Ariables
No ratings yet
Principal Component Analysis: Atent Ariables
13 pages
Lecture 9 - Data Reduction
No ratings yet
Lecture 9 - Data Reduction
36 pages
10-601 Machine Learning (Fall 2010) Principal Component Analysis
No ratings yet
10-601 Machine Learning (Fall 2010) Principal Component Analysis
8 pages
Dimensionality Reduction Guide
No ratings yet
Dimensionality Reduction Guide
104 pages
Module 5 - BECE309L - AIML - Part2
No ratings yet
Module 5 - BECE309L - AIML - Part2
34 pages
Pattern Recognition PCA: Subrata Datta Dept. of AIML Nsec
No ratings yet
Pattern Recognition PCA: Subrata Datta Dept. of AIML Nsec
19 pages
Principal Component Analysis Guide
No ratings yet
Principal Component Analysis Guide
23 pages
CHBE413CDS Lecture 12 Unsupervised DimRed
No ratings yet
CHBE413CDS Lecture 12 Unsupervised DimRed
30 pages
ML Lec-20
No ratings yet
ML Lec-20
17 pages
4.5 Principal Component Analysis
No ratings yet
4.5 Principal Component Analysis
15 pages
PCA for Data Scientists
No ratings yet
PCA for Data Scientists
20 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
16 pages
Presentation A I STD 2
No ratings yet
Presentation A I STD 2
63 pages
Computer Vision and Image Processing - Fundamentals and Applications
No ratings yet
Computer Vision and Image Processing - Fundamentals and Applications
34 pages
Part1 Lecture 12 Annotated
No ratings yet
Part1 Lecture 12 Annotated
12 pages
Dimensionality Reduction With Principal Component Analysis
No ratings yet
Dimensionality Reduction With Principal Component Analysis
39 pages
Advanced Dimension Reduction Techniques
No ratings yet
Advanced Dimension Reduction Techniques
40 pages
Lecture W12ab
No ratings yet
Lecture W12ab
60 pages
Dimensionality Reduction for ML
No ratings yet
Dimensionality Reduction for ML
36 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
Data Analysis: Dr. C Santhosh Kumar
No ratings yet
Data Analysis: Dr. C Santhosh Kumar
22 pages
Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020
No ratings yet
Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020
22 pages
315 F19 27 Pca1
No ratings yet
315 F19 27 Pca1
28 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
MLSP-6 Dimensionality Reduction
No ratings yet
MLSP-6 Dimensionality Reduction
39 pages
Kernel Principal Component Analysis and Its Applications in Face Recognition and Active Shape Models
No ratings yet
Kernel Principal Component Analysis and Its Applications in Face Recognition and Active Shape Models
9 pages
PCA for High-Dimensional Data
No ratings yet
PCA for High-Dimensional Data
14 pages
PCA & Dimensionality Reduction Guide
No ratings yet
PCA & Dimensionality Reduction Guide
53 pages
ML Mod 4 Part 2
No ratings yet
ML Mod 4 Part 2
32 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
Face Recognition PAC
No ratings yet
Face Recognition PAC
24 pages
ML Mod32019
No ratings yet
ML Mod32019
6 pages
Dimensionality Reduction by Pca: Non - Feasible
No ratings yet
Dimensionality Reduction by Pca: Non - Feasible
26 pages
Sample Quiz Part 3
No ratings yet
Sample Quiz Part 3
5 pages
Sample Quiz Part 2
No ratings yet
Sample Quiz Part 2
6 pages
Sample Quiz Part 1
No ratings yet
Sample Quiz Part 1
9 pages
Cheatsheet 1
No ratings yet
Cheatsheet 1
2 pages
LDPC Codes
No ratings yet
LDPC Codes
3 pages
Audio Signal Processing
No ratings yet
Audio Signal Processing
7 pages
Stability Analysis in S
No ratings yet
Stability Analysis in S
10 pages
Mla Aiml
No ratings yet
Mla Aiml
52 pages
Filt Ros
No ratings yet
Filt Ros
15 pages
Pretty Good Privacy: PGP - RFC4880 / RFC6637
No ratings yet
Pretty Good Privacy: PGP - RFC4880 / RFC6637
10 pages
Paramente R Classification Clustering: 11) What Is The Difference Between Classification and Clustering?
No ratings yet
Paramente R Classification Clustering: 11) What Is The Difference Between Classification and Clustering?
2 pages
Lab Title: Shortest Job First (SJF) Scheduling Algorithm in Python Objective: Arrival Time Burst Time Process Ids
No ratings yet
Lab Title: Shortest Job First (SJF) Scheduling Algorithm in Python Objective: Arrival Time Burst Time Process Ids
2 pages
The Levenberg Marquardt Algorithm For
No ratings yet
The Levenberg Marquardt Algorithm For
23 pages
ELEN90055 Control Systems: Midsemester Test
No ratings yet
ELEN90055 Control Systems: Midsemester Test
2 pages
Lesson 1.4 Domain and Range of Functions PDF
No ratings yet
Lesson 1.4 Domain and Range of Functions PDF
36 pages
UCS551 Chapter 6 - Classification
No ratings yet
UCS551 Chapter 6 - Classification
20 pages
AI Practical 05-Greedy Best First Search Implementation
100% (1)
AI Practical 05-Greedy Best First Search Implementation
19 pages
CSCI 4717/5717 Computer Architecture: Error Correction in Memory
No ratings yet
CSCI 4717/5717 Computer Architecture: Error Correction in Memory
3 pages
Basic Models of Artificial Neural Networks
No ratings yet
Basic Models of Artificial Neural Networks
5 pages
Cascaded Integrator Comb Digital Filters Paper Hogenauer
No ratings yet
Cascaded Integrator Comb Digital Filters Paper Hogenauer
8 pages
Advanced Analysis of Algorithms: Lecture # 1 MS (Computer Science) Semester 1
No ratings yet
Advanced Analysis of Algorithms: Lecture # 1 MS (Computer Science) Semester 1
21 pages
Ece404 Digital-Image-Processing TH 1.00 Ac16
No ratings yet
Ece404 Digital-Image-Processing TH 1.00 Ac16
1 page
Algorithm Efficiency Basics
No ratings yet
Algorithm Efficiency Basics
34 pages
Designing An Optimal Railway Routes Planner Using Dijkstra-1
No ratings yet
Designing An Optimal Railway Routes Planner Using Dijkstra-1
47 pages
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
No ratings yet
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
31 pages
Math 8 Week 1-2 Part 1 - Factoring Polynomials With Common Factors
No ratings yet
Math 8 Week 1-2 Part 1 - Factoring Polynomials With Common Factors
29 pages
Instruction Division First Semester, 2016-2017 Course Handout (Part II)
No ratings yet
Instruction Division First Semester, 2016-2017 Course Handout (Part II)
3 pages
Ai Unit 3
No ratings yet
Ai Unit 3
27 pages
Differential Pulse Code Modulation: 308201-Communication Systems 39
No ratings yet
Differential Pulse Code Modulation: 308201-Communication Systems 39
9 pages
Bokeh Blur PDF
No ratings yet
Bokeh Blur PDF
8 pages
LAB 9 - Multirate Sampling
No ratings yet
LAB 9 - Multirate Sampling
8 pages
Problem 6 7 8 Frequency Response
No ratings yet
Problem 6 7 8 Frequency Response
3 pages
Question Bank DSA
No ratings yet
Question Bank DSA
6 pages
Huawei CE Utilization Guideline
No ratings yet
Huawei CE Utilization Guideline
3 pages