Bayesian decision theory
Assume , two class problem.
Example :- An automatic system for quality measurement of a product industry.
Acceptance class = 𝑤1 , Reject class = 𝑤2
Based on previous record,
Probability of acceptance = 𝑝 𝑤1 known
Prior Probability
Probability of rejection = 𝑝 𝑤2 known
We can make simple decision rule:-
If 𝑝 𝑤1 > 𝑝 𝑤2 , then decide class 𝑤1
If 𝑝 𝑤2 > 𝑝 𝑤1 , then decide class 𝑤2
I can find out the probabilistic measure or probability density function (PDF) of variable 𝑥 for object which belongs
to class 𝑤1 and 𝑤2 seperately.
𝑝(𝑥 𝑤1 ) and 𝑝(𝑥 𝑤2 )
Class conditional PDF
Our objective is to calculate:-
𝑝(𝑤1 𝑥) and 𝑝(𝑤2 𝑥)
Posterior probability
Joint probability density function,
𝑝 𝑤𝑖 , 𝑥 = 𝑝 𝑤𝑖 𝑥 . 𝑝(𝑥)
= 𝑝(𝑥 𝑤𝑖 ). 𝑝(𝑤𝑖 )
2
⇒ 𝑝 𝑤𝑖 𝑥 . 𝑝 𝑥 = 𝑝(𝑥 𝑤𝑖 ). 𝑝(𝑤𝑖 )
𝑝 𝑥 𝑤𝑖 . 𝑝(𝑤𝑖 )
⇒ 𝑝 𝑤𝑖 𝑥 =
𝑝(𝑥)
Likelihood x Prior
Posterior =
Evidence
𝑝 𝑥 𝑤𝑖 . 𝑝(𝑤𝑖 )
𝑝 𝑤𝑖 𝑥 =
𝑝(𝑥) Bayes rule
𝑝 𝑥 = 𝑝 𝑥 𝑤𝑖 . 𝑝(𝑤𝑖 ) If 𝑝(𝑤1 𝑥)> 𝑝(𝑤2 𝑥), then decide class 𝑤1
𝑖=1 If 𝑝(𝑤2 𝑥) >𝑝(𝑤1 𝑥), then decide class 𝑤2
By expanding,
If 𝑝(𝑥 𝑤1 )𝑝 𝑤1 > 𝑝(𝑥 𝑤2 )𝑝 𝑤2 , then decide 𝑤1
If 𝑝(𝑥 𝑤2 )𝑝 𝑤2 > 𝑝(𝑥 𝑤1 )𝑝 𝑤1 , then decide 𝑤2
If 𝑝(𝑥 𝑤1 ) 𝑝 𝑤1 = 𝑝(𝑥 𝑤2 )𝑝 𝑤2 , then decision will based on 𝑝 𝑤1 and 𝑝 𝑤2 .
Error in this case:- PDF of class 𝑤1
PDF of class 𝑤2
If 𝑥1 ∈ 𝑤2 , then error 𝑝(𝑤1 𝑥) Error
𝑝 𝑤𝑖 𝑥
If 𝑥2 ∈ 𝑤1 , then error 𝑝(𝑤2 𝑥)
If I decide in favour of class 𝑤1 then probability of error = 𝑝(𝑤2 𝑥)
If I decide in favour of class 𝑤2 then probability of error = 𝑝(𝑤1 𝑥) 𝑥1 𝑥2 𝑥
Decision boundary
∞ ∞
Total error = −∞
𝑝 𝑒𝑟𝑟𝑜𝑟, 𝑥 𝑑𝑥 = −∞
𝑝(𝑒𝑟𝑟𝑜𝑟 𝑥). 𝑝 𝑥 𝑑𝑥
𝑝 𝑒𝑟𝑟𝑜𝑟, 𝑥 = min 𝑝(𝑤1 𝑥), 𝑝(𝑤2 𝑥)
𝑝 𝑥 𝑤𝑖 . 𝑝(𝑤𝑖 ) 𝑝 𝑥 𝑤𝑖 . 𝑝(𝑤𝑖 )
𝑝(𝑤𝑖 𝑥) = 2 = 𝑝(𝑤𝑖 𝑥) =
𝑖=1 𝑝 𝑥 𝑤𝑖 . 𝑝(𝑤𝑖 ) 𝑝(𝑥)
If 𝑝(𝑤1 𝑥)>𝑝(𝑤2 𝑥) ,then decide class 𝑤1
If 𝑝(𝑤2 𝑥)>𝑝(𝑤1 𝑥) ,then decide class 𝑤2
5
Generalized bayes classifier
Use more than two states of nature.
More than one feature .
More action to consider.
Loss function.
c No. of classes
𝑤1 , 𝑤2 , … … 𝑤𝑐
No. of actions = 𝛼1 , 𝛼2 , … 𝛼𝑎
Loss function:- 𝜆 𝛼𝑖 𝑤𝑗 , loss is occurred for taking action 𝛼𝑖 when state of class is 𝑤𝑗 .
𝑥 is 𝑑 – dimensional vector
6
𝑐
𝑅 𝛼𝑖 𝑥 = 𝜆 𝛼𝑖 𝑤𝑗 𝑝(𝑤𝑗 𝑥)
𝑗=1
Risk function/conditional risk/expected loss.
Minimum Risk classifier:-
Two category case :- 𝑤1 and 𝑤2 and actions 𝛼1 and 𝛼2
For simplicity , 𝜆 𝛼𝑖 𝑤𝑗 = 𝜆𝑖𝑗
In general, 𝑐
𝑅 𝛼𝑖 𝑥 = 𝜆 𝛼𝑖 𝑤𝑗 𝑝(𝑤𝑗 𝑥)
𝑗=1
𝜆𝑖𝑗
7
For two class problem.
For action 𝛼1 , 𝑅 𝛼1 𝑥 = 𝜆11 𝑝(𝑤1 𝑥) + 𝜆12 𝑝(𝑤2 𝑥)
𝜆 𝛼1 𝑤1 𝜆 𝛼1 𝑤2
If 𝑅 𝛼1 𝑥 < 𝑅 𝛼2 𝑥 , then in favour of 𝛼1
For action 𝛼2 , 𝑅 𝛼2 𝑥 = 𝜆21 𝑝(𝑤1 𝑥) + 𝜆22 𝑝(𝑤2 𝑥)
If 𝑅 𝛼1 𝑥 > 𝑅 𝛼2 𝑥 , then in favour of 𝛼2
𝜆 𝛼2 𝑤1 𝜆 𝛼2 𝑤2
𝜆21 𝑝(𝑤1 𝑥) + 𝜆22 𝑝(𝑤2 𝑥) > 𝜆11 𝑝(𝑤1 𝑥) + 𝜆12 𝑝(𝑤2 𝑥)
for decision in favour of 𝑤1 or action 𝛼1
= (𝜆21 - 𝜆11 )𝑝(𝑤1 𝑥)>(𝜆12 − 𝜆22 )𝑝(𝑤2 𝑥) If both , (𝜆21 - 𝜆11 ) > 0 and 𝜆12 − 𝜆22 > 0
And 𝑝(𝑤1 𝑥) > 𝑝(𝑤2 𝑥), then decide class 𝑤1 .
Multi-category class
𝑔1 (𝑥)
𝑋 𝑔2 (𝑥) 𝑋 ∈ !!
𝑔𝑐 (𝑥)
𝑐 = No. of classes 𝑔 𝑥 = discriminant function
𝑤1 , 𝑤2 , … … 𝑤𝑐 , are 𝑐 no of classes
𝑔𝑖 𝑥 ; 𝑖 = 1,2, . . , 𝑐.
If 𝑔𝑖 𝑥 > 𝑔𝑗 𝑥 ∀ 𝑗 ≠ 𝑖 decide 𝑥 ∈ 𝑤𝑖 .
9
Minimum risk classifier
We can let 𝑔𝑖 𝑥 = −𝑅 𝛼𝑖 𝑥
As we know , 𝑅 𝛼𝑖 𝑥 = 1 − 𝑝(𝑤𝑖 𝑥)
Then, 𝑔𝑖 𝑥 = 𝑝(𝑤𝑖 𝑥)
𝑓 𝑔𝑖 𝑥 = monotonically increasing function.
Minimum error rate classification:-
𝑝 𝑥 𝑤𝑖 . 𝑝(𝑤𝑖 )
𝑔𝑖 𝑥 = 𝑝(𝑤𝑖 𝑥) = 2
𝑗=1 𝑝 𝑥 𝑤𝑖 . 𝑝(𝑤𝑗 ) 𝑝(𝑥)
𝑔𝑖 𝑥 = 𝑝 𝑥 𝑤𝑖 . 𝑝(𝑤𝑖 )
Logarithmic function monotonically increasing function.
𝑔𝑖 𝑥 = ln 𝑝(𝑤𝑖 𝑥) = ln 𝑝 𝑥 𝑤𝑖 + ln 𝑝(𝑤𝑖 )
10
Two category case:- Class 𝑤1
We have now two discriminant function 𝑔1 𝑥 and 𝑔2 𝑥 .
Class 𝑤2
If 𝑔1 𝑥 > 𝑔2 (𝑥) decide class 𝑤1
If 𝑔1 𝑥 < 𝑔2 (𝑥) decide class 𝑤2 Decision boundary 𝑔1 𝑥 − 𝑔2 𝑥 = 0
Single discriminant function:-
𝑔 𝑥 = 𝑔1 𝑥 − 𝑔2 𝑥 = ln 𝑝(𝑤1 𝑥) − ln 𝑝(𝑤2 𝑥)
= ln 𝑝 𝑥 𝑤1 + ln 𝑝(𝑤1 ) − ln 𝑝 𝑥 𝑤2 − ln 𝑝(𝑤2 )
𝑝 𝑥 𝑤1 𝑝(𝑤1 )
= ln + ln
𝑝 𝑥 𝑤2 𝑝(𝑤2 )
11
The Normal Density
Univariate Density:- We begin with the continuous univariate normal or Gaussian density,
1 1 𝑥−𝜇 2
𝑝 𝑥 = exp −
2𝜋𝜎 2 𝜎
𝜇 = expected value of 𝑥
∞
𝜇=𝐸 𝑥 = 𝑥𝑝 𝑥 𝑑𝑥 𝑝(𝑥)
−∞
𝜎 2 = variance
∞
𝜎 2 = 𝐸[ 𝑥 − 𝜇)2 = (𝑥 − 𝜇)2 𝑝 𝑥 𝑑𝑥 = 𝑁(𝜇, 𝜎 2 )
−∞ 𝜇 𝑥
12
Multivariate Density:- The general multivariate normal density in 𝑑 dimensions is written as
1 1 𝑡 −1
𝑝 𝑥 = exp − (𝑥 − 𝜇) 𝛴 (𝑥 − 𝜇)
(2𝜋)𝑑/2 𝛴 1/2 2
𝑥 = feature vector with dimension 𝑑
𝜇 = expected value of dimension 𝑑
∞
𝜇=𝐸 𝑥 = 𝑥𝑝 𝑥 𝑑𝑥
−∞
𝛴 = covariance matrix
∞
𝛴 = 𝐸[ 𝑥 − 𝜇 𝑥 − 𝜇)𝑡 = 𝑥 − 𝜇 (𝑥 − 𝜇)𝑡 𝑝 𝑥 𝑑𝑥 𝑥 − 𝜇 = (𝑑 × 1)
−∞ (𝑥 − 𝜇)𝑡 = (1 × 𝑑)
𝛴 = 𝐸 𝑥 = (𝑑 × 𝑑)
𝑖𝑡ℎ component 𝜇𝑖 = 𝐸[𝑥𝑖 ]
𝑖, 𝑗 𝑡ℎ component 𝜎𝑖,𝑗 = 𝐸[ 𝑥𝑖 − 𝜇𝑖 𝑥𝑗 − 𝜇𝑗 )𝑡
Diagonal component,𝜎𝑖,𝑖 = 𝐸[ 𝑥𝑖 − 𝜇𝑖 )2 = 𝜎𝑖2
13
Bivariate normal density function:-
𝑋 = two dimensional feature vector
𝑥1
𝑋= 𝑥
2
1 1 𝑥1 − 𝜇1 2 𝑥2 − 𝜇2 2
𝑝 𝑥 = 1/2
exp − {( ) + ( ) }
(2𝜋) 𝛴 2 𝜎1 𝜎2
𝜎12 0 𝜇1
𝛴= 𝜇= 𝜇
0 𝜎22 2
1 1 𝑥1 −𝜇1 2 𝑥2 −𝜇2 2
𝑝 𝑥 = exp − ( ) exp ( )
(2𝜋) 𝛴 1/2 2 𝜎1 𝜎2
14
Physical Interpretation:-
(i) First case:- 𝜎1 = 𝜎2
𝑝(𝑥)
𝑝(𝑥)
𝑥2
Loci of point having
constant density
𝑥1
For bivariate function I want to trace the loci of constant density i.e. all value of 𝑥 for which 𝑝(𝑥) is constant , those loci
is nothing but circle.
Along with these circles , I have more probability of occurrence of set of points which are drawn from a single population
arbitrary.
15
(ii) Second case:- 𝜎1 2 ≠ 𝜎2 2
𝜎12 𝜎12
𝛴=
𝜎21 𝜎22
If all samples are statistically independent,
𝜎12 = 𝜎21 = 0
then 𝜎12 0
𝛴=
0 𝜎22
(iii) Third case:-Data are not statistically independent .
𝑒1 = eigen vectors of the covariance matrix 𝛴
16
Discriminant Functions for the Normal Density
We know that the discriminant functions given as:-
𝑔𝑖 𝑥 = 𝑙𝑛𝑝(𝑤𝑖 𝑥) = 𝑙𝑛𝑝 𝑥 𝑤𝑖 + 𝑙𝑛𝑝(𝑤𝑖 )
Multivariate Density:-
1 1 𝑡 𝛴 −1 (𝑥 − 𝜇 )
𝑝 𝑥 𝑤𝑖 = exp − (𝑥 − 𝜇𝑖 ) 𝑖 𝑖
(2𝜋)𝑑/2 𝛴𝑖 1/2 2
Discriminant Functions:-
1 𝑑 1
𝑔𝑖 𝑥 = − 2 (𝑥 − 𝜇𝑖 )𝑡 𝛴𝑖−1 (𝑥 − 𝜇𝑖 ) − 2 𝑙𝑛2𝜋 − 2 ln 𝛴𝑖 + 𝑙𝑛𝑝(𝑤𝑖 )
Let us examine the discriminant function and resulting classification for a no. of special cases.
17
Case :-
𝛴𝑖 = 𝜎 2 𝐼 𝐼 = Identity matrix (𝑑 × 𝑑)
𝜎𝑖,𝑗 = 0, different components are statistically independent.
𝛴𝑖 = 𝜎 2𝑑
1
𝛴𝑖−1 = 2 𝐼
𝜎
1 𝑑 1
𝑔𝑖 𝑥 = − 2 (𝑥 − 𝜇𝑖 )𝑡 𝛴𝑖−1 (𝑥 − 𝜇𝑖 ) − 2 𝑙𝑛2𝜋 − 2 ln 𝛴𝑖 + ln 𝑝(𝑤𝑖 )
constant Independent of 𝑖,so they are ignored.
Thus we obtain the simple discriminant functions:-
𝑥 − 𝜇𝑖 2
𝑔𝑖 𝑥 = − + 𝑙𝑛𝑝(𝑤𝑖 )
2𝜎 2
where . is the Euclidean norm that is ,
𝑥 − 𝜇𝑖 2 = (𝑥 − 𝜇𝑖 )𝑡 (𝑥 − 𝜇𝑖 )
18
Expansion of the quadratic form (𝑥 − 𝜇𝑖 )𝑡 𝑥 − 𝜇𝑖 yields
1
𝑔𝑖 𝑥 = − 2 𝑥 𝑡 𝑥 − 2𝜇𝑖𝑡 𝑥 + 𝜇𝑖𝑡 𝜇𝑖 + ln 𝑝(𝑤𝑖 )
2𝜎
which appears to be quadratic function of 𝑥. However ,the quadratic term 𝑥 𝑡 𝑥 is independent for all 𝑖,
making it an ignorable additive constant. then,
1 𝑡 𝑡
𝑔𝑖 𝑥 = − −2𝜇 𝑖 𝑥 + 𝜇𝑖 𝜇𝑖 + ln 𝑝(𝑤𝑖 )
2𝜎 2
Thus we obtain the equivalent linear discriminant functions
𝑔𝑖 𝑥 = 𝑤𝑖𝑡 𝑥 + 𝑤𝑖0
where
1 1 𝑡
𝑤𝑖 = 𝜇 , 𝑤 = − 𝜇 𝜇 + ln 𝑝(𝑤𝑖 )
𝜎 2 𝑖 𝑖0 2𝜎 2 𝑖 𝑖
If 𝑔𝑖 𝑥 > 𝑔𝑗 𝑥 ,then 𝑥 ∈ class 𝑖
If 𝑔𝑗 𝑥 > 𝑔𝑖 𝑥 ,then 𝑥 ∈ class 𝑗
19
𝑔𝑖 𝑥 = 𝑔𝑗 𝑥 or 𝑔𝑖 𝑥 − 𝑔𝑗 𝑥 = 0
𝑔𝑖 𝑥 = 𝑤𝑖𝑡 𝑥 + 𝑤𝑖0
𝑔𝑗 𝑥 = 𝑤𝑗𝑡 𝑥 + 𝑤𝑗0
𝑔𝑖 𝑥 − 𝑔𝑗 𝑥 = 0
⇒ 𝑤𝑖 − 𝑤𝑗 𝑡 𝑥 + 𝑤𝑖0 + 𝑤𝑗0 = 0
1 𝑡
1 𝑡 1 𝑡
⇒ 2 (𝜇𝑖 −𝜇𝑗 ) 𝑥 − 2 𝜇𝑖 𝜇𝑖 + ln 𝑝 𝑤𝑖 + 2 𝜇𝑗 𝜇𝑗 − ln 𝑝 𝑤𝑗 = 0
𝜎 2𝜎 2𝜎
1 ln 𝑝 𝑤𝑖
⇒ (𝜇𝑖 −𝜇𝑗 )𝑡 𝑥 − (𝜇𝑖𝑡 𝜇𝑖 − 𝜇𝑗𝑡 𝜇𝑗 ) + 𝜎 2 =0
2 ln 𝑝 𝑤𝑗
2
𝑡
1 𝜎 ln 𝑝 𝑤𝑖
⇒ (𝜇𝑖 −𝜇𝑗 ) 𝑥 − 2 𝜇𝑖 + 𝜇𝑗 − 2
(𝜇𝑖 −𝜇𝑗 ) =0
𝜇𝑖 − 𝜇𝑗 ln 𝑝 𝑤𝑗
⇒ 𝑤𝑡 𝑥 − 𝑥0 = 0
20
1 1 𝜎2 ln 𝑝 𝑤𝑖
𝑤 = (𝜇𝑖 −𝜇𝑗 ) 𝑥0 = 𝜇𝑖 + 𝜇𝑗 − (𝜇𝑖 −𝜇𝑗 )
2 2 𝜇𝑖 − 𝜇𝑗 2 ln 𝑝 𝑤𝑗
1
If 𝑝 𝑤𝑖 = 𝑝 𝑤𝑗 ; 𝑥0 = 𝜇𝑖 + 𝜇𝑗
2
𝑤 Bisector for the line joining between 𝜇𝑖 and 𝜇𝑗
𝜇𝑖 𝜇𝑗
Orthogonal to the line joining between 𝜇𝑖 and 𝜇𝑗
21