KEMBAR78
01B DL2023 LinearModels | PDF | Statistical Classification | Logistic Regression
0% found this document useful (0 votes)
11 views47 pages

01B DL2023 LinearModels

The document outlines concepts in supervised learning, focusing on classification and regression problems using linear models, including linear and logistic regression. It explains the learning process, cost functions, gradient descent, and feature scaling, along with their applications in predicting outcomes. Additionally, it covers logistic regression for probability predictions and classification based on the sigmoid function.

Uploaded by

marshe386
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views47 pages

01B DL2023 LinearModels

The document outlines concepts in supervised learning, focusing on classification and regression problems using linear models, including linear and logistic regression. It explains the learning process, cost functions, gradient descent, and feature scaling, along with their applications in predicting outcomes. Additionally, it covers logistic regression for probability predictions and classification based on the sigmoid function.

Uploaded by

marshe386
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

DASC7606-1B

Deep Learning
Linear Models

Dr Bethany Chan
Professor Francis Chin

2023
1
Outline
• Supervised learning, classification &
regression problems, linear models
• Linear regression and gradient descent
• Logistic regression and classification

2
How did we learn in school?
Asked Questions
given Answers

Questions & Answers


® Practice exercises
(training data)
® Mock exams (validation)
® Final exams (testing)

Supervised Learning

3
Classification problems (discrete answers)

• Example 1: Given an image, • Example 3: Given image of


determine whether an handwritten digit,
image is a dog or not determine which digit is it
• Example 2: Given a loan
applicant, approve or deny
the loan
salary 150,000
current debt 75,000
age 28 years old
years in current job 3
… … Binary vs multi- classification

4
Regression problems (continuous answers)
• Example 1: Given the car camera view, predict how much you
should turn the steering wheel
• Example 2: Given the features of a flat (e.g. size, number of
bedrooms, number of bathrooms, age of building), predict the
rent of the flat
1st 2nd 3rd 4th
feature feature feature feature label
Size No. of No. of Age of Rent per
(sq.ft.) Bedrooms Bathrooms Building (yrs) Month ($)
x1 1700 4 3 10 70k
x2 1420 3 2 12 54k
x3 1290 4 1.5 8 45k
x4 880 2 2 2 40k
x5 510 2 2 3 26.5k
5
Learning Model for supervised learning
System
Questions Answers
(unknown function)
(x1, x2, … xM) (y1, y2, … yM)
f: X ®Y

Past data on Known answers


Picture Training Data Set Dog or not dog
Car camera view (x1, y1), (x2, y2), … , (xM, yM) Turn right 15%
… …
Learning Algorithm

New Trained machine Predicted


Question (hypothesis Answer
function h»f)

6
Perspective on

Classification Regression

7
Linear models for supervised learning

System
Questions Answers
(unknown function)
(x1, x2, … xM) (y1, y2, … yM)
f: X ®Y

Training Data Set


(x1, y1), (x2, y2), … , (xM, yM)
Learning coefficients or
weights q of a line or
Learning Algorithm hyperplane to minimize
error
New Trained machine Predicted
Question (hypothesis Answer
function h»f)

8
Outline
• Supervised learning, classification &
regression problems, linear models
• Linear regression and gradient descent
• Logistic regression and classification

9
Linear Regression
Predict ℎ(𝒙) given 𝒙 = (𝑥!, … , 𝑥" ) Loss function vs Cost function
Find hyperplane or line:
ℎ 𝒙 = 𝜃# + 𝜃!𝑥! + ⋯ + 𝜃" 𝑥" A loss function is for a single
[or ℎ𝜽 𝒙 = 𝜽𝒙 with 𝑥" = 1] training example (sometimes
to minimize error called error function).

ℎ(𝑥) A cost function 𝐽(𝜽) is


the average loss over the
entire training dataset, which
is to be minimized.

𝑥
10
Linear Regression cost functions
𝑦
𝑥 𝑦
1.00 1.00 hq(x)
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

Mean square error


! )
= ) ∑*+!(h(𝑥* ) − 𝑦* )2 (norm 2)

Absolute Error Loss


!
= ∑) *+! h(𝑥* ) − 𝑦* | (norm 1) 𝑥
)
11
'
Look at cost 𝐽 𝜽 = ( ∑(
)*' ℎ 𝑥) − 𝑦)
+
(norm 2)

where ℎ 𝑥) = 𝜃, + 𝜃' 𝑥) = 0.785 + 0.425𝑥)


ℎ 𝑥 −𝑦 ℎ 𝑥 −𝑦 2
𝑥 𝑦 ℎ(𝑥)
1.00 1.00 1.210 0.210 0.044
2.00 2.00 1.635 -0.365 0.133
3.00 1.30 2.060 0.760 0.578
4.00 3.75 2.485 -1.265 1.600
5.00 2.25 2.910 0.660 0.436

𝜃#

𝜃!
12
Iterative way to learn 𝜽

Make an initial guess for 𝜽


Calculate the error 𝐽 𝜽
Repeat until error is small enough:
make a better guess for 𝜽
calculate the error 𝐽 𝜽
Return θ

Technique: Gradient Descent

13
Gradient Descent (one feature)
Objective is to minimize cost function:
'
J(q0, q1) = ∑(
)*' ℎ- 𝑥) − 𝑦)
+
(

Random guess of (q0, q1)

14
Random guess of (q0, q1)
Gradient is the
slope of the
tangent at (q0, q1)

Move in opposite direction of gradient (steepest descent)


15
J(q)

tangent line
Learning
rate

-aDJ(q)

DJ(q)

q
Move in opposite direction of gradient (steepest descent)
Step size is determined by the gradient magnitude DJ(q),
larger step size for steeper tangent line
Small step when close to the minimum, DJ(q) is small
as the tangent line is almost horizontal.
16
Random guess of (q0, q1)
Gradient is the
slope of the
tangent at (q0, q1)

Make steps
repeatedly
Move in opposite direction of gradient (steepest descent)
17
Gradient Descent (one feature)
Make steps of size a (learning rate) down the cost
function J in the direction with the steepest descent
(as determined by slope of the tangent at (q0, q1) )

Repeat until error is small enough:


𝜕
𝜃! ← 𝜃! − 𝛼 J(q0, q1)
𝜕𝜃!
𝜕
𝜃" ← 𝜃" − 𝛼 J(q0, q1)
𝜕𝜃"
18
Cost Function for Gradient Descent
(one variable)

Mean-square error:
!
J(q0, q1) = ∑"
#$! ℎ% 𝑥# − 𝑦#
&
"

Half mean-square error :


!
J(q0, q1) = ∑"
#$! ℎ% 𝑥# − 𝑦#
&
&"

19
Taking derivatives (one variable)
!
J(q0, q1) = ,) ∑)
*+! ℎ- 𝑥* − 𝑦*
,
and ℎ- 𝑥* = 𝜃# + 𝜃!𝑥*
)
𝜕 1
J(q0, q1) = 7 ℎ- 𝑥* − 𝑦*
𝜕𝜃# 𝑀
*+!
)
𝜕 1
J(q0, q1) = 7 ℎ- 𝑥* − 𝑦* 𝑥*
𝜕𝜃! 𝑀
*+!

Repeat until error is small enough:


Repeat: )
𝜕 1
𝜃" ← 𝜃" − 𝛼 J(q , q ) 𝜃# ← 𝜃# − 𝛼 7 ℎ- 𝑥* − 𝑦*
𝜕𝜃" 0 1 𝑀
*+!
)
𝜕 1
𝜃# ← 𝜃# − 𝛼 J(q , q ) 𝜃! ← 𝜃! − 𝛼 7 ℎ- 𝑥* − 𝑦* 𝑥*
𝜕𝜃# 0 1
𝑀
*+! 20
Problem: Learning Rate

Small learning rate: Large learning rate:


- Many iterations - Overshooting
till convergence - No convergence
- Trapped in local
minimum
21
Learning Rate and No. of Iterations
• Plot no. of iterations x-axis
cost J(q) y-axis
• If increases, then decrease learning rate a
• Stop when D J(q) smaller than chosen threshold

22
Gradient Descent (multiple features)
Repeat until error is small enough:
'
𝜃, ← 𝜃, − 𝛼 ( ∑(
)*' ℎ- 𝒙) − 𝑦) 𝑥),,
'
𝜃' ← 𝜃' − 𝛼 ∑(
)*' ℎ- 𝒙) − 𝑦) 𝑥),'
(
'
𝜃+ ← 𝜃+ − 𝛼 ∑(
)*' ℎ- 𝒙) − 𝑦) 𝑥),+
(

'
𝜃4 ← 𝜃+ − 𝛼 ( ∑(
)*' ℎ- 𝒙) − 𝑦) 𝑥),4
Þ Repeat until error is small enough:
'
𝜃5 ← 𝜃5 − 𝛼 ( ∑(
)*' ℎ- 𝒙) − 𝑦) 𝑥),5 for j = 0,…,N

23
Multiple feature example
1st 2nd 3rd 4th
feature feature feature feature label
Size No. of No. of Age of Rent per
(sq.ft.) Bedrooms Bathrooms Building (yrs) Month ($)
x1 1700 4 3 10 70k
x2 1420 3 2 12 54k
x3 1290 4 1.5 8 45k
x4 880 2 2 2 40k
x5 510 2 2 3 26.5k

𝑥!" = 𝑗 th feature of 𝑖 th example with 𝑥!,$ = 1


𝑦! = label associated with 𝑖 th example

𝒙𝟏 = 𝑥&,$ , 𝑥&,& , 𝑥&,' , 𝑥&,( , 𝑥&,) = (1, 1700, 4, 3, 10)


24
Feature Scaling
• Note that the value ranges of different features may
be very different, e.g., size of flat is in hundreds or
thousands sq. ft., age of flat is in tens.
• As a used across all N features (variables), would like
input values to be roughly in same range
Þ feature scaling or mean normalization
𝑥*J − meanJ
𝑥*J ←
maxJ − minJ

25
Gradient Descent (multiple variables)
Make steps of size a (learning rate) down the cost
function J in the direction with the steepest descent
(as determined by slope of the tangent at (q0, q1,… qN))

Repeat until error is small enough:


(
1
𝜃5 ← 𝜃5 − 𝛼 0 ℎ- 𝒙) − 𝑦) 𝑥),5 for j = 0,…,N
𝑀
)*'

K L
[ Or 𝛉 ← 𝛉 − )X X𝛉 − 𝒚 in vectorized form]
26
Outline
• Supervised learning, classification &
regression problems, linear models
• Linear regression and gradient descent
• Logistic regression and classification

27
Logistic Regression
Predict the probability of an event occurring
(e.g. prediction of a heart attack)

Prediction : ℎ 𝒙 = 𝜎 𝑠 = 𝜎(𝛉𝒙)
s : RN → [0,1] interpreted as a probability

𝛉𝒙 gives a sort of “risk score” that gets passed


through the sigmoid function (a.k.a logistic
function) s in order to determine the probability
the event (e.g. heart attack)
28
Logistic Function (sigmoid)
𝑒' 1
𝜎 𝑠 = '
=
1+𝑒 1 + 𝑒 ('

s®¥

s ® -¥ s=0

29
Logistic Regression and classification
𝑒' 1
𝜎 𝑠 = '
=
1+𝑒 1 + 𝑒 ('
For classification
Predict y = 1

Predict y = 0

If the output > 0.5 (50% Probability),


it is classified as positive class.
If the output < 0.5,
it is classified as negative class.
30
Logistic Regression Loss Function
y h(x)
Loss is 0
Loss is very high
Loss is very high
Loss is 0
entropy

• To predict positive class (y = 1), loss = - log (h(x))


• To predict negative class (y = 0), loss = - log (1 - h(x))
loss Positive class (y = 1) loss Negative class
(y = 0)
loss = - log (h(x))
loss = - log(1-h(x))

h(x) h(x)
Logistic Regression Loss Function

− log ℎ- 𝑥 if 𝑦 = 1
J ℎ- 𝑥 , 𝑦 = '
− log 1 − ℎ- 𝑥 if 𝑦 = 0

J ℎ- 𝑥 , 𝑦 =
−𝑦 log ℎ- 𝑥 − (1 − 𝑦) log 1 − ℎ- 𝑥
Gradient Descent (Logistic Regression)
!
𝐽 𝛉 =− " ∑"
#$! [𝑦# log ℎ 𝒙# + (1 − 𝑦# ) log(1 − ℎ 𝒙# )]
M ! " %
MN!
J(q) =− " ∑#$! %& [𝑦# log ℎ 𝒙# + (1 − 𝑦# ) log(1 − ℎ 𝒙# )]
%

𝐽! (𝛉)
!
where ℎ 𝒙# = 𝜎 𝑠# ; 𝑠# = 𝒙# 𝛉, 𝜎 𝑠# =
!'( &𝒔(
𝑠* = 𝒙* 𝛉 = θ!𝑥*,!+ … θJ 𝑥*,J + … θ" 𝑥*," ; N = dimension
67! (𝛉) 67!
6: 𝒙! 6𝒔!
69"
= 6: 𝒙! 6;! 69"
(chain rule)
67! =! '>=!
= − M
log
!
x=Q
6: 𝒙! : 𝒙! '>: 𝒙! MQ
=! >: 𝒙!
=
: 𝒙! ('>: 𝒙! )
33
Gradient Descent (Logistic Regression)
𝐽) (𝛉) = 𝑦) log ℎ 𝒙) + (1 − 𝑦) ) log(1 − ℎ 𝒙) )
*
where ℎ 𝒙* = 𝜎 𝑠) , 𝜎 𝑠) =
*+, ,𝒔-
and 𝑠* = 𝒙* 𝛉 = θ! 𝑥*,! + … θJ 𝑥*,J + … θ" 𝑥*,"
MU$ (𝛉) MU$ MV 𝒙$ MW$
MN
= MV 𝒙 MW MN
(chain rule)
% $ $ %
MU$ X$ YV 𝒙$ Linear Regression
1) = *
MV 𝒙$ V 𝒙$ (!YV 𝒙$ ) J(q) = ∑. -
-. )/* ℎ 𝒙) − 𝑦)
*𝒔" − 1 01(𝛉)
* .
MV 𝒙$ [ &𝒔$ 1 + 𝑒 = ∑)/* ℎ 𝒙) −𝑦) 𝑥),7
2) M𝒔$
= &𝒔
(!\[ $ ) ! (1 + 𝑒 *𝒔" )(1 + 𝑒 *𝒔" ) 05 . .

! !
= 1− = 𝜎(𝑠! )(1 − 𝜎(𝑠! ))
!\[ &𝒔$ !\[ &𝒔$
M𝒔$ Repeat until error is small:
3) = 𝑥*,J
MN% as ℎ 𝒙( = 𝜎 𝑠! 6
MU$ (𝛉) θ5 ← θ5 − 𝛼 69 J(q)
= 𝑦* − ℎ 𝒙* 𝑥*,J '
MN%
for j = 0, …., N
MU(𝛉) "
MN%
= − # ∑#
!$" 𝑦! − ℎ 𝒙! 𝑥!,& Same equation as Linear Regression
34
Logistic Function and classification
𝑒' 1
𝜎 𝑠 = '
=
1+𝑒 1 + 𝑒 ('
For classification
Predict y = 1
threshold

Predict y = 0

35
Outcomes of Binary Classification
predict 1 if 𝑓 𝑥 > threshold
predict 0 otherwise
Actual 0 Actual 1
Predicted 0 true negative false negative
Predicted 1 false positive true positive

• True positives:
data points predicted as positive that are actually positive
• False positives:
data points predicted as positive that are actually negative
• True negatives:
data points predicted as negative that are actually negative
• False negatives:
data points predicted as negative that are actually positive
Accuracy of Predictions
predict 1 if 𝑓 𝑥 > threshold
predict 0 otherwise
Actual 0 Actual 1
Predicted 0 true negative false negative
Predicted 1 false positive true positive

precision means how “accurate” is the answer


true positives true positives
precision = =
predicted positives false positives + true positives

recall means how “𝑔𝑜𝑜𝑑” is the answer,


i. e. , the ability to Yind the correct ones.
true positives true positives
recall = =
actual positives false negatives + true positives
37
Accuracy of Predictions
predict 1 if 𝑓 𝑥 > threshold
predict 0 otherwise

Actual 0 Actual 1
Predicted 0 true negative false negative
Predicted 1 false positive true positive

true positives true positives


precision = =
predicted positives false positives + true positives

­ threshold ­ precision
¯ threshold ¯ precision

38
Accuracy of Predictions
predict 1 if 𝑓 𝑥 > threshold
predict 0 otherwise

Actual 0 Actual 1
Predicted 0 true negative false negative
Predicted 1 false positive true positive

true positives true positives


recall = =
actual positives false negatives + true positives

­ threshold ­ precision ¯ recall


¯ threshold ¯ precision ­ recall

39
Combining Precision and Recall
• Ideally both precision and recall are 1
• With different thresholds, we can have higher
precision and recall values
• Depending on applications
– Disease screening - higher recall
– Prosecution of criminals – higher precision
– Identify terrorists - both
precisi𝑜𝑛 ∗ recall
F1 = 2∗
precisi𝑜𝑛 + recall
• Harmonic mean instead of simple average to
penalize extreme values
40
Logistic Regression, diagrammatically

Prediction : ℎ 𝒙 = 𝜎 𝑠 = 𝜎(𝛉𝒙)

1
𝜃$
𝑥&
𝜃& 𝑠 ℎ(𝒙)
∑ 𝜎
:

𝜃/
𝑥/

41
Multi-classification example
MNIST (Mixed National Institute of Standards and Technology) database

42
Multi-Classification
Pr(Class 1) Pr(Class 1)

Softmax
Pr(Class 2) Pr(Class 2)

: :

Pr(Class K) Pr(Class K)

Pr(Class x) = Probability of Softmax function: not only


Class x being the correct class normalizes a set of scores to
numbers in [0,1] but also
Class with highest probability
makes sure that the
is the predicted
numbers all add up to 1
Sigmoid vs. Softmax
For sigmoid (2-class):
!
Class 1 probability: 𝜎 𝑠 = !\[ "#
[ "# !
Class 2 probability: 1 − 𝜎 𝑠 = =
!\[ "# !\[ #

For softmax (𝐾-class): For softmax (2-class):


[ #& !
Given scores 𝑠!, 𝑠,, … , 𝑠d Class 1 probability: [ #& \[ #'
= !\[ #' "#&
[ #$
Class 𝑘 probability: ∑ [ #% [ #' !
Class 2 probability: [ #& \[ #'
= !\[ #& "#'

44
• Question: Why do we have to pass each value
through an exponential before normalizing them?
Why can’t we just normalize the values themselves?
• Answer: This is because the goal of softmax is to
make sure one value is very high (close to 1) and
all other values are very low (close to 0).
• Softmax uses exponential to make sure this happens.

5
Softmax

0.3

2.1
s h(x)
Loss function for multi-classification
Categorical Cross Entropy loss Positive class (y = 1)
loss = - log (h(x))
Loss ℎ 𝑥 , 𝑦
= − 0 𝑦 log ℎ 𝑥
B
h(x)

Ex: Predicted h(x) Label y


[0.94, 0.01, 0.05] [0, 1, 0]

Loss = −[0 ∗ log 0.94 + 1 ∗ log 0.01 + 0 ∗ log 0.05 ] = − log(0.01)

46
Another Loss Function: Hinge Loss
• Probability for the correct class (pi,ci) should be
> the other probabilities by at least margin ∆
• For instance i: 𝐿𝑖 = ∑678# max(0, pi,j − pi,ci +∆ )
• Cost function = ∑"
#$! 𝐿𝑖

Predicted h(x) Label y


Ex:
[0.25, 0.35, 0.4] [0, 0, 1]
With ∆ = 0.1
𝐿𝑖 = max (0, 0.25-0.4+∆) + max (0, 0.35-0.4+∆)
= 0 + 0.05 = 0.05

You might also like