DASC7606-1B
Deep Learning
Linear Models
Dr Bethany Chan
Professor Francis Chin
2023
1
Outline
• Supervised learning, classification &
regression problems, linear models
• Linear regression and gradient descent
• Logistic regression and classification
2
How did we learn in school?
Asked Questions
given Answers
Questions & Answers
® Practice exercises
(training data)
® Mock exams (validation)
® Final exams (testing)
Supervised Learning
3
Classification problems (discrete answers)
• Example 1: Given an image, • Example 3: Given image of
determine whether an handwritten digit,
image is a dog or not determine which digit is it
• Example 2: Given a loan
applicant, approve or deny
the loan
salary 150,000
current debt 75,000
age 28 years old
years in current job 3
… … Binary vs multi- classification
4
Regression problems (continuous answers)
• Example 1: Given the car camera view, predict how much you
should turn the steering wheel
• Example 2: Given the features of a flat (e.g. size, number of
bedrooms, number of bathrooms, age of building), predict the
rent of the flat
1st 2nd 3rd 4th
feature feature feature feature label
Size No. of No. of Age of Rent per
(sq.ft.) Bedrooms Bathrooms Building (yrs) Month ($)
x1 1700 4 3 10 70k
x2 1420 3 2 12 54k
x3 1290 4 1.5 8 45k
x4 880 2 2 2 40k
x5 510 2 2 3 26.5k
5
Learning Model for supervised learning
System
Questions Answers
(unknown function)
(x1, x2, … xM) (y1, y2, … yM)
f: X ®Y
Past data on Known answers
Picture Training Data Set Dog or not dog
Car camera view (x1, y1), (x2, y2), … , (xM, yM) Turn right 15%
… …
Learning Algorithm
New Trained machine Predicted
Question (hypothesis Answer
function h»f)
6
Perspective on
Classification Regression
7
Linear models for supervised learning
System
Questions Answers
(unknown function)
(x1, x2, … xM) (y1, y2, … yM)
f: X ®Y
Training Data Set
(x1, y1), (x2, y2), … , (xM, yM)
Learning coefficients or
weights q of a line or
Learning Algorithm hyperplane to minimize
error
New Trained machine Predicted
Question (hypothesis Answer
function h»f)
8
Outline
• Supervised learning, classification &
regression problems, linear models
• Linear regression and gradient descent
• Logistic regression and classification
9
Linear Regression
Predict ℎ(𝒙) given 𝒙 = (𝑥!, … , 𝑥" ) Loss function vs Cost function
Find hyperplane or line:
ℎ 𝒙 = 𝜃# + 𝜃!𝑥! + ⋯ + 𝜃" 𝑥" A loss function is for a single
[or ℎ𝜽 𝒙 = 𝜽𝒙 with 𝑥" = 1] training example (sometimes
to minimize error called error function).
ℎ(𝑥) A cost function 𝐽(𝜽) is
the average loss over the
entire training dataset, which
is to be minimized.
𝑥
10
Linear Regression cost functions
𝑦
𝑥 𝑦
1.00 1.00 hq(x)
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
Mean square error
! )
= ) ∑*+!(h(𝑥* ) − 𝑦* )2 (norm 2)
Absolute Error Loss
!
= ∑) *+! h(𝑥* ) − 𝑦* | (norm 1) 𝑥
)
11
'
Look at cost 𝐽 𝜽 = ( ∑(
)*' ℎ 𝑥) − 𝑦)
+
(norm 2)
where ℎ 𝑥) = 𝜃, + 𝜃' 𝑥) = 0.785 + 0.425𝑥)
ℎ 𝑥 −𝑦 ℎ 𝑥 −𝑦 2
𝑥 𝑦 ℎ(𝑥)
1.00 1.00 1.210 0.210 0.044
2.00 2.00 1.635 -0.365 0.133
3.00 1.30 2.060 0.760 0.578
4.00 3.75 2.485 -1.265 1.600
5.00 2.25 2.910 0.660 0.436
𝜃#
𝜃!
12
Iterative way to learn 𝜽
Make an initial guess for 𝜽
Calculate the error 𝐽 𝜽
Repeat until error is small enough:
make a better guess for 𝜽
calculate the error 𝐽 𝜽
Return θ
Technique: Gradient Descent
13
Gradient Descent (one feature)
Objective is to minimize cost function:
'
J(q0, q1) = ∑(
)*' ℎ- 𝑥) − 𝑦)
+
(
Random guess of (q0, q1)
14
Random guess of (q0, q1)
Gradient is the
slope of the
tangent at (q0, q1)
Move in opposite direction of gradient (steepest descent)
15
J(q)
tangent line
Learning
rate
-aDJ(q)
DJ(q)
q
Move in opposite direction of gradient (steepest descent)
Step size is determined by the gradient magnitude DJ(q),
larger step size for steeper tangent line
Small step when close to the minimum, DJ(q) is small
as the tangent line is almost horizontal.
16
Random guess of (q0, q1)
Gradient is the
slope of the
tangent at (q0, q1)
Make steps
repeatedly
Move in opposite direction of gradient (steepest descent)
17
Gradient Descent (one feature)
Make steps of size a (learning rate) down the cost
function J in the direction with the steepest descent
(as determined by slope of the tangent at (q0, q1) )
Repeat until error is small enough:
𝜕
𝜃! ← 𝜃! − 𝛼 J(q0, q1)
𝜕𝜃!
𝜕
𝜃" ← 𝜃" − 𝛼 J(q0, q1)
𝜕𝜃"
18
Cost Function for Gradient Descent
(one variable)
Mean-square error:
!
J(q0, q1) = ∑"
#$! ℎ% 𝑥# − 𝑦#
&
"
Half mean-square error :
!
J(q0, q1) = ∑"
#$! ℎ% 𝑥# − 𝑦#
&
&"
19
Taking derivatives (one variable)
!
J(q0, q1) = ,) ∑)
*+! ℎ- 𝑥* − 𝑦*
,
and ℎ- 𝑥* = 𝜃# + 𝜃!𝑥*
)
𝜕 1
J(q0, q1) = 7 ℎ- 𝑥* − 𝑦*
𝜕𝜃# 𝑀
*+!
)
𝜕 1
J(q0, q1) = 7 ℎ- 𝑥* − 𝑦* 𝑥*
𝜕𝜃! 𝑀
*+!
Repeat until error is small enough:
Repeat: )
𝜕 1
𝜃" ← 𝜃" − 𝛼 J(q , q ) 𝜃# ← 𝜃# − 𝛼 7 ℎ- 𝑥* − 𝑦*
𝜕𝜃" 0 1 𝑀
*+!
)
𝜕 1
𝜃# ← 𝜃# − 𝛼 J(q , q ) 𝜃! ← 𝜃! − 𝛼 7 ℎ- 𝑥* − 𝑦* 𝑥*
𝜕𝜃# 0 1
𝑀
*+! 20
Problem: Learning Rate
Small learning rate: Large learning rate:
- Many iterations - Overshooting
till convergence - No convergence
- Trapped in local
minimum
21
Learning Rate and No. of Iterations
• Plot no. of iterations x-axis
cost J(q) y-axis
• If increases, then decrease learning rate a
• Stop when D J(q) smaller than chosen threshold
22
Gradient Descent (multiple features)
Repeat until error is small enough:
'
𝜃, ← 𝜃, − 𝛼 ( ∑(
)*' ℎ- 𝒙) − 𝑦) 𝑥),,
'
𝜃' ← 𝜃' − 𝛼 ∑(
)*' ℎ- 𝒙) − 𝑦) 𝑥),'
(
'
𝜃+ ← 𝜃+ − 𝛼 ∑(
)*' ℎ- 𝒙) − 𝑦) 𝑥),+
(
…
'
𝜃4 ← 𝜃+ − 𝛼 ( ∑(
)*' ℎ- 𝒙) − 𝑦) 𝑥),4
Þ Repeat until error is small enough:
'
𝜃5 ← 𝜃5 − 𝛼 ( ∑(
)*' ℎ- 𝒙) − 𝑦) 𝑥),5 for j = 0,…,N
23
Multiple feature example
1st 2nd 3rd 4th
feature feature feature feature label
Size No. of No. of Age of Rent per
(sq.ft.) Bedrooms Bathrooms Building (yrs) Month ($)
x1 1700 4 3 10 70k
x2 1420 3 2 12 54k
x3 1290 4 1.5 8 45k
x4 880 2 2 2 40k
x5 510 2 2 3 26.5k
𝑥!" = 𝑗 th feature of 𝑖 th example with 𝑥!,$ = 1
𝑦! = label associated with 𝑖 th example
𝒙𝟏 = 𝑥&,$ , 𝑥&,& , 𝑥&,' , 𝑥&,( , 𝑥&,) = (1, 1700, 4, 3, 10)
24
Feature Scaling
• Note that the value ranges of different features may
be very different, e.g., size of flat is in hundreds or
thousands sq. ft., age of flat is in tens.
• As a used across all N features (variables), would like
input values to be roughly in same range
Þ feature scaling or mean normalization
𝑥*J − meanJ
𝑥*J ←
maxJ − minJ
25
Gradient Descent (multiple variables)
Make steps of size a (learning rate) down the cost
function J in the direction with the steepest descent
(as determined by slope of the tangent at (q0, q1,… qN))
Repeat until error is small enough:
(
1
𝜃5 ← 𝜃5 − 𝛼 0 ℎ- 𝒙) − 𝑦) 𝑥),5 for j = 0,…,N
𝑀
)*'
K L
[ Or 𝛉 ← 𝛉 − )X X𝛉 − 𝒚 in vectorized form]
26
Outline
• Supervised learning, classification &
regression problems, linear models
• Linear regression and gradient descent
• Logistic regression and classification
27
Logistic Regression
Predict the probability of an event occurring
(e.g. prediction of a heart attack)
Prediction : ℎ 𝒙 = 𝜎 𝑠 = 𝜎(𝛉𝒙)
s : RN → [0,1] interpreted as a probability
𝛉𝒙 gives a sort of “risk score” that gets passed
through the sigmoid function (a.k.a logistic
function) s in order to determine the probability
the event (e.g. heart attack)
28
Logistic Function (sigmoid)
𝑒' 1
𝜎 𝑠 = '
=
1+𝑒 1 + 𝑒 ('
s®¥
s ® -¥ s=0
29
Logistic Regression and classification
𝑒' 1
𝜎 𝑠 = '
=
1+𝑒 1 + 𝑒 ('
For classification
Predict y = 1
Predict y = 0
If the output > 0.5 (50% Probability),
it is classified as positive class.
If the output < 0.5,
it is classified as negative class.
30
Logistic Regression Loss Function
y h(x)
Loss is 0
Loss is very high
Loss is very high
Loss is 0
entropy
• To predict positive class (y = 1), loss = - log (h(x))
• To predict negative class (y = 0), loss = - log (1 - h(x))
loss Positive class (y = 1) loss Negative class
(y = 0)
loss = - log (h(x))
loss = - log(1-h(x))
h(x) h(x)
Logistic Regression Loss Function
− log ℎ- 𝑥 if 𝑦 = 1
J ℎ- 𝑥 , 𝑦 = '
− log 1 − ℎ- 𝑥 if 𝑦 = 0
J ℎ- 𝑥 , 𝑦 =
−𝑦 log ℎ- 𝑥 − (1 − 𝑦) log 1 − ℎ- 𝑥
Gradient Descent (Logistic Regression)
!
𝐽 𝛉 =− " ∑"
#$! [𝑦# log ℎ 𝒙# + (1 − 𝑦# ) log(1 − ℎ 𝒙# )]
M ! " %
MN!
J(q) =− " ∑#$! %& [𝑦# log ℎ 𝒙# + (1 − 𝑦# ) log(1 − ℎ 𝒙# )]
%
𝐽! (𝛉)
!
where ℎ 𝒙# = 𝜎 𝑠# ; 𝑠# = 𝒙# 𝛉, 𝜎 𝑠# =
!'( &𝒔(
𝑠* = 𝒙* 𝛉 = θ!𝑥*,!+ … θJ 𝑥*,J + … θ" 𝑥*," ; N = dimension
67! (𝛉) 67!
6: 𝒙! 6𝒔!
69"
= 6: 𝒙! 6;! 69"
(chain rule)
67! =! '>=!
= − M
log
!
x=Q
6: 𝒙! : 𝒙! '>: 𝒙! MQ
=! >: 𝒙!
=
: 𝒙! ('>: 𝒙! )
33
Gradient Descent (Logistic Regression)
𝐽) (𝛉) = 𝑦) log ℎ 𝒙) + (1 − 𝑦) ) log(1 − ℎ 𝒙) )
*
where ℎ 𝒙* = 𝜎 𝑠) , 𝜎 𝑠) =
*+, ,𝒔-
and 𝑠* = 𝒙* 𝛉 = θ! 𝑥*,! + … θJ 𝑥*,J + … θ" 𝑥*,"
MU$ (𝛉) MU$ MV 𝒙$ MW$
MN
= MV 𝒙 MW MN
(chain rule)
% $ $ %
MU$ X$ YV 𝒙$ Linear Regression
1) = *
MV 𝒙$ V 𝒙$ (!YV 𝒙$ ) J(q) = ∑. -
-. )/* ℎ 𝒙) − 𝑦)
*𝒔" − 1 01(𝛉)
* .
MV 𝒙$ [ &𝒔$ 1 + 𝑒 = ∑)/* ℎ 𝒙) −𝑦) 𝑥),7
2) M𝒔$
= &𝒔
(!\[ $ ) ! (1 + 𝑒 *𝒔" )(1 + 𝑒 *𝒔" ) 05 . .
! !
= 1− = 𝜎(𝑠! )(1 − 𝜎(𝑠! ))
!\[ &𝒔$ !\[ &𝒔$
M𝒔$ Repeat until error is small:
3) = 𝑥*,J
MN% as ℎ 𝒙( = 𝜎 𝑠! 6
MU$ (𝛉) θ5 ← θ5 − 𝛼 69 J(q)
= 𝑦* − ℎ 𝒙* 𝑥*,J '
MN%
for j = 0, …., N
MU(𝛉) "
MN%
= − # ∑#
!$" 𝑦! − ℎ 𝒙! 𝑥!,& Same equation as Linear Regression
34
Logistic Function and classification
𝑒' 1
𝜎 𝑠 = '
=
1+𝑒 1 + 𝑒 ('
For classification
Predict y = 1
threshold
Predict y = 0
35
Outcomes of Binary Classification
predict 1 if 𝑓 𝑥 > threshold
predict 0 otherwise
Actual 0 Actual 1
Predicted 0 true negative false negative
Predicted 1 false positive true positive
• True positives:
data points predicted as positive that are actually positive
• False positives:
data points predicted as positive that are actually negative
• True negatives:
data points predicted as negative that are actually negative
• False negatives:
data points predicted as negative that are actually positive
Accuracy of Predictions
predict 1 if 𝑓 𝑥 > threshold
predict 0 otherwise
Actual 0 Actual 1
Predicted 0 true negative false negative
Predicted 1 false positive true positive
precision means how “accurate” is the answer
true positives true positives
precision = =
predicted positives false positives + true positives
recall means how “𝑔𝑜𝑜𝑑” is the answer,
i. e. , the ability to Yind the correct ones.
true positives true positives
recall = =
actual positives false negatives + true positives
37
Accuracy of Predictions
predict 1 if 𝑓 𝑥 > threshold
predict 0 otherwise
Actual 0 Actual 1
Predicted 0 true negative false negative
Predicted 1 false positive true positive
true positives true positives
precision = =
predicted positives false positives + true positives
threshold precision
¯ threshold ¯ precision
38
Accuracy of Predictions
predict 1 if 𝑓 𝑥 > threshold
predict 0 otherwise
Actual 0 Actual 1
Predicted 0 true negative false negative
Predicted 1 false positive true positive
true positives true positives
recall = =
actual positives false negatives + true positives
threshold precision ¯ recall
¯ threshold ¯ precision recall
39
Combining Precision and Recall
• Ideally both precision and recall are 1
• With different thresholds, we can have higher
precision and recall values
• Depending on applications
– Disease screening - higher recall
– Prosecution of criminals – higher precision
– Identify terrorists - both
precisi𝑜𝑛 ∗ recall
F1 = 2∗
precisi𝑜𝑛 + recall
• Harmonic mean instead of simple average to
penalize extreme values
40
Logistic Regression, diagrammatically
Prediction : ℎ 𝒙 = 𝜎 𝑠 = 𝜎(𝛉𝒙)
1
𝜃$
𝑥&
𝜃& 𝑠 ℎ(𝒙)
∑ 𝜎
:
𝜃/
𝑥/
41
Multi-classification example
MNIST (Mixed National Institute of Standards and Technology) database
42
Multi-Classification
Pr(Class 1) Pr(Class 1)
Softmax
Pr(Class 2) Pr(Class 2)
: :
Pr(Class K) Pr(Class K)
Pr(Class x) = Probability of Softmax function: not only
Class x being the correct class normalizes a set of scores to
numbers in [0,1] but also
Class with highest probability
makes sure that the
is the predicted
numbers all add up to 1
Sigmoid vs. Softmax
For sigmoid (2-class):
!
Class 1 probability: 𝜎 𝑠 = !\[ "#
[ "# !
Class 2 probability: 1 − 𝜎 𝑠 = =
!\[ "# !\[ #
For softmax (𝐾-class): For softmax (2-class):
[ #& !
Given scores 𝑠!, 𝑠,, … , 𝑠d Class 1 probability: [ #& \[ #'
= !\[ #' "#&
[ #$
Class 𝑘 probability: ∑ [ #% [ #' !
Class 2 probability: [ #& \[ #'
= !\[ #& "#'
44
• Question: Why do we have to pass each value
through an exponential before normalizing them?
Why can’t we just normalize the values themselves?
• Answer: This is because the goal of softmax is to
make sure one value is very high (close to 1) and
all other values are very low (close to 0).
• Softmax uses exponential to make sure this happens.
5
Softmax
0.3
2.1
s h(x)
Loss function for multi-classification
Categorical Cross Entropy loss Positive class (y = 1)
loss = - log (h(x))
Loss ℎ 𝑥 , 𝑦
= − 0 𝑦 log ℎ 𝑥
B
h(x)
Ex: Predicted h(x) Label y
[0.94, 0.01, 0.05] [0, 1, 0]
Loss = −[0 ∗ log 0.94 + 1 ∗ log 0.01 + 0 ∗ log 0.05 ] = − log(0.01)
46
Another Loss Function: Hinge Loss
• Probability for the correct class (pi,ci) should be
> the other probabilities by at least margin ∆
• For instance i: 𝐿𝑖 = ∑678# max(0, pi,j − pi,ci +∆ )
• Cost function = ∑"
#$! 𝐿𝑖
Predicted h(x) Label y
Ex:
[0.25, 0.35, 0.4] [0, 0, 1]
With ∆ = 0.1
𝐿𝑖 = max (0, 0.25-0.4+∆) + max (0, 0.35-0.4+∆)
= 0 + 0.05 = 0.05