0% found this document useful (0 votes)

1K views8 pages

18.4 Evaluating and Choosing The Best Hypothesis: Model Selection: Complexity vs. Goodness of Fit

The document discusses several key concepts in machine learning: 1. It introduces the idea of independent and identically distributed (i.i.d.) examples which are important for evaluating hypotheses. 2. It describes the tradeoff between model complexity and goodness of fit in selecting the best hypothesis and discusses using cross-validation. 3. It explains how loss functions quantify the cost of incorrect predictions and how minimizing empirical loss estimates generalization performance.

Uploaded by

Naveen Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views8 pages

18.4 Evaluating and Choosing The Best Hypothesis: Model Selection: Complexity vs. Goodness of Fit

Uploaded by

Naveen Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

336

18.4 Evaluating and Choosing

the Best Hypothesis

• We assume that there is a probability distribution over examples

that remains stationary over time
• Each observed value is sampled from that distribution and is
independent of previous examples and
• Each example has identical prior probability distribution
• Examples that satisfy these assumptions are called independent
and identically distributed (i.i.d.)
• The error rate of a hypothesis h is the proportion of mistakes it
makes
• The proportion of times that h(x) y for an (x, y) example
• Just because a hypothesis h has low error rate on the training set
does not mean that it will generalize well

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

337

Model selection: Complexity vs. goodness of fit

• We can think of finding the best hypothesis as two tasks:

• Model selection defines the hypothesis space and
• Optimization finds the best hypothesis within that space
• How to select among models that are parameterized by size
• With polynomials we have size = 1 for linear functions,
size = 2 for quadratics, and so on
• For decision trees, the size could be the number of nodes in
the tree
• We want to find the value of the size parameter that best
balances underfitting and overfitting to give the best test set
accuracy

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

1
338

• A wrapper takes a learning algorithm as an argument (DT

learning for example)
• The wrapper enumerates models according to the size parameter
• For each size, it uses cross validation (say) on the learner to
compute the average error rate on training and test sets
• We start with the smallest, simplest models (which probably
underfit the data), and iterate, considering more complex models
at each step, until the models start to overfit
• The cross validation picks the value of size with the lowest
validation set error
• We then generate a hypothesis of that size using all the data
(without holding out any of it; eventually we should evaluate the
returned hypothesis on a separate test set)

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

339

From error rates to loss

• Consider the problem of classifying emails as spam or non-spam

• It is worse to classify non-spam as spam than to classify spam as
non-spam
• So a classifier with a 1% error rate, where almost all errors were
classifying spam as non-spam, would be better than a classifier
with only a 0.5% error rate, if most of those errors were
classifying non-spam as spam
• Utility is what learners – like decision makers – should maximize
• In machine learning it is traditional to express utilities by means
of loss functions
• The loss function L(x, y, ) is defined as the amount of utility lost
by predicting h(x) = when the correct answer is f(x) = y:
L(x, y, ) = U(result of using y given an input x)
– U(result of using given an input x)

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

2
340

• Often a simplified version of the loss function is used: It is 10

times worse to classify non-spam as spam than vice-versa:
L(spam, nonspam) = 1, L(nonspam, spam) = 10
• Note that L(y, y) is always zero
• In general for real-valued data small errors are better than large
ones
• Two functions that implement that idea are the absolute value of
the difference (called the L1 loss), and the square of the
difference (called the L2 loss)
• Minimizing error rate is formulated in the L0/1 loss function

Absolute value loss: L1(y, ) = |y– |

Squared error loss: L2(y, ) = (y– )2
0/1 loss: L0/1(y, ) = 0 if y = , else 1

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

341

• Let P(X, Y) be a prior probability distribution over examples

• Let E be the set of all possible input-output examples
• Then the expected generalization loss for a hypothesis h (w.r.t.
loss function L) is
GenLossL(h) = (x,y) E L(y, h(x)) P(x, y)
• The best hypothesis h* is the one with the minimum expected
generalization loss
h* = arg minh H GenLossL(h)
• Because P(x, y) is not known, the learning agent can only
estimate generalization loss with empirical loss on the set of
examples E:
EmpLossL,E(h) = (1/n) (x,y) E L(y, h(x))
• The estimated best hypothesis * is then the one with minimum
empirical loss:
* = arg minh H EmpLossL,E(h)

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

3
342

Regularization

• Earlier on we did model selection with cross-validation on model

size
• An alternative approach is to search for a hypothesis that directly
minimizes the weighted sum of empirical loss and the complexity
of the hypothesis, which we call the total cost
Cost(h) = EmpLoss(h) + Complexity(h)
* = arg minh H Cost(h)
• Here is a parameter, a positive number that serves as a
conversion rate between loss and hypothesis complexity
• We still need to do cross-validation search to find the hypothesis
that generalizes best, but this time with different values of
• This process of explicitly penalizing complex hypotheses is called
regularization

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

343

18.5 The Theory of Learning

• Computational Learning Theory (COLT) [Valiant 1984]

• Valiant received the ACM Turing Award in 2010
• Any hypothesis that is seriously wrong will almost certainly be
“found out” with high probability after a small number of
examples, because it will make an incorrect prediction
• Thus, any hypothesis that is consistent with a sufficiently large
set of training examples is unlikely to be seriously wrong: that is,
it must be probably approximately correct (PAC)
• PAC-learning is based on the stationarity assumption: Future
examples are going to be drawn from the same fixed distribution
P(E) = P(X, Y) as past examples
• We do not necessarily know the distribution
• Without this connection between the past and the future, learning
would be practically impossible

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

4
344

• Let X be the instance space from which the examples are drawn
from with a stationary distribution
• H is the hypothesis class and N denotes the number of training
examples
• We assume that the true function f is a member of H
• The error rate of a hypothesis h is defined as the expected
generalization error for examples drawn from the stationary
distribution
error(h) = GenLossL0/1(h)
= (x,y) L0/1(y, h(x)) P(x, y)

• A hypothesis h is called approximately correct if error(h) ,

where is a small constant

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

345

• Let us examine a hypothesis hb, which is not an -approximation

of the true function
• Then error(hb) > and the probability that it agrees with a given
example is at most 1 –
• Since the examples are independent, the bound for N examples
is
P(hb agrees with N examples) (1 – )N
• The probability that the hypothesis class H contains at least one
such hypothesis is at most |H|(1 – )N
• We would like to reduce the probability of this event below some
small number
|H|(1 – )N

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

5
346

• Then
1 1
ln + ln
ln

• Because ln(1 + ) < , we can achieve this if we allow the

algorithm to see
N (1/ )(ln|H| + ln(1/ ))
examples
• Thus, if a learning algorithm returns a hypothesis that is
consistent with this many examples, then with probability at least
1 – , it has error at most
• In other words, it is probably approximately correct
• The number of required examples, as a function of and , is
called the sample complexity of the hypothesis space

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

347

Learning decision lists

• A decision list consists of a series of tests, each of which is a

conjunction of literals
• The tests (rules) are attempted in order
• If a test succeeds, the decision list specifies the value to be
returned
• If a test fails, processing continues with the next test in the list

Patrons = No Full & No

No
Some Fri/Sat

Yes Yes

Yes Yes
Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

6
348

• Decision lists resemble decision trees, but their overall structure

is simpler: they branch only in one direction
• In contrast, the individual tests are more complex
• Decision lists generalize decision trees
• If we allow tests of arbitrary size, then decision lists can represent
any Boolean function
• On the other hand, if we restrict the size of each test to at most k
literals (k-DL), then it is possible for the learning algorithm to
generalize successfully from a small number of examples
• Language k-DL includes as a subset the language k-DT, the set
of all decision trees of depth at most k
• Let k-DL(n) denote a k-DL language using n Boolean attributes

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

349

• Let Conj(n, k) denote the conjunctions of at most k literals using

n attributes
• The number of conjunctions of k literals from n attributes is
given by

) =

• In a decision list each conjunction can be attached to either a

Yes or No outcome or can be absent from the decision list
• Hence, there are at most 3|Conj(n,k)| distinct sets of component
tests
• Each of these sets of tests can be in any order, so
|k-DL(n)| 3|Conj(n,k)||Conj(n,k)|!

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

7
350

• Combining the above estimates yields

|k-DL(n)| 2O(nk log2(nk))
• We plug this into the formula for sample complexity to show that
the number of examples needed for PAC-learning is
N (1/ )(|O(nk log2(nk))| + ln(1/ ))

• Hence, a k-DL function is PAC-learnable from a number of

examples polynomial in n
• Therefore, any algorithm that returns a consistent decision list will
PAC-learn a k-DL function in a reasonable number of examples,
for small k
• The following greedy algorithm will output a consistent decision
list (if one exists)

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

351

Algorithm DL-learner(S)
Input: S training examples
Output: a decision list or failure
1. if S = Ø then return the trivial decision list No;
2. Choose a test t that matches a subset of examples St Ø such
that the members are all positive or negative;
3. if there is no such t then fail;
4. if all examples in St are positive
then o Yes
else o No;
5. L DL-learner(S\St);
6. return a decision list with initial test t and outcome o and
remaining tests given by L;

Department of Software Systems OHJ-2556 Artificial Intelligence, Spring 2013 4 April 2013

Previous University Question Paper
No ratings yet
Previous University Question Paper
3 pages
MCA Sem 1 Question Papers Mumbai University
60% (5)
MCA Sem 1 Question Papers Mumbai University
29 pages
April 2020,2021,2022,2023 Anna University MCA Question Paper
No ratings yet
April 2020,2021,2022,2023 Anna University MCA Question Paper
16 pages
Bda Experiment 4: Roll No. A-52 Name: Janmejay Patil Class: BE-A Batch: A3 Date of Experiment: Date of Submission Grade
No ratings yet
Bda Experiment 4: Roll No. A-52 Name: Janmejay Patil Class: BE-A Batch: A3 Date of Experiment: Date of Submission Grade
5 pages
CANDIDATE-ELIMINATION Learning Algorithm
0% (1)
CANDIDATE-ELIMINATION Learning Algorithm
3 pages
JNTUA - R23 - B.tech. Cyber Security III & IV Year Course Structure & Syllabus
No ratings yet
JNTUA - R23 - B.tech. Cyber Security III & IV Year Course Structure & Syllabus
195 pages
Madhuri Gupta 7th Sem AI Lab Manual1
No ratings yet
Madhuri Gupta 7th Sem AI Lab Manual1
17 pages
Module 3 Informed Search Techniques and Knowledge Representation
No ratings yet
Module 3 Informed Search Techniques and Knowledge Representation
26 pages
Memory Bounded1
No ratings yet
Memory Bounded1
17 pages
Aids - VSB Syllabus 2023 - 16.8.24
No ratings yet
Aids - VSB Syllabus 2023 - 16.8.24
88 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
73 pages
STM Question Paper R18
No ratings yet
STM Question Paper R18
2 pages
DAABits
No ratings yet
DAABits
4 pages
Finite and Infinite Hypothesis Spaces - PAC and Bayes Theorem
No ratings yet
Finite and Infinite Hypothesis Spaces - PAC and Bayes Theorem
9 pages
ST Lab Manual
No ratings yet
ST Lab Manual
52 pages
Artificial Intelligence Module 5
No ratings yet
Artificial Intelligence Module 5
23 pages
LAB # 07 Facts and Rules in PROLOG: Objective
No ratings yet
LAB # 07 Facts and Rules in PROLOG: Objective
6 pages
Unit 2
No ratings yet
Unit 2
29 pages
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
No ratings yet
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
26 pages
PDA With Differentiate DPDA Vs NPDA
No ratings yet
PDA With Differentiate DPDA Vs NPDA
16 pages
AI Digital Notes Complete
100% (1)
AI Digital Notes Complete
202 pages
Unit 5 1
No ratings yet
Unit 5 1
18 pages
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
No ratings yet
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
5 pages
Ai Unit 6 Techknow
No ratings yet
Ai Unit 6 Techknow
31 pages
Module-02 AIML NOTES
No ratings yet
Module-02 AIML NOTES
29 pages
Rich Automata Solns
100% (1)
Rich Automata Solns
187 pages
Syllabus
No ratings yet
Syllabus
9 pages
CST 402 - Distributed Computing
No ratings yet
CST 402 - Distributed Computing
78 pages
Unit 3 PPT Ai
No ratings yet
Unit 3 PPT Ai
93 pages
Ait 307 Artificial Intelligence
No ratings yet
Ait 307 Artificial Intelligence
93 pages
TY - BSC (C.S) Software Testing - Notes
No ratings yet
TY - BSC (C.S) Software Testing - Notes
62 pages
Depth Limit Search Input
No ratings yet
Depth Limit Search Input
3 pages
Apollo Institute of Engineering and Technology: Question Bank Branch: IT Subject: Artificial Intelligence (3161608)
No ratings yet
Apollo Institute of Engineering and Technology: Question Bank Branch: IT Subject: Artificial Intelligence (3161608)
2 pages
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
3 pages
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
No ratings yet
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
4 pages
Software Engineering - Challenges Ahead
No ratings yet
Software Engineering - Challenges Ahead
43 pages
Problems, Problem Spaces and Search
100% (1)
Problems, Problem Spaces and Search
54 pages
R20 - II To IV Year Syllabus CSE
No ratings yet
R20 - II To IV Year Syllabus CSE
25 pages
Unit-2 SE Notes DKPJ
No ratings yet
Unit-2 SE Notes DKPJ
21 pages
HN DAA 15CS43 LectureNotes 1
20% (5)
HN DAA 15CS43 LectureNotes 1
28 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
Case Study Tcp-Ip
No ratings yet
Case Study Tcp-Ip
27 pages
Python Programming KNC 302
No ratings yet
Python Programming KNC 302
2 pages
Session 13 AO Memory Bounded Heuristic Search Heuristic Functions
No ratings yet
Session 13 AO Memory Bounded Heuristic Search Heuristic Functions
22 pages
Week 7 Assignment 1
No ratings yet
Week 7 Assignment 1
6 pages
AI Question Bank for CSE Students
100% (1)
AI Question Bank for CSE Students
121 pages
COA Course File 2023-24
No ratings yet
COA Course File 2023-24
61 pages
Assignment 1 Solution
No ratings yet
Assignment 1 Solution
6 pages
Deadlock Handling in OS Notes
No ratings yet
Deadlock Handling in OS Notes
12 pages
Constraint Satisfaction Problems: AIMA: Chapter 6
No ratings yet
Constraint Satisfaction Problems: AIMA: Chapter 6
64 pages
Unit 4
No ratings yet
Unit 4
18 pages
Knowledge Representation Using Predicate Logic
No ratings yet
Knowledge Representation Using Predicate Logic
14 pages
Unit Vi
No ratings yet
Unit Vi
14 pages
Traffic Sign Board Recognition and Voice Alert System Using Convolutional Neural Network
No ratings yet
Traffic Sign Board Recognition and Voice Alert System Using Convolutional Neural Network
1 page
Assignment #2 AI
No ratings yet
Assignment #2 AI
5 pages
Chapter 3 - Greedy Method
No ratings yet
Chapter 3 - Greedy Method
24 pages
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
No ratings yet
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
195 pages
MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
SML Lecture2
No ratings yet
SML Lecture2
35 pages
M.Tech MOOC Certificate Submission
No ratings yet
M.Tech MOOC Certificate Submission
1 page
M.Tech I Semester Supplementary Examinations August/September 2018
No ratings yet
M.Tech I Semester Supplementary Examinations August/September 2018
1 page
Advances in Practical RF Antenna Design: E & ICT Academy, National Institute of Technology, Warangal
No ratings yet
Advances in Practical RF Antenna Design: E & ICT Academy, National Institute of Technology, Warangal
2 pages
Implementation Research in Chittoor District, Andhra Pradesh" Under The
No ratings yet
Implementation Research in Chittoor District, Andhra Pradesh" Under The
2 pages
Snapdragon G3 Gen 3 Product Brief
No ratings yet
Snapdragon G3 Gen 3 Product Brief
1 page
Piling-Inspection-Checklist - DDA Standart
No ratings yet
Piling-Inspection-Checklist - DDA Standart
1 page
Mcp3008 Spi Adc
No ratings yet
Mcp3008 Spi Adc
9 pages
OPT British Cumulus - IMO 9724532 - Cargo Operating Manual
100% (1)
OPT British Cumulus - IMO 9724532 - Cargo Operating Manual
231 pages
Robo Ai Trade
No ratings yet
Robo Ai Trade
30 pages
Lecture09 CE72.12Isoparametric Formulation
No ratings yet
Lecture09 CE72.12Isoparametric Formulation
14 pages
Photoshop Level 2 Session 02
No ratings yet
Photoshop Level 2 Session 02
21 pages
1st Chapter English - ICTFI BPE 100 - F
No ratings yet
1st Chapter English - ICTFI BPE 100 - F
37 pages
JIRA - Acp-100-Exam-Topics-And-Resources
No ratings yet
JIRA - Acp-100-Exam-Topics-And-Resources
8 pages
MENG370L - Worksheets - Student Version - 2016-2017 - Protected PDF
No ratings yet
MENG370L - Worksheets - Student Version - 2016-2017 - Protected PDF
42 pages
Learning-Plan-Week-5-6 POM Q4
No ratings yet
Learning-Plan-Week-5-6 POM Q4
2 pages
14.T24 Common Variables-R17
100% (2)
14.T24 Common Variables-R17
29 pages
Kernel Perceptron
No ratings yet
Kernel Perceptron
28 pages
Biohacking: Tech-Savvy Self-Experimenters
No ratings yet
Biohacking: Tech-Savvy Self-Experimenters
4 pages
Diesel Pump Service Price List 2023
No ratings yet
Diesel Pump Service Price List 2023
1 page
John Strikwerda - Finite Difference Schemes and Partial Differential Equations
100% (2)
John Strikwerda - Finite Difference Schemes and Partial Differential Equations
437 pages
Razer Gold Gift Card - Google Search
No ratings yet
Razer Gold Gift Card - Google Search
1 page
Chapter 20 - Managing Operations
No ratings yet
Chapter 20 - Managing Operations
31 pages
Allied Telesis - at gs2002 SP - Data Sheet
No ratings yet
Allied Telesis - at gs2002 SP - Data Sheet
2 pages
40097501718
No ratings yet
40097501718
3 pages
Rozdział 12 - Nauka I Technika - MiniMatura (Grupa B)
No ratings yet
Rozdział 12 - Nauka I Technika - MiniMatura (Grupa B)
2 pages
FN595NWS
No ratings yet
FN595NWS
53 pages
De Dlpca 200
No ratings yet
De Dlpca 200
5 pages
FTTH Economics: Key Parameters & Decisions
No ratings yet
FTTH Economics: Key Parameters & Decisions
8 pages
DM Unit-1 Notes
No ratings yet
DM Unit-1 Notes
47 pages
Maths Roadmap For Machine Learning
No ratings yet
Maths Roadmap For Machine Learning
21 pages
Type II Propeller Synchrophaser
No ratings yet
Type II Propeller Synchrophaser
4 pages
Queue Data Structure
No ratings yet
Queue Data Structure
15 pages
Bizom - Concept Deck'22
No ratings yet
Bizom - Concept Deck'22
21 pages
What The Pandemic Means For Health Care's Digital Transformation
No ratings yet
What The Pandemic Means For Health Care's Digital Transformation
9 pages