KEMBAR78
ML Lecture Notes Unit-1 | PDF | Support Vector Machine | Machine Learning
0% found this document useful (0 votes)
73 views45 pages

ML Lecture Notes Unit-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views45 pages

ML Lecture Notes Unit-1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Buddha Institute of Technology

GIDA
Gorakhpur

Machine Learning
(LECTURE NOTE)
CSE/IT (VII Sem)

UNIT- I

Topic Covered:
INTRODUCTION – Well defined learning problems, Designing a
Learning System, Issues in Machine Learning; THE CONCEPT
LEARNING TASK - General-to-specific ordering of hypotheses, Find-
S, List then eliminate algorithm, Candidate elimination algorithm,
Inductive bias.
Submitted by:
Ranjeet Singh
(Asstt. Prof)
CSE
Introduction of Machine Learnimg:
The field of machine learning is concerned with the question of how to construct computer
programs that automatically improve with experience.
In recent years many successful machine learning applications have been developed,
ranging from data-mining programs that learn to detect fraudulent credit card
transactions, to information-filtering systems that learn users' reading preferences, to
autonomous vehicles that learn to drive on public highways.

At the same time, there have been important advances in the theory and algorithms that form the
foundations of Machine Learning.

Machine learning draws on concepts and results from many fields, including statistics,
artificial intelligence, philosophy, information theory, biology, cognitive science,
computational complexity, and control theory, to learn about machine learning is to view it
from all of these perspectives and to understand the problem settings, algorithms, and
assumptions that underlie each.

Ever since computers were invented, we have wondered whether they might be made to learn. If
we could understand how to program them to learn-to improve automatically with experience-
the impact would be dramatic. Imagine computers learning from medical records which
treatments are most effective for new diseases, houses learning from experience to optimize
energy costs based on the particular usage patterns of their occupants, or personal software
assistants learning the evolving interests of their users in order to highlight especially relevant
stories from the online morning newspaper.

We do not yet know how to make computers learn nearly as well as people learn. However,
algorithms have been invented that are effective for certain types of learning tasks, and a
theoretical understanding of learning is beginning to emerge. Many practical computer programs
have been developed to exhibit useful types of learning, and significant commercial applications
have begun to appear.
For problems such as speech recognition, algorithms based on machine learning outperform all
other approaches that have been attempted to date. In the field known as data mining, machine
learning algorithms are being used routinely to discover valuable knowledge from large
commercial databases containing equipment maintenance records, loan applications, financial
transactions, medical records, and the like. As our understanding of computers continues to
mature, it seems inevitable that machine learning will play an increasingly central role in
computer science and computer technology.
Key Point of Machine Learning Dataset:
A machine learning dataset is a structured collection of data that are carefully selecting,
preparing, and organizing the data for training, evaluating, and testing machine learning
algorithms and models. These datasets serve as the foundation for teaching machines to learn and
make predictions or decisions based on patterns and information within the data.

Here are some key characteristics and components of a typical machine learning dataset:

1- Data Points or Samples or instances: Each row or observation in the dataset represents
a single data point or sample. These data points can pertain to anything from images and
text documents to numerical measurements and categorical information.
2- Features or Attributes: Columns within the dataset are known as features or attributes.
These features provide the relevant information about each data point that the machine
learning algorithm uses to make predictions. Features can be numerical, categorical, or
even text-based.
3- Target Variable: In supervised learning, a dataset often includes a target variable (or
label) that the machine learning model aims to predict based on the input features. This is
the variable that the model learns to map from the input data.
4- Training, Validation, and Test Sets: A typical machine learning dataset is divided into
at least two or three subsets: a training set, a validation set (optional), and a test set. The
training set is used to train the model, the validation set helps tune model
hyperparameters and assess its performance during training, and the test set is used to
evaluate the model's performance after training.
5- Data Preprocessing: Before using the dataset for training, it often requires preprocessing
steps such as to ensure that the data is suitable for machine learning algorithms.

 Data cleaning: Data cleaning is the process of identifying and correcting errors,
inconsistencies, and inaccuracies in a dataset to ensure its quality and reliability
for analysis or machine learning, often involving tasks like handling missing
values, removing duplicates, and addressing outliers.
 Data normalization: Data normalization is rescaling data to fit within a specific
range (usually between 0 and 1 or -1 and 1) to remove scale differences.
Example: Consider two features in a dataset, "Age" (ranging from 0 to 100) and
"Income" (ranging from $20,000 to $100,000). Normalizing these features would
transform "Age" values to, say, [0.2, 0.7] and "Income" values to [0.2, 1.0],
making them comparable on the same scale for machine learning algorithms.
 Feature scaling: Feature scaling is the process of standardizing the range of
independent variables (features) in a dataset, making them comparable and
preventing certain features from dominating others.
 Handling missing values: Handling missing values involves dealing with data
points where certain values are absent or undefined in a dataset to prevent their
negative impact on analysis or modeling.
Example: In a dataset of patient records, some entries may lack values for
"Cholesterol Level." Handling missing values could involve methods like
imputing the missing cholesterol levels with the mean value of the available data
or using advanced techniques like predictive modeling to estimate the missing
values based on other patient characteristics.
6- Label Encoding or One-Hot Encoding: For categorical features, encoding methods like
label encoding or one-hot encoding are applied to convert categorical data into a
numerical format that can be processed by machine learning algorithms.
7- Dataset Size: The size of a dataset can vary widely, from small datasets with a few
hundred samples to large datasets with millions or even billions of data points. The
dataset size often impacts the complexity and training time of machine learning models.
8- Data Sources: Datasets can be sourced from various places, including research
institutions, government agencies, open data repositories, or generated through
simulations or data collection efforts.
Machine learning datasets play a crucial role in the development and evaluation of machine
learning models. The choice of dataset is critical, as it can significantly impact the model's
performance, generalization, and applicability to real-world tasks. Researchers and practitioners
select or create datasets that align with the specific problem they aim to solve and the type of
machine learning algorithm they plan to use.
What is machine learning (ML)?

• Data is being produced and stored continuously (“big data”):


– science: genomics, astronomy, materials science, particle accelerators. . .
– sensor networks: weather measurements, traffic. . .
– people: social networks, blogs, mobile phones, purchases, bank transactions. . .
– etc.
• Data is not random; it contains structure that can be used to predict outcomes, or gain
knowledge in some way.
Ex: patterns of Amazon purchases can be used to recommend items.
• It is more difficult to design algorithms for such ML tasks Such algorithms need data.
Ex: construct a spam filter, using a collection of email messages labeled as spam/not
spam.
• Data mining: the application of ML methods to large databases.
• Ex of ML applications: fraud detection, medical diagnosis, speech or face recognition. . .
• ML is programming computers using data (past experience) to optimize a performance
criterion.

• ML relies on:
– Statistics: making inferences from sample data.
– Numerical algorithms (linear algebra, optimization): optimize criteria,
manipulate models.
– Computer science: data structures and programs that solve a ML
problem efficiently.
• A model:
– is a compressed version of a database;
– extracts knowledge from it;
– does not have perfect performance but is a useful approximation to the
data.
Definition of ML:
Arthur Samuel described it as: “the field of study that gives computers the ability
to learn without being explicitly programmed.” This is an older, informal
definition.
OR

Tom Mitchell provides a more modern definition: “A computer program is said to


learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with
experience E. ”
A checkers learning problem:
Task T: playing checkers
Performance measure P: percent of games won against opponents
Training experience E: playing practice games against itself

A handwriting recognition learning problem:


Task T: recognizing and classifying handwritten words within images
Performance measure P: percent of words correctly classified

A robot driving learning problem:


Task T: driving on public four-lane highways using vision sensors
Performance measure P: average distance traveled before an error (as judged by human overseer)
Training experience E: a sequence of images and steering commands recorder while observing a
human driver

DATA SET:
Statistical Techniques
The development of today’s AI applications started with using the age-old
traditional statistical techniques. You must have used straight-line
interpolation in schools to predict a future value. There are several other such
statistical techniques which are successfully applied in developing so-called AI
programs. We say “so-called” because the AI programs that we have today are
much more complex and use techniques far beyond the statistical techniques used
by the early AI programs.
Some of the examples of statistical techniques that are used for developing AI
applications in those days and are still in practice are listed here:
 Regression
 Classification
 Clustering
 Probability Theories
 Decision Trees

Types of Machine Learning: Machine Learning is broadly categorized under the


following headings

Machine learning evolved from left to right as shown in the above diagram.
Supervised Learning
Supervised learning is a type of machine learning where an algorithm learns from a labeled dataset,
which means it's provided with input data along with corresponding output labels. The algorithm's goal
is to learn a mapping or relationship between the inputs and the desired outputs so that it can make
predictions or classify new, unseen data points accurately.

Example: Handwritten Digit Recognition

Here's an explanation of supervised learning with an example: Imagine you want to build a system that
can automatically recognize handwritten digits (0 through 9). This is a common task in optical character
recognition (OCR) and is a classic example of supervised learning.

There are mainly two types of Supervised learning, Regression and Classification

1- Regression
Regression algorithms are a class of supervised learning techniques used for predicting continuous values
or numerical outcomes. They establish a relationship between input features and a target variable. Here
are some common types of regression algorithms along with examples:

 Simple Linear Regression: A simple linear regression allows you to model the relationship between
a dependent variable (target) and a independent variables (features) by fitting a linear equation to the
observed data.
Where: 𝑋 is Mean of 𝑋 & Where: 𝑌 is Mean of 𝑌
Here b1 is known as Regression Coefficient
Now, you can calculate the intercept b0 using the mean values:
 Multiple Linear Regression: Multiple linear regression is an extension of simple linear regression
that allows you to model the relationship between a dependent variable (target) and two or more
independent variables (features) by fitting a linear equation to the observed data. Instead of a straight
line, it finds the best-fitting hyperplane in higher-dimensional space.
 Polynomial Regression: Polynomial Regression is a type of regression analysis used when the
relationship between the independent variable (feature) and the dependent variable (target) is
nonlinear. It models this relationship using a polynomial equation rather than a straight line. The
polynomial equation can have various degrees (e.g., quadratic, cubic) to capture different levels of
complexity in the data.

Example: Predicting Ice Cream Sales


Imagine you want to predict the number of ice creams sold (Y) based on the temperature (X) outside. The
relationship might not be linear; people tend to buy more ice cream as it gets warmer, but this relationship
may curve. So, you decide to use polynomial regression.

Let's say you have the following dataset:


 Ridge Regression: Ridge Regression, also known as L2 regularization, is a linear regression
technique used to address the problem of multicollinearity (high correlation among independent
variables) and overfitting in regression models. It adds a penalty term to the linear regression
equation, forcing the model to keep the coefficients of the independent variables relatively small,
thus reducing the complexity of the model.

Example: Predicting a person's income based on various features like education, experience, and
location while avoiding overfitting the model by using ridge regression to balance feature
importance.
 Lasso Regression: Lasso Regression, short for Least Absolute Shrinkage and Selection Operator, is
a linear regression technique used to address multicollinearity and perform feature selection by
adding a penalty term to the linear regression equation. It can be used for feature selection as it
encourages some of the coefficients of independent variables to become exactly zero, effectively
selecting a subset of the most important features while also reducing the complexity of the model.

 Support Vector Regression (SVR): Support Vector Regression (SVR) is a type of regression
algorithm that extends the concepts of Support Vector Machines (SVMs) from classification to
regression problems. It is used for modeling and predicting continuous or numerical values,
making it suitable for tasks where the relationship between input features and the target variable
is non-linear or complex.

In SVR, the goal is to find a hyperplane that best fits the data while minimizing the margin violations.
This hyperplane is often referred to as the "epsilon tube." SVR works by defining an epsilon tube around
the regression line. Data points within this tube are considered to have zero error, while those outside the
tube incur a penalty proportional to the distance they deviate from the tube.

Example: Predicting the price of a stock based on historical data and other market indicators,
using SVR to capture complex non-linear relationships.

Here are some key components and concepts of Support Vector Regression:

Kernel Trick: Similar to SVMs, SVR can use a kernel function to map the input features into a higher-
dimensional space, making it easier to find non-linear relationships in the data.

Epsilon (ϵ): The width of the tube around the regression line. It defines the acceptable margin of error for
the predictions. Data points within this margin do not contribute to the loss function, while those outside
the margin are penalized.

Support Vectors: These are the data points that lie on the margin boundary or within the tube. They are
the most critical data points for defining the SVR model.

Loss Function: The loss function in SVR aims to minimize the deviation of data points from the
regression line while considering the width of the epsilon tube.

Regularization Parameter (C): Similar to SVMs, SVR has a regularization parameter C that controls the
trade-off between maximizing the margin and minimizing the loss. A smaller C allows for a wider margin
but permits more margin violations, while a larger C enforces a narrower margin with fewer violations.

Hyperparameters: SVR has hyperparameters that need to be tuned, such as the choice of kernel function
(e.g., linear, polynomial, radial basis function), kernel-specific parameters (e.g., degree in polynomial
kernel, gamma in RBF kernel), and the value of C.

Advantages of SVR:

 Can capture complex non-linear relationships between features and the target variable.
 Effective in high-dimensional spaces.
 Handles outliers well by limiting their impact through the epsilon tube.
 Provides flexibility through kernel functions.
Disadvantages of SVR:
 Requires careful tuning of hyperparameters.
 Can be computationally expensive, especially with large datasets.
 Interpretability may be limited when using complex kernels.
 Decision Tree Regression: Decision tree regression splits the data into subsets based on the
values of input features and then predicts the average or median of the target variable for each
subset.

Example: Predicting the expected delivery time of a pizza based on factors like distance, traffic,
and weather conditions using a decision tree.

 Random Forest Regression: Random forest regression is an ensemble method that combines
multiple decision trees to improve predictive accuracy and reduce overfitting.

Example: Predicting housing prices by aggregating predictions from multiple decision trees trained
on various subsets of data and features.

The choice of above regression algorithm depends on the nature of the data and the problem you are
trying to solve. Experimentation and model evaluation are essential to determine which regression
technique works best for a specific task.

Applications of Machine Learning:

Type of
Application Type of Algorithm Example
Dataset
Labeled Image Convolutional Neural Handwriting digit
Image Classification
Data Network (CNN) recognition (MNIST)
Naive Bayes, Logistic
Spam Detection Text Data Email spam filtering
Regression
Recurrent Neural
Voice-to-text systems
Speech Recognition Audio Data Network (RNN), Hidden
(Siri, Google Assistant)
Markov Model (HMM)
User-Item Product
Recommendation Collaborative Filtering,
Interaction recommendations
Systems Matrix Factorization
Data (Amazon, Netflix)
Transaction Random Forest, Support Credit card fraud
Fraud Detection
Data Vector Machine (SVM) detection
Analyzing reviews on
Sentiment Analysis Text Data LSTM, Naive Bayes social media
(positive/negative)
Image, Sensor CNN, Reinforcement Self-driving cars (Tesla,
Autonomous Vehicles
Data Learning Waymo)
Time Series Predicting machine
Predictive Maintenance LSTM, ARIMA
Data failures in factories
Medical
Detecting cancer from
Medical Diagnosis Imaging, Random Forest, CNN
MRI scans
Tabular Data
Type of
Application Type of Algorithm Example
Dataset
Financial Time
Market Forecasting LSTM, ARIMA Stock price prediction
Series
Customer service
Chatbots/Conversational
Text Data Transformer (e.g., GPT) chatbots (OpenAI,
AI
Google BERT)
Facebook facial
Face Recognition Image Data CNN, FaceNet recognition, iPhone Face
ID
Labeled Image Detecting objects in
Object Detection CNN (YOLO, SSD)
Data images or videos
Transformer (e.g.,
Language Translation Text Data Google Translate, DeepL
BERT, GPT)
Sensor,
Detecting unusual
Anomaly Detection Transaction Isolation Forest, SVM
patterns in network traffic
Data
Chemical, Random Forest, Neural Predicting the efficacy of
Drug Discovery
Biological Data Networks drug molecules

Components of Learning

Basic components of learning process The learning process, whether by a human or a machine, can be
divided into four components, namely, data storage, abstraction, generalization and evaluation. Figure 1.1
illustrates the various components and the steps involved in the learning process.

Fig-4: Components of Learning process

1. Data storage Facilities for storing and retrieving huge amounts of data are an important component of
the learning process. Humans and computers alike utilize data storage as a foundation for advanced
reasoning. • In a human being, the data is stored in the brain and data is retrieved using electrochemical
signals. • Computers use hard disk drives, flash memory, random access memory and similar devices to
store data and use cables and other technology to retrieve data.
2. Abstraction The second component of the learning process is known as abstraction. Abstraction is the
process of extracting knowledge about stored data. This involves creating general concepts about the
data as a whole. The creation of knowledge involves application of known models and creation of new
models. The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original information.

3. Generalization The third component of the learning process is known as generalisation. The term
generalization describes the process of turning the knowledge about stored data into a form that can
be utilized for future action. These actions are to be carried out on tasks that are similar, but not
identical, to those what have been seen before. In generalization, the goal is to discover those properties
of the data that will be most relevant to future tasks.

4. Evaluation Evaluation is the last component of the learning process. It is the process of giving
feedback to the user to measure the utility of the learned knowledge. This feedback is then utilised to
effect improvements in the whole learning process

DESIGNING A LEARNING SYSTEM

1- Choosing the Training Experience


 The first design choice we face is to choose the type of training experience from which our
system will learn.
 The type of training experience available can have a significant impact on success or failure of
the learner.
2 Choosing the Target Function
The next design choice is to determine exactly what type of knowledge will be learned and how this will
be used by the performance program.
The program needs only to learn how to choose the best move from among these legal moves.

For checkers-playing problem, given this setting where we must learn to choose among the legal moves,
the most obvious choice for the type of information to be learned is a program, or function, that chooses
the best move for any given board state. Let us call this function ChooseMove and use the notation
ChooseMove : B → M to indicate that this function accepts as input any board from the set of legal
board states B and produces as output some move from the set of legal moves M.

What exactly should be the value of the target function V for any given board state?

3 Choosing a Representation for the Target Function


Now that we have specified the ideal target function V, we must choose a representation that the
learning program will use to describe the function c that it will learn.

4 Choosing a Learning Algorithm for Approximation Target Function


In order to learn the target function f we require a set of training examples, each describing a specific
board state b and the training value Vtrain(b) for b.
4.1 ESTIMATING TRAINING VALUES

4.2 ADJUSTING THE WEIGHTS

5 The Final Design


The final design of our checkers learning system can be naturally described by four distinct program
modules that represent the central components in many learning systems. These four modules are
 The Performance System is the module that must solve the given performance task, in this case
playing checkers, by using the learned target function(s).
It takes an instance of a new problem (new game) as input and produces a trace of its solution (game
history) as output. In our case, the strategy used by the Performance System to select its next move at
each step is determined by the learned evaluation function. Therefore, we expect its performance to
improve as this evaluation function becomes increasingly accurate.
 The Critic takes as input the history or trace of the game and produces as output a set of training
examples of the target function. As shown in the diagram, each training example in this case
corresponds to some game state in the trace, along with an estimate the target function value for
this example.
 The Generalizer takes as input the training examples and produces an output hypothesis that is
its estimate of the target function. It generalizes from the specific training examples,
hypothesizing a general function that covers these examples and other cases beyond the training
examples.
 The Experiment Generator takes as input the current hypothesis (currently learned function)
and outputs a new problem (i.e., initial board state) for the Performance System to explore. Its
role is to pick new practice problems that will maximize the learning rate of the overall system.
Fig: Summary of choices in designing the checkers learning program.

Issues in Machine Learning


Our checkers example raises a number of generic questions about machine learning. The field of machine
learning, and much of this book, is concerned with answering questions such as the following:
 What algorithms exist for learning general target functions from specific training examples? In
what settings will particular algorithms converge to the desired function, given sufficient training
data? Which algorithms perform best for which types of problems and representations?
 How much training data is sufficient? What general bounds can be found to relate the
confidence in learned hypotheses to the amount of training experience and the character of the
learner's hypothesis space?
 When and how can prior knowledge held by the learner guide the process of generalizing from
examples? Can prior knowledge be helpful even when it is only approximately correct?
 What is the best strategy for choosing a useful next training experience, and how does the
choice of this strategy alter the complexity of the learning problem?
 What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn? Can this
process itself be automated?
 How can the learner automatically alter its representation to improve its ability to represent and
learn the target function?

Concept learning: Concept learning, also known as category learning, concept


attainment, and concept formation
it is a search for and listing of attributes that can be used to distinguish exemplars from non
exemplars of various categories".

Concept learning. Inferring a boolean-valued function from training examples


of its input and output.

Example for task of learning:

The target concept is "days on which my friend Aldo enjoys his favorite water sport."
Solution:
The attribute EnjoySport indicates whether or not Aldo enjoys his favorite water sport on this
day. The task is to learn to predict the value of EnjoySport for an arbitrary day, based on the
values of its other attributes.
What hypothesis representation shall we provide to the learner in this case? Let us begin by
considering a simple representation in which each hypothesis consists of a conjunction of
constraints on the instance attributes. In particular, let each hypothesis be a vector of six
constraints, specifying the values of the six attributes Sky, AirTemp, Humidity, Wind, Water,
and Forecast. For each attribute, the hypothesis will either

 indicate by a "?" that any value is acceptable for this attribute,


 specify a single required value (e.g., Warm) for the attribute, or
 indicate by a "ϕ" that no value is acceptable.
If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a positive
example (h(x) = 1). To illustrate, the hypothesis that Aldo enjoys his favorite sport only on cold
days with high humidity (independent of the values of the other attributes) is represented by the
expression
<?, Cold, High, ?, ?, ?>

TABLE-2.1: Positive and negative training examples for the target concept EnjoySport.

The most general hypothesis-that every day is a positive example-is represented by


<?, ?, ?, ?, ?, ?>
and the most specific possible hypothesis-that no day is a positive example-is represented by
< ϕ , ϕ, ϕ, ϕ, ϕ, ϕ >
To summarize, the Enjoy Sport concept learning task requires learning the set of days for which
Enjoy Sport = yes, describing this set by a conjunction of constraints over the instance
attributes. In general, any concept learning task can be described by the set of instances over
which the target function is defined, the target function, the set of candidate hypotheses
considered by the learner, and the set of available training examples. The definition of the Enjoy
Sport concept learning task in this general form is given in below.

Fig: The EnjoySport concept learning task.


In Above Enjoy Sport concept learning task in which The set of items over which the concept is
defined is called the set of instances, which we denote by X. In the current example, X is the set
of all possible days, each represented by the attributes Sky, AirTemp, Humidity, Wind, Water,
and Forecast. The concept or function to be learned is called the target concept, which we denote
by c. In general, c can be any Boolean valued function defined over the instances X; that is,
c : X→ {0,1}. In the current example, the target concept corresponds to the value of the attribute
Enjoy Sport (i.e., c(x) = 1 if Enjoy Sport = Yes, and c(x) = 0 if Enjoy Sport = No).

When learning the target concept, the learner is presented a set of training examples, each
consisting of an instance x from X, along with its target concept value c(x) (e.g., the training
examples in Table-2.1). Instances for which c(x) = 1 are called positive examples, or members of
the target concept. Instances for which c(x) = 0 are called negative examples, or nonmembers of
the target concept. We will often write the ordered pair <x, c(x)> to describe the training
example consisting of the instance x and its target concept value c(x). We use the symbol D to
denote the set of available training examples.

Given a set of training examples of the target concept c, the problem faced by the learner is to
hypothesize, or estimate, c. We use the symbol H to denote the set of all possible hypotheses that
the learner may consider regarding the identity of the target concept. Usually H is determined by
the human designer's choice of hypothesis representation. In general, each hypothesis h in H
represents a boolean-valued function defined over X; that is, h : X→ {0,1}. The goal of the
learner is to find a hypothesis h such that h(x) = c(x) for all x in X.
NOTE- Hypothesis: Hypothesis is a consists of a conjunction of constraints on the instance
attributes.

Some Important Definitions-

Hypothesis space 'h' is described by a conjunction of constraints on the attribute, the constraints
may General hypothesis "?" ( any value is acceptable), Specific hypothesis "φ" (a specific value
or no value is accepted) or any specific value.

Instance Space: It is a subset of all possible example or instance.

Version Space: The Version Space denotes VSHD (with respect to hypothesis space H and
training example D) is the subset of hypothesis from H consistent with training example in D.

CONCEPT LEARNING AS SEARCH (General-to-Specific Ordering of Hypotheses)


Concept learning can be viewed as the task of searching through a large space of hypotheses
implicitly defined by the hypothesis representation. The goal of this search is to find the
hypothesis that best fits the training examples. It is important to note that by selecting a
hypothesis representation, the designer of the learning algorithm implicitly defines the space of
all hypotheses that the program can ever represent and therefore can ever learn. Consider, for
example, the instances X and hypotheses H in the EnjoySport learning task. Given that the
attribute Sky has three possible values, and that AirTemp, Humidity, Wind, Water, and
Forecast each have two possible values, the instance space X contains exactly 3.2.2.2.2.2 = 96
distinct instances.
A similar calculation shows that there are 5.4.4.4.4.4 = 5120 syntactically distinct hypotheses
within H. however, that every hypothesis containing one or more "ϕ" symbols represents the
empty set of instances; that is, it classifies every instance as negative. Therefore, the number of
semantically distinct hypotheses is only 1 + (4.3.3.3.3.3) = 973. Our EnjoySport example is a
very simple learning task, with a relatively small, finite hypothesis space. Most practical learning
tasks involve much larger, sometimes infinite, hypothesis spaces.

Example for Concept Learning-


Now Determine---

For given Attributes


 Size of Instance Space (X) = 3*2*2*2*2*2 = 96
General-to-Specific Ordering of Hypotheses
To illustrate the general-to-specific ordering, consider the two hypotheses

Now consider the sets of instances that are classified positive by hl and by h2. Because h2
imposes fewer constraints on the instance, it classifies more instances as positive. In fact, any
instance classified positive by hl will also be classified positive by h2. Therefore, we say that h2
is more general than hl.
This intuitive "more general than" relationship between hypotheses can be defined more
precisely as follows. First, for any instance x in X and hypothesis h in H, we say that x satisfies
h if and only if h(x) = 1. We now define the more-general than or equal to relation in terms of
the sets of instances that satisfy the two hypotheses: Given hypotheses hj and hk, hj is more-
general-than equal to hk if and only if any instance that satisfies hk also satisfies hj.
Instances, hypotheses, and the more-general-than relation. The box on the left represents the set
X of all instances, the box on the right the set H of all hypotheses. Each hypothesis corresponds
to some subset of X-the subset of instances that it classifies positive. The arrows connecting
hypotheses represent the m o r e - g e n e r a l - t h a n relation, with the arrow pointing toward
the less general hypothesis.

FIND-S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS


The FIND-S algorithm is a First approach of concept learning,

The Question is How can we use the more-general-than partial ordering to organize the search
for a hypothesis consistent with the observed training examples?
One way is to begin with the most specific possible hypothesis in H, then generalize this
hypothesis each time it fails to cover an observed positive training example. (We say that a
hypothesis "covers" a positive example if it correctly classifies the example as positive.) To be
more precise about how the partial ordering is used, consider the FIND-S algorithm.

FIND-S algorithm defined as-

To illustrate FIND-S algorithm, assume the learner is given the sequence of training examples
given in below Table for the Enjoy-Sport task.
The first step of FIND-S algorithm is to initialize h to the most specific hypothesis in H.

Upon observing the first training example from above given Table, which happens to be a
positive example, it becomes clear that our hypothesis is too specific. In particular, none of the
"ϕ" constraints in h are satisfied by this example, so each is replaced by the next more general
constraint that fits the example; namely, the attribute values for this training example.
This h is still very specific; it asserts that all instances are negative except for the single positive
training example we have observed. Next, the second training example (also positive in this
case) forces the algorithm to further generalize h, this time substituting a "?" in place of any
attribute value in h that is not satisfied by the new example. The refined hypothesis in this case
is

Upon encountering the third training example-in this case a negative example- the algorithm
makes no change to h. In fact, the FIND-S algorithm simply ignores every negative example!

While this may at first seem strange, notice that in the current case our hypothesis h is already
consistent with the new negative example (i.e., h correctly classifies this example as negative),
and hence no revision is needed. In the general case, as long as we assume that the hypothesis
space H contains a hypothesis that describes the true target concept c and that the training data
contains no errors, then the current hypothesis h can never require a revision in response to a
negative example. To see why, recall that the current hypothesis h is the most specific hypothesis
in H consistent with the observed positive examples.
Because the target concept c is also assumed to be in H and to be consistent with the positive
training examples, c must be more_general_than-or-equal to h . But the target concept c will
never cover a negative example, thus neither will h (by the definition of more-general-than).
Therefore, no revision to h will be required in response to any negative example.

To complete our trace of FIND-S, the fourth (positive) example leads to a further generalization
of h.
Final Maximally Specific Hypothesis is for given data set is:

The FIND-S algorithm illustrates one way in which the more-general-than partial ordering can
be used to organize the search for an acceptable hypothesis.
The search moves from hypothesis to hypothesis, searching from the most specific to
progressively more general hypotheses along one chain of the partial ordering. As given in below
fig illustrates this search in terms of the instance and hypothesis spaces.

The key property of the FIND-S algorithm is that for hypothesis spaces described by
conjunctions of attribute constraints, FIND-S is guaranteed to output the most specific
hypothesis within H that is consistent with the positive training examples. Its final hypothesis
will also be consistent with the negative examples provided the correct target concept is
contained in H, and provided the training examples are correct.

However, there are several Limitations of Find-S algorithm, such as:


 Has the learner converged to the correct target concept?
 Why prefer the most specific hypothesis?
 Are the training examples consistent?
 What if there are several maximally specific consistent hypotheses?

KEY POINTS FOR, THE CANDIDATE-ELIMINATION


ALGORITHM

 Consistent Hypothesis-

Notice the key difference between this definition of consistent and our earlier definition of
satisfies. An example x is said to satisfy hypothesis h when h(x) = 1, regardless of whether x is
a positive or negative example of the target concept.
However, whether such an example is consistent with h depends on the target concept, and in
particular, whether h(x) = c(x).
 The LIST-THEN-ELIMINATE algorithm

 Version Space- The CANDIDATE-ELIMINATIN Algorithm represents the set


of all hypotheses consistent with the observed training examples. This subset of all
hypotheses is called the version space with respect to the hypothesis space H and the
training examples D, because it contains all possible versions of the target concept.

In above figure, given version space with its general and specific boundary sets. The version
space includes all six hypotheses shown here, but can be represented more simply by S and G.
Arrows indicate instances of the more-general-than relation. This is the version space for the
Enjoy sport concept learning problem and training examples described in below Table.
In fact, this is just one of six different hypotheses from H that are consistent with these training
examples. All six hypotheses are shown in above Figure. They constitute the version space
relative to this set of data and this hypothesis representation.
The arrows among these six hypotheses in above Figure indicate instances of the more-general-
than-relation. The CANDIDATE-ELIMINATION algorithm represents the version space by
storing only its most general members (labeled G in Figure) and its most specific (labeled S in
the figure).
CANDIDAT-ELIMINATION Algorithm- The CANDIDAT-ELIMINATION
algorithm is a second approach of concept learning, , that addresses several of the
limitations of FIND-S Algorithm. Notice that although FIND-S outputs a hypothesis from H,
that is consistent with the training examples, this is just one of many hypotheses from H that
might fit the training data equally well.
The CANDIDATE-ELIMINAN-algorithm LIST the set of all hypotheses (called as version space)
consistent with the training examples.
Surprisingly, the CANDIDAT-ELIMINATION algorithm computes the description of this set
without explicitly enumerating all of its members.
This is accomplished by again using the more-general-than partial ordering, this time to
maintain a compact representation of the set of consistent hypotheses and to incrementally refine
this representation as each new training example is encountered.

NOTE- Nevertheless, practical applications of the CANDIDAT-ELIMINATION algorithm


and FIND-S algorithms are limited by the fact that they both perform poorly when given noisy
training data.

Example of Candidate Elimination Algorithm


Here CANDIDATE-ELIMINATION Trace-1. So and Go are the initial boundary sets
corresponding to the most specific and most general hypotheses. Training examples 1 and 2 force
the S boundary to become more general, as in the FIND-S algorithm. They have no effect on the G
boundary.
Here CANDIDATE-ELIMINATION Trace-2. Training example 3 is a negative example that forces
the G2 boundary to be specialized to G3. Note several alternative maximally general hypotheses are
included in G3.

.
Here CANDIDATE-ELIMINATION Trace-3. The positive training example generalizes the S
boundary, from S3 to S4. One member of G3 must also be deleted, because it is no longer more
general than the S4 boundary.

In above figure, given version space with its general and specific boundary sets. The version
space includes all six hypotheses shown here, but can be represented more simply by S and G.
Concept Learning - Summary
• Concept learning can be seen as a problem of searching through a large predefined space of
potential hypotheses.
• The general-to-specific partial ordering of hypotheses provides a useful structure for organizing
the search through the hypothesis space.
• The FIND-S algorithm utilizes this general-to-specific ordering, performing a specific-to-
general search through the hypothesis space along one branch of the partial ordering, to find the
more specific hypothesis consistent with the training examples.
• The CANDIDATE-ELIMINATION algorithm utilizes this general-to-specific ordering to
compute the version space (the set of all hypotheses consistent with the training data) by
incrementally computing the sets of maximally specific (S) and maximally general (G)
hypotheses.

• Because the S and G sets delimit the entire set of hypotheses consistent with the data, they
provide the learner with a description of its uncertainty regarding the exact identity of the target
concept. This version space of alternative hypotheses can be examined
– to determine whether the learner has converged to the target concept,
– to determine when the training data are inconsistent,
– to generate informative queries to further refine the version space, and
– to determine which unseen instances can be unambiguously classified based on the
partially learned concept.
• The CANDIDATE-ELIMINATION algorithm is not robust to noisy data or to situations in
which the unknown target concept is not expressible in the provided hypothesis space.
• Inductive learning algorithms are able to classify unseen examples only because of their
implicit inductive bias for selecting one consistent hypothesis over another.
• If the hypothesis space is enriched to the point where there is a hypothesis corresponding to
every possible subset of instances (the power set of the instances), this will remove any inductive
bias from the CANDIDATE-ELIMINATION algorithm .
– Unfortunately, this also removes the ability to classify any instance beyond the
observed training examples.
– An unbiased learner cannot make inductive leaps to classify unseen examples.

INDUCTIVE BIAS

You might also like