KEMBAR78
Learning Problems & Concept Learning | PDF | Machine Learning | Cluster Analysis
0% found this document useful (0 votes)
122 views41 pages

Learning Problems & Concept Learning

UNIT 1 MACHINE LEARNING TECHNIQUES Learning Problems – Perspectives and Issues – Concept Learning – Version Spaces and Candidate Eliminations – Inductive bias –Decision Tree learning–Representation–Algorithm –Heuristic Space Search.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views41 pages

Learning Problems & Concept Learning

UNIT 1 MACHINE LEARNING TECHNIQUES Learning Problems – Perspectives and Issues – Concept Learning – Version Spaces and Candidate Eliminations – Inductive bias –Decision Tree learning–Representation–Algorithm –Heuristic Space Search.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Department of Artificial Intelligence and Data Science T.

Kalaiselvi

191AIC502T
Machine Learning Techniques
UNIT 1
INTRODUCTION

Learning Problems – Perspectives and Issues – Concept Learning – Version Spaces and
Candidate Eliminations –Inductive bias –Decision Tree learning–Representation–
Algorithm –Heuristic Space Search.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

UNIT 1
MACHINE LEARNING TECHNIQUES

Learning Problems – Perspectives and Issues – Concept Learning – Version Spaces and Candidate
Eliminations – Inductive bias –Decision Tree learning–Representation–Algorithm –Heuristic Space
Search.

1. Learning :
Learning denotes changes in a system that ... enable a system to do the same task … more efficiently the
next time.”- Herbert Simon
“Learning is constructing or modifying representations of what is being experienced.” - Ryszard
Michalski
“Learning is making useful changes in our minds.” - Marvin Minsky

1.1 Definition of learning


Definition
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks T, as measured by P, improves with experience E.
Examples
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given classifications
ii) A robot driving learning problem
• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• training experience: A sequence of images and steering commands recorded while
observing a human driver
iii) A chess learning problem
• Task T: Playing chess
• Performance measure P: Percent of games won against opponents
• Training experience E: Playing practice games against itself

1.2 Machine Learning:

Machine learning is programming computers to optimize a performance criterion using


example data or past experience.
Machine learning refers to a system capable of the autonomous acquisition and integration
of knowledge.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

We have a model defined up to some parameters, and learning is the execution of a


computer program to optimize the parameters of the model using the training data or past
experience.
The model may be predictive to make predictions in the future, or descriptive to
gain knowledge from data, or both. Arthur Samuel, an early American leader in the field of
computer gaming and artificial intelligence,coined the term “Machine Learning” in 1959 while at
IBM. He defined machine learning as “the field of study that gives computers the ability to learn
without being explicitly programmed.” However, there is no universally accepted definition for
machine learning. Different authors define the term differently.
Why Machine Learning?
No human experts,Industrial/manufacturing control,Mass spectrometer analysis, drug
design, astronomic discovery,Black-box human expertise,Face/handwriting/speech recognition
Driving a car, flying a plane,Rapidly changing phenomena,Credit scoring, financial modeling
Diagnosis, fraud detection,Need for customization/personalization,Personalized news reader
Movie/book recommendation
Definition
A computer program which learns from experience is called a machine learning program or
simply a learning program. Such a program is sometimes also referred to as a learner.
Components of Learning
Basic components of learning process
The learning process, whether by a human or a machine, can be divided into four components,
namely, data storage, abstraction, generalization and evaluation. Figure 1.1 illustrates the
various components and the steps involved in the learning process.

1. Data storage
Facilities for storing and retrieving huge amounts of data are an important component of the
learning process. Humans and computers alike utilize data storage as a foundation for advanced
reasoning.
• In a human being, the data is stored in the brain and data is retrieved using electrochemical
signals.
• Computers use hard disk drives, flash memory, random access memory and similar devices to
stored data and use cables and other technology to retrieve data.

2. Abstraction
The second component of the learning process is known as abstraction.Abstraction is the process
of extracting knowledge about stored data. This involves creating general concepts about the data
as a whole. The creation of knowledge involves application of known models and creation of new
models.The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original information.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

3. Generalization
The third component of the learning process is known as generalisation.The term generalization
describes the process of turning the knowledge about stored data into a form that can be utilized
for future action. These actions are to be carried out on tasks that are similar, but not identical, to
those what have been seen before. In generalization, the goal is to discover those properties of
the data that will be most relevant to future tasks.
4. Evaluation
Evaluation is the last component of the learning process.It is the process of giving feedback to the
user to measure the utility of the learned knowledge. This feedback is then utilised to effect
improvements in the whole learning process.

Applications of machine learning


Application of machine learning methods to large databases is called data mining. In data
mining, a large volume of data is processed to construct a simple model with valuable use, for
example, having high predictive accuracy.The following is a list of some of the typical applications
of machine learning.
1. In retail business, machine learning is used to study consumer behaviour.
2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.
3. In manufacturing, learning models are used for optimization, control, and troubleshooting.
4. In medicine, learning programs are used for medical diagnosis.
5. In telecommunications, call patterns are analyzed for network optimization and maximizing
the quality of service.
6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed fast
enough by computers. The World Wide Web is huge; it is constantly growing and searching for
relevant information cannot be done manually.
7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that the
system designer need not foresee and provide solutions for all possible situations.
8. It is used to find solutions to many problems in vision, speech recognition, and robotics.
9. Machine learning methods are applied in the design of computer-controlled vehicles to steer
correctly when driving on a variety of roads.
10. Machine learning methods have been used to develop programmes for playing games such as
chess, backgammon and Go.

1.3 Learning Models:

Geometric models
Here we consider models that define similarity by considering the geometry of the instance
space. In Geometric models, features could be described as points in two dimensions (x- and y-
axis) or a three-dimensional space (x, y, and z). Even when features are not intrinsically
geometric, they could be modelled in a geometric manner (for example, temperature as a function
of time can be modelled in two axes). In geometric models, there are two ways we could impose
similarity.
We could use geometric concepts like lines or planes to segment (classify) the instance space.
These are called Linear models.
Alternatively, we can use the geometric notion of distance to represent similarity. In this
case, if two points are close together, they have similar values for features and thus can be classed
as similar. We call such models as Distance-based models.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Linear models
Linear models are relatively simple. In this case, the function is represented as a linear
combination of its inputs. Thus, if x1 and x2 are two scalars or vectors of the same dimension and
a and b are arbitrary scalars, then ax1 + bx2 represents a linear combination of x1 and x2. In the
simplest case where f(x) represents a straight line, we have an equation of the form f (x) = mx + c
where c represents the intercept and m represents the slope.

Linear models are parametric, which means that they have a fixed form with a small number of
numeric parameters that need to be learned from data. For example, in f (x) = mx + c, m and c are
the parameters that we are trying to learn from the data. This technique is different from tree or
rule models, where the structure of the model (e.g., which features to use in the tree, and where)
is not fixed in advance.
Linear models are stable, i.e., small variations in the training data have only a limited impact on
the learned model. In contrast, tree models tend to vary more with the training data, as the
choice of a different split at the root of the tree typically means that the rest of the tree is different
as well. As a result of having relatively few parameters, Linear models have low variance and high
bias. This implies that Linear models are less likely to overfit the training data than some
other models. However, they are more likely to underfit. For example, if we want to learn the
boundaries between countries based on labelled data, then linear models are not likely to give a
good approximation.

Distance-based models
Distance-based models are the second class of Geometric models. Like Linear models, distance
based models are based on the geometry of data. As the name implies, distance-based models
work on the concept of distance. In the context of Machine learning, the concept of distance is not
based on merely the physical distance between two points. Instead, we could think of the distance
between two points considering the mode of transport between two points. Travelling between
two cities by plane covers less distance physically than by train because a plane is unrestricted.
Similarly, in chess, the concept of distance depends on the piece used – for example, a Bishop can
move diagonally. Thus, depending on the entity and the mode of travel, the concept of distance
can be experienced differently. The distance metrics commonly used are Euclidean, Minkowski,
Manhattan, and Mahalanobis.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Distance is applied through the concept of neighbours and exemplars. Neighbours are points in
proximity with respect to the distance measure expressed through exemplars. Exemplars are
either centroids that find a centre of mass according to a chosen distance metric or medoids that
find the most centrally located data point. The most commonly used centroid is the arithmetic
mean, which minimises squared Euclidean distance to all other points.
Notes:

position of all the points in the figure from the centroid point. This definition extends to any
object in n-dimensional space: its centroid is the mean position of all the points.

data when a mean or centroid cannot be defined. They are used in contexts where the centroid
is not representative of the dataset, such as in image data.
Examples of distance-based models include the nearest-neighbour models, which use the training
data as exemplars – for example, in classification. The K-means clustering algorithm also uses
exemplars to create clusters of similar data points.
Probabilistic models
The third family of machine learning algorithms is the probabilistic models. We have seen
before that the k-nearest neighbour algorithm uses the idea of distance (e.g., Euclidian distance)
to classify entities, and logical models use a logical expression to partition the instance space. In
this section, we see how the probabilistic models use the idea of probability to classify new
entities. Probabilistic models see features and target variables as random variables. The process
of modelling represents and manipulates the level of uncertainty with respect to these variables.
There are two types of probabilistic models: Predictive and Generative. Predictive probability
models use the idea of a conditional probability distribution P (Y |X) from which Y can be
predicted from X. Generative models
estimate the joint distribution P (Y, X). Once we know the joint distribution for the generative
models, we can derive any conditional or marginal distribution involving the same variables.
Thus, the generative model is capable of creating new data points and their labels, knowing the
joint probability distribution. The joint distribution looks for a relationship between two
variables. Once this relationship is inferred, it is possible to infer new data points.
Naïve Bayes is an example of a probabilistic classifier.
We can do this using the Bayes rule defined as

P(A/B) = P(B/A)P(A)
P(B)

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

The Naïve Bayes algorithm is based on the idea of Conditional Probability. Conditional probability
is based on finding the probability that something will happen, given that something else has
already happened. The task of the algorithm then is to look at the evidence and to determine the
likelihood of a specific class and assign a label accordingly to each entity.
Logical models:
Logical models use a logical expression to divide the instance space into segments and hence
construct grouping models. A logical expression is an expression that returns a Boolean value, i.e.,
a True or False outcome. Once the data is grouped using a logical expression, the data is divided
into homogeneous groupings for the problem we are trying to solve. For example, for a
classification problem, all the instances in the group belong to one class.
There are mainly two kinds of logical models:
Tree models and Rule models.
Rule models consist of a collection of implications or IF-THEN rules. For tree-based models, the
‘if-part’ defines a segment and the ‘then-part’ defines the behaviour of the model for this segment.
Rule models follow the same reasoning.

Logical models and Concept learning:


To understand logical models further, we need to understand the idea of Concept Learning.
Concept Learning involves learning logical expressions or concepts from examples. The idea of
Concept Learning fits in well with the idea of Machine learning, i.e., inferring a general function
from specific training examples. Concept learning forms the basis of both tree-based and rule-
based models. More formally, Concept Learning involves acquiring the definition of a general
category from a given set of positive and negative training examples of the category. A Formal
Definition for Concept Learning is
“The inferring of a Boolean-valued function from training examples of its input and output.” In
concept learning, we only learn a description for the positive class and label everything that
doesn’t satisfy that description as negative.
The following example explains this idea in more detail.

A Concept Learning Task called “Enjoy Sport” as shown above is defined by a set of data from
some example days. Each data is described by six attributes. The task is to learn to predict the
value of Enjoy Sport for an arbitrary day based on the values of its attribute values. The problem
can be represented by a series of hypotheses. Each hypothesis is described by a conjunction of
constraints on the attributes. The training data represents a set of positive and negative examples
of the target function. In the example above, each hypothesis is a vector of six constraints,
specifying the values of the six attributes – Sky, AirTemp, Humidity, Wind, Water, and Forecast.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

The training phase involves learning the set of days (as a conjunction of attributes) for which
Enjoy Sport = yes.
Thus, the problem can be formulated as:
Given instances X which represent a set of all possible days, each described by the attributes:
Sky – (values: Sunny, Cloudy, Rainy),
AirTemp – (values: Warm, Cold),
Humidity – (values: Normal, High),
Wind – (values: Strong, Weak),
Water – (values: Warm, Cold),
Forecast – (values: Same, Change).
Try to identify a function that can predict the target variable Enjoy Sport as yes/no, i.e., 1 or 0.

Grouping and Grading


Grading vs grouping is an orthogonal categorization to geometric-probabilistic-logical-
compositional.
Grouping models break the instance space up into groups or segments and in each segment
apply a very simple method (such as majority class) E.g. decision tree, KNN.
Grading models form one global model over the instance space. E.g. Linear classifiers – Neural
networks
Grading Model:
Grading models in machine learning (ML) refer to models used for evaluating and assigning a
grade, score, or rank to a specific output or performance, based on a set of pre-defined criteria or
metrics. These models are typically used in education, hiring, performance evaluation, and similar
contexts. They can be either supervised or unsupervised models and can use various techniques,
such as regression, classification, clustering, and decision trees, to make predictions or evaluate
results. The accuracy of grading models depends on the quality of the training data and the
appropriate selection of metrics and algorithms.

Grouping Model:
A grouping model in machine learning (ML) refers to a type of model that is used to categorize or
segment data points into different groups or clusters, based on their similarity or proximity. This
process is also known as clustering. The goal of clustering is to find structure in the data and
discover natural groupings or patterns within it.
There are several algorithms used for clustering in ML, including K-Means, Hierarchical
Clustering, DBSCAN, and Gaussian Mixture Models, among others. The choice of algorithm
depends on the nature of the data and the desired outcome. Clustering models are commonly
used in a variety of applications, such as customer segmentation, market research, anomaly
detection, and image segmentation.

1.4 Types of Learning:


There are three main types of machine learning:
Supervised Learning: In supervised learning, the model is trained on labeled data, where the
correct output is known. The model learns to map inputs to the corresponding outputs and then
uses this mapping to make predictions on new, unseen data. Examples of supervised learning
tasks include regression and classification problems.

Unsupervised Learning: In unsupervised learning, the model is trained on unlabeled data,


where the correct output is unknown. The goal of unsupervised learning is to discover the

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

underlying structure of the data, such as grouping similar data points together or finding patterns.
Examples of unsupervised learning tasks include clustering and dimensionality reduction.

Reinforcement Learning: In reinforcement learning, an agent interacts with an environment to


maximize a reward signal. The agent receives feedback in the form of rewards or penalties, and
it learns from this feedback over time to make better decisions. Reinforcement learning is used
in a variety of applications, such as game playing, robotics, and recommendation systems.

There are also semi-supervised and active learning techniques that are hybrids of supervised
and unsupervised learning.

Based on different business goals and data sets, there are three learning models for algorithms.
In general, machine learning algorithms can be classified into three types:
Supervised Learning
X,Y (Pre classified training examples)
Given an observation x, what is the best label for y?
Unsupervised Learning
X -Given a set of X’s , cluster or summarize them
Reinforcement Learning
Determine what to do based on rewards and punishments
Supervised Learning:

Supervised learning is a type of machine learning (ML) where the model is trained on labeled
data, where the correct output is known. The goal of supervised learning is to learn a mapping
from input features to the corresponding output labels. This mapping is then used to make
predictions on new, unseen data.

Supervised learning algorithms can be divided into two main categories: regression and
classification. In regression problems, the goal is to predict a continuous value (e.g. price,
temperature), while in classification problems, the goal is to predict a discrete label (e.g. spam or
not spam, positive or negative sentiment).

Supervised learning algorithms include linear regression, logistic regression, decision trees,
random forests, support vector machines (SVMs), neural networks, and many others. The choice
of algorithm depends on the nature of the data and the desired outcome.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Supervised learning is widely used in various applications, including natural language processing,
computer vision, recommendation systems, and predictive analytics, among others.

A training set of examples with the correct responses (targets) is provided and, based on this
trainingset, the algorithm generalises to respond correctly to all possible inputs. This is also called
learning from exemplars. Supervised learning is the machine learning task of learning a function
that maps an input to an output based on example input-output pairs. In supervised learning,
each example in the training set is a pair consisting of an input object (typically a vector) and an
output value. A supervised learning algorithm analyzes the training data and produces a function,
which can be used for mapping new examples. In the optimal case, the function will correctly
determine the class labels for unseen instances. Both classification and regression problems are
supervised learning problems. A wide range of supervised learning algorithms are available, each
with its strengths and weaknesses. There is no single learning algorithm that works best on all
supervised learning problems.

Remarks

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

A “supervised learning” is so called because the process of an algorithm learning from the training
dataset can be thought of as a teacher supervising the learning process. We know the correct
answers (that is, the correct outputs), the algorithm iteratively makes predictions on the training
data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable
level of performance. Consider the following data regarding patients entering a clinic. The data
consists of the gender and age of the patients and each patient is labeled as “healthy” or “sick”.

Unsupervised learning:
Unsupervised learning is a type of machine learning (ML) where the model is trained on
unlabeled data, where the correct output is unknown. The goal of unsupervised learning is to
discover the underlying structure of the data, such as grouping similar data points together,
identifying patterns or relationships, or finding lower-dimensional representations of the data.

Unsupervised learning algorithms can be divided into two main categories: clustering and
dimensionality reduction. In clustering, the goal is to divide the data into groups or clusters based
on similarity. In dimensionality reduction, the goal is to reduce the number of features or
dimensions in the data while preserving as much information as possible.

Unsupervised learning algorithms include K-Means, hierarchical clustering, DBSCAN, Gaussian


Mixture Models, Principal Component Analysis (PCA), Autoencoders, and many others. The
choice of algorithm depends on the nature of the data and the desired outcome.

Unsupervised learning is widely used in various applications, including image segmentation,


market segmentation, anomaly detection, and feature engineering, among others.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses. Correct responses are not provided,
but instead the algorithm tries to identify similarities between the inputs so that inputs that have
something in common are categorized together. The statistical approach to unsupervised
learning is known as density estimation.
Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses. In unsupervised learning algorithms,
a classification or categorization is not included in the observations. There are no output values
and so there is no estimation of functions. Since the examples given to the learner are unlabeled,
the accuracy of the structure that is output by the algorithm cannot be evaluated. The most
common unsupervised learning method is cluster analysis, which is used for exploratory data
analysis to find hidden patterns or grouping in data.

Unlike supervised learning algorithms, where we deal with labelled data for training, the training
data will be unlabelled for Unsupervised Machine Learning Algorithms. The clustering of data
into a specific group will be done on the basis of the similarities between the variables.

Clustering: Unsupervised learning problem that involves finding groups in data.


Density estimation: Unsupervised learning problem that involves summarizing the distribution
of data.
Visualization: Unsupervised learning problem that involves creating plots of data.
Projection: Unsupervised learning problem that involves creating lower- dimensional
representations of data.
Examples: K-means clustering, neural networks

Example
Consider the following data regarding patients entering a clinic. The data consists of the gender
and age of the patients.

Based on this data, can we infer anything regarding the patients entering the clinic?

Semi supervised Learning:

supervised and unsupervised learning. In semi-supervised learning, the model is trained on a


mixture of labeled and unlabeled data. This approach can be useful when labeled data is scarce
or expensive to obtain.

The goal of semi-supervised learning is to leverage the information in the unlabeled data to
improve the performance of the model on the labeled data. The model uses the labeled data to
make predictions, and then it uses the predictions to propagate labels to the unlabeled data. This

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

process can help the model learn additional information from the unlabeled data, leading to
improved performance.

Semi-supervised learning algorithms include semi-supervised support vector machines (S3VMs),


generative adversarial networks (GANs), co-training, and many others. The choice of algorithm
depends on the nature of the data and the desired outcome.

Semi-supervised learning is widely used in various applications, including natural language


processing, computer vision, and sentiment analysis, among others.

Reinforcement learning:

Reinforcement learning (RL) is a type of machine learning (ML) where an agent interacts with an
environment to maximize a reward signal. The agent learns from the environment through trial
and error by taking actions and observing the resulting rewards or penalties. Over time, the agent
learns to make better decisions that lead to higher rewards.

In RL, the agent's goal is to find an optimal policy, which is a mapping from states to actions that
maximizes the expected cumulative reward over time. The policy is learned through a process of
exploration and exploitation, where the agent balances taking actions it knows will lead to high
rewards with taking actions that may lead to even higher rewards in the future.

Reinforcement learning algorithms include Q-learning, SARSA, policy gradients, and deep
reinforcement learning, among others. The choice of algorithm depends on the nature of the
environment and the desired outcome.

Reinforcement learning is widely used in various applications, including game playing, robotics,
recommendation systems, and autonomous systems, among others.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Reinforcement Learning is a type of Machine Learning in which the machine is required to


determine the ideal behaviour within a specific context, in order to maximize its rewards. It
works on the rewards and punishment principle which means that for any decision which a
machine takes, it will be either be rewarded or punished. Thus, it will understand whether or not
the decision was correct. This is how the machine will learn to take the correct decisions to
maximize the reward in the long run.

This is somewhere between supervised and unsupervised learning. The algorithm gets told
when the answer is wrong, but does not get told how to correct it. It has to explore and try out
different possibilities until it works out how to get the answer right. Reinforcement learning is
sometime called learning with a critic because of this monitor that scores the answer, but does
not suggest improvements.

Reinforcement learning is the problem of getting an agent to act in the world so as to maximize
its rewards. A learner (the program) is not told what actions to take as in most forms of machine
learning, but instead must discover which actions yield the most reward by trying them. In the
most interesting and challenging cases, actions may affect not only the immediate reward but also
the next situations and, through that, all subsequent rewards.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Example
Consider teaching a dog a new trick: we cannot tell it what to do, but we can reward/punish it if
it does the right/wrong thing. It has to find out what it did that made it get the
reward/punishment. We can use a similar method to train computers to do many tasks, such as
playing backgammon or chess, scheduling jobs, and controlling robot limbs. Reinforcement
learning is different from supervised learning. Supervised learning is learning from examples
provided by a knowledgeable expert.
Examples: Q-learning, temporal-difference learning, and deep reinforcement learning

1.5 Error and Noise:


Error measures are a tool in ML that quantify the question “how wrong was our estimation”. It is
a function that compares the output of a learned hypothesis with the output of the real target
function.we compare the prediction of our model with the real value in data.

An error measure is expressed as E(h, f) (a hypothesis h ∈ H, and f is the target function).


E is almost always pointwise. It is defined by the difference at two points, therefore, we use the
pointwise definition of the error measure e() to compute this error in the different points: e(h(x),
f(x)). Examples:
Squared error : e(h(x), f(x)) = (h(x)- f(x))²
Binary error : e(h(x),f(x)) = ⟦h(x) ≠ f(x)⟧
(the number of wrong classifications)
Noise :
It refers to the irrelevant information or randomness in a dataset.
We can express noisy target as follows:
Noisy target= deterministic target + noise = ➪[y|x] + ε
where ε = (y - f(x)) is the difference between the outcome and the predicted value.
➪[y|x] is the expected value of y knowing x, y is our prediction using the target function h(x) and
f(x) is the real value of the data point. We introduced P(y|x) into our learning scheme to account
for the fact that there will always be noise in the relationship between x and y while P(x)
represents the random variable x and is necessary for us to use Hoeffding‘s inequality.

1.6 Training versus Testing:

In a dataset, a training set is implemented to build up a model, while a test (or validation) set is
to validate the model built. Data points in the training set are excluded from the test (validation)
set.
In Machine Learning, we basically try to create a model to predict the test data Usually, a dataset
is divided into a training set, a validation set (some people use ‘test set’ instead) in each iteration,
or divided into a training set, a validation set and a test set in each iteration.

• Training Set: Here, you have the complete training dataset. You can extract features and
train to fit a model and so on.
• Validation Set: This is crucial to choose the right parameters for your estimator. We can
divide the training set into a train set and validation set. Based on the validation test
results, the model can be trained(for instance, changing parameters, classifiers)
• Testing Set: Here, once the model is obtained, you can predict using the model obtained
on the training set

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

1.7 Designing a Learning System

Designing a machine learning (ML) system involves several steps, including:

Problem definition: Clearly define the problem that you want to solve and identify the
performance metrics that will be used to evaluate the system's success.

Data collection: Gather and preprocess the data that will be used to train and test the ML model.
Ensure that the data is clean, well-labeled, and relevant to the problem.

Feature engineering: Identify and extract relevant features from the data that can be used as input
to the ML model. Feature engineering is a critical step that can greatly impact the performance of
the ML system.

Model selection: Choose an appropriate ML algorithm that suits the problem and the data.
Consider factors such as the size and complexity of the data, the desired outcome, and the
computational resources available.

Model training: Train the ML model using the pre-processed data. Fine-tune the model
parameters to achieve the best performance.

Model evaluation: Evaluate the performance of the ML model using a validation dataset. Compare
the performance to the baseline and the performance metrics defined in step 1.

Model deployment: Deploy the ML model in a production environment, ensuring that it can
handle real-world scenarios and changing data distributions.

Monitoring and maintenance: Monitor the performance of the ML system and perform regular
maintenance to keep it up-to-date with new data and changing requirements.

It is important to iterate through these steps and make changes as necessary to improve the
performance of the ML system. Machine learning is a constantly evolving field, and it is important

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

to stay up-to-date with new techniques and algorithms that may be more suitable for your
problem.
Just now we looked into the learning process and also understood the goal of the learning. When we
want to design a learning system that follows the learning process, we need to consider a few design
choices. The design choices will be to decide the following key components:
1. Choosing the training experience
2. Choosing the Target Function
3. Choosing a representation for the Target Function
4. Choosing a Function Approximation Algorithm
5. The final Design
We will look into the game - checkers learning problem and apply the above design choices. For a
checkers learning problem, the three elements will be,
1. Task T: To play checkers
2. Performance measure P: Total percent of the game won in the tournament.
3. Training experience E: A set of games played against itself
Choosing the training experience:

The first design choice we face is to choose the type of training experience from which our system will
learn. The type of training experience available can have a significant impact on success or failure of
the learner. One key attribute is whether the training experience provides direct or indirect feedback
regarding the choices made by the performance system.

During the design of the checker's learning system, the type of training experience available for a
learning system will have a significant effect on the success or failure of the learning.
1. Direct or Indirect training experience — In the case of direct training experience, an individual
board states and correct move for each board state are given.
In case of indirect training experience, the move sequences for a game and the final result (win, loss
or draw) are given for a number of games. How to assign credit or blame to individual moves is the
credit assignment problem.
2. Teacher or Not — Supervised — The training experience will be labeled, which means, all the board
states will be labeled with the correct move. So the learning takes place in the presence of a
supervisor or a teacher.
Unsupervised — The training experience will be unlabeled, which means, all the board states will not
have the moves. So the learner generates random games and plays against itself with no supervision
or teacher involvement.
Semi-supervised — Learner generates game states and asks the teacher for help in finding the
correct move if the board state is confusing.
3. Is the training experience good — Do the training examples represent the distribution of examples
over which the final system performance will be measured? Performance is best when training
examples and test examples are from the same/a similar distribution.
The checker player learns by playing against oneself. Its experience is indirect. It may not encounter
moves that are common in human expert play. Once the proper training experience is available, the
next design step will be choosing the Target Function.
Choosing the Target Function
The next design choice is to determine exactly what type of knowledge will be learned and how this
will be used by the performance program.When you are playing the checkers game, at any moment of
time, you make a decision on
choosing the best move from different possibilities. You think and apply the learning that you have

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

gained from the experience. Here the learning is, for a specific board, you move a checker such that
your board state tends towards the winning situation. Now the same learning has to be defined in
terms of the target function.
Here there are 2 considerations — direct and indirect experience.

During the direct experience, the checkers learning system, it needs only to learn how to choose
the best move among some large search space. We need to find a target function that will help
us choose the best move among alternatives. Let us call this function ChooseMove and use the
notation ChooseMove : B →M to indicate that this function accepts as input any board from the
set of legal board states B and produces as output some move from the set of legal moves M.

When there is an indirect experience, it becomes difficult to learn such function. How about
assigning a real score to the board state. So the function be V : B →R indicating that this accepts as
input any board from the set of legal board states B and produces an output a real score. This function
assigns the higher scores to better board states.

If the system can successfully learn such a target function V, then it can easily use it to select the best
move from any board position. Let us therefore define the target value V(b) for an arbitrary board state
b in B, as follows:
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V (b) = V (b’), where b’ is the best final board state that can
be achieved starting from b and playing optimally until the end of the game.
The (4) is a recursive definition and to determine the value of V(b) for a particular board state, it
performs the search ahead for the optimal line of play, all the way to the end of the game. So this
definition is not efficiently computable by our checkers playing program, we say that it is a non
operational definition.

The goal of learning, in this case, is to discover an operational description of V ; that is, a description
that can be used by the checkers-playing program to evaluate states and select moves within realistic
time bounds.
It may be very difficult in general to learn such an operational form of V perfectly. We expect learning
algorithms to acquire only some approximation to the target function ^V.

Choosing a representation for the Target Function


Now that we have specified the ideal target function V, we must choose a representation that the
learning program will use to describe the function ^V that it will learn. As with earlier design choices,

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

we again have many options. We could, for example, allow the program to represent using a large table
with a distinct entry specifying the value for each distinct board state. Or we could allow it to represent
using a collection of rules that match against features of the board state, or a quadratic polynomial
function of predefined board features, or an artificial neural network. In general, this choice of
representation involves a crucial tradeoff. On one hand, we wish to pick a very expressive
representation to allow representing as close an approximation as possible to the ideal target function
V. On the other hand, the more expressive the representation, the more training data the program will
require in order to choose among the alternative hypotheses it can represent. To keep the discussion
brief, let us choose a simple representation: for any given board state, the function ^V will be calculated
as a linear combination of the following board features:
x1(b) — number of black pieces on board b
x2(b) — number of red pieces on b
x3(b) — number of black kings on b
x4(b) — number of red kings on b
x5(b) — number of red pieces threatened by black (i.e., which can be taken on black’s next turn)
x6(b) — number of black pieces threatened by red
^V (b)= w0 + w1 · x1(b) + w2 · x2(b) + w3 · x3(b) + w4 · x4(b) +w5 · x5(b) + w6 · x6(b)
Where w0 through w6 are numerical coefficients or weights to be obtained by a learning algorithm.
Weights w1 to w6 will determine the relative importance of different board features.

Specification of the Machine Learning Problem at this time — Till now we worked on choosing the type
of training experience, choosing the target function and its representation. The checkers learning task
can be summarized as below.
Task T : Play Checkers
Performance Measure : % of games won in world tournament
Training Experience E : opportunity to play against itself
Target Function : V : Board → R
Target Function Representation : ^V = w0 + w1 · x1(b) + w2 · x2(b) + w3 · x3(b) + w4 · x4(b) +w5 ·
x5(b) + w6 · x6(b)
The first three items above correspond to the specification of the learning task, whereas the final two
items constitute design choices for the implementation of the learning program.
Choosing a Function Approximation Algorithm
Generating training data :
To train our learning program, we need a set of training data, each describing a specific board state b
and the training value V_train (b) for b. Each training example is an ordered pair <b,V_train(b)>
For example, a training example may be <(x1 = 3, x2 = 0, x3 = 1, x4 = 0, x5 = 0, x6 = 0), +100">. This is
an example where black has won the game since x2 = 0 or red has no remaining pieces. However, such
clean values of V_train (b) can be obtained only for board value b that are clear win, loss or draw.

Estimating Training Values:


In above case, assigning a training value V_train(b) for the specific boards b that are clean win, loss or
draw is direct as they are direct training experience. But in the case of indirect training experience,
assigning a training value V_train(b) for the intermediate boards is difficult. In such case, the training
values are updated using temporal difference learning. Temporal difference (TD) learning is a concept
central to reinforcement learning, in which learning happens through the iterative correction of your
estimated returns towards a more accurate target return.
Let Successor(b) denotes the next board state following b for which it is again the program’s turn to
move. ^V is the learner’s current approximation to V. Using these information, assign the training value
of V_train(b) for any intermediate board state b as below :

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Rules For Estimating Training Values

V_train(b) ← ^V(Successor(b))

Adjusting the weights:


Now its time to define the learning algorithm for choosing the weights and best fit the set of
training examples. One common approach is to define the best hypothesis as that which minimizes the
squared error E between the training values and the values predicted by the hypothesis ^V. The
learning algorithm should incrementally refine weights as more training examples become available
and it needs to be robust to errors in training data Least Mean Square (LMS) training rule is the one
training algorithm that will adjust weights a small amount in the direction that reduces the error.

The LMS algorithm is defined as follows:

Final Design for Checkers Learning system


The final design of our checkers learning system can be naturally described by four distinct
program modules that represent the central components in many learning systems.

1. The performance System — Takes a new board as input and outputs a trace of the game it played
against itself.
2. The Critic — Takes the trace of a game as an input and outputs a set of training examples of the

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

target function.
3. The Generalizer — Takes training examples as input and outputs a hypothesis that estimates the
target function. Good generalization to new cases is crucial.
4. The Experiment Generator — Takes the current hypothesis (currently learned function) as input and
outputs a new problem (an initial board state) for the performance system to explore.

2. Perspectives and Issues

2.1 Perspectives in Machine Learning


One useful perspective on machine learning is that it involves searching a very large space of
possible hypotheses to determine one that best fits the observed data and any prior knowledge held by
the learner.
For example, consider the space of hypotheses that could in principle be output by the above checkers
learner. This hypothesis space consists of all evaluation functions that can be represented by some
choice of values for the weights wo through w6. The learner's task is thus to search through this vast
space to locate the hypothesis that is most consistent with the available training examples. The LMS
algorithm for fitting weights achieves this goal by iteratively tuning the weights, adding a correction to
each weight each time the hypothesized evaluation function predicts a value that differs from the
training value. This algorithm works well when the hypothesis representation considered by the learner
defines a continuously parameterized space of potential hypotheses.

2.2 Issues in Machine Learning

Our checkers example raises a number of generic questions about machine learning. The field of
machine learning, and much of this book, is concerned with answering questions such as the following:

what settings will particular algorithms converge to the desired function, given sufficient
training data? Which algorithms perform best for which types of problems and representations?
How much training data is sufficient? What general bounds can be found to relate the confidence in
learned hypotheses to the amount of training experience and the character of the learner's hypothesis
space?
When and how can prior knowledge held by the learner guide the process of generalizing from examples?
Can prior knowledge be helpful even when it is only approximately correct?
What is the best strategy for choosing a useful next training experience, and how does the choice of this
strategy alter the complexity of the learning problem?
What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn? Can
this process itself be automated?
How can the learner automatically alter its representation to improve its ability to represent
and learn the target function?

3. Concept Learning
Concept learning is the process of learning to recognize and categorize
objects or situations based on their attributes and relations. For example, a
concept learning system might learn to identify different types of animals based
on their shape, size, color, and behavior.
Concept Learning in Machine Learning can be thought of as a boolean-
valued function defined over a large set of training data. Taking a very simple

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

example, one possible target concept may be to Find the day when my friend
Ramesh enjoys his favourite sport. We have some attributes/features of the day
like, Sky, Air Temperature, Humidity, Wind, Water, Forecast and based on this we
have a target Concept named Enjoy Sport. Let’s Design the problem formally with
TPE(Task, Performance, Experience):

Problem: Leaning the day when Ramesh enjoys the sport.

Task T: Learn to predict the value of EnjoySport for an arbitrary day, based
on the values of the attributes of the day.

Performance measure P: Total percent of days (EnjoySport) correctly


predicted.

Training experience E: A set of days with given labels (EnjoySport: Yes/No)

Let us take a very simple hypothesis representation which consists of a


conjunction of constraints in the instance attributes. We get a hypothesis h_i with
the help of example i for our training set as below:

hi(x) := <x1, x2, x3, x4, x5, x6>

where x1, x2, x3, x4, x5 and x6 are the values of Sky, AirTemp, Humidity,
Wind, Water and Forecast.

Hence h1 will look like(the first row of the table above):

h1(x=1): <Sunny, Warm, Normal, Strong, Warm, Same > Note: x=1
represents a positive hypothesis / Positive example

We want to find the most suitable hypothesis which can represent the
concept. For example, Ramesh enjoys his favorite sport only on cold days with high
humidity (This seems independent of the values of the other attributes present in
the training examples).

h(x=1) = <?, Cold, High, ?, ?, ?>

Here? Indicates that any value of the attribute is acceptable. Note: The most
generic hypothesis will be <?, ?, ?, ?, ?, ?> where every day is a positive example
and the most specific hypothesis will be <?,?,?,?,?,? > where no day is a positive
example.

We will discuss the two most popular approaches to find a suitable


hypothesis, they are:

Find-S Algorithm and List-Then-Eliminate Algorithm

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

3.1 Find-S Algorithm:

The find-S algorithm is a basic concept learning algorithm in machine


learning. The find-S algorithm finds the most specific hypothesis that fits all the
positive examples. We have to note here that the algorithm considers only those
positive training example. The find-S algorithm starts with the most specific
hypothesis and generalizes this hypothesis each time it fails to classify an
observed positive training data. Hence, the Find-S algorithm moves from the most
specific hypothesis to the most general hypothesis.
In the vast landscape of machine learning algorithms, the "Find S
Algorithm" emerges as a foundational method for inducing hypotheses from
training data. Concept learning, a fundamental aspect of machine learning, often
relies on algorithms like Find S to discern patterns and regularities within data. At
its core, the Find S Algorithm operates on a principle of iterative refinement,
gradually honing in on a hypothesis that accurately represents the underlying
concept being learned. This article provides an in-depth exploration of the Find S
Algorithm, elucidating its critical representations, steps, and significance in the
realm of machine learning. For example, in a dataset of email messages labeled as
spam or not spam, the Find-S algorithm could iteratively refine a hypothesis that
identifies common features of spam emails, ultimately providing a rule to classify
new emails as spam or not spam based on those features.
Important Representations in Find S Algorithm

Important Representation:

1. ? indicates that any value is acceptable for the attribute.


2. Specify a single required value ( e.g., Cold ) for the attribute.
3. Φ indicates that no value is acceptable.
4. The most general hypothesis is represented by: {?, ?, ?, ?, ?, ?}
5. The most specific hypothesis is represented by: {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}

Steps Involved in Find S Algorithm

1. Start with the most specific hypothesis.


h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
2. Take the next example and if it is negative, then no changes occur to the
hypothesis.
3. If the example is positive and we find that our initial hypothesis is too
specific then we update our current hypothesis to a general condition.
4. Keep repeating the above steps till all the training examples are
complete.
5. After we have completed all the training examples we will have the final
hypothesis when can use to classify the new examples.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

The Find S Algorithm operates through a systematic series of steps aimed


at progressively refining hypotheses based on provided training examples. This
iterative process facilitates the extraction of underlying patterns and regularities
from data. Here's a comprehensive breakdown of the steps involved:

Initialization:
The algorithm commences by initializing the hypothesis to the most
specific representation, {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}. This starting hypothesis encapsulates
the absence of any prior knowledge about the concept being learned.

Iteration through Training Examples:

For each training example encountered: If the example is negative (i.e.,


does not conform to the concept being learned), no changes occur to the
hypothesis. The algorithm maintains its specificity in light of conflicting evidence.
If the example is positive (i.e., aligns with the concept being learned):
The algorithm evaluates whether the current hypothesis is too specific to
accommodate the positive example.
If deemed too specific, the hypothesis is generalised to encompass the
observed attribute values of the positive example.
This generalisation process involves replacing null (ϕ) values with specific
attribute values from the positive example or introducing wildcard (?) symbols
where appropriate.
Iterative Refinement:
The algorithm iterates through the training examples, progressively
refining the hypothesis with each encounter. By assimilating both positive and
negative instances, the algorithm hones in on a hypothesis that accurately
captures the underlying concept.

Convergence:
The iterative process continues until all training examples have been
processed. At this juncture, the algorithm yields a final hypothesis that
encapsulates the learned concept based on the provided data.

Final Hypothesis:
The culmination of the algorithm's iterative refinement is the derivation of
a final hypothesis. This hypothesis represents a synthesized abstraction of the
underlying concept, encapsulating the observed patterns and regularities within
the training data.

By systematically navigating through these steps, the Find S Algorithm


adeptly traverses the space of hypotheses, culminating in the extraction of
actionable knowledge from data

Algorithm

1. Initialize h to the most specific hypothesis in H

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

2. For each positive training example,

 For each attribute, constraint ai in h

a. If the constraints ai is satisfied by x

b. Then do nothing

c. Else replace ai in h by the next more general


constraint that is satisfied by x

3. Output hypothesis h
Example:
Consider the following data set having the data about which particular
seeds are poisonous.

First, we consider the hypothesis to be a more specific hypothesis. Hence, our


hypothesis would be :
h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Consider example 1 :
The data in example 1 is {GREEN, HARD, NO, WRINKLED }. We see that our initial
hypothesis is more specific and we have to generalize it for this example. Hence,
the hypothesis becomes :
h = {GREEN, HARD, NO, WRINKLED}
Consider example 2 :
Here we see that this example has a negative outcome. Hence we neglect this
example and our hypothesis remains the same.
h = { GREEN, HARD, NO, WRINKLED }
Consider example 3 :
Here we see that this example has a negative outcome. Hence we neglect this
example and our hypothesis remains the same.
h = { GREEN, HARD, NO, WRINKLED }
Consider example 4 :
The data present in example 4 is { ORANGE, HARD, NO, WRINKLED }. We compare
every single attribute with the initial data and if any mismatch is found we replace
that particular attribute with a general case ( ” ? ” ). After doing the process the
hypothesis becomes :
h = { ?, HARD, NO, WRINKLED }
Consider example 5 :
The data present in example 5 is { GREEN, SOFT, YES, SMOOTH }. We compare
every single attribute with the initial data and if any mismatch is found we replace
that particular attribute with a general case ( ” ? ” ). After doing the process the
hypothesis becomes :
h = { ?, ?, ?, ? }
Since we have reached a point where all the attributes in our hypothesis
have the general condition, example 6 and example 7 would result in the same
hypothesizes with all general attributes.
h = { ?, ?, ?, ? }

Hence, for the given data the final hypothesis would be :


Final Hyposthesis: h = { ?, ?, ?, ? }

3.2 The LIST-THEN-ELIMINATE Algorithm:

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Following are the steps for the LIST-THE-ELIMINATE algorithm:


VersionSpace <- a list containing every hypothesis in H
For each training example, <x, c(x)>
Remove from VersionSpace any hypothesis h for which h(x) != c(x)
Output the list of hypotheses in VersionSpace.
The most suitable way to find a good hypothesis will be to start with both the
directions, by taking the most general and the most specific boundaries. This approach is
called a CANDIDATE-ELIMINATION Learning Algorithm.
4. Version Spaces and Candidate Eliminations
4.1 Version Spaces
A version space is a hierarchial representation of knowledge that enables you to keep
track of all the useful information supplied by a sequence of learning examples without
remembering any of the examples.
The version space method is a concept learning process accomplished by managing
multiple models within a version space.
Version Space Characteristics:
Tentative heuristics are represented using version spaces.
A version space represents all the alternative plausible descriptions of a heuristic.
A plausible description is one that is applicable to all known positive examples and no
known negative example.
A version space description consists of two complementary trees:
1. One that contains nodes connected to overly general models, and
2. One that contains nodes connected to overly specific models.
Node values/attributes are discrete.
Fundamental Assumptions
1. The data is correct; there are no erroneous instances.
2. A correct description is a conjunction of some of the attributes with values.
Diagrammatical Guidelines
There is a generalization tree and a specialization tree.Each node is connected to a model.
Nodes in the generalization tree are connected to a model that matches everything in its
subtree. Nodes in the specialization tree are connected to a model that matches only one
thing in its subtree.
Links between nodes and their models denote

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

1. generalization relations in a generalization tree, and


2. specialization relations in a specialization tree.
Diagram of a Version Space
In the diagram below, the specialization tree is colored red, and the generalization tree is
colored green.

Generalization and Specialization Leads to Version Space Convergence


The key idea in version space learning is that specialization of the general models and
generalization of the specific models may ultimately lead to just one correct model that
matches all observed positive examples and does not match any negative examples. That
is, each time a negative example is used to specialilize the general models, those specific
models that match the negative example are eliminated and each time a positive example
is used to generalize the specific models, those general models that fail to match the
positive example are eliminated. Eventually, the positive and negative examples may be
such that only one general model and one identical specific model survive.

4.2 Version Space Method Learning Algorithm: Candidate-Elimination:


The version space method handles positive and negative examples symmetrically.
Given:
1. A representation language.
2. A set of positive and negative examples expressed in that language.
Compute: a concept description that is consistent with all the positive examples and none
of the negative examples.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Method:
 Initialize G, the set of maximally general hypotheses, to contain one element: the
null description (all features are variables).
 Initialize S, the set of maximally specific hypotheses, to contain one element: the
first positive example.
 Accept a new training example.
a. If the example is positive:
1. Generalize all the specific models to match the positive example,
but ensure the following:
 The new specific models involve minimal changes.
 Each new specific model is a specialization of some general
model.
 No new specific model is a generalization of some other
specific model.
2. Prune away all the general models that fail to match the positive
example.

b. If the example is negative:


1. Specialize all general models to prevent match with the negative
example, but ensure the following:
 The new general models involve minimal changes.
 Each new general model is a generalization of some specific
model.
 No new general model is a specialization of some other general
model.
2. Prune away all the specific models that match the negative
example.
c. If S and G are both singleton sets, then:
 if they are identical, output their value and halt.
 if they are different, the training cases were inconsistent.
Output this result and halt.
 else continue accepting new training examples.
The algorithm stops when:
1. It runs out of data.
2. The number of hypotheses remaining is:
0 - no consistent description for the data in the language.
1 - answer (version space converges).
2+ - all descriptions in the language are implicitly included.
Comments on the Version Space Method

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

The version space method is still a trial and error method.


The program does not base its choice of examples, or its learned heuristics, on an analysis
of what works or why it works, but rather on the simple assumption that what works will
probably work again.
Unlike the decision tree ID3 algorithm,

 Candidate-elimination searches an incomplete set of hypotheses (ie. only a subset


of the potentially teachable concepts are included in the hypothesis space).
 Candidate-elimination finds every hypothesis that is consistent with the training
data, meaning it searches the hypothesis space completely.
 Candidate-elimination's inductive bias is a consequence of how well it can
represent the subset of possible hypotheses it will search. In other words, the bias
is a product of its search space.
 No additional bias is introduced through Candidate-eliminations's search
strategy.
Advantages of the version space method:
 Can describe all the possible hypotheses in the language consistent with the data.
 Fast (close to linear).
Disadvantages of the version space method:
 Inconsistent data (noise) may cause the target concept to be pruned.
 Learning disjunctive concepts is challenging.

5. Inductive Bias

Every machine learning model requires some type of architecture design and
possibly some initial assumptions about the data we want to analyze. Generally,
every building block and every belief that we make about the data is a form of
inductive bias.

Inductive biases play an important role in the ability of machine learning models
to generalize to the unseen data. A strong inductive bias can lead our model to
converge to the global optimum. On the other hand, a weak inductive bias can
cause the model to find only the local optima and be greatly affected by random
changes in the initial states.

We can categorize inductive biases into two different groups called relational and
non-relational. The former represents the relationship between entities in the
network, while the latter is a set of techniques that further constrain the learning
algorithm.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

5.1 Inductive Biases in Machine Learning


In traditional machine learning, every algorithm has its own inductive biases. In
this section, we mention some of these algorithms.

 Bayesian Models
Inductive bias in Bayesian models shows itself in the form of the prior
distributions that we choose for the variables. Consequently, the prior can shape
the posterior distribution in a way that the latter can turn out to be a similar
distribution to the former. In addition, we assume that the variables are
conditionally independent, meaning that given the parents of a node in the
network, it’ll be independent from its ancestors. As a result, we can make use of
conditional probability to make the inference. Also, the structure of the Bayesian
net can facilitate the analysis of causal relationships between entities.

 k-Nearest Neighbors (k-NN) Algorithm


The k-Nearest Neighbors (k-NN) algorithm assumes that entities belonging to a
particular category should appear near each other, and those that are part of
different groups should be distant. In other words, we assume that similar data
points are clustered near each other away from the dissimilar ones.

 Linear Regression
Given the (X, Y) data points, in linear regression, we assume that the variable (Y)
is linearly dependent on the explanatory variables (X). Therefore, the resulting
model linearly fits the training data. However, this assumption can limit the
model’s capacity to learn non-linear functions.

 Logistic Regression
In logistic regression, we assume that there’s a hyperplane that separates the two
classes from each other. This simplifies the problem, but one can imagine that if
the assumption is not valid, we won’t have a good model.
6. Decision Tree Algorithm
Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
7. Representation
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

It is a graphical representation for getting all the possible solutions to a


problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Decision Tree Terminologies

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

8. Algorithm
In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or Not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next
decision node (distance from the office) and one leaf node based on the corresponding
labels. The next decision node further gets split into one decision node (Cab facility) and
one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer).
Consider the below diagram:

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are: Information Gain and Gini Index
1. Information Gain:

 Information gain is the measurement of changes in entropy after the


segmentation of a dataset based on an attribute.
 It calculates how much information a feature provides us about a class.
 According to the value of information gain, we split the node and build the
decision tree.
 A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first.
It can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where, S= Total number of samples P(yes)= probability of yes P(no)= probability
of no
2.Gini Index:
 Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
 An attribute with the low Gini index should be preferred as compared to the high
Gini index.
 It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
 Gini index can be calculated using the below formula: Gini Index= 1- ∑jPj2
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types
of tree pruning technology used: Cost Complexity Pruning and Reduced Error Pruning.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Advantages of the Decision Tree


 It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
 The decision tree contains lots of layers, which makes it complex.
 It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
 For more class labels, the computational complexity of the decision tree may
increase.

Steps will also remain the same, which are given below:
 Data Pre-processing step
 Fitting a Decision-Tree algorithm to the Training set
 Predicting the test result
 Test accuracy of the result(Creation of Confusion matrix)
 Visualizing the test set result.
Data set given is
+----+-------+-------+------------------+----------+
| ID | Fever | Cough | Breathing issues | Infected |
+----+-------+-------+------------------+----------+
| 1 | No | No | No | No |
+----+-------+-------+------------------+----------+
| 2 | Yes | Yes | Yes | Yes |
+----+-------+-------+------------------+----------+
| 3 | Yes | Yes | No | No |
+----+-------+-------+------------------+----------+
| 4 | Yes | No | Yes | Yes |
+----+-------+-------+------------------+----------+
| 5 | Yes | Yes | Yes | Yes |
+----+-------+-------+------------------+----------+
| 6 | No | Yes | No | No |
+----+-------+-------+------------------+----------+
| 7 | Yes | No | Yes | Yes |
+----+-------+-------+------------------+----------+
| 8 | Yes | No | Yes | Yes |
+----+-------+-------+------------------+----------+
| 9 | No | Yes | Yes | Yes |
+----+-------+-------+------------------+----------+

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

| 10 | Yes | Yes | No | Yes |


+----+-------+-------+------------------+----------+
| 11 | No | Yes | No | No |
+----+-------+-------+------------------+----------+
| 12 | No | Yes | Yes | Yes |
+----+-------+-------+------------------+----------+
| 13 | No | Yes | Yes | No |
+----+-------+-------+------------------+----------+
| 14 | Yes | Yes | No | No |
+----+-------+-------+------------------+----------+
The columns are self-explanatory. Y and N stand for Yes and No respectively. The values
or classes in Infected column Y and N represent Infected and Not Infected respectively.
The columns used to make decision nodes viz. ‘Breathing Issues’, ‘Cough’ and ‘Fever’ are
called feature columns or just features and the column used for leaf nodes i.e. ‘Infected’
is called the target column.
Metrics in ID3
ID3 algorithm selects the best feature at each step while building a Decision tree. ‘How
does ID3 select the best feature?’ is that ID3 uses Information Gain or just Gain to find
the best feature.Information Gain calculates the reduction in the entropy and measures
how well a given feature separates or classifies the target classes. The feature with the
highest Information Gain is selected as the best one. In simple words, Entropy is the
measure of disorder and the Entropy of a dataset is the measure of disorder in the
target feature of the dataset. In the case of binary classification (where the target
column has only two types of classes) entropy is 0 if all values in the target column are
homogenous(similar) and will be 1 if the target column has equal number values for
both the classes.
We denote our dataset as S, entropy is calculated as:
Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
where, n is the total number of classes in the target column (in our case n = 2 i.e YES and
NO) pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the
target column” to the “total number of rows” in the dataset.
Information Gain for a feature column A is calculated as:
IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))
where Sᵥ is the set of rows in S for which the feature column A has value v, |Sᵥ| is the
number of rows in Sᵥ and likewise |S| is the number of rows in S.
ID3 Steps
1. Calculate the Information Gain of each feature.
2. Considering that all rows don’t belong to the same class, split the dataset S into
subsets using the feature for which the Information Gain is maximum.
3. Make a decision tree node using the feature with the maximum Information gain.
4. If all rows belong to the same class, make the current node as a leaf node with
the class as its label.
5. Repeat for the remaining features until we run out of all features, or the decision
tree has all leaf nodes.
Implementation on our Dataset

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

As stated in the previous section the first step is to find the best feature i.e. the one that
has the maximum Information Gain(IG). We’ll calculate the IG for each of the features
now, but for that, we first need to calculate the entropy of S
From the total of 14 rows in our dataset S, there are 8 rows with the target value YES
and 6 rows with the target value NO. The entropy of S is calculated as:
Entropy(S) = — (8/14) * log₂(8/14) — (6/14) * log₂(6/14) = 0.99
Note: If all the values in our target column are same the entropy will be zero (meaning
that it has no or zero randomness).
We now calculate the Information Gain for each feature:
IG calculation for Fever:
In this(Fever) feature there are 8 rows having value YES and 6 rows having value NO.
As shown below, in the 8 rows with YES for Fever, there are 6 rows having target value
YES and 2 rows having target value NO.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| Yes | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | Yes | No | No |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | Yes | No | Yes |
+-------+-------+------------------+----------+
| Yes | Yes | No | No |
+-------+-------+------------------+----------+
As shown below, in the 6 rows with NO, there are 2 rows having target value YES and 4
rows having target value NO.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| No | No | No | No |
+-------+-------+------------------+----------+
| No | Yes | No | No |
+-------+-------+------------------+----------+
| No | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| No | Yes | No | No |
+-------+-------+------------------+----------+
| No | Yes | Yes | Yes |

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

+-------+-------+------------------+----------+
| No | Yes | Yes | No |
+-------+-------+------------------+----------+
The block, below, demonstrates the calculation of Information Gain for Fever.
# total rows |S| = 14For v = YES, |Sᵥ| = 8
Entropy(Sᵥ) = - (6/8) * log₂(6/8) - (2/8) * log₂(2/8) = 0.81For v = NO, |Sᵥ| = 6
Entropy(Sᵥ) = - (2/6) * log₂(2/6) - (4/6) * log₂(4/6) = 0.91# Expanding the summation
in the IG formula:
IG(S, Fever) = Entropy(S) - (|Sʏᴇꜱ| / |S|) * Entropy(Sʏᴇꜱ) -
(|Sɴᴏ| / |S|) * Entropy(Sɴᴏ)∴ IG(S, Fever) = 0.99 - (8/14) * 0.81 - (6/14) * 0.91 = 0.13
Next, we calculate the IG for the features “Cough” and “Breathing issues”.
You can use this free online tool to calculate the Information Gain.
IG(S, Cough) = 0.04
IG(S, BreathingIssues) = 0.40
Since the feature Breathing issues have the highest Information Gain it is used to create
the root node.
Hence, after this initial step our tree looks like this:

Next, from the remaining two unused features, namely, Fever and Cough, we decide
which one is the best for the left branch of Breathing Issues.
Since the left branch of Breathing Issues denotes YES, we will work with the subset of
the original data i.e the set of rows having YES as the value in the Breathing Issues
column. These 8 rows are shown below:
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| Yes | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| No | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| No | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| No | Yes | Yes | No |
+-------+-------+------------------+----------+
Next, we calculate the IG for the features Fever and Cough using the subset Sʙʏ (Set
Breathing Issues Yes) which is shown above :
Note: For IG calculation the Entropy will be calculated from the subset Sʙʏ and not the
original dataset S.

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

IG(Sʙʏ, Fever) = 0.20


IG(Sʙʏ, Cough) = 0.09
IG of Fever is greater than that of Cough, so we select Fever as the left branch of
Breathing Issues:
Next, we find the feature with the maximum IG for the right branch of Breathing Issues.
But, since there is only one unused feature left we have no other choice but to make it
the right branch of the root node.
So our tree now looks like this:
There are no more unused features, so we stop here and jump to the final step of
creating the leaf nodes.
For the left leaf node of Fever, we see the subset of rows from the original data set that
has Breathing Issues and Fever both values as YES.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| Yes | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
Since all the values in the target column are YES, we label the left leaf node as YES, but
to make it more logical we label it infected. Similarly, for the right node of Fever we see
the subset of rows from the original data set that have Breathing Issues value as YES
and Fever as NO.

+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| No | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| No | Yes | Yes | No |
+-------+-------+------------------+----------+
| No | Yes | Yes | No |
+-------+-------+------------------+----------+

Here not all but most of the values are NO, hence NO or Not Infected becomes our right
leaf node.
We repeat the same process for the node Cough, however here both left and right leaves
turn out to be the same i.e. NO or Not Infected as shown below:

191AIC502T Machine Learning Techniques


Department of Artificial Intelligence and Data Science T.Kalaiselvi

Find S algorithm

https://www.youtube.com/watch?v=O6vwN74aSGY&ab_channel=MaheshHuddar

191AIC502T Machine Learning Techniques

You might also like