Learning Problems & Concept Learning
Learning Problems & Concept Learning
Kalaiselvi
191AIC502T
Machine Learning Techniques
UNIT 1
INTRODUCTION
Learning Problems – Perspectives and Issues – Concept Learning – Version Spaces and
Candidate Eliminations –Inductive bias –Decision Tree learning–Representation–
Algorithm –Heuristic Space Search.
UNIT 1
MACHINE LEARNING TECHNIQUES
Learning Problems – Perspectives and Issues – Concept Learning – Version Spaces and Candidate
Eliminations – Inductive bias –Decision Tree learning–Representation–Algorithm –Heuristic Space
Search.
1. Learning :
Learning denotes changes in a system that ... enable a system to do the same task … more efficiently the
next time.”- Herbert Simon
“Learning is constructing or modifying representations of what is being experienced.” - Ryszard
Michalski
“Learning is making useful changes in our minds.” - Marvin Minsky
1. Data storage
Facilities for storing and retrieving huge amounts of data are an important component of the
learning process. Humans and computers alike utilize data storage as a foundation for advanced
reasoning.
• In a human being, the data is stored in the brain and data is retrieved using electrochemical
signals.
• Computers use hard disk drives, flash memory, random access memory and similar devices to
stored data and use cables and other technology to retrieve data.
2. Abstraction
The second component of the learning process is known as abstraction.Abstraction is the process
of extracting knowledge about stored data. This involves creating general concepts about the data
as a whole. The creation of knowledge involves application of known models and creation of new
models.The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original information.
3. Generalization
The third component of the learning process is known as generalisation.The term generalization
describes the process of turning the knowledge about stored data into a form that can be utilized
for future action. These actions are to be carried out on tasks that are similar, but not identical, to
those what have been seen before. In generalization, the goal is to discover those properties of
the data that will be most relevant to future tasks.
4. Evaluation
Evaluation is the last component of the learning process.It is the process of giving feedback to the
user to measure the utility of the learned knowledge. This feedback is then utilised to effect
improvements in the whole learning process.
Geometric models
Here we consider models that define similarity by considering the geometry of the instance
space. In Geometric models, features could be described as points in two dimensions (x- and y-
axis) or a three-dimensional space (x, y, and z). Even when features are not intrinsically
geometric, they could be modelled in a geometric manner (for example, temperature as a function
of time can be modelled in two axes). In geometric models, there are two ways we could impose
similarity.
We could use geometric concepts like lines or planes to segment (classify) the instance space.
These are called Linear models.
Alternatively, we can use the geometric notion of distance to represent similarity. In this
case, if two points are close together, they have similar values for features and thus can be classed
as similar. We call such models as Distance-based models.
Linear models
Linear models are relatively simple. In this case, the function is represented as a linear
combination of its inputs. Thus, if x1 and x2 are two scalars or vectors of the same dimension and
a and b are arbitrary scalars, then ax1 + bx2 represents a linear combination of x1 and x2. In the
simplest case where f(x) represents a straight line, we have an equation of the form f (x) = mx + c
where c represents the intercept and m represents the slope.
Linear models are parametric, which means that they have a fixed form with a small number of
numeric parameters that need to be learned from data. For example, in f (x) = mx + c, m and c are
the parameters that we are trying to learn from the data. This technique is different from tree or
rule models, where the structure of the model (e.g., which features to use in the tree, and where)
is not fixed in advance.
Linear models are stable, i.e., small variations in the training data have only a limited impact on
the learned model. In contrast, tree models tend to vary more with the training data, as the
choice of a different split at the root of the tree typically means that the rest of the tree is different
as well. As a result of having relatively few parameters, Linear models have low variance and high
bias. This implies that Linear models are less likely to overfit the training data than some
other models. However, they are more likely to underfit. For example, if we want to learn the
boundaries between countries based on labelled data, then linear models are not likely to give a
good approximation.
Distance-based models
Distance-based models are the second class of Geometric models. Like Linear models, distance
based models are based on the geometry of data. As the name implies, distance-based models
work on the concept of distance. In the context of Machine learning, the concept of distance is not
based on merely the physical distance between two points. Instead, we could think of the distance
between two points considering the mode of transport between two points. Travelling between
two cities by plane covers less distance physically than by train because a plane is unrestricted.
Similarly, in chess, the concept of distance depends on the piece used – for example, a Bishop can
move diagonally. Thus, depending on the entity and the mode of travel, the concept of distance
can be experienced differently. The distance metrics commonly used are Euclidean, Minkowski,
Manhattan, and Mahalanobis.
Distance is applied through the concept of neighbours and exemplars. Neighbours are points in
proximity with respect to the distance measure expressed through exemplars. Exemplars are
either centroids that find a centre of mass according to a chosen distance metric or medoids that
find the most centrally located data point. The most commonly used centroid is the arithmetic
mean, which minimises squared Euclidean distance to all other points.
Notes:
position of all the points in the figure from the centroid point. This definition extends to any
object in n-dimensional space: its centroid is the mean position of all the points.
data when a mean or centroid cannot be defined. They are used in contexts where the centroid
is not representative of the dataset, such as in image data.
Examples of distance-based models include the nearest-neighbour models, which use the training
data as exemplars – for example, in classification. The K-means clustering algorithm also uses
exemplars to create clusters of similar data points.
Probabilistic models
The third family of machine learning algorithms is the probabilistic models. We have seen
before that the k-nearest neighbour algorithm uses the idea of distance (e.g., Euclidian distance)
to classify entities, and logical models use a logical expression to partition the instance space. In
this section, we see how the probabilistic models use the idea of probability to classify new
entities. Probabilistic models see features and target variables as random variables. The process
of modelling represents and manipulates the level of uncertainty with respect to these variables.
There are two types of probabilistic models: Predictive and Generative. Predictive probability
models use the idea of a conditional probability distribution P (Y |X) from which Y can be
predicted from X. Generative models
estimate the joint distribution P (Y, X). Once we know the joint distribution for the generative
models, we can derive any conditional or marginal distribution involving the same variables.
Thus, the generative model is capable of creating new data points and their labels, knowing the
joint probability distribution. The joint distribution looks for a relationship between two
variables. Once this relationship is inferred, it is possible to infer new data points.
Naïve Bayes is an example of a probabilistic classifier.
We can do this using the Bayes rule defined as
P(A/B) = P(B/A)P(A)
P(B)
The Naïve Bayes algorithm is based on the idea of Conditional Probability. Conditional probability
is based on finding the probability that something will happen, given that something else has
already happened. The task of the algorithm then is to look at the evidence and to determine the
likelihood of a specific class and assign a label accordingly to each entity.
Logical models:
Logical models use a logical expression to divide the instance space into segments and hence
construct grouping models. A logical expression is an expression that returns a Boolean value, i.e.,
a True or False outcome. Once the data is grouped using a logical expression, the data is divided
into homogeneous groupings for the problem we are trying to solve. For example, for a
classification problem, all the instances in the group belong to one class.
There are mainly two kinds of logical models:
Tree models and Rule models.
Rule models consist of a collection of implications or IF-THEN rules. For tree-based models, the
‘if-part’ defines a segment and the ‘then-part’ defines the behaviour of the model for this segment.
Rule models follow the same reasoning.
A Concept Learning Task called “Enjoy Sport” as shown above is defined by a set of data from
some example days. Each data is described by six attributes. The task is to learn to predict the
value of Enjoy Sport for an arbitrary day based on the values of its attribute values. The problem
can be represented by a series of hypotheses. Each hypothesis is described by a conjunction of
constraints on the attributes. The training data represents a set of positive and negative examples
of the target function. In the example above, each hypothesis is a vector of six constraints,
specifying the values of the six attributes – Sky, AirTemp, Humidity, Wind, Water, and Forecast.
The training phase involves learning the set of days (as a conjunction of attributes) for which
Enjoy Sport = yes.
Thus, the problem can be formulated as:
Given instances X which represent a set of all possible days, each described by the attributes:
Sky – (values: Sunny, Cloudy, Rainy),
AirTemp – (values: Warm, Cold),
Humidity – (values: Normal, High),
Wind – (values: Strong, Weak),
Water – (values: Warm, Cold),
Forecast – (values: Same, Change).
Try to identify a function that can predict the target variable Enjoy Sport as yes/no, i.e., 1 or 0.
Grouping Model:
A grouping model in machine learning (ML) refers to a type of model that is used to categorize or
segment data points into different groups or clusters, based on their similarity or proximity. This
process is also known as clustering. The goal of clustering is to find structure in the data and
discover natural groupings or patterns within it.
There are several algorithms used for clustering in ML, including K-Means, Hierarchical
Clustering, DBSCAN, and Gaussian Mixture Models, among others. The choice of algorithm
depends on the nature of the data and the desired outcome. Clustering models are commonly
used in a variety of applications, such as customer segmentation, market research, anomaly
detection, and image segmentation.
underlying structure of the data, such as grouping similar data points together or finding patterns.
Examples of unsupervised learning tasks include clustering and dimensionality reduction.
There are also semi-supervised and active learning techniques that are hybrids of supervised
and unsupervised learning.
Based on different business goals and data sets, there are three learning models for algorithms.
In general, machine learning algorithms can be classified into three types:
Supervised Learning
X,Y (Pre classified training examples)
Given an observation x, what is the best label for y?
Unsupervised Learning
X -Given a set of X’s , cluster or summarize them
Reinforcement Learning
Determine what to do based on rewards and punishments
Supervised Learning:
Supervised learning is a type of machine learning (ML) where the model is trained on labeled
data, where the correct output is known. The goal of supervised learning is to learn a mapping
from input features to the corresponding output labels. This mapping is then used to make
predictions on new, unseen data.
Supervised learning algorithms can be divided into two main categories: regression and
classification. In regression problems, the goal is to predict a continuous value (e.g. price,
temperature), while in classification problems, the goal is to predict a discrete label (e.g. spam or
not spam, positive or negative sentiment).
Supervised learning algorithms include linear regression, logistic regression, decision trees,
random forests, support vector machines (SVMs), neural networks, and many others. The choice
of algorithm depends on the nature of the data and the desired outcome.
Supervised learning is widely used in various applications, including natural language processing,
computer vision, recommendation systems, and predictive analytics, among others.
A training set of examples with the correct responses (targets) is provided and, based on this
trainingset, the algorithm generalises to respond correctly to all possible inputs. This is also called
learning from exemplars. Supervised learning is the machine learning task of learning a function
that maps an input to an output based on example input-output pairs. In supervised learning,
each example in the training set is a pair consisting of an input object (typically a vector) and an
output value. A supervised learning algorithm analyzes the training data and produces a function,
which can be used for mapping new examples. In the optimal case, the function will correctly
determine the class labels for unseen instances. Both classification and regression problems are
supervised learning problems. A wide range of supervised learning algorithms are available, each
with its strengths and weaknesses. There is no single learning algorithm that works best on all
supervised learning problems.
Remarks
A “supervised learning” is so called because the process of an algorithm learning from the training
dataset can be thought of as a teacher supervising the learning process. We know the correct
answers (that is, the correct outputs), the algorithm iteratively makes predictions on the training
data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable
level of performance. Consider the following data regarding patients entering a clinic. The data
consists of the gender and age of the patients and each patient is labeled as “healthy” or “sick”.
Unsupervised learning:
Unsupervised learning is a type of machine learning (ML) where the model is trained on
unlabeled data, where the correct output is unknown. The goal of unsupervised learning is to
discover the underlying structure of the data, such as grouping similar data points together,
identifying patterns or relationships, or finding lower-dimensional representations of the data.
Unsupervised learning algorithms can be divided into two main categories: clustering and
dimensionality reduction. In clustering, the goal is to divide the data into groups or clusters based
on similarity. In dimensionality reduction, the goal is to reduce the number of features or
dimensions in the data while preserving as much information as possible.
Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses. Correct responses are not provided,
but instead the algorithm tries to identify similarities between the inputs so that inputs that have
something in common are categorized together. The statistical approach to unsupervised
learning is known as density estimation.
Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses. In unsupervised learning algorithms,
a classification or categorization is not included in the observations. There are no output values
and so there is no estimation of functions. Since the examples given to the learner are unlabeled,
the accuracy of the structure that is output by the algorithm cannot be evaluated. The most
common unsupervised learning method is cluster analysis, which is used for exploratory data
analysis to find hidden patterns or grouping in data.
Unlike supervised learning algorithms, where we deal with labelled data for training, the training
data will be unlabelled for Unsupervised Machine Learning Algorithms. The clustering of data
into a specific group will be done on the basis of the similarities between the variables.
Example
Consider the following data regarding patients entering a clinic. The data consists of the gender
and age of the patients.
Based on this data, can we infer anything regarding the patients entering the clinic?
The goal of semi-supervised learning is to leverage the information in the unlabeled data to
improve the performance of the model on the labeled data. The model uses the labeled data to
make predictions, and then it uses the predictions to propagate labels to the unlabeled data. This
process can help the model learn additional information from the unlabeled data, leading to
improved performance.
Reinforcement learning:
Reinforcement learning (RL) is a type of machine learning (ML) where an agent interacts with an
environment to maximize a reward signal. The agent learns from the environment through trial
and error by taking actions and observing the resulting rewards or penalties. Over time, the agent
learns to make better decisions that lead to higher rewards.
In RL, the agent's goal is to find an optimal policy, which is a mapping from states to actions that
maximizes the expected cumulative reward over time. The policy is learned through a process of
exploration and exploitation, where the agent balances taking actions it knows will lead to high
rewards with taking actions that may lead to even higher rewards in the future.
Reinforcement learning algorithms include Q-learning, SARSA, policy gradients, and deep
reinforcement learning, among others. The choice of algorithm depends on the nature of the
environment and the desired outcome.
Reinforcement learning is widely used in various applications, including game playing, robotics,
recommendation systems, and autonomous systems, among others.
This is somewhere between supervised and unsupervised learning. The algorithm gets told
when the answer is wrong, but does not get told how to correct it. It has to explore and try out
different possibilities until it works out how to get the answer right. Reinforcement learning is
sometime called learning with a critic because of this monitor that scores the answer, but does
not suggest improvements.
Reinforcement learning is the problem of getting an agent to act in the world so as to maximize
its rewards. A learner (the program) is not told what actions to take as in most forms of machine
learning, but instead must discover which actions yield the most reward by trying them. In the
most interesting and challenging cases, actions may affect not only the immediate reward but also
the next situations and, through that, all subsequent rewards.
Example
Consider teaching a dog a new trick: we cannot tell it what to do, but we can reward/punish it if
it does the right/wrong thing. It has to find out what it did that made it get the
reward/punishment. We can use a similar method to train computers to do many tasks, such as
playing backgammon or chess, scheduling jobs, and controlling robot limbs. Reinforcement
learning is different from supervised learning. Supervised learning is learning from examples
provided by a knowledgeable expert.
Examples: Q-learning, temporal-difference learning, and deep reinforcement learning
In a dataset, a training set is implemented to build up a model, while a test (or validation) set is
to validate the model built. Data points in the training set are excluded from the test (validation)
set.
In Machine Learning, we basically try to create a model to predict the test data Usually, a dataset
is divided into a training set, a validation set (some people use ‘test set’ instead) in each iteration,
or divided into a training set, a validation set and a test set in each iteration.
• Training Set: Here, you have the complete training dataset. You can extract features and
train to fit a model and so on.
• Validation Set: This is crucial to choose the right parameters for your estimator. We can
divide the training set into a train set and validation set. Based on the validation test
results, the model can be trained(for instance, changing parameters, classifiers)
• Testing Set: Here, once the model is obtained, you can predict using the model obtained
on the training set
Problem definition: Clearly define the problem that you want to solve and identify the
performance metrics that will be used to evaluate the system's success.
Data collection: Gather and preprocess the data that will be used to train and test the ML model.
Ensure that the data is clean, well-labeled, and relevant to the problem.
Feature engineering: Identify and extract relevant features from the data that can be used as input
to the ML model. Feature engineering is a critical step that can greatly impact the performance of
the ML system.
Model selection: Choose an appropriate ML algorithm that suits the problem and the data.
Consider factors such as the size and complexity of the data, the desired outcome, and the
computational resources available.
Model training: Train the ML model using the pre-processed data. Fine-tune the model
parameters to achieve the best performance.
Model evaluation: Evaluate the performance of the ML model using a validation dataset. Compare
the performance to the baseline and the performance metrics defined in step 1.
Model deployment: Deploy the ML model in a production environment, ensuring that it can
handle real-world scenarios and changing data distributions.
Monitoring and maintenance: Monitor the performance of the ML system and perform regular
maintenance to keep it up-to-date with new data and changing requirements.
It is important to iterate through these steps and make changes as necessary to improve the
performance of the ML system. Machine learning is a constantly evolving field, and it is important
to stay up-to-date with new techniques and algorithms that may be more suitable for your
problem.
Just now we looked into the learning process and also understood the goal of the learning. When we
want to design a learning system that follows the learning process, we need to consider a few design
choices. The design choices will be to decide the following key components:
1. Choosing the training experience
2. Choosing the Target Function
3. Choosing a representation for the Target Function
4. Choosing a Function Approximation Algorithm
5. The final Design
We will look into the game - checkers learning problem and apply the above design choices. For a
checkers learning problem, the three elements will be,
1. Task T: To play checkers
2. Performance measure P: Total percent of the game won in the tournament.
3. Training experience E: A set of games played against itself
Choosing the training experience:
The first design choice we face is to choose the type of training experience from which our system will
learn. The type of training experience available can have a significant impact on success or failure of
the learner. One key attribute is whether the training experience provides direct or indirect feedback
regarding the choices made by the performance system.
During the design of the checker's learning system, the type of training experience available for a
learning system will have a significant effect on the success or failure of the learning.
1. Direct or Indirect training experience — In the case of direct training experience, an individual
board states and correct move for each board state are given.
In case of indirect training experience, the move sequences for a game and the final result (win, loss
or draw) are given for a number of games. How to assign credit or blame to individual moves is the
credit assignment problem.
2. Teacher or Not — Supervised — The training experience will be labeled, which means, all the board
states will be labeled with the correct move. So the learning takes place in the presence of a
supervisor or a teacher.
Unsupervised — The training experience will be unlabeled, which means, all the board states will not
have the moves. So the learner generates random games and plays against itself with no supervision
or teacher involvement.
Semi-supervised — Learner generates game states and asks the teacher for help in finding the
correct move if the board state is confusing.
3. Is the training experience good — Do the training examples represent the distribution of examples
over which the final system performance will be measured? Performance is best when training
examples and test examples are from the same/a similar distribution.
The checker player learns by playing against oneself. Its experience is indirect. It may not encounter
moves that are common in human expert play. Once the proper training experience is available, the
next design step will be choosing the Target Function.
Choosing the Target Function
The next design choice is to determine exactly what type of knowledge will be learned and how this
will be used by the performance program.When you are playing the checkers game, at any moment of
time, you make a decision on
choosing the best move from different possibilities. You think and apply the learning that you have
gained from the experience. Here the learning is, for a specific board, you move a checker such that
your board state tends towards the winning situation. Now the same learning has to be defined in
terms of the target function.
Here there are 2 considerations — direct and indirect experience.
During the direct experience, the checkers learning system, it needs only to learn how to choose
the best move among some large search space. We need to find a target function that will help
us choose the best move among alternatives. Let us call this function ChooseMove and use the
notation ChooseMove : B →M to indicate that this function accepts as input any board from the
set of legal board states B and produces as output some move from the set of legal moves M.
When there is an indirect experience, it becomes difficult to learn such function. How about
assigning a real score to the board state. So the function be V : B →R indicating that this accepts as
input any board from the set of legal board states B and produces an output a real score. This function
assigns the higher scores to better board states.
If the system can successfully learn such a target function V, then it can easily use it to select the best
move from any board position. Let us therefore define the target value V(b) for an arbitrary board state
b in B, as follows:
1. if b is a final board state that is won, then V(b) = 100
2. if b is a final board state that is lost, then V(b) = -100
3. if b is a final board state that is drawn, then V(b) = 0
4. if b is a not a final state in the game, then V (b) = V (b’), where b’ is the best final board state that can
be achieved starting from b and playing optimally until the end of the game.
The (4) is a recursive definition and to determine the value of V(b) for a particular board state, it
performs the search ahead for the optimal line of play, all the way to the end of the game. So this
definition is not efficiently computable by our checkers playing program, we say that it is a non
operational definition.
The goal of learning, in this case, is to discover an operational description of V ; that is, a description
that can be used by the checkers-playing program to evaluate states and select moves within realistic
time bounds.
It may be very difficult in general to learn such an operational form of V perfectly. We expect learning
algorithms to acquire only some approximation to the target function ^V.
we again have many options. We could, for example, allow the program to represent using a large table
with a distinct entry specifying the value for each distinct board state. Or we could allow it to represent
using a collection of rules that match against features of the board state, or a quadratic polynomial
function of predefined board features, or an artificial neural network. In general, this choice of
representation involves a crucial tradeoff. On one hand, we wish to pick a very expressive
representation to allow representing as close an approximation as possible to the ideal target function
V. On the other hand, the more expressive the representation, the more training data the program will
require in order to choose among the alternative hypotheses it can represent. To keep the discussion
brief, let us choose a simple representation: for any given board state, the function ^V will be calculated
as a linear combination of the following board features:
x1(b) — number of black pieces on board b
x2(b) — number of red pieces on b
x3(b) — number of black kings on b
x4(b) — number of red kings on b
x5(b) — number of red pieces threatened by black (i.e., which can be taken on black’s next turn)
x6(b) — number of black pieces threatened by red
^V (b)= w0 + w1 · x1(b) + w2 · x2(b) + w3 · x3(b) + w4 · x4(b) +w5 · x5(b) + w6 · x6(b)
Where w0 through w6 are numerical coefficients or weights to be obtained by a learning algorithm.
Weights w1 to w6 will determine the relative importance of different board features.
Specification of the Machine Learning Problem at this time — Till now we worked on choosing the type
of training experience, choosing the target function and its representation. The checkers learning task
can be summarized as below.
Task T : Play Checkers
Performance Measure : % of games won in world tournament
Training Experience E : opportunity to play against itself
Target Function : V : Board → R
Target Function Representation : ^V = w0 + w1 · x1(b) + w2 · x2(b) + w3 · x3(b) + w4 · x4(b) +w5 ·
x5(b) + w6 · x6(b)
The first three items above correspond to the specification of the learning task, whereas the final two
items constitute design choices for the implementation of the learning program.
Choosing a Function Approximation Algorithm
Generating training data :
To train our learning program, we need a set of training data, each describing a specific board state b
and the training value V_train (b) for b. Each training example is an ordered pair <b,V_train(b)>
For example, a training example may be <(x1 = 3, x2 = 0, x3 = 1, x4 = 0, x5 = 0, x6 = 0), +100">. This is
an example where black has won the game since x2 = 0 or red has no remaining pieces. However, such
clean values of V_train (b) can be obtained only for board value b that are clear win, loss or draw.
V_train(b) ← ^V(Successor(b))
1. The performance System — Takes a new board as input and outputs a trace of the game it played
against itself.
2. The Critic — Takes the trace of a game as an input and outputs a set of training examples of the
target function.
3. The Generalizer — Takes training examples as input and outputs a hypothesis that estimates the
target function. Good generalization to new cases is crucial.
4. The Experiment Generator — Takes the current hypothesis (currently learned function) as input and
outputs a new problem (an initial board state) for the performance system to explore.
Our checkers example raises a number of generic questions about machine learning. The field of
machine learning, and much of this book, is concerned with answering questions such as the following:
what settings will particular algorithms converge to the desired function, given sufficient
training data? Which algorithms perform best for which types of problems and representations?
How much training data is sufficient? What general bounds can be found to relate the confidence in
learned hypotheses to the amount of training experience and the character of the learner's hypothesis
space?
When and how can prior knowledge held by the learner guide the process of generalizing from examples?
Can prior knowledge be helpful even when it is only approximately correct?
What is the best strategy for choosing a useful next training experience, and how does the choice of this
strategy alter the complexity of the learning problem?
What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn? Can
this process itself be automated?
How can the learner automatically alter its representation to improve its ability to represent
and learn the target function?
3. Concept Learning
Concept learning is the process of learning to recognize and categorize
objects or situations based on their attributes and relations. For example, a
concept learning system might learn to identify different types of animals based
on their shape, size, color, and behavior.
Concept Learning in Machine Learning can be thought of as a boolean-
valued function defined over a large set of training data. Taking a very simple
example, one possible target concept may be to Find the day when my friend
Ramesh enjoys his favourite sport. We have some attributes/features of the day
like, Sky, Air Temperature, Humidity, Wind, Water, Forecast and based on this we
have a target Concept named Enjoy Sport. Let’s Design the problem formally with
TPE(Task, Performance, Experience):
Task T: Learn to predict the value of EnjoySport for an arbitrary day, based
on the values of the attributes of the day.
where x1, x2, x3, x4, x5 and x6 are the values of Sky, AirTemp, Humidity,
Wind, Water and Forecast.
h1(x=1): <Sunny, Warm, Normal, Strong, Warm, Same > Note: x=1
represents a positive hypothesis / Positive example
We want to find the most suitable hypothesis which can represent the
concept. For example, Ramesh enjoys his favorite sport only on cold days with high
humidity (This seems independent of the values of the other attributes present in
the training examples).
Here? Indicates that any value of the attribute is acceptable. Note: The most
generic hypothesis will be <?, ?, ?, ?, ?, ?> where every day is a positive example
and the most specific hypothesis will be <?,?,?,?,?,? > where no day is a positive
example.
Important Representation:
Initialization:
The algorithm commences by initializing the hypothesis to the most
specific representation, {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}. This starting hypothesis encapsulates
the absence of any prior knowledge about the concept being learned.
Convergence:
The iterative process continues until all training examples have been
processed. At this juncture, the algorithm yields a final hypothesis that
encapsulates the learned concept based on the provided data.
Final Hypothesis:
The culmination of the algorithm's iterative refinement is the derivation of
a final hypothesis. This hypothesis represents a synthesized abstraction of the
underlying concept, encapsulating the observed patterns and regularities within
the training data.
Algorithm
b. Then do nothing
3. Output hypothesis h
Example:
Consider the following data set having the data about which particular
seeds are poisonous.
Consider example 1 :
The data in example 1 is {GREEN, HARD, NO, WRINKLED }. We see that our initial
hypothesis is more specific and we have to generalize it for this example. Hence,
the hypothesis becomes :
h = {GREEN, HARD, NO, WRINKLED}
Consider example 2 :
Here we see that this example has a negative outcome. Hence we neglect this
example and our hypothesis remains the same.
h = { GREEN, HARD, NO, WRINKLED }
Consider example 3 :
Here we see that this example has a negative outcome. Hence we neglect this
example and our hypothesis remains the same.
h = { GREEN, HARD, NO, WRINKLED }
Consider example 4 :
The data present in example 4 is { ORANGE, HARD, NO, WRINKLED }. We compare
every single attribute with the initial data and if any mismatch is found we replace
that particular attribute with a general case ( ” ? ” ). After doing the process the
hypothesis becomes :
h = { ?, HARD, NO, WRINKLED }
Consider example 5 :
The data present in example 5 is { GREEN, SOFT, YES, SMOOTH }. We compare
every single attribute with the initial data and if any mismatch is found we replace
that particular attribute with a general case ( ” ? ” ). After doing the process the
hypothesis becomes :
h = { ?, ?, ?, ? }
Since we have reached a point where all the attributes in our hypothesis
have the general condition, example 6 and example 7 would result in the same
hypothesizes with all general attributes.
h = { ?, ?, ?, ? }
Method:
Initialize G, the set of maximally general hypotheses, to contain one element: the
null description (all features are variables).
Initialize S, the set of maximally specific hypotheses, to contain one element: the
first positive example.
Accept a new training example.
a. If the example is positive:
1. Generalize all the specific models to match the positive example,
but ensure the following:
The new specific models involve minimal changes.
Each new specific model is a specialization of some general
model.
No new specific model is a generalization of some other
specific model.
2. Prune away all the general models that fail to match the positive
example.
5. Inductive Bias
Every machine learning model requires some type of architecture design and
possibly some initial assumptions about the data we want to analyze. Generally,
every building block and every belief that we make about the data is a form of
inductive bias.
Inductive biases play an important role in the ability of machine learning models
to generalize to the unseen data. A strong inductive bias can lead our model to
converge to the global optimum. On the other hand, a weak inductive bias can
cause the model to find only the local optima and be greatly affected by random
changes in the initial states.
We can categorize inductive biases into two different groups called relational and
non-relational. The former represents the relationship between entities in the
network, while the latter is a set of techniques that further constrain the learning
algorithm.
Bayesian Models
Inductive bias in Bayesian models shows itself in the form of the prior
distributions that we choose for the variables. Consequently, the prior can shape
the posterior distribution in a way that the latter can turn out to be a similar
distribution to the former. In addition, we assume that the variables are
conditionally independent, meaning that given the parents of a node in the
network, it’ll be independent from its ancestors. As a result, we can make use of
conditional probability to make the inference. Also, the structure of the Bayesian
net can facilitate the analysis of causal relationships between entities.
Linear Regression
Given the (X, Y) data points, in linear regression, we assume that the variable (Y)
is linearly dependent on the explanatory variables (X). Therefore, the resulting
model linearly fits the training data. However, this assumption can limit the
model’s capacity to learn non-linear functions.
Logistic Regression
In logistic regression, we assume that there’s a hyperplane that separates the two
classes from each other. This simplifies the problem, but one can imagine that if
the assumption is not valid, we won’t have a good model.
6. Decision Tree Algorithm
Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
7. Representation
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
8. Algorithm
In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or Not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next
decision node (distance from the office) and one leaf node based on the corresponding
labels. The next decision node further gets split into one decision node (Cab facility) and
one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer).
Consider the below diagram:
Steps will also remain the same, which are given below:
Data Pre-processing step
Fitting a Decision-Tree algorithm to the Training set
Predicting the test result
Test accuracy of the result(Creation of Confusion matrix)
Visualizing the test set result.
Data set given is
+----+-------+-------+------------------+----------+
| ID | Fever | Cough | Breathing issues | Infected |
+----+-------+-------+------------------+----------+
| 1 | No | No | No | No |
+----+-------+-------+------------------+----------+
| 2 | Yes | Yes | Yes | Yes |
+----+-------+-------+------------------+----------+
| 3 | Yes | Yes | No | No |
+----+-------+-------+------------------+----------+
| 4 | Yes | No | Yes | Yes |
+----+-------+-------+------------------+----------+
| 5 | Yes | Yes | Yes | Yes |
+----+-------+-------+------------------+----------+
| 6 | No | Yes | No | No |
+----+-------+-------+------------------+----------+
| 7 | Yes | No | Yes | Yes |
+----+-------+-------+------------------+----------+
| 8 | Yes | No | Yes | Yes |
+----+-------+-------+------------------+----------+
| 9 | No | Yes | Yes | Yes |
+----+-------+-------+------------------+----------+
As stated in the previous section the first step is to find the best feature i.e. the one that
has the maximum Information Gain(IG). We’ll calculate the IG for each of the features
now, but for that, we first need to calculate the entropy of S
From the total of 14 rows in our dataset S, there are 8 rows with the target value YES
and 6 rows with the target value NO. The entropy of S is calculated as:
Entropy(S) = — (8/14) * log₂(8/14) — (6/14) * log₂(6/14) = 0.99
Note: If all the values in our target column are same the entropy will be zero (meaning
that it has no or zero randomness).
We now calculate the Information Gain for each feature:
IG calculation for Fever:
In this(Fever) feature there are 8 rows having value YES and 6 rows having value NO.
As shown below, in the 8 rows with YES for Fever, there are 6 rows having target value
YES and 2 rows having target value NO.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| Yes | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | Yes | No | No |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | Yes | No | Yes |
+-------+-------+------------------+----------+
| Yes | Yes | No | No |
+-------+-------+------------------+----------+
As shown below, in the 6 rows with NO, there are 2 rows having target value YES and 4
rows having target value NO.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| No | No | No | No |
+-------+-------+------------------+----------+
| No | Yes | No | No |
+-------+-------+------------------+----------+
| No | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| No | Yes | No | No |
+-------+-------+------------------+----------+
| No | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| No | Yes | Yes | No |
+-------+-------+------------------+----------+
The block, below, demonstrates the calculation of Information Gain for Fever.
# total rows |S| = 14For v = YES, |Sᵥ| = 8
Entropy(Sᵥ) = - (6/8) * log₂(6/8) - (2/8) * log₂(2/8) = 0.81For v = NO, |Sᵥ| = 6
Entropy(Sᵥ) = - (2/6) * log₂(2/6) - (4/6) * log₂(4/6) = 0.91# Expanding the summation
in the IG formula:
IG(S, Fever) = Entropy(S) - (|Sʏᴇꜱ| / |S|) * Entropy(Sʏᴇꜱ) -
(|Sɴᴏ| / |S|) * Entropy(Sɴᴏ)∴ IG(S, Fever) = 0.99 - (8/14) * 0.81 - (6/14) * 0.91 = 0.13
Next, we calculate the IG for the features “Cough” and “Breathing issues”.
You can use this free online tool to calculate the Information Gain.
IG(S, Cough) = 0.04
IG(S, BreathingIssues) = 0.40
Since the feature Breathing issues have the highest Information Gain it is used to create
the root node.
Hence, after this initial step our tree looks like this:
Next, from the remaining two unused features, namely, Fever and Cough, we decide
which one is the best for the left branch of Breathing Issues.
Since the left branch of Breathing Issues denotes YES, we will work with the subset of
the original data i.e the set of rows having YES as the value in the Breathing Issues
column. These 8 rows are shown below:
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| Yes | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| Yes | No | Yes | Yes |
+-------+-------+------------------+----------+
| No | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| No | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| No | Yes | Yes | No |
+-------+-------+------------------+----------+
Next, we calculate the IG for the features Fever and Cough using the subset Sʙʏ (Set
Breathing Issues Yes) which is shown above :
Note: For IG calculation the Entropy will be calculated from the subset Sʙʏ and not the
original dataset S.
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| No | Yes | Yes | Yes |
+-------+-------+------------------+----------+
| No | Yes | Yes | No |
+-------+-------+------------------+----------+
| No | Yes | Yes | No |
+-------+-------+------------------+----------+
Here not all but most of the values are NO, hence NO or Not Infected becomes our right
leaf node.
We repeat the same process for the node Cough, however here both left and right leaves
turn out to be the same i.e. NO or Not Infected as shown below:
Find S algorithm
https://www.youtube.com/watch?v=O6vwN74aSGY&ab_channel=MaheshHuddar