KEMBAR78
Unit I Machine Learning | PDF | Machine Learning | Statistical Classification
0% found this document useful (0 votes)
8 views78 pages

Unit I Machine Learning

Uploaded by

sumeet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views78 pages

Unit I Machine Learning

Uploaded by

sumeet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Machine Learning

(BE Computer 2019 PAT)


A.Y. 2025-26 SEM-I

Prof. Minal P. Jungare

Indira College of Engineering &Management, Pune


Unit-1 Introduction to ML Syllabus
• Introduction to Machine Learning, Comparison of Machine learning with
traditional programming.
• ML vs AI vs Data Science.
• Types of learning: Supervised, Unsupervised, and semi-supervised,
reinforcement learning techniques.
• Models of Machine learning: Geometric model, Probabilistic Models, Logical
Models.
• Grouping and grading models, Parametric and non-parametric models.
• Important Elements of Machine Learning- Data formats,
• Learnability.
• Statistical learning approaches.

7/21/2025 Indira College of Engineering Management, Parandwadi 2


Need of Machine learning.
• Ever since the technical revolution, we’ve been generating an
immeasurable amount of data. As per research, we generate around 2.5
quintillion bytes of data every single day! It is estimated that by 2020,
1.7MB of data will be created every second for every person on earth.
• With the availability of so much data, it is finally possible to build
predictive models that can study and analyze complex data to find
useful insights and deliver more accurate results.
• Top Tier companies such as Netflix and Amazon build such Machine
Learning models by using tons of data in order to identify profitable
opportunities and avoid unwanted risks.

7/21/2025 Indira College of Engineering & Management, Parandwadi 3


Why Machine Learning is so important?

7/21/2025 Indira College of Engineering & Management, Parandwadi 4


What is Machine Learning?
• It is the field of study that says “ computer have capability to learn
without being explicitly programmed”.

7/21/2025 Indira College of Engineering & Management, Parandwadi 5


• A Machine Learning process begins by feeding the machine lots of
data, by using this data the machine is trained to detect hidden insights
and trends. These insights are then used to build a Machine Learning
Model by using an algorithm in order to solve a problem.

7/21/2025 Indira College of Engineering & Management, Parandwadi 6


Machine Learning Process
• The Machine Learning process involves building a Predictive model
that can be used to find a solution for a Problem Statement. To
understand the Machine Learning process let’s assume that you have
been given a problem that needs to be solved by using Machine
Learning.

7/21/2025 Indira College of Engineering & Management, Parandwadi 7


Traditional learning and machine learning
• Traditional computer programming has been around for more
than a century, with the first known computer program dating
back to the mid 1800s.
• Traditional programming is a manual process — meaning a
person (programmer) creates the program. But without anyone
programming the logic, one has to manually formulate or code
rules. We have the input data, and someone (programmer)
coded a program that uses that data and runs on a computer to
produce the desired output.

7/21/2025 Indira College of Engineering & Management, Parandwadi 8


Traditional learning and Machine learning
• Machine Learning, on the other hand, the input data and output are fed
to an algorithm to create a program.

• This is the basic difference between traditional programming and


machine learning. Without anyone programming the logic, In
Traditional programming one has to manually formulate/code rules
while in Machine Learning the algorithms automatically formulate
the rules from the data, which is very powerful.

7/21/2025 Indira College of Engineering & Management, Parandwadi 9


APPLICATIONS OF MACHINE LEARNING
• Google Search
• Stock Predictions
• Robotics-‘Sophia’ introduced which could actually behave like
humans.
• Social Media Services- Face Recognition , Add as friend in
facebook or people you may know
• Email Spam and Malware Filtering- C 4.5 Decision Tree Induction
• Over 325, 000 malwares are detected everyday and each piece of
code is 90–98% similar to its previous versions.

10
Machine Learning Matters
• Machine learning is to study, engineer, and improve mathematical
models which can be trained (once or continuously) with context-
related data (provided by a generic environment), to infer the future
and to make decisions without complete knowledge of all
influencing elements (external factors).
• In other words, an agent (which is a software entity that receives
information from an environment, picks the best action to reach a
specific goal, and observes the results of it) adopts a statistical
learning approach, trying to determine the right probability
distributions and use them to compute the action (value or decision)
that is most likely to be successful (with the least error).

11
Supervised Learning
• Supervised learning is where you have input variables (x) and an
output variable (Y) and you use an algorithm to learn the
mapping function from the input to the output.
Y = f(X)
• The goal is to approximate the mapping function so well that
when you have new input data (x) that you can predict the
output variables (Y) for that data.

12
• It is called supervised learning because the process of an algorithm
learning from the training dataset can be thought of as a teacher
supervising the learning process.
• We know the correct answers, the algorithm iteratively makes
predictions on the training data and is corrected by the teacher.
• Learning stops when the algorithm achieves an acceptable level of
performance.

13
• Supervised learning problems can be further grouped into regression and
classification problems.
• Classification: A classification problem is when the output variable is a
category, such as “red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a real value,
such as “dollars” or “weight”.
• Some common types of problems built on top of classification and regression
include recommendation and time series prediction respectively.
• Some popular examples of supervised machine learning algorithms are:
• Linear regression for regression problems.
• Random forest for classification and regression problems.
• Support vector machines for classification problems.

14
Classification example
• Sometimes, instead of predicting the actual category, it's better to
determine its probability distribution.
• For example, an algorithm can be trained to recognize a handwritten
alphabetical letter, so its output is categorical (in English, there'll be 26
allowed symbols).
• On the other hand, even for human beings, such a process can lead to
more than one probable outcome when the visual representation of a
letter isn't clear enough to belong to a single category.
• That means that the actual output is better described by a discrete
probability distribution (for example, with 26 continuous values
normalized so that they always sum up to 1).

15
Common Supervised Learning Applications include:
• Predictive analysis based on regression or
categorical classification Spam detection
• Pattern detection
• Natural Language Processing
• Sentiment analysis
• Automatic image classification
• Automatic sequence processing (for example,
music or speech)

16
17
Unsupervised Machine Learning
• Unsupervised learning is where you only have input data (X)
and no corresponding output variables.
• The goal for unsupervised learning is to model the underlying
structure or distribution in the data in order to learn more about
the data.
• These are called unsupervised learning because unlike
supervised learning above there is no correct answers and there
is no teacher. Algorithms are left to their own devises to
discover and present the interesting structure in the data.

18
• Unsupervised learning problems can be further grouped into
clustering and association problems.
• Clustering: A clustering problem is where you want to discover
the inherent groupings in the data, such as grouping customers
by purchasing behavior.
• Association: An association rule learning problem is where you
want to discover rules that describe large portions of your data,
such as people that buy X also tend to buy Y.
• Some popular examples of unsupervised learning algorithms are:
• k-means for clustering problems.
• Apriori algorithm for association rule learning problems.

19
Commons Unsupervised Applications include:
• Object segmentation (for example, users, products,
movies, songs, and so on) Similarity detection
• Automatic labeling

20
Semi-Supervised Machine Learning
• Problems where you have a large amount of input data (X) and only
some of the data is labeled (Y) are called semi-supervised learning
problems.
• These problems sit in between both supervised and unsupervised
learning.
• A good example is a photo archive where only some of the images are
labeled, (e.g. dog, cat, person) and the majority are unlabeled.
• Many real world machine learning problems fall into this area. This is
because it can be expensive or time-consuming to label data as it
may require access to domain experts. Whereas unlabeled data is
cheap and easy to collect and store.

21
Summary
• Supervised: All data is labeled and the
algorithms learn to predict the output from
the input data.
• Unsupervised: All data is unlabeled and
the algorithms learn to inherent structure
from the input data.
• Semi-supervised: Some data is labeled but
most of it is unlabeled and a mixture of
supervised and unsupervised techniques
can be used.
22
Reinforcement learning

• Reinforcement learning is also based on feedback provided by the


environment. However, in this case, the information is more
qualitative and doesn't help the agent in determining a precise
measure of its error.
• this feedback is usually called reward (sometimes, a negative one
is defined as a penalty) and it's useful to understand whether a
certain action performed in a state is positive or not.

23
• Reinforcement Learning is a framework for learning where an agent interacts
with an environment and receives a reward for each interaction. The goal is to
learn to accumulate as much reward as possible over time.
• The real advantage these systems have over conventional supervised learning is
illustrated by this example I like a lot:
• Supervised Learning: Let us say that you know how to play chess. We record
you playing games against a lot of people. Now we train a system in the
supervised fashion to learn from your examples and call it KidPlayer. Let us say
that we train another system on Vishwanathan Anand’s games and call
this ProPlayer. Obviously the “policy” learned by KidPlayer will be an
inferior player to the policy learned by ProPlayer because of the different
capabilities of the teacher.
• Reinforcement Learning: In this setting, you make an agent play Chess against
someone (usually against another copy of itself) and give it a reward for every
time it wins a game.

24
• to learn the best policy for playing Atari video games and to teach an agent
how to associate the right action with an input representing the state (usually a
screenshot or a memory dump).
• In the following figure, there's a schematic representation of a deep neural network
trained to play a famous Atari game.
• As input, there are one or more subsequent screenshots (this can often be enough
to capture the temporal dynamics as well).
• They are processed using different layers (discussed briefly later) to produce an
output that represents the policy for a specific state transition.
• After applying this policy, the game produces a feedback (as a reward-penalty), and
this result is used to refine the output until it becomes stable (so the states are
correctly recognized and the suggested action is always the best one) and the total
reward overcomes a predefined threshold.

25
Atari Video Game

26
schematic representation of a deep neural network trained to
play a famous Atari game.

27
Difference between AI and Machine Learning
Artificial Intelligence Machine Learning
AI aims to make a smart computer system
ML allows machines to learn from data so
work just like humans to solve complex
they can provide accurate output
problems

ML can be categorized into Supervised


Based on capability, AI can be
Learning, Unsupervised Learning, and
General AI, and Strong AI
Reinforcement Learning

AI systems are concerned with maximizing Machine Learning primarily concerns with
of success accuracy and patterns

AI enables a machine to emulate human Machine Learning is a sub-set of AI

Mainly deals with structured, semi- Deals with structured and semi-structured
unstructured data data

Applications of ML are recommendation


Some applications of AI are virtual
system, search algorithms, Facebook auto
Siri, chatbots, intelligent humanoid
friend tagging system, etc.

7/21/2025 Indira College of Engineering & Management, Parandwadi 28


Difference Between Data Science and Machine
Learning
Data Science Machine Learning
Machine Learning helps in accurately predicting or
Data Science helps with creating insights from data
classifying outcomes for new data points by
that deals with real world complexities
learning patterns from historical data
Preferred skill-set:
Preferred skill-set:
– domain expertise
– Python/ R Programming
– strong SQL
– Strong Mathematics Knowledge
– ETL and data profiling
– Data Wrangling
– NoSQL systems, Standard reporting,
– SQL Model specific visualization
Visualization
Horizontally scalable systems preferred to handle
GPUs are preferred for intensive vector operations
massive data
Major complexity is with the algorithms and
Components for handling unstructured raw data
mathematical concepts behind them
Most of the input data is in human consumable Input data is transformed specifically for the type
form of algorithms used
7/21/2025 Indira College of Engineering & Management, Parandwadi 29
Relationship between Data Science, Artificial
Intelligence and Machine Learning
• Artificial Intelligence and data science are a wide field of applications,
systems and more that aim at replicating human intelligence through
machines. Artificial Intelligence represents an action planned feedback of
perception.
• Perception > Planning > Action > Feedback of Perception
Data Science uses different parts of this pattern or loop to solve specific
problems. For instance, in the first step, i.e. Perception, data scientists try to
identify patterns with the help of the data. Similarly, in the next step, i.e.
planning, there are two aspects:
• Finding all possible solutions
• Finding the best solution among all solutions

7/21/2025 Indira College of Engineering & Management, Parandwadi 30


• Data science creates a system that interrelates both
the aforementioned points and helps businesses move
forward.
• Although it’s possible to explain machine learning by
taking it as a standalone subject, it can best be
understood in the context of its environment, i.e., the
system it’s used within.
• Simply put, machine learning is the link that connects
Data Science and AI. That is because it’s the process
of learning from data over time. So, AI is the tool
that helps data science get results and solutions for
specific problems. However, machine learning is what
helps in achieving that goal. A real-life example of
this is Google’s Search Engine.

7/21/2025 Indira College of Engineering & Management, Parandwadi 31


Artificial Intelligence Machine Learning Data Science
Subset of Artificial Includes various Data
Includes Machine Learning.
Intelligence. Operations.
Artificial Intelligence
combines large amounts of Data Science works by
Machine Learning uses
data through iterative sourcing, cleaning, and
efficient programs that
processing and intelligent processing data to extract
can use data without being
algorithms to help meaning out of it for
explicitly told to do so.
computers learn analytical purposes.
automatically.
Some of the popular tools The popular tools that
Some of the popular tools
that AI uses are- Machine Learning makes use
used by Data Science are-
1. TensorFlow2. Scikit of are-1. Amazon Lex2. IBM
1. SAS2. Tableau3. Apache
Learn Watson Studio3. Microsoft
Spark4. MATLAB
3. Keras Azure ML Studio
Artificial Intelligence Data Science deals with
Machine Learning uses
uses logic and decision structured and
statistical models.
trees. unstructured data.
Recommendation Systems Fraud Detection and
Chatbots, and Voice
such as Spotify, and Healthcare analysis are
assistants are popular
Facial Recognition are popular examples of Data
applications of AI.
popular examples. Science.
7/21/2025 Indira College of Engineering & Management, Parandwadi 32
Machine learning model
• A machine learning model is a program that can find patterns or make
decisions from a previously unseen dataset. For example, in natural
language processing, machine learning models can parse and correctly
recognize the intent behind previously unheard sentences or
combinations of words. In image recognition, a machine learning
model can be taught to recognize objects - such as cars or dogs. A
machine learning model can perform such tasks by having it ‘trained’
with a large dataset. During training, the machine learning algorithm is
optimized to find certain patterns or outputs from the dataset,
depending on the task. The output of this process - often a computer
program with specific rules and data structures - is called a machine
learning model.

7/21/2025 Indira College of Engineering & Management, Parandwadi 33


Learning model

• Geometric model.
• Probabilistic Models.
• Logical Models.
• Grouping and grading models.
• Parametric and non-parametric models.

7/21/2025 Indira College of Engineering & Management, Parandwadi 34


Geometric model
• In Geometric models, features could be described as points in two
dimensions (x- and y-axis) or a three-dimensional space (x, y, and z).
Even when features are not intrinsically geometric, they could be
modelled in a geometric manner (for example, temperature as a
function of time can be modelled in two axes). In geometric models,
there are two ways we could impose similarity.
• We could use geometric concepts like lines or planes to segment
(classify) the instance space. These are called Linear models.
• Alternatively, we can use the geometric notion of distance to represent
similarity. In this case, if two points are close together, they have
similar values for features and thus can be classed as similar. We call
such models as Distance-based models.

7/21/2025 Indira College of Engineering & Management, Parandwadi 35


1. Linear models
• Linear models are relatively simple. In this case, the function is
represented as a linear combination of its inputs. Thus, if x1 and x2 are
two scalars or vectors of the same dimension and a and b are arbitrary
scalars, then ax1 + bx2 represents a linear combination of x1 and x2. In
the simplest case where f(x) represents a straight line, we have an
equation of the form f (x) = mx + c where c represents the intercept
and m represents the slope.

7/21/2025 Indira College of Engineering & Management, Parandwadi 36


2. Distance-based models are the second class of Geometric models.
Like Linear models, distance-based models are based on the geometry
of data. As the name implies, distance-based models work on the
concept of distance. In the context of Machine learning, the concept of
distance is not based on merely the physical distance between two
points. Instead, we could think of the distance between two points
considering the mode of transport between two points. Travelling
between two cities by plane covers less distance physically than by train
because a plane is unrestricted. Similarly, in chess, the concept of
distance depends on the piece used – for example, a Bishop can move
diagonally. Thus, depending on the entity and the mode of travel, the
concept of distance can be experienced differently. The distance metrics
commonly used are Euclidean, Minkowski, Manhattan,
and Mahalanobis.

7/21/2025 Indira College of Engineering & Management, Parandwadi 37


Distance is applied through the concept of neighbours and exemplars. Neighbours are
points in proximity with respect to the distance measure expressed through exemplars.
Exemplars are either centroids that find a centre of mass according to a chosen
distance metric or medoids that find the most centrally located data point. The most
commonly used centroid is the arithmetic mean, which minimises squared Euclidean
distance to all other points.

7/21/2025 Indira College of Engineering & Management, Parandwadi 38


• Examples of distance-based models include the nearest-
neighbour models, which use the training data as exemplars – for
example, in classification. The K-means clustering algorithm also
uses exemplars to create clusters of similar data points.

7/21/2025 Indira College of Engineering & Management, Parandwadi 39


Probabilistic Models

• Probabilistic models see features and target variables as random variables.


The process of modelling represents and manipulates the level of
uncertainty with respect to these variables. There are two types of
probabilistic models: Predictive and Generative. Predictive probability
models use the idea of a conditional probability distribution P (Y |X) from
which Y can be predicted from X. Generative models estimate the joint
distribution P (Y, X). Once we know the joint distribution for the
generative models, we can derive anyconditional or marginal distribution
involving the same variables. Thus, the generative model is capable of
creating new data points and their labels, knowing the joint probability
distribution. The joint distribution looks for a relationship between two
variables. Once this relationship is inferred, it is possible to infer new data
points.

7/21/2025 Indira College of Engineering & Management, Parandwadi 40


• Naïve Bayes is an example of a probabilistic classifier.
• The goal of any probabilistic classifier is given a set of features (x_0
through x_n) and a set of classes (c_0 through c_k), we aim to
determine the probability of the features occurring in each class, and to
return the most likely class. Therefore, for each class, we need to
calculate P(c_i | x_0, …, x_n).
• We can do this using the Bayes rule defined as

7/21/2025 Indira College of Engineering & Management, Parandwadi 41


• The Naïve Bayes algorithm is based on the idea of Conditional
Probability. Conditional probability is based on finding
the probability that something will happen, given that something
else has already happened. The task of the algorithm then is to look at
the evidence and to determine the likelihood of a specific class and
assign a label accordingly to each entity.

7/21/2025 Indira College of Engineering & Management, Parandwadi 42


Logical Models
• Logical models use a logical expression to divide the instance space
into segments and hence construct grouping models. A logical
expression is an expression that returns a Boolean value, i.e., a True or
False outcome. Once the data is grouped using a logical expression,
the data is divided into homogeneous groupings for the problem we
are trying to solve. For example, for a classification problem, all the
instances in the group belong to one class.
• There are mainly two kinds of logical models: Tree models and Rule
models.

7/21/2025 Indira College of Engineering & Management, Parandwadi 43


• Rule models consist of a collection of implications or IF-THEN rules.
For tree-based models, the ‘if-part’ defines a segment and the ‘then-
part’ defines the behaviour of the model for this segment. Rule models
follow the same reasoning.
• Tree models can be seen as a particular type of rule model where the
if-parts of the rules are organised in a tree structure. Both Tree models
and Rule models use the same approach to supervised learning. The
approach can be summarised in two strategies: we could first find the
body of the rule (the concept) that covers a sufficiently homogeneous
set of examples and then find a label to represent the body. Alternately,
we could approach it from the other direction, i.e., first select a class
we want to learn and then find rules that cover examples of the class.

7/21/2025 Indira College of Engineering & Management, Parandwadi 44


Grouping and grading models
• Grouping models do this by breaking up the instance space into groups
or segments, the number of which is determined at training time. One
could say that grouping models have a fixed and finite ‘resolution’ and
cannot distinguish between individual instances beyond this resolution

7/21/2025 Indira College of Engineering & Management, Parandwadi 45


• Grading vs grouping is an orthogonal categorization to geometric-
probabilistic-logical-compositional • Grouping models break the
instance space up into groups or segments and in each segment apply a
very simple method (such as majority class) –
• E.g. decision tree, KNN • Grading models form one global model over
the instance space – Linear classifiers – Neural networks

7/21/2025 Indira College of Engineering & Management, Parandwadi 46


What is a parameter in a machine learning
model?
A model parameter is a configuration variable that is internal to the model and whose value
can be estimated from the given data.
• They are required by the model when making predictions.
• Their values define the skill of the model on your problem.
• They are estimated or learned from historical training data.
• They are often not set manually by the practitioner.

• They are often saved as part of the learned model.


• The examples of model parameters include:
• The weights in an artificial neural network.
• The support vectors in a support vector machine.
• The coefficients in linear regression or logistic regression.

7/21/2025 Indira College of Engineering & Management, Parandwadi 47


Parametric and non-parametric models

• A learning model that summarizes


data with a set of fixed-size
parameters (independent on the
number of instances of
training).Parametric machine
learning algorithms are which
optimizes the function to a known
form.

7/21/2025 Indira College of Engineering & Management, Parandwadi 48


• In a parametric model, you know exactly which model you are going to fit
in with the data, for example, linear regression line.
b0 + b1*x1 + b2*x2 = 0
where,
b0, b1, b2 → the coefficients of the line that control the intercept and slope
x1, x2 → input variables
• Following the functional form of a linear line clarifies the learning process
greatly. Now we’ll have to do is estimate the line equation coefficients and
we have a predictive model for the problem. With the intercept and the
coefficient, one can predict any value along with the regression.

7/21/2025 Indira College of Engineering & Management, Parandwadi 49


• The assumed functional form is always a linear combination of input
variables and as such parametric machine learning algorithms are also
frequently referred to as ‘linear machine learning algorithms.’
• The equation in algorithms is pre-defined. Feeding more data might
just change the coefficients in the equations and increasing the
number of instances will not make your model more complex. It
becomes stable.

7/21/2025 Indira College of Engineering & Management, Parandwadi 50


Some more examples of parametric machine learning algorithms
include:
• Logistic Regression
• Linear Discriminant Analysis
• Perceptron
• Naive Bayes
• Simple Neural Networks

7/21/2025 Indira College of Engineering & Management, Parandwadi 51


• Nonparametric machine learning algorithms
are those which do not make specific
assumptions about the type of the mapping
function. They are prepared to choose any
functional form from the training data, by not
making assumptions.
The word nonparametric does not mean that
the value lacks parameters existing in it, but
rather that the parameters are adjustable and
can change. When dealing with ranked data
one may turn to nonparametric modeling, in
which the sequence in that they are ordered is
some of the significance of the parameters.

7/21/2025 Indira College of Engineering & Management, Parandwadi 52


• A simple to understand the nonparametric model is the k-nearest
neighbors' algorithm, making predictions for a new data instance
based on the most similar training patterns k. The only assumption it
makes about the data set is that the training patterns that are the most
similar are most likely to have a similar result.
• Some more examples of popular nonparametric machine learning
algorithms are:
• k-Nearest Neighbors
• Decision Trees like CART and C4.5
• Support Vector Machines

7/21/2025 Indira College of Engineering & Management, Parandwadi 53


Data Format
• Labeled data: Data consisting of a set of training
examples, where each example is a pair consisting of
an input and a desired output value (also called
the supervisory signal, labels, etc)
• Classification: The goal is to predict discrete values,
e.g. {1,0}, {True, False}, {spam, not spam}.
• Regression: The goal is to predict continuous values,
e.g. home prices.

54
Important Elements in Machine
Learning
• Data formats
• In a supervised learning problem, there will always be a dataset,
defined as a finite set of real vectors with m features each:

55
• Feature vector: A typical setting for machine learning
is to be given a collection of objects (or data points),
each of which is characterised by several different
features.
• Features can be of different sorts: e.g., they might be
continuous (say, real- or integer-valued) or categorical
(for instance, a feature for colour can have values like
green, blue, red ).
• A vector containing all of the feature values for a
given data point is called the feature vector;
• if this is a vector of length m, then one can think of
each data point as being mapped to a m-dimensional
vector space (in the case of real-valued features, this is
R m ), called the feature space.
56
• This means all variables belong to the same
distribution D, and considering an arbitrary subset of
m values, it happens that:

• The corresponding output values can be both


numerical-continuous or categorical. In the first case,
the process is called regression, while in the second, it
is called classification. Examples of numerical
outputs are:

57
• Categorical examples are

• We define generic regressor, a vector-valued function


which associates an input value to a continuous output
and generic classifier, a vector-values function whose
predicted output is categorical (discrete).
• If they also depend on an internal parameter vector
which determines the actual instance of a generic
predictor, the approach is called parametric learning:

58
interpretation can be expressed in terms of additive noise:

In unsupervised learning, we normally only have an input


set X with m-length vectors, and we define clustering
function (with n target clusters) with the following
expression:

In most scikit-learn models, there is an instance variable coef_ which


contains all trained parameters
59
Multiclass strategies
• When the number of output classes is greater than one,
there are two main possibilities to manage a
classification problem:
• One-vs-all- If there are n output classes, n
classifiers will be trained in parallel considering there
is always a separation between an actual class and the
remaining ones.
• This approach is relatively lightweight (at most, n-1
checks are needed to find the right class, so it has an
O(n) complexity) and, for this reason, it's normally the
default choice and there's no need for further actions.

60
• One-vs-one
• The alternative to one-vs-all is training a
model for each pair of classes.
• The complexity is no longer linear (it's
O(n2) indeed) and the right class is
determined by a majority vote.
• In general, this choice is more expensive
and should be adopted only when a full
dataset comparison is not preferable.

61
Learnability

62
• there's an example of a dataset whose points must be
classified as red (Class A) or blue (Class B).
• Three hypotheses are shown: the first one (the middle
line starting from left) misclassifies one sample,
• while the lower and upper ones misclassify 13 and 23
samples respectively:
• the first hypothesis is optimal and should be selected;
however, it's important to understand an essential
concept which can determine a potential overfitting

63
64
• The blue classifier is linear while the red one is cubic.
At a glance, non-linear strategy seems to perform
better, because it can capture more expressivity, thanks
to its concavities.
• However, if new samples are added following the
trend defined by the last four ones (from the right),
they'll be completely misclassified.
• In fact, while a linear function is globally better but
cannot capture the initial oscillation between 0 and 4,
a cubic approach can fit this data almost perfectly but,
at the same time, loses its ability to keep a global
linear trend.

65
Statistical learning approaches
• Imagine that you need to design a spam-filtering algorithm
starting from this initial (over- simplistic) classification
based on two parameters:
Parameter Spam emails (X1) Regular emails (X2)
P1 Contains > 5 blacklisted
words 80 20
p2 - Message length < 20 75 25
characters

• We have collected 200 email messages (X) (for simplicity,


we consider p1 and p2 mutually exclusive) and we need to
find a couple of probabilistic hypotheses (expressed in terms
of p1 and p2), to determine:

66
• For example, we could think about rules (hypotheses)
like: "If there are more than five blacklisted words" or
"If the message is less than 20 characters in length"
then "the probability of spam is high" (for example,
greater than 50 percent). However, without assigning
probabilities, it's difficult to generalize when the
dataset changes (like in a real world antispam filter).
We also want to determine a partitioning threshold
(such as green, yellow, and red signals) to help the
user in deciding what to keep and what to trash.
• As the hypotheses are determined through the dataset
X, we can also write (in a discrete form):

67
• In this example, it's quite easy to determine the value of each
term. However, in general, it's necessary to introduce the Bayes
formula

• In the previous equation, the first term is called a


posteriori (which comes after) probability, because
it's determined by a marginal Apriori (which comes
first) probability multiplied by a factor which is called
likelihood.

68
69
MAP learning(maximum a posteriori )
• When selecting the right hypothesis, a Bayesian approach is
normally one of the best choices,
• For example, a real coin is a very short cylinder, so, in tossing a
coin, we should also consider the probability of even.
• Let's say, it‘s 0.001. It means that we have three possible
outcomes: P(head) = P(tail) = (1.0 - 0.001) / 2.0 and P(even) =
0.001. The latter event is obviously unlikely, but in Bayesian
learning it must be considered (even if it'll be squeezed by the
strength of the other terms).
• An alternative is picking the most probable hypothesis in terms
of a posteriori probability:

70
Maximum-likelihood learning
• We have defined likelihood as a filtering term in the
Bayes formula. In general, it has the form of:

• Here the first term expresses the actual likelihood of a


hypothesis, given a dataset X. As you can imagine, in
this formula there are no more Apriori probabilities,
so, maximizing it doesn't imply accepting a theoretical
preferential hypothesis, nor considering unlikely ones.
A very common approach, known as expectation-
maximization and used in many algorithms

71
• A log-likelihood (normally called L) is a useful trick
that can simplify gradient calculations. A generic
likelihood expression is:

• As all parameters are inside hi, the gradient is a


complex expression which isn't very manageable.
However our goal is maximizing the likelihood, but
it's easier minimizing its reciprocal:

72
Elements of information theory
• A machine learning problem can also be analyzed in
terms of information transfer or exchange. Our dataset
is composed of n features, which are considered
independent (for simplicity, even if it's often a realistic
assumption) drawn from n different statistical
distributions.
• Therefore, there are n probability density functions
pi(x) which must be approximated through other n
qi(x) functions.
• In any machine learning task, it's very important to
understand how two corresponding distributions
diverge and what is the amount of information we lose
when approximating the original dataset.
73
The most useful measure is called entropy:

This value is proportional to the uncertainty of X and it's


measured in bits (if the logarithm has another base, this unit can
change too). For many purposes, a high entropy is preferable,
because it means that a certain feature contains more
information. For example, in tossing a coin (two possible
outcomes), H(X) = 1 bit, but if the number of outcomes grows,
even with the same probability, H(X) also does because of a
higher number of different values and therefore increased
variability. It's possible to prove that for a Gaussian distribution
(using natural logarithm):
74
• So, the entropy is proportional to the variance, which is a measure of
the amount of information carried by a single feature.
• low variance implies low information level and a model could often
discard all those features.
• If we have a target probability distribution p(x), which is approximated
by another distribution q(x), a useful measure is cross-entropy
between p and q

75
• In order to understand how a machine learning
approach is performing, it's also useful to introduce a
conditional entropy or the uncertainty of X given the
knowledge of Y:

• it's possible to introduce the idea of mutual


information, which is the amount of information
shared by both variables and therefore, the reduction
of uncertainty about X provided by the knowledge of
Y:

76
• Intuitively, when X and Y are independent, they don't
share any information. However, in machine learning
tasks, there's a very tight dependence between an
original feature and its prediction, so we want to
maximize the information shared by both
distributions.
• If the conditional entropy is small enough (so Y is
able to describe X quite well), the mutual information
gets close to the marginal entropy H(X), which
measures the amount of information we want to learn.

77
References

• Russel S., Norvig P., Artificial Intelligence: A Modern Approach, Pearson


• Valiant L., A Theory of the Learnable, Communications of the ACM, Vol. 27, No. 11 (Nov. 1984)
• Hastie T., Tibshirani R., Friedman J., The Elements of Statistical Learning: Data Mining, Inference and,
Prediction, Springer
• Aleksandrov A.D., Kolmogorov A.N, Lavrent'ev M.A., Mathematics: Its contents, Methods, and Meaning,
Courier Corporation
• https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-algorithms

78

You might also like