Introduction to Machine
Learning:
Machine learning
• Machine learning is a type of artificial intelligence that enables
computers to detect patterns and establish baseline behavior using
algorithms that learn through training or observation.
• It can process and analyze vast amounts of data that are simply
impractical for humans.
• Machine learning tasks are classified into two main categories:
• Supervised learning – the machine is presented with a set of inputs
and expected outputs, later given a new input the output is
predicted.
• Unsupervised learning – the machine aims to find patterns, within a
dataset without an explicit input from a human as to what these
patterns might look like.
What is supervised learning?
• Supervised machine learning is a branch of artificial intelligence that
focuses on training models to make predictions or decisions based on
labeled training data.
• It involves a learning process where the model learns from known
examples to predict or classify unseen or future instances accurately.
• Supervised machine learning has two key components: first is input
data and second corresponding output labels.
• The goal is to build a model that can learn from this labeled data to
make predictions or classifications on new, unseen data.
• The labeled data consists of input features (also known as
independent variables or predictors) and the corresponding output
labels (also known as dependent variables or targets).
• The model’s objective is to capture patterns and relationships
between the input features and the output labels, allowing it to
generalize and make accurate predictions on unseen data.
How Does Supervised Learning Work?
• Supervised machine learning typically follows a series of steps to train
a model and make predictions.
Data Collection and Labeling
• The first step in supervised machine learning is collecting a
representative and diverse dataset. This dataset should include a
sufficient number of labeled examples that cover the range of inputs
and outputs the model will encounter in real-world scenarios.
• The labeling process involves assigning the correct output label to
each input example in the dataset. This can be a time-consuming and
labor-intensive task, depending on the complexity and size of the
dataset.
Training and Test Sets
• Once the dataset is collected and labeled, it is divided into two
subsets: the training set and the test set. The training set is used to
train the model, while the test set is used to evaluate its performance
on unseen data.
• The training set serves as the basis for the model to learn patterns
and relationships between the input features and the output labels.
The test set, on the other hand, helps assess the model’s
generalization ability and its performance on new, unseen data.
Feature Extraction
• Before training the model, it is essential to extract relevant features
from the input data.
• Feature extraction involves selecting or transforming the input
features to capture the most relevant information for the learning
task.
• This process can enhance the model’s predictive performance and
reduce the dimensionality of the data.
Model Selection and Training
• Choosing an appropriate machine learning algorithm is crucial for the
success of supervised learning.
• Different algorithms have different strengths and weaknesses,
making it important to select the one that best fits the problem at
hand.
• Once the algorithm is selected, the model is trained using the
labeled training data.
• During the training process, the model learns the underlying
patterns and relationships in the data by adjusting its internal
parameters. The objective is to minimize the difference between the
predicted outputs and the true labels in the training data.
Prediction and Evaluation
• Once the model is trained, it can be used to make predictions on new,
unseen data.
• The input features of the unseen data are fed into the trained
model, which generates predictions or classifications based on the
learned patterns.
• To evaluate the model’s performance, the predicted outputs are
compared against the true labels of the unseen data.
• Common evaluation metrics include accuracy, precision, recall, and
F1 score, depending on the nature of the learning task.
Regression vs. Classification in Machine Learning
• Regression and Classification algorithms are Supervised Learning
algorithms.
• Both the algorithms are used for prediction in Machine learning and
work with the labeled datasets.
• But the difference between both is how they are used for different
machine learning problems.
• The main difference between Regression and Classification algorithms
that Regression algorithms are used to predict the continuous values
such as price, salary, age, etc. and Classification algorithms are used
to predict/Classify the discrete values such as Male or Female, True
or False, Spam or Not Spam, etc.
Classification:
• Classification is a process of finding a function which helps in dividing
the dataset into classes based on different parameters.
• In Classification, a computer program is trained on the training
dataset and based on that training, it categorizes the data into
different classes.
• Example: The best example to understand the Classification problem
is Email Spam Detection.
• The model is trained on the basis of millions of emails on different
parameters, and whenever it receives a new email, it identifies
whether the email is spam or not. If the email is spam, then it is
moved to the Spam folder.
Types of ML Classification Algorithms:
• Logistic Regression
• K-Nearest Neighbours
• Support Vector Machines
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
Advantages and Disadvantages of Supervised
Learning
Advantages:
• Since supervised learning work with the labelled dataset so we can
have an exact idea about the classes of objects.
• These algorithms are helpful in predicting the output on the basis of
prior experience.
Disadvantages:
• These algorithms are not able to solve complex tasks.
• It may predict the wrong output if the test data is different from the
training data.
• It requires lots of computational time to train the algorithm.
Applications of Supervised Learning
• Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.
• Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is done
by using medical images and past labelled data with labels for disease conditions. With
such a process, the machine can identify a disease for the new patients.
• Fraud Detection - Supervised Learning classification algorithms are used for identifying
fraud transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.
• Spam detection - In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam
folder.
• Speech Recognition - Supervised learning algorithms are also used in speech recognition.
The algorithm is trained with voice data, and various identifications can be done using
the same, such as voice-activated passwords, voice commands, etc.
Difference between Regression and Classification
Regression Algorithm Classification Algorithm
In Regression, the output variable must be of continuous In Classification, the output variable must be a discrete
nature or real value. value.
The task of the regression algorithm is to map the input The task of the classification algorithm is to map the input
value (x) with the continuous output variable(y). value(x) with the discrete output variable(y).
Regression Algorithms are used with continuous data. Classification Algorithms are used with discrete data.
In Regression, we try to find the best fit line, which can In Classification, we try to find the decision boundary,
predict the output more accurately. which can divide the dataset into different classes.
Regression algorithms can be used to solve the regression Classification Algorithms can be used to solve classification
problems such as Weather Prediction, House price problems such as Identification of spam emails, Speech
prediction, etc. Recognition, Identification of cancer cells, etc.
The regression Algorithm can be further divided into Linear The Classification algorithms can be divided into Binary
and Non-linear Regression. Classifier and Multi-class Classifier.
What is Unsupervised Learning?
• There may be many cases in which we do not have labeled data and need
to find the hidden patterns from the given dataset.
• So, to solve such types of cases in machine learning, we need unsupervised
learning techniques.
• Unsupervised learning is a type of machine learning in which
models are trained using unlabeled dataset and are allowed to act
on that data without any supervision.
• Unsupervised learning is a machine learning technique in which models
are not supervised using training dataset.
• Instead, models itself find the hidden patterns and insights from the
given data.
• It can be compared to learning which takes place in the human brain while
learning new things.
• Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have
the input data but no corresponding output data.
• The goal of unsupervised learning is to find the underlying structure
of dataset, group that data according to similarities, and represent
that dataset in a compressed format.
• Example:
• Suppose the unsupervised learning algorithm is given an input
dataset containing images of different types of cats and dogs.
• The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset.
• The task of the unsupervised learning algorithm is to identify the
image features on their own.
• Unsupervised learning algorithm will perform this task by clustering
the image dataset into the groups according to similarities between
images.
Why use Unsupervised Learning?
• Unsupervised learning is helpful for finding useful insights from the
data.
• Unsupervised learning is much similar as a human learns to think by
their own experiences, which makes it closer to the real AI.
• Unsupervised learning works on unlabeled and uncategorized data
which make unsupervised learning more important.
• In real-world, we do not always have input data with the
corresponding output so to solve such cases, we need unsupervised
learning.
Working of Unsupervised Learning
• Here, we have taken an unlabeled input data, which means it is not
categorized and corresponding outputs are also not given.
• Now, this unlabeled input data is fed to the machine learning model
in order to train it.
• Firstly, it will interpret the raw data to find the hidden patterns from
the data and then will apply suitable algorithms such as k-means
clustering, Decision tree, etc.
Types of Unsupervised Learning Algorithm:
Types of Unsupervised Learning Algorithm:
• Clustering:
• Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no
similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
• Association:
• An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It
determines the set of items that occurs together in the dataset.
• Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis.
Unsupervised Learning algorithms:
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
Advantages of Unsupervised Learning
• Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have
labeled input data.
• Unsupervised learning is preferable as it is easy to get unlabeled data
in comparison to labeled data.
Disadvantages of Unsupervised Learning
• Unsupervised learning is intrinsically more difficult than supervised
learning as it does not have corresponding output.
• The result of the unsupervised learning algorithm might be less
accurate as input data is not labeled, and algorithms do not know the
exact output in advance.
Applications of Unsupervised Learning
• Network Analysis: Unsupervised learning is used for identifying plagiarism
and copyright in document network analysis of text data for scholarly
articles.
• Recommendation Systems: Recommendation systems widely use
unsupervised learning techniques for building recommendation
applications for different web applications and e-commerce websites.
• Anomaly Detection: Anomaly detection is a popular application of
unsupervised learning, which can identify unusual data points within the
dataset. It is used to discover fraudulent transactions.
• Singular Value Decomposition: Singular Value Decomposition or SVD is
used to extract particular information from the database. For example,
extracting information of each user located at a particular location
Difference
Supervised Learning Unsupervised Learning
Input Data is Labelled Input Data is not Labelled
Uses Training Data set Uses just Input Data Set
Data is classified based on training data set Uses properties of given data to classify it.
Used for prediction Used for Analysis
Divided into two types: Regression and Classification Divided into two types: Clustering and Association
Uses off line analysis of data Uses real time data
The goal is to train the model so that it can predict the The goal is to find the hidden patterns and useful
output when it is given new data. insights from the unknown data set.
Algorithms include: Decision Trees, Logistics Algorithms include: h-means clustering, hierarchical
Regressions, Support Vector Machine etc clustering etc.
Clustering
• Clustering is the assignment of objects to homogeneous groups
(called clusters) while making sure that objects in different groups
are not similar.
• Clustering is considered an unsupervised task as it aims to describe
the hidden structure of the objects.
• Each object is described by a set of characters called features.
• The first step of dividing objects into clusters is to define the distance
between the different objects. Defining an adequate distance
measure is crucial for the success of the clustering process.
Dimensionality Reduction
• In the field of machine learning, it is useful to apply a process called
dimensionality reduction to highly dimensional data.
• The purpose of this process is to reduce the number of features
under consideration, where each feature is a dimension that partly
represents the objects.
• Why is dimensionality reduction important? As more features are
added, the data becomes very sparse and analysis suffers from the
curse of dimensionality. Additionally, it is easier to process smaller
data sets.
Dimensionality reduction can be executed using
two different methods:
• Selecting from the existing features (feature selection)
• Extracting new features by combining the existing features (feature
extraction)
• The main technique for feature extraction is the Principle Component
Analysis (PCA).
• PCA guarantees finding the best linear transformation that reduces
the number of dimensions with a minimum loss of information.
Machine learning generally uses Feature Extraction
for pre-processing the data. The most common
techniques used are:
• Principal Component Analysis
• For feature extraction, Principal Component Analysis (PCA)
uses a linear transformation to produce a set of new
principal components, reducing the number of dimensions
to a minimum without information loss.
• The process is repeated to find linear transformations which
are entirely uncorrelated to each other in an orthogonal
way. This helps maximize the variance of the data set.
• Singular Value Decomposition
• Singular Value Decomposition (SVD) divides a principal
matrix into three lower matrices. It is generally based on the
formula A = USVT, where U and V represent orthogonal
matrices, and S represents a diagonal matrix.
• Like PCA, it is generally used to reduce noise and compress
data, such as in image files.
• Random Forest
• Another popular dimensionality reduction method in
machine learning, the random forest technique, has an in-
built algorithm for generating feature importance. It uses
statistics of each attribute to find the subset of features.
• However, this algorithm only accepts numerical variables.
Therefore, the data has to be first processed using hot
encoding.
Two main applications of unsupervised learning: clustering and dimensionality
reduction. For clustering, points with high similarity (i.e. near each other) are grouped
together in clusters to identify patterns in data. The field of dimensionality reduction
tries to transfer high-dimensional data into efficient low-dimensional representations
by finding appropriate transformations ϕ.
Bias-Variance Trade Off
• What is Bias?
• The bias is known as the difference between the prediction of the
values by the Machine Learning model and the correct value.
• Being high in biasing gives a large error in training as well as testing
data.
• It recommended that an algorithm should always be low-biased to
avoid the problem of underfitting.
• By high bias, the data predicted is in a straight line format, thus not
fitting accurately in the data in the data set. Such fitting is known as
the Underfitting of Data.
• This happens when the hypothesis is too simple or linear in nature.
• When the Bias is high, assumptions made by our model are too basic,
the model can’t capture the important features of our data.
• This means that our model hasn’t captured patterns in the training
data and hence cannot perform well on the testing data too. If this is
the case, our model cannot perform on new data and cannot be sent
into production.
• This instance, where the model cannot find patterns in our training
set and hence fails for both seen and unseen data, is called
Underfitting.
Underfitting
• The below figure shows an example of Underfitting. As we can see,
the model has found no patterns in our data and the line of best fit is
a straight line that does not pass through any of the data points. The
model has failed to train properly on the data given and cannot
predict new data either.
• Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data.
• To avoid the overfitting in the model, the fed of training data can be
stopped at an early stage, due to which the model may not learn
enough from the training data.
• As a result, it may fail to find the best fit of the dominant trend in the
data.
What is Variance?
• Variance is the very opposite of Bias.
• The variability of model prediction for a given data point
which tells us the spread of our data is called the variance of
the model.
• The model with high variance has a very complex fit to the
training data and thus is not able to fit accurately on the
data which it hasn’t seen before.
• As a result, such models perform very well on training data but
have high error rates on test data.
• When a model is high on variance, it is then said to
as Overfitting of Data.
In the above figure, we can see that our model has learned extremely well for our
training data, which has taught it to identify cats. But when given new data, such as
the picture of a fox, our model predicts it as a cat, as that is what it has learned. This
happens when the Variance is high, our model will capture all the features of the data
given to it, including the noise, will tune itself to the data, and predict it very well but
when given new data, it cannot predict on it as it is too specific to training data.
• During training, it allows our model to ‘see’ the data a certain number
of times to find patterns in it. If it does not work on the data for long
enough, it will not find patterns and bias occurs. On the other hand, if
our model is allowed to view the data too many times, it will learn
very well for only that data. It will capture most patterns in the data,
but it will also learn from the unnecessary data present, or from the
noise.
Overfitting
• Our model will perform really well on testing data and get high
accuracy but will fail to perform on new, unseen data. New data may
not have the exact same features and the model won’t be able to
predict it very well. This is called Overfitting.
Overfitting
• Overfitting occurs when our machine learning model tries to cover all
the data points or more than the required data points present in the
given dataset.
• Because of this, the model starts caching noise and inaccurate values
present in the dataset, and all these factors reduce the efficiency and
accuracy of the model.
• The overfitted model has low bias and high variance.
• Noise: Noise is unnecessary and irrelevant data that reduces the
performance of the model.
• The chances of occurrence of overfitting increase as much we provide
training to our model. It means the more we train our model, the
more chances of occurring the overfitted model.
• Overfitting is the main problem that occurs in supervised learning.
How to avoid the Overfitting in Model
• Cross-Validation
• Training with more data
• Removing features
• Early stopping the training
• Regularization
• Ensembling
The Bias-Variance Tradeoff
• The bias-variance tradeoff illustrates the relationship between bias
and variance in machine learning models. As we decrease bias,
variance tends to increase, and vice versa. Finding the optimal
tradeoff is crucial to achieve good model performance.
1.High Bias, Low Variance: Underfitting
• When a model has high bias and low variance, it tends to underfit the
data. Underfitting occurs when the model is too simple to capture the
underlying patterns in the data. It leads to poor performance on both
the training and testing data, as the model fails to generalize.
Underfitting can be addressed by increasing the model’s complexity
or incorporating more relevant features.
• 2. Low Bias, High Variance: Overfitting
• Conversely, a model with low bias and high variance tends to overfit
the data. Overfitting happens when the model becomes too complex,
capturing noise or random fluctuations in the training data. It
performs exceptionally well on the training data but fails to generalize
to unseen data. To address overfitting, techniques like regularization,
feature selection, or collecting more training data can be employed.
Finding the Optimal Balance
• To achieve optimal model performance, we aim to strike a balance
between bias and variance.
1.Cross-Validation: Utilize techniques like cross-validation to estimate model
performance on unseen data. This helps assess the model’s bias and
variance and make informed decisions about model complexity.
2.Regularization: Regularization techniques, such as L1 or L2 regularization,
help control model complexity and prevent overfitting. They add a penalty
term to the model’s objective function, discouraging overly complex
solutions.
3.Feature Engineering: Carefully selecting relevant features and removing
redundant or noisy ones can help reduce model complexity and enhance
generalization.
• The bias-variance tradeoff is a critical concept in machine learning
that guides us in finding the optimal balance between model
simplicity and complexity. Understanding and managing this tradeoff
is crucial for developing models that generalize well to unseen data.
By carefully assessing bias and variance, utilizing techniques like
cross-validation, regularization, and feature engineering, we can strike
the right balance and create models that achieve optimal
performance in real-world scenarios.