KEMBAR78
Module 1 | PDF | Machine Learning | Receiver Operating Characteristic
0% found this document useful (0 votes)
19 views54 pages

Module 1

The document provides an introduction to machine learning, defining it as a field that enables computers to learn from data without explicit programming. It discusses the increasing need for machine learning due to its ability to handle complex tasks and highlights various applications, advantages, and disadvantages of machine learning techniques. Additionally, it covers different types of machine learning, including supervised and unsupervised learning, along with key concepts such as training and testing datasets, overfitting, and performance measures.

Uploaded by

kush tejani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views54 pages

Module 1

The document provides an introduction to machine learning, defining it as a field that enables computers to learn from data without explicit programming. It discusses the increasing need for machine learning due to its ability to handle complex tasks and highlights various applications, advantages, and disadvantages of machine learning techniques. Additionally, it covers different types of machine learning, including supervised and unsupervised learning, along with key concepts such as training and testing datasets, overfitting, and performance measures.

Uploaded by

kush tejani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Introduction to Machine learning

Module 1
What is Machine Learning?
● “ A Field of study that gives computers the ability to learn without being
explicitly programmed”
-Arthur Samuel (1959)

● Machine learning is an example of AI that algorithms and data, that


automatically analyse and make decisions by itself without human human
interventions.
● It performs how computer performs tasks on their own by previous
experiences.
● So, in machine language AI is generated on the basis of experience.
● The difference between normal computer software and machine learning is
that a normal human developer has not given codes that instructs the system
how to react to situation, instead it is being trained by large number of data.
What is Machine Learning?
Need for Machine Learning

● The need for machine learning is increasing day by day. The reason behind the need for
machine learning is that it is capable of doing tasks that are too complex for a person to
implement directly.
● As a human, we have some limitations as we cannot access the huge amount of data
manually, so for this, we need some computer systems and here comes the machine
learning to make things easy for us.
● We can train machine learning algorithms by providing them the huge amount of data and
let them explore the data, construct the models, and predict the required output
automatically.
● The performance of the machine learning algorithm depends on the amount of data, and it
can be determined by the cost function. With the help of machine learning, we can save both
time and money.
Applications of Machine learning
● Virtual Personal Assistant
● Speech recognition
● Email spam and malware filtering
● Bioinformatics
● Natural language processing
● Real Time Examples
● Traffic prediction
● Virtual Personal Assistant
● Online Transportation
● Social Media Services
● Email spam filtering
● Product Recommendation
● Online Fraud detection
● Fast, Accurate, Efficient.
●Advantages of ML
Automation of most applications.
● Wide range of real life applications.
● Enhanced cyber security and spam detection.
● No human Intervention is needed.
● Handling multi dimensional data.
Disadvantages of ML
● It is very difficult to identify and rectify the errors.
● Data Acquisition.
● Interpretation of results Requires more time and space.
Difference Between Machine Learning And Artificial Intelligence

● Artificial Intelligence is a concept of creating intelligent machines that stimulates human


behaviour whereas Machine learning is a subset of Artificial intelligence that allows machine to
learn from data without being programmed.

Difference between Signal Processing and Machine learning

● Signal processing is a branch of electrical engineering used to model and analyse analog
and digital data representations of physical events. All the technology we use today and
even rely on in our everyday lives (computers, radios, videos, mobile phones) is enabled
by signal processing. Hence, it truly represents the science behind our digital lives.
● Machine learning is the study of computer algorithms that learn to do prediction and/or
classification based on just a set of collected data.
steps involved in developing a machine learning application
Types of Machine learning
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
Supervised Learning
● Supervised learning is a type of machine learning method in which we provide sample
labeled data to the machine learning system in order to train it, and on that basis, it predicts
the output.
● The system creates a model using labeled data to understand the datasets and learn about
each data, once the training and processing are done then we test the model by providing a
sample data to check whether it is predicting the exact output or not.
● As input data is fed into the model, it adjusts its weights until the model has been
fitted appropriately, which occurs as part of the cross validation process. Supervised
learning helps organizations solve for a variety of real-world problems at scale, such
as classifying spam in a separate folder from your inbox.
Supervised Machine Learning
Steps Involved in Supervised Learning:

● First Determine the type of training dataset


● Collect/Gather the labelled training data.
● Split the training dataset into training dataset, test dataset, and validation dataset.
● Determine the input features of the training dataset, which should have enough knowledge so that
the model can accurately predict the output.
● Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc.
● Execute the algorithm on the training dataset. Sometimes we need validation sets as the control
parameters, which are the subset of training datasets.
● Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.
How supervised learning works
● Supervised learning uses a training set to teach models to
yield the desired output. This training dataset includes
inputs and correct outputs, which allow the model to learn
over time. The algorithm measures its accuracy through the
loss function, adjusting until the error has been sufficiently
minimized.
● Supervised learning can be separated into two types of
problems when data mining—classification and regression:
● Classification uses an algorithm to accurately assign test
data into specific categories. It recognizes specific entities
within the dataset and attempts to draw some conclusions
on how those entities should be labeled or defined.
● Common classification algorithms are linear classifiers,
support vector machines (SVM), decision trees, k-nearest
neighbor, and random forest,
● Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Some algorithms are, Linear Regression, Regression Trees, Non-Linear Regression
Bayesian Linear Regression, Polynomial Regression

● Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc. e.g. Spam Filtering,

Random Forest, Decision Trees, Logistic Regression, Support vector Machines

● Disadvantages of Supervised Learning:


● Supervised learning models are not suitable for handling the complex tasks.
● Supervised learning cannot predict the correct output if the test data is different from the training
dataset.
● Training required lots of computation times.
● In supervised learning, we need enough knowledge about the classes of object.
Unsupervised Learning
● As the name suggests, unsupervised learning is a
machine learning technique in which models are not
supervised using training dataset. Instead, models itself
find the hidden patterns and insights from the given data.
● The goal of unsupervised learning is to find the
underlying structure of dataset, group that data
according to similarities, and represent that dataset in a
compressed format.
● The task of the unsupervised learning algorithm is to
identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the
image dataset into the groups according to similarities
between images.
Unsupervised Learning
Clustering
Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group.

The clustering technique can be widely used in various tasks. Some most common uses of this technique are:

● Market Segmentation
● Statistical data analysis
● Social network analysis
● Image segmentation

It is used by the Amazon in its recommendation system to provide the recommendations as per the past search
of products. Netflix also uses this technique to recommend the movies and web-series to its users as per the
watch history.
Unsupervised Learning

● Association: An association rule is an unsupervised learning method which is used for


finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
● Example: In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar objects. The accurate result of a query
depends on the quality of the clustering algorithm used.
Some popular unsupervised learning algorithms:

● K-means clustering
● KNN (k-nearest neighbors)
● Hierarchal clustering
● Anomaly detection
● Neural Networks
● Principle Component Analysis
● Independent Component Analysis
● Apriori algorithm
● Singular value decomposition
Advantages of Unsupervised Learning

● Unsupervised learning is used for more complex tasks as compared to supervised


learning because, in unsupervised learning, we don't have labeled input data.
● Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.

Disadvantages of Unsupervised Learning

● Unsupervised learning is intrinsically more difficult than supervised learning as it


does not have corresponding output.
● The result of the unsupervised learning algorithm might be less accurate as input
data is not labeled, and algorithms do not know the exact output in advance.
Supervised Vs Unsupervised Learning
Training and Testing Data Set

● In machine learning projects, we generally divide the original dataset into


training data and test data.
● We train our model over a subset of the original dataset, i.e., the training
dataset, and then evaluate whether it can generalize well to the new or unseen
dataset or test set.
Therefore, train and test datasets are the two key concepts of machine learning,
where the training dataset is used to fit the model, and the test dataset is used to
evaluate the model.

What is Training Dataset?

The training data is the biggest (in -size) subset of the original dataset, which is
used to train or fit the machine learning model. Firstly, the training data is fed to the
ML algorithms, which lets them learn how to make predictions for the given task.
What is testing Data Sets
● For Unsupervised learning, the training data contains unlabeled data points, i.e., inputs
are not tagged with the corresponding outputs. Models are required to find the patterns
from the given training datasets in order to make predictions.
● On the other hand, for supervised learning, the training data contains labels in order to
train the model and make predictions.
● The type of training data that we provide to the model is highly responsible for the
model's accuracy and prediction ability. It means that the better the quality of the
training data, the better will be the performance of the model.
● Training data is approximately more than or equal to 60% of the total data for an ML
project.
● The test dataset is another subset of original data, which is independent of the
training dataset. Usually, the test dataset is approximately 20-25% of the total original data
for an ML project.
Overfitting and Underfitting
● At this stage, we can also check and compare the testing accuracy with the training
accuracy, which means how accurate our model is with the test dataset against the
training dataset.
● If the accuracy of the model on training data is greater than that on testing data, then
the model is said to have overfitting.
● On the other hand, the model is said to be under-fitted when it is not able to capture the
underlying trend of the data.
● It means the model shows poor performance even with the training dataset.
● In most cases, underfitting issues occur when the model is not perfectly suitable for
the problem that we are trying to solve.
● To avoid the overfitting issue, we can either increase the training time of the model or
increase the number of features in the dataset.
How to detect underfitting and overfitting
Cross-Validation
● There are various ways by which we can avoid overfitting in the
model, such as Using the Cross-Validation method, early stopping
the training, or by regularization etc.
● Cross-validation is a technique for validating the model efficiency by
training it on the subset of input data and testing on previously
unseen subset of the input data.
● We can also say that it is a technique to check how a statistical model
generalizes to an independent dataset.
Hence the basic steps of cross-validations are:

● Reserve a subset of the dataset as a validation set.


● Provide the training to the model using the training dataset.
● Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else
check for the issues.
Methods of cross validation
There are some common methods that are used for cross-validation. These methods are
given below:
1. Leave one out cross-validation
2. K-fold cross-validation
3. Stratified k-fold cross-validation
4. Time series Cross Validation
● Cross-Validation dataset: It is used to overcome the disadvantage of train/test split by
splitting the dataset into groups of train/test splits, and averaging the result.
● It can be used if we want to optimize our model that has been trained on the training
dataset for the best performance.
● It is more efficient as compared to train/test split as every observation is used for the
training and testing both.
Hypothesis Testing in Machine learning
● To trust your model and make
predictions, we utilize hypothesis
testing. When we will use sample data
to train our model, we make
assumptions about our population. By
performing hypothesis testing, we
validate these assumptions for a
desired significance level.
● ML professionals and data scientists make an initial assumption for the solution of
the problem.This assumption in Machine learning is known as Hypothesis.
● The hypothesis is defined as the supposition or proposed explanation based on
insufficient evidence or assumptions. It is just a guess based on some known facts
but has not yet been proven.
● A Hypothesis is an assumption made by scientists, whereas a model is a
mathematical representation that is used to test the hypothesis.
● Hypothesis space (H):
● Hypothesis space is defined as a set of all possible legal hypotheses; hence it is
also known as a hypothesis set. It is used by supervised machine learning algorithms to
determine the best possible hypothesis to describe the target function or best maps input
to output.
● Hypothesis (h):
● It is defined as the approximate function that best describes the target in supervised
machine learning algorithms. It is primarily based on data as well as bias and restrictions
applied to data.
Key steps to perform hypothesis test are as follows:

1. Formulate a Hypothesis
2. Determine the significance level
3. Determine the type of test
4. Calculate the Test Statistic values and the p values
5. P value is a probability (between 0-1) with the assumption that null hypothesis is
true.
6. Make Decision
1. https://www.youtube.com/watch?v=gP3tFs2mArw
Performance Measures
1. Performance Metrics for Classification

In machine learning, each task or problem is divided into classification and Regression.

Different evaluation metrics are used for both Regression and Classification tasks.

● Accuracy
● Confusion Matrix
● Precision
● Recall
● F-Score
● AUC(Area Under the Curve)-ROC
1. ACCURACY:

2. CONFUSION MATRIX: A confusion matrix is a tabular representation of prediction


outcomes of any binary classifier, which is used to describe the performance of the
classification model on a set of test data when true values are known.
● from sklearn.metrics import accuracy_score

● y_true = [1, 0, 1, 1, 0, 1]
● y_pred = [1, 0, 1, 0, 1, 1]

● accuracy = accuracy_score(y_true, y_pred)


● print(f'Accuracy: {accuracy}')

If a model correctly predicts 90 out of 100 instances, the accuracy is 90%.


● Precision:

● Definition: Precision is the ratio of correctly predicted positive observations to


the total predicted positives.
● Example: If a model predicted 20 instances as positive and 18 of them were
actually positive, the precision is 18/20.

from sklearn.metrics import precision_score

precision = precision_score(y_true, y_pred)


print(f'Precision: {precision}')
● Recall (Sensitivity or True Positive Rate):

● Definition: Recall is the ratio of correctly predicted positive observations to the


all observations in the actual class.
● Example: If there were a total of 30 actual positive instances, and the model
predicted 25 of them correctly, the recall is 25/30.

from sklearn.metrics import recall_score

recall = recall_score(y_true, y_pred)


print(f'Recall: {recall}')
● F1 Score:

● Definition: F1 Score is the weighted average of precision and recall. It ranges


from 0 to 1, where 1 is the best possible F1 Score.
● Example: If a model has precision of 0.8 and recall of 0.7, the F1 Score is 2 *
(0.8 * 0.7) / (0.8 + 0.7) = 0.74.

from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print(f'F1 Score: {f1}')
1. True Positive(TP): In this case, the prediction outcome is true, and it is true in reality, also.
2. True Negative(TN): in this case, the prediction outcome is false, and it is false in reality, also.
3. False Positive(FP): In this case, prediction outcomes are true, but they are false in actuality.
4. False Negative(FN): In this case, predictions are false, and they are true in actuality.

III. Precision

The precision determines the proportion of positive prediction that was actually correct. It can be
calculated as the True Positive or predictions that are actually true to the total positive predictions (True
Positive and False Positive).

IV. Recall or Sensitivity

It can be calculated as True Positive or predictions that are actually true to the total number of positives,
either correctly predicted as positive or incorrectly predicted as negative (true Positive and false negative).
When to use Precision and Recall?

From the above definitions of Precision and Recall, we can say that recall determines the
performance of a classifier with respect to a false negative, whereas precision gives
information about the performance of a classifier with respect to a false positive.

So, if we want to minimize the false negative, then, Recall should be as near to 100%, and if
we want to minimize the false positive, then precision should be close to 100% as possible.

In simple words, if we maximize precision, it will minimize the FP errors, and if we maximize
recall, it will minimize the FN error.
V. F-Scores
● F-score or F1 Score is a metric to evaluate a binary classification model on the basis of
predictions that are made for the positive class.
● It is calculated with the help of Precision and Recall.
● It is a type of single score that represents both Precision and Recall.
● So, the F1 Score can be calculated as the harmonic mean of both precision and Recall,
assigning equal weight to each of them.

● When to use F-Score?

● As F-score make use of both precision and recall, so it should be used if both of them are
important for evaluation, but one (precision or recall) is slightly more important to consider
than the other. For example, when False negatives are comparatively more important than
false positives, or vice versa.
● To calculate value at any
.. point in a ROC curve, we can evaluate a logistic regression
model multiple times with different classification thresholds, but this would not be
much efficient.
● So, for this, one efficient method is used, which is known as AUC.

● AUC: Area Under the ROC curve


AUC is known for Area Under the ROC curve. As its name suggests, AUC
calculates the two-dimensional area under the entire ROC curve, as shown below image:
ROC example:
● True Positives = Radar Operator interpreted signal as Enemy Planes and there were
Enemy planes (Good Result: No wasted Resources)
● True Negatives = Radar Operator said no planes and there were none (Good Result:
No wasted resources)
● False Positives = Radar Operator said planes, but there were none (Geese: wasted
resources)
● False Negatives = Radar Operator said no plane, but there were planes (Bombs
dropped: very bad outcome)
● Sensitivity = Probability of correctly interpreting the radar signal as Enemy planes
among those times when Enemy planes were actually coming
○ SE = True Positives / True Positives + False
Negatives
● Specificity = Probability of correctly interpreting the radar signal as no Enemy
planes among those times when no Enemy planes were actually coming
○ SP = True Negatives / True Negatives +
False Positives
Performance Metrics for Regression
● Regression is a supervised learning technique that aims to find the relationships
between the dependent and independent variables.
● A predictive regression model predicts a numeric or discrete value.
● The performance of a Regression model is reported as errors in the prediction.
Following are the popular metrics that are used to evaluate the performance of
Regression models.
• Mean Absolute Error.
• Mean Squared Error.
• R2 Score
• Adjusted R2.
R2 score
● The R2 score, or coefficient of determination, is a statistical measure that evaluates how well a
regression model's predictions align with actual data.
● REFERENCES:
● https://www.javatpoint.com/
● https://www.analyticsvidhya.com/blog/
● Introduction to Machine Learning by Ethem Alpaydin

You might also like