KEMBAR78
Machine Learning Basics for Beginners | PDF | Machine Learning | Artificial Intelligence
100% found this document useful (2 votes)
465 views139 pages

Machine Learning Basics for Beginners

Machine learning enables computers to learn from data without being explicitly programmed. It works by building mathematical models from sample data known as "training data" to make predictions or decisions without being programmed with rules. The document then discusses the different types of machine learning including supervised learning, unsupervised learning, and reinforcement learning. It also provides a brief history of machine learning from its early concepts in the 1930s-1950s to modern applications of deep learning.

Uploaded by

PaNkaj Sonwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
465 views139 pages

Machine Learning Basics for Beginners

Machine learning enables computers to learn from data without being explicitly programmed. It works by building mathematical models from sample data known as "training data" to make predictions or decisions without being programmed with rules. The document then discusses the different types of machine learning including supervised learning, unsupervised learning, and reinforcement learning. It also provides a brief history of machine learning from its early concepts in the 1930s-1950s to modern applications of deep learning.

Uploaded by

PaNkaj Sonwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 139

Machine Learning Tutorial

Machine learning is a growing technology which enables computers to learn


automatically from past data. Machine learning uses various algorithms for building
mathematical models and making predictions using historical data or information.
Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender system, and
many more.

What is Machine Learning


In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions. But can a machine also learn from experiences or past data like
a human does? So here comes the role of Machine Learning.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned
with the development of algorithms which allow a computer to learn from the data and
past experiences on their own. The term machine learning was first introduced
by Arthur Samuel in 1959. We can define it in a summarized way as:

Machine learning enables a machine to automatically learn from data, improve


performance from experiences, and predict things without being explicitly programmed.

With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine learning brings computer
science and statistics together for creating predictive models. Machine learning
constructs or uses the algorithms that learn from historical data. The more we will
provide the information, the higher will be the performance.

A machine has the ability to learn if it can improve its performance by gaining
more data.

How does Machine Learning work


A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it. The accuracy of
predicted output depends upon the amount of data, as the huge amount of data helps
to build a better model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so


instead of writing a code for it, we just need to feed the data to generic algorithms, and
with the help of these algorithms, machine builds the logic as per the data and predict
the output. Machine learning has changed our way of thinking about the problem. The
below block diagram explains the working of Machine Learning algorithm:

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge
amount of the data.

Need for Machine Learning


The need for machine learning is increasing day by day. The reason behind the need for
machine learning is that it is capable of doing tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the huge
amount of data manually, so for this, we need some computer systems and here comes
the machine learning to make things easy for us.

We can train machine learning algorithms by providing them the huge amount of data
and let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of
machine learning, we can save both time and money.
The importance of machine learning can be easily understood by its uses cases,
Currently, machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc. Various top companies such as
Netflix and Amazon have build machine learning models that are using a vast amount of
data to analyze the user interest and recommend product accordingly.

Following are some key points which show the importance of Machine Learning:

o Rapid increment in the production of data


o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning


At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample
labeled data to the machine learning system in order to train it, and on that basis, it
predicts the output.

The system creates a model using labeled data to understand the datasets and learn
about each data, once the training and processing are done then we test the model by
providing a sample data to check whether it is predicting the exact output or not.

The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student learns
things in the supervision of the teacher. The example of supervised learning is spam
filtering.

Supervised learning can be grouped further in two categories of algorithms:

o Classification
o Regression

2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.

The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to


find useful insights from the huge amount of data. It can be further classifieds into two
categories of algorithms:

o Clustering
o Association

3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning agent
gets a reward for each right action and gets a penalty for each wrong action. The agent
learns automatically with these feedbacks and improves its performance. In
reinforcement learning, the agent interacts with the environment and explores it. The
goal of an agent is to get the most reward points, and hence, it improves its
performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

Note: We will learn about the above types of machine learning in detail in later chapters.

History of Machine Learning


Before some years (about 40-50 years), machine learning was science fiction, but today
it is the part of our daily life. Machine learning is making our day to day life easy
from self-driving cars to Amazon virtual assistant "Alexa". However, the idea behind
machine learning is so old and has a long history. Below some milestones are given
which have occurred in the history of machine learning:

The early history of Machine Learning (Pre-1940):


o 1834: In 1834, Charles Babbage, the father of the computer, conceived a device
that could be programmed with punch cards. However, the machine was never
built, but all modern computers rely on its logical structure.
o 1936: In 1936, Alan Turing gave a theory that how a machine can determine and
execute a set of instructions.

The era of stored program computers:

o 1940: In 1940, the first manually operated computer, "ENIAC" was invented,
which was the first electronic general-purpose computer. After that stored
program computer such as EDSAC in 1949 and EDVAC in 1951 were invented.
o 1943: In 1943, a human neural network was modeled with an electrical circuit. In
1950, the scientists started applying their idea to work and analyzed how human
neurons might work.

Computer machinery and intelligence:

o 1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery


and Intelligence," on the topic of artificial intelligence. In his paper, he asked,
"Can machines think?"

Machine intelligence in Games:

o 1952: Arthur Samuel, who was the pioneer of machine learning, created a


program that helped an IBM computer to play a checkers game. It performed
better more it played.
o 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.

The first "AI" winter:

o The duration of 1974 to 1980 was the tough time for AI and ML researchers, and
this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and people had reduced
their interest from AI, which led to reduced funding by the government to the
researches.
Machine Learning from theory to reality

o 1959: In 1959, the first neural network was applied to a real-world problem to
remove echoes over phone lines using an adaptive filter.
o 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural
network NETtalk, which was able to teach itself how to correctly pronounce
20,000 words in one week.
o 1997: The IBM's Deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten
a human chess expert.

Machine Learning at 21st century

o 2006: In the year 2006, computer scientist Geoffrey Hinton has given a new name
to neural net research as "deep learning," and nowadays, it has become one of
the most trending technologies.
o 2012: In 2012, Google created a deep neural network which learned to recognize
the image of humans and cats in YouTube videos.
o 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing Test. It was the
first Chabot who convinced the 33% of human judges that it was not a machine.
o 2014: DeepFace was a deep neural network created by Facebook, and they
claimed that it could recognize a person with the same precision as a human can
do.
o 2016: AlphaGo beat the world's number second player Lee sedol at Go game. In
2017 it beat the number one player of this game Ke Jie.
o 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system that was
able to learn the online trolling. It used to read millions of comments of
different websites to learn to stop online trolling.

Machine Learning at present:


Now machine learning has got a great advancement in its research, and it is present
everywhere around us, such as self-driving cars, Amazon
Alexa, Catboats, recommender system, and many more. It
includes Supervised, unsupervised, and reinforcement learning with
clustering, classification, decision tree, SVM algorithms, etc.

Modern machine learning models can be used for making various predictions,
including weather prediction, disease prediction, stock market analysis, etc.

Applications of Machine learning


Machine learning is a buzzword for today's technology, and it is growing very rapidly
day by day. We are using machine learning in our daily life even without knowing it such
as Google Maps, Google assistant, Alexa, etc. Below are some most trending real-world
applications of Machine Learning:

1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is
used to identify objects, persons, places, digital images, etc. The popular use case of
image recognition and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload
a photo with our Facebook friends, then we automatically get a tagging suggestion with
name, and the technology behind this is machine learning's face
detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.creen

2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also
known as "Speech to text", or "Computer speech recognition." At present, machine
learning algorithms are widely used by various applications of speech
recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or


heavily congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the performance.

4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies
such as Amazon, Netflix, etc., for product recommendation to the user. Whenever we
search for some product on Amazon, then we started getting an advertisement for the
same product while internet surfing on the same browser and this is because of machine
learning.
Google understands the user interest using various machine learning algorithms and
suggests the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment


series, movies, etc., and this is also done with the help of machine learning.

5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.

6. Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as important, normal, and
spam. We always receive an important mail in our inbox with the important symbol and
spam emails in our spam box, and the technology behind this is Machine learning.
Below are some spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree,


and Naïve Bayes classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google
assistant, Alexa, Cortana, Siri. As the name suggests, they help us in finding the
information using our voice instruction. These assistants can help us in various ways just
by our voice instructions such as Play music, call someone, Open an email, Scheduling
an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.


These assistant record our voice instructions, send it over the server on a cloud, and
decode it using ML algorithms and act accordingly.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways
that a fraudulent transaction can take place such as fake accounts, fake ids, and steal
money in the middle of a transaction. So to detect this, Feed Forward Neural
network helps us by checking whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these
values become the input for the next round. For each genuine transaction, there is a
specific pattern which gets change for the fraud transaction hence, it detects it and
makes our online transactions more secure.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock market, there is
always a risk of up and downs in shares, so for this machine learning's long short term
memory neural network is used for the prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:


Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning


algorithm, which is used with image recognition and translates the text from one
language to another language.
Machine learning Life cycle
Machine learning has given the computer systems the abilities to automatically learn
without being explicitly programmed. But how does a machine learning system work?
So, it can be described using the life cycle of machine learning. Machine learning life
cycle is a cyclic process to build an efficient machine learning project. The main purpose
of the life cycle is to find a solution to the problem or project.

Machine learning life cycle involves seven major steps, which are given below:

o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
The most important thing in the complete process is to understand the problem and to
know the purpose of the problem. Therefore, before starting the life cycle, we need to
understand the problem because the good result depends on the better understanding
of the problem.

In the complete life cycle process, to solve a problem, we create a machine learning
system called "model", and this model is created by providing "training". But to train a
model, we need data, hence, life cycle starts by collecting data.

1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is
to identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the
most important steps of the life cycle. The quantity and quality of the collected data will
determine the efficiency of the output. The more will be the data, the more accurate will
be the prediction.

This step includes the below tasks:

o Identify various data sources


o Collect data
o Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as a dataset. It
will be used in further steps.

2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a
step where we put our data into a suitable place and prepare it to use in our machine
learning training.

In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:

o Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.

3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format.
It is the process of cleaning the data, selecting the variable to use, and transforming the
data in a proper format to make it more suitable for analysis in the next step. It is one of
the most important steps of the complete process. Cleaning of data is required to
address the quality issues.

It is not necessary that data we have collected is always of our use as some of the data
may not be useful. In real-world applications, collected data may have various issues,
including:

o Missing Values
o Duplicate data
o Invalid data
o Noise

So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively affect
the quality of the outcome.

4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:

o Selection of analytical techniques


o Building models
o Review the result

The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the determination of
the type of the problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model
using prepared data, and evaluate the model.

Hence, in this step, we take the data and use machine learning algorithms to build the
model.

5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training
a model is required so that it can understand the various patterns, rules, and, features.

6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the
model. In this step, we check for the accuracy of our model by providing a test dataset
to it.

Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.

7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in
the real-world system.

If the above-prepared model is producing an accurate result as per our requirement


with acceptable speed, then we deploy the model in the real system. But before
deploying the project, we will check whether it is improving its performance using
available data or not. The deployment phase is similar to making the final report for a
project.

Difference between Artificial intelligence and


Machine learning
Artificial intelligence and machine learning are the part of computer science that are
correlated with each other. These two technologies are the most trending technologies
which are used for creating intelligent systems.

AI is a bigger concept to create intelligent machines that can simulate human thinking
capability and behavior, whereas, machine learning is an application or subset of AI that
allows machines to learn from data without being programmed explicitly.

Artificial Intelligence
Artificial intelligence is a field of computer science which makes a computer system that
can mimic human intelligence. It is comprised of two words "Artificial" and
"intelligence", which means "a human-made thinking power." Hence we can define it
as,

Artificial intelligence is a technology using which we can create intelligent systems that can
simulate human intelligence.

The Artificial intelligence system does not require to be pre-programmed, instead of


that, they use such algorithms which can work with their own intelligence. It involves
machine learning algorithms such as Reinforcement learning algorithm and deep
learning neural networks. AI is being used in multiple places such as Siri, Google?s
AlphaGo, AI in Chess playing, etc.

Based on capabilities, AI can be classified into three types:

o Weak AI
o General AI
o Strong AI

Currently, we are working with weak AI and general AI. The future of AI is Strong AI for
which it is said that it will be intelligent than humans.

Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,

Machine learning is a subfield of artificial intelligence, which enables machines to learn from
past data or experiences without being explicitly programmed.

Machine learning enables a computer system to make predictions or take some


decisions using historical data without being explicitly programmed. Machine learning
uses a massive amount of structured and semi-structured data so that a machine
learning model can generate accurate result or give predictions based on that data.

Machine learning works on algorithm which learn by its own using historical data. It
works only for specific domains such as if we are creating a machine learning model to
detect pictures of dogs, it will only give result for dog images, but if we provide a new
data like cat image then it will become unresponsive. Machine learning is being used in
various places such as for online recommender system, for Google search algorithms,
Email spam filter, Facebook Auto friend tagging suggestion, etc.

It can be divided into three types:

o Supervised learning
o Reinforcement learning
o Unsupervised learning

Key differences between Artificial Intelligence (AI) and


Machine learning (ML):

Artificial Intelligence Machine learning

Artificial intelligence is a technology which Machine learning is a subset of AI which allows a


enables a machine to simulate human machine to automatically learn from past data without
behavior. programming explicitly.

The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from data
system like humans to solve complex so that they can give accurate output.
problems.

In AI, we make intelligent systems to In ML, we teach machines with data to perform a
perform any task like a human. particular task and give an accurate result.

Machine learning and deep learning are the Deep learning is a main subset of machine learning.
two main subsets of AI.

AI has a very wide range of scope. Machine learning has a limited scope.

AI is working to create an intelligent system Machine learning is working to create machines that
which can perform various complex tasks. can perform only those specific tasks for which they are
trained.

AI system is concerned about maximizing Machine learning is mainly concerned about accuracy
the chances of success. and patterns.

The main applications of AI are Siri, The main applications of machine learning are Online
customer support using catboats, Expert recommender system, Google search
System, Online game playing, intelligent algorithms, Facebook auto friend tagging
humanoid robot, etc. suggestions, etc.

On the basis of capabilities, AI can be Machine learning can also be divided into mainly three
divided into three types, which are, Weak types that are Supervised learning, Unsupervised
AI, General AI, and Strong AI. learning, and Reinforcement learning.

It includes learning, reasoning, and self- It includes learning and self-correction when introduced
correction. with new data.

AI completely deals with Structured, semi- Machine learning deals with Structured and semi-
structured, and unstructured data. structured data.

How to get datasets for Machine Learning


The key to success in the field of machine learning or to become a great data scientist is
to practice with different types of datasets. But discovering a suitable dataset for each
kind of machine learning project is a difficult task. So, in this topic, we will provide the
detail of the sources from where you can easily get the dataset according to your
project.

Before knowing the sources of the machine learning dataset, let's discuss datasets.

What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can
contain any data from a series of an array to a database table. Below table shows an
example of the dataset:

Country Age Salary Purchased

India 38 48000 No

France 43 45000 Yes

Germany 30 54000 No
France 48 65000 No

Germany 40 Yes

India 35 58000 Yes

A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset is "Comma Separated
File," or CSV. But to store a "tree-like data," we can use the JSON file more efficiently.

Types of data in datasets

o Numerical data:Such as house price, temperature, etc.


o Categorical data:Such as Yes/No, True/False, Blue/green, etc.
o Ordinal data:These data are similar to categorical data but can be measured on
the basis of comparison.

Note: A real-world dataset is of huge size, which is difficult to manage and


process at the initial level. Therefore, to practice machine learning algorithms, we
can use any dummy dataset.

Need of Dataset
To work with machine learning projects, we need a huge amount of data, because,
without the data, one cannot train ML/AI models. Collecting and preparing the dataset
is one of the most crucial parts while creating an ML/AI project.

The technology applied behind any ML projects cannot work properly if the dataset is
not well prepared and pre-processed.

During the development of the ML project, the developers completely rely on the
datasets. In building ML applications, datasets are divided into two parts:

o Training dataset:
o Test Dataset
Note: The datasets are of large size, so to download these datasets, you must
have fast internet on your computer.

Data Preprocessing in Machine learning


Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory
to clean it and put in a formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?


A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models. Data preprocessing
is required tasks for cleaning the data and making it suitable for a machine learning
model which also increases the accuracy and efficiency of a machine learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset


To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in
a proper format is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create
a machine learning model for business purpose, then dataset will be different with the
dataset required for a liver patient. So each dataset is different from another dataset. To
use the dataset in our code, we usually put it into a CSV file. However, sometimes, we
may also need to use an HTML or xlsx file.

What is a CSV File?


CSV stands for "Comma-Separated Values" files; it is a file format which allows us to
save the tabular data, such as spreadsheets. It is useful for huge datasets and can use
these datasets in programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-learning.
For real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets, https://archive.ics.uci.edu/ml/index.php etc.

We can also create our dataset by gathering data using various API with Python and put
that data into a .csv file.

2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries. These libraries are used to perform some specific jobs.
There are three specific libraries that we will use for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation
in the code. It is the fundamental package for scientific calculation in Python. It also
supports to add large, multidimensional arrays and matrices. So, in Python, we can
import it as:

1. import numpy as nm  

Here we have used nm, which is a short name for Numpy, and it will be used in the
whole program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and


with this library, we need to import a sub-library pyplot. This library is used to plot any
type of charts in Python for the code. It will be imported as below:

1. import matplotlib.pyplot as mpt  

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:
3) Importing the Datasets
Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a
working directory. To set a working directory in Spyder IDE, we need to follow the below
steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Note: We can set any directory as a working directory, but it must contain the required
dataset.

Here, in the below image, we can see the Python file along with required dataset. Now,
the current folder is set as a working directory.

read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is
used to read a csv file and performs various operations on it. Using this function, we can
read a csv file locally as well as through an URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')  

Here, data_set is a name of the variable to store our dataset, and inside the function, we
have passed the name of our dataset. Once we execute the above line of code, it will
successfully import the dataset in our code. We can also check the imported dataset by
clicking on the section variable explorer, and then double click on data_set. Consider
the below image:

As in the above image, indexing is started from 0, which is the default indexing in
Python. We can also change the format of our dataset by clicking on the format option.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent


variables) and dependent variables from dataset. In our dataset, there are three
independent variables that are Country, Age, and Salary, and one is a dependent
variable which is Purchased.
Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is


used to extract the required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values  

In the above code, the first colon(:) is used to take all the rows, and the second colon(:)
is for all the columns. Here we have used :-1, because we don't want to take the last
column as it contains the dependent variable. So by doing this, we will get the matrix of
features.

By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]  
2.  ['France' 43.0 45000.0]  
3.  ['Germany' 30.0 54000.0]  
4.  ['France' 48.0 65000.0]  
5.  ['Germany' 40.0 nan]  
6.  ['India' 35.0 58000.0]  
7.  ['Germany' nan 53000.0]  
8.  ['France' 49.0 79000.0]  
9.  ['India' 50.0 88000.0]  
10.  ['France' 37.0 77000.0]]  

As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values  

Here we have taken all the rows with the last column only. It will give the array of
dependent variables.

By executing the above code, we will get output as:

Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

Note: If you are using Python language for machine learning, then extraction is mandatory,
but for R language it is not required.

4) Handling Missing data:


The next step of data preprocessing is to handle missing data in the datasets. If our
dataset contains some missing data, then it may create a huge problem for our machine
learning model. Hence it is necessary to handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values.
In this way, we just delete the specific row or column which consists of null values. But
this way is not so efficient and removing data may lead to loss of information which will
not give the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This
strategy is useful for the features which have numeric data such as age, salary, year, etc.
Here, we will use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains


various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:

1. #handling missing data (Replacing missing data with the mean value)  
2. from sklearn.impute import SimpleImputer

3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)  
4. #Fitting imputer object to the independent variables x.   
5. imputerimputer= imputer.fit(x[:, 1:3])  
6. #Replacing missing data with the calculated mean value  
7. x[:, 1:3]= imputer.transform(x[:, 1:3])  

Output:
array([['India', 38.0, 68000.0],
['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

As we can see in the above output, the missing values have been replaced with the
means of rest column values.

5) Encoding Categorical data:


Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if
our dataset would have a categorical variable, then it may create trouble while building
the model. So it is necessary to encode these categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

1. #Catgorical data  
2. #for Country Variable  
3. from sklearn.preprocessing import LabelEncoder  
4. label_encoder_x= LabelEncoder()  
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

In above code, we have imported LabelEncoder class of sklearn library. This class has


successfully encoded the variables into digits.

But in our case, there are three country variables, and as we can see in the above output,
these variables are encoded into 0, 1, and 2. By these values, the machine learning
model may assume that there is some correlation between these variables which will
produce the wrong output. So to remove this issue, we will use dummy encoding.

Dummy Variables:

Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With
dummy encoding, we will have a number of columns equal to the number of categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and 1
values. For Dummy Encoding, we will use OneHotEncoder class
of preprocessing library.

1. #for Country Variable  
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder  
3. label_encoder_x= LabelEncoder()  
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  
5. #Encoding for dummy variables  
6. onehot_encoder= OneHotEncoder(categorical_features= [0])    
7. x= onehot_encoder.fit_transform(x).toarray()  

Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])

As we can see in the above output, all the variables are encoded into numbers 0 and 1
and divided into three columns.

It can be seen more clearly in the variables explorer section, by clicking on x option as:

For Purchased Variable:

1. labelencoder_y= LabelEncoder()  
2. y= labelencoder_y.fit_transform(y)  
For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the
purchased variable has only two categories yes or no, and which are automatically
encoded into 0 and 1.

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:

6) Splitting the Dataset into the Training set and Test


set
In machine learning data preprocessing, we divide our dataset into a training set and
test set. This is one of the crucial steps of data preprocessing as by doing this, we can
enhance the performance of our machine learning model.

Suppose, if we have given training to our machine learning model by a dataset and we
test it by a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.

If we train our model very well and its training accuracy is also very high, but we provide
a new dataset to it, then it will decrease the performance. So we always try to make a
machine learning model which performs well with the training set and also with the test
dataset. Here, we can define these datasets as:

Training Set: A subset of dataset to train the machine learning model, and we already
know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test
set, model predicts the output.

For splitting the dataset, we will use the below lines of code:

1. from sklearn.model_selection import train_test_split  
2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  

Explanation:

o In the above code, the first line is used for splitting arrays of the dataset into
random train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two
are for arrays of data, and test_size is for specifying the size of the test set. The
test_size maybe .5, .3, or .2, which tells the dividing ratio of training and testing
sets.
o The last parameter random_state is used to set a seed for a random generator
so that you always get the same result, and the most used value for this is 42.

Output:

By executing the above code, we will get 4 different variables, which can be seen under
the variable explorer section.

As we can see in the above image, the x and y variables are divided into 4 different
variables with corresponding values.

7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a
technique to standardize the independent variables of the dataset in a specific range. In
feature scaling, we put our variables in the same range and in the same scale so that no
any variable dominate the other variable.
Consider the below dataset:

As we can see, the age and salary column values are not on the same scale. A machine
learning model is based on Euclidean distance, and if we do not scale the variable, then
it will cause some issue in our machine learning model.

Euclidean distance is given as:


If we compute any two values from age and salary, then salary values will dominate the
age values, and it will produce an incorrect result. So to remove this issue, we need to
perform feature scaling for machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization
Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class


of sklearn.preprocessing library as:

1. from sklearn.preprocessing import StandardScaler  

Now, we will create the object of StandardScaler class for independent variables or


features. And then we will fit and transform the training dataset.

1. st_x= StandardScaler()  
2. x_train= st_x.fit_transform(x_train)  

For test dataset, we will directly apply transform() function instead


of fit_transform() because it is already done in training set.

1. x_test= st_x.transform(x_test)  

Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test
as:

x_train:
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.

Note: Here, we have not scaled the dependent variable because there are only two values
0 and 1. But if these variables will have more range of values, then we will also need to
scale those variables.

Combining all the steps:

Now, in the end, we can combine all the steps together to make our complete code
more understandable.

1. # importing libraries  
2. import numpy as nm  
3. import matplotlib.pyplot as mtp  
4. import pandas as pd  
5.   
6. #importing datasets  
7. data_set= pd.read_csv('Dataset.csv')  
8.   
9. #Extracting Independent Variable  
10. x= data_set.iloc[:, :-1].values  
11.   
12. #Extracting Dependent variable  
13. y= data_set.iloc[:, 3].values  
14.   
15. #handling missing data(Replacing missing data with the mean value)  
16. from sklearn.preprocessing import Imputer  
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)  
18.   
19. #Fitting imputer object to the independent varibles x.   
20. imputerimputer= imputer.fit(x[:, 1:3])  
21.   
22. #Replacing missing data with the calculated mean value  
23. x[:, 1:3]= imputer.transform(x[:, 1:3])  
24.   
25. #for Country Variable  
26. from sklearn.preprocessing import LabelEncoder, OneHotEncoder  
27. label_encoder_x= LabelEncoder()  
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  
29.   
30. #Encoding for dummy variables  
31. onehot_encoder= OneHotEncoder(categorical_features= [0])    
32. x= onehot_encoder.fit_transform(x).toarray()  
33.   
34. #encoding for purchased variable  
35. labelencoder_y= LabelEncoder()  
36. y= labelencoder_y.fit_transform(y)  
37.   
38. # Splitting the dataset into training and test set.  
39. from sklearn.model_selection import train_test_split  
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  
41.   
42. #Feature Scaling of datasets  
43. from sklearn.preprocessing import StandardScaler  
44. st_x= StandardScaler()  
45. x_train= st_x.fit_transform(x_train)  
46. x_test= st_x.transform(x_test)  
In the above code, we have included all the data preprocessing steps together. But there
are some steps or lines of code which are not necessary for all machine learning models.
So we can exclude them from our code to make it reusable for all models.

Supervised Machine Learning


Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.

In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the same
concept as a student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to
the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image


classification, Fraud Detection, spam filtering, etc.

How Supervised Learning Works?


In supervised learning, models are trained using labelled dataset, where the model
learns about each type of data. Once the training process is completed, the model is
tested on the basis of test data (a subset of the training set), and then it predicts the
output.

The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square,
rectangle, triangle, and Polygon. Now the first step is that we need to train the model
for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:


o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation
dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets
as the control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model
predicts the correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:


Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression algorithms
which come under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification
Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Note: We will discuss these algorithms in detail in later chapters.

Advantages of Supervised learning:


o With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:


o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different
from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
Unsupervised Machine Learning
In the previous topic, we learned supervised machine learning in which models are
trained using labeled data under the supervision of training data. But there may be
many cases in which we do not have labeled data and need to find the hidden patterns
from the given dataset. So, to solve such types of cases in machine learning, we need
unsupervised learning techniques.

What is Unsupervised Learning?


As the name suggests, unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data. It can be compared to learning which takes
place in the human brain while learning new things. It can be defined as:

Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification


problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the underlying
structure of dataset, group that data according to similarities, and represent that
dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset


containing images of different types of cats and dogs. The algorithm is never trained
upon the given dataset, which means it does not have any idea about the features of the
dataset. The task of the unsupervised learning algorithm is to identify the image
features on their own. Unsupervised learning algorithm will perform this task by
clustering the image dataset into the groups according to similarities between images.
Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so
to solve such cases, we need unsupervised learning.

Working of Unsupervised Learning


Working of unsupervised learning can be understood by the below diagram:
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it. Firstly, it will interpret the raw data to find
the hidden patterns from the data and then will apply suitable algorithms such as k-
means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:


The unsupervised learning algorithm can be further categorized into two types of
problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
with the objects of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the presence and absence
of those commonalities.
o Association: An association rule is an unsupervised learning method which is
used for finding the relationships between variables in the large database. It
determines the set of items that occurs together in the dataset. Association rule
makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.

Note: We will learn these algorithms in later chapters.

Unsupervised Learning algorithms:


Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Advantages of Unsupervised Learning


o Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages of Unsupervised Learning


o Unsupervised learning is intrinsically more difficult than supervised learning as it
does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input
data is not labeled, and algorithms do not know the exact output in advance.

o Difference between Supervised and


Unsupervised Learning
o Supervised and Unsupervised learning are the two techniques of machine
learning. But both the techniques are used in different scenarios and with
different datasets. Below the explanation of both learning methods along with
their difference table is given.
o

o Supervised Machine Learning:


o Supervised learning is a machine learning method in which models are trained
using labeled data. In supervised learning, models need to find the mapping
function to map the input variable (X) with the output variable (Y).

o
o Supervised learning needs supervision to train the model, which is similar to as a
student learns things in the presence of a teacher. Supervised learning can be
used for two types of problems: Classification and Regression.
o Example: Suppose we have an image of different types of fruits. The task of our
supervised learning model is to identify the fruits and classify them accordingly.
So to identify the image in supervised learning, we will give the input data as well
as output for that, which means we will train the model by the shape, size, color,
and taste of each fruit. Once the training is completed, we will test the model by
giving the new set of fruit. The model will identify the fruit and predict the output
using a suitable algorithm.

Supervised Learning Unsupervised Learning

Supervised learning algorithms are trained using Unsupervised learning algorithms are trained
labeled data. using unlabeled data.

Supervised learning model takes direct feedback to Unsupervised learning model does not take any
check if it is predicting correct output or not. feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.

In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.

The goal of supervised learning is to train the The goal of unsupervised learning is to find the
model so that it can predict the output when it is hidden patterns and useful insights from the
given new data. unknown dataset.

Supervised learning needs supervision to train the Unsupervised learning does not need any
model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems. in Clustering and Associations problems.

Supervised learning can be used for those cases Unsupervised learning can be used for those
where we know the input as well as corresponding cases where we have only input data and no
outputs. corresponding output data.

Supervised learning model produces an accurate Unsupervised learning model may give less
result. accurate result as compared to supervised
learning.

Supervised learning is not close to true Artificial Unsupervised learning is more close to the true
intelligence as in this, we first train the model for Artificial Intelligence as it learns similarly as a
each data, and then only it can predict the correct child learns daily routine things by his
output. experiences.

It includes various algorithms such as Linear It includes various algorithms such as Clustering,
Regression, Logistic Regression, Support Vector KNN, and Apriori algorithm.
Machine, Multi-class Classification, Decision tree,
Bayesian Logic, etc.

o Unsupervised Machine Learning:


o Unsupervised learning is another machine learning method in which patterns
inferred from the unlabeled input data. The goal of unsupervised learning is to
find the structure and patterns from the input data. Unsupervised learning does
not need any supervision. Instead, it finds patterns from the data by its own.
o Learn more Unsupervised Machine Learning
o Unsupervised learning can be used for two types of
problems: Clustering and Association.
o Example: To understand the unsupervised learning, we will use the example
given above. So unlike supervised learning, here we will not provide any
supervision to the model. We will just provide the input dataset to the model and
allow the model to find the patterns from the data. With the help of a suitable
algorithm, the model will train itself and divide the fruits into different groups
according to the most similar features between them.
o The main differences between Supervised and Unsupervised learning are given
below:
o Note: The supervised and unsupervised learning both are the machine learning
methods, and selection of any of these learning depends on the factors related to the
structure and volume of your dataset and the use cases of the problem

Regression Analysis in Machine learning


Regression analysis is a statistical method to model the relationship between a
dependent (target) and independent (predictor) variables with one or more independent
variables. More specifically, Regression analysis helps us to understand how the value of
the dependent variable is changing corresponding to an independent variable when
other independent variables are held fixed. It predicts continuous/real values such
as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement


every year and get sales on that. The below list shows the advertisement made by the
company in the last 5 years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants
to know the prediction about the sales for this year. So to solve such type of
prediction problems in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation


between variables and enables us to predict the continuous output variable based on
the one or more predictor variables. It is mainly used for prediction, forecasting, time
series modeling, and determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given
datapoints, using this plot, the machine learning model can make predictions about the
data. In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.
Terminologies Related to the Regression Analysis:
o Dependent Variable: The main factor in Regression analysis which we want to
predict or understand is called the dependent variable. It is also called target
variable.
o Independent Variable: The factors which affect the dependent variables or
which are used to predict the values of the dependent variables are called
independent variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very
high value in comparison to other observed values. An outlier may hamper the
result, so it should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each
other than other variables, then such condition is called Multicollinearity. It
should not be present in the dataset, because it creates problem while ranking
the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training
dataset but not well with test dataset, then such problem is called Overfitting.
And if our algorithm does not perform well even with training dataset, then such
problem is called underfitting.

Why do we use Regression Analysis?


As mentioned above, Regression analysis helps in the prediction of a continuous
variable. There are various scenarios in the real world where we need some future
predictions such as weather condition, sales prediction, marketing trends, etc., for such
case we need some technology which can make predictions more accurately. So for
such case we need Regression analysis which is a statistical method and used in
machine learning and data science. Below are some other reasons for using Regression
analysis:

o Regression estimates the relationship between the target and the independent
variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.

Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:

o Linear regression is a statistical regression method which is used for predictive


analysis.
o It is one of the very simple and easy algorithms which works on regression and
shows the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.
o Below is the mathematical equation for Linear regression:

1. Y= aX+b  

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

Logistic Regression:

o Logistic regression is another supervised learning algorithm which is used to


solve the classification problems. In classification problems, we have dependent
variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1,
Yes or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear
regression algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex
cost function. This sigmoid function is used to model the data in logistic
regression. The function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

o It uses the concept of threshold levels, values above the threshold level are
rounded up to 1, and values below the threshold level are rounded up to 0.

There are three types of logistic regression:


o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:

o Polynomial Regression is a type of regression which models the non-linear


dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the
value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a
non-linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear
model. Which means the datapoints are best fitted using a polynomial line.

o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b 0+ b1x, is transformed into
Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic

Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the
same degree.

Support Vector Regression:


Support Vector Machine is a supervised learning algorithm which can be used for
regression as well as classification problems. So if we use it for regression problems,
then it is termed as Support Vector Regression.

Support Vector Regression is a regression algorithm which works for continuous


variables. Below are some keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher


dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in
SVR, it is a line which helps to predict the continuous variables and cover most of
the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which
creates a margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the
hyperplane and opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that


maximum number of datapoints are covered in that margin. The main goal of SVR is
to consider the maximum datapoints within the boundary lines and the hyperplane
(best-fit line) must contain a maximum number of datapoints. Consider the below
image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary
lines.

Decision Tree Regression:

o Decision Tree is a supervised learning algorithm which can be used for solving
both classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node
represents the "test" for an attribute, each branch represent the result of the test,
and each leaf node represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset),
which splits into left and right child nodes (subsets of dataset). These child nodes
are further divided into their children node, and themselves become the parent
node of those nodes. Consider the below image:
Above image showing the example of Decision Tee regression, here, the model is trying
to predict the choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms which
is capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of
each tree output. The combined decision trees are called as base models, and it
can be represented more formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble


learning in which aggregated decision tree runs in parallel and do not interact
with each other.
o With the help of Random Forest regression, we can prevent Overfitting in the
model by creating random subsets of the dataset.
Ridge Regression:

o Ridge regression is one of the most robust versions of linear regression in which
a small amount of bias is introduced so that we can get better long term
predictions.
o The amount of bias added to the model is known as Ridge Regression penalty.
We can compute this penalty term by multiplying with the lambda to the squared
weight of each individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity


between the independent variables, so to solve such problems, Ridge regression
can be used.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:

o Lasso regression is another regularization technique to reduce the complexity of


the model.
o It is similar to the Ridge Regression except that penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

Linear Regression in Machine Learning


Linear regression is one of the easiest and most popular Machine Learning algorithms. It
is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age, product
price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple
Linear Regression.

Linear Regression Line


A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means
the error between predicted values and actual values should be minimized. The best fit
line will have the least error.

The different values for weights or the coefficient of lines (a 0, a1) gives a different line of
regression, so we need to calculate the best values for a 0 and a1 to find the best fit line,
so to calculate this we use cost function.

Cost function-

o The different values for weights or coefficient of lines (a 0, a1) gives the different
line of regression, and the cost function is used to estimate the values of the
coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping function is
also known as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual values. It
can be written as:
For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual.
If the observed points are far from the regression line, then the residual will be high, and
so cost function will high. If the scatter points are close to the regression line, then the
residual will be small and hence the cost function.

Gradient Descent:

o Gradient descent is used to minimize the MSE by calculating the gradient of the
cost function.
o A regression model uses gradient descent to update the coefficients of the line
by reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively
update the values to reach the minimum cost function.

Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations.
The process of finding the best model out of various models is called optimization. It
can be achieved by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and
independent variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted
values and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression


Below are some important assumptions of Linear Regression. These are some formal
checks while building a Linear Regression model, which ensures to get the best possible
result from the given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and
independent variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due
to multicollinearity, it may difficult to find the true relationship between the
predictors and target variables. Or we can say, it is difficult to determine which
predictor variable is affecting the target variable and which is not. So, the model
assumes either little or no multicollinearity between the features or independent
variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values
of independent variables. With homoscedasticity, there should be no clear
pattern distribution of data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal
distribution pattern. If error terms are not normally distributed, then confidence
intervals will become either too wide or too narrow, which may cause difficulties
in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there
will be any correlation in the error term, then it will drastically reduce the
accuracy of the model. Autocorrelation usually occurs if there is a dependency
between residual errors.

Simple Linear Regression in Machine Learning


Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship
shown by a Simple Linear Regression model is linear or a sloped straight line, hence it is
called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on
continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

o Model the relationship between the two variables. Such as the relationship


between Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.

Simple Linear Regression Model:


The Simple Linear Regression model can be represented using the below equation:

y= a0+a1x+ ε

Where,

a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing
or decreasing.
ε = The error term. (For a good model it will be negligible)
Implementation of Simple Linear Regression Algorithm
using Python
Problem Statement example for Simple Linear Regression:

Here we are taking a dataset that has two variables: salary (dependent variable) and
experience (Independent variable). The goals of this problem is:

o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent
variable.

In this section, we will create a Simple Linear Regression model to find out the best
fitting line for representing the relationship between these two variables.

To implement the Simple Linear regression model in machine learning using Python, we
need to follow the below steps:

Step-1: Data Pre-processing

The first step for creating the Simple Linear Regression model is data pre-processing.
We have already done it earlier in this tutorial. But there will be some changes, which are
given in the below steps:

o First, we will import the three important libraries, which will help us for loading
the dataset, plotting the graphs, and creating the Simple Linear Regression
model.

1. import numpy as nm  
2. import matplotlib.pyplot as mtp  
3. import pandas as pd  

o Next, we will load the dataset into our code:

1. data_set= pd.read_csv('Salary_Data.csv')  

By executing the above line of code (ctrl+ENTER), we can read the dataset on our
Spyder IDE screen by clicking on the variable explorer option.
The above output shows the dataset, which has two variables: Salary and Experience.

Note: In Spyder IDE, the folder containing the code file must be saved as a working
directory, and the dataset or csv file should be in the same folder.

o After that, we need to extract the dependent and independent variables from the
given dataset. The independent variable is years of experience, and the
dependent variable is salary. Below is code for it:

1. x= data_set.iloc[:, :-1].values  
2. y= data_set.iloc[:, 1].values   

In the above lines of code, for x variable, we have taken -1 value since we want to
remove the last column from the dataset. For y variable, we have taken 1 value as a
parameter, since we want to extract the second column and indexing starts from the
zero.

By executing the above line of code, we will get the output for X and Y variable as:

In the above output image, we can see the X (independent) variable and Y (dependent)
variable has been extracted from the given dataset.

o Next, we will split both variables into the test set and training set. We have 30
observations, so we will take 20 observations for the training set and 10
observations for the test set. We are splitting our dataset so that we can train our
model using a training dataset and then test the model using a test dataset. The
code for this is given below:

1. # Splitting the dataset into training and test set.  
2. from sklearn.model_selection import train_test_split  
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)  

By executing the above code, we will get x-test, x-train and y-test, y-train dataset.
Consider the below images:

Test-dataset:

Training Dataset:
o For simple linear Regression, we will not use Feature Scaling. Because Python
libraries take care of it for some cases, so we don't need to perform it here. Now,
our dataset is well prepared to work on it and we are going to start building a
Simple Linear Regression model for the given problem.

Step-2: Fitting the Simple Linear Regression to the Training Set:

Now the second step is to fit our model to the training dataset. To do so, we will import
the LinearRegression class of the linear_model library from the scikit learn. After
importing the class, we are going to create an object of the class named as a  regressor.
The code for this is given below:

1. #Fitting the Simple Linear Regression model to the training dataset  
2. from sklearn.linear_model import LinearRegression  
3. regressor= LinearRegression()  
4. regressor.fit(x_train, y_train)  

In the above code, we have used a fit() method to fit our Simple Linear Regression
object to the training set. In the fit() function, we have passed the x_train and y_train,
which is our training dataset for the dependent and an independent variable. We have
fitted our regressor object to the training set so that the model can easily learn the
correlations between the predictor and target variables. After executing the above lines
of code, we will get the below output.

Output:

Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)

Step: 3. Prediction of test set result:

dependent (salary) and an independent variable (Experience). So, now, our model is
ready to predict the output for the new observations. In this step, we will provide the
test dataset (new observations) to the model to check whether it can predict the correct
output or not.

We will create a prediction vector y_pred, and x_pred, which will contain predictions of


test dataset, and prediction of training set respectively.

1. #Prediction of Test and Training set result  
2. y_pred= regressor.predict(x_test)  
3. x_pred= regressor.predict(x_train)  

On executing the above lines of code, two variables named y_pred and x_pred will
generate in the variable explorer options that contain salary predictions for the training
set and test set.

Output:

You can check the variable by clicking on the variable explorer option in the IDE, and
also compare the result by comparing values from y_pred and y_test. By comparing
these values, we can check how good our model is performing.

Step: 4. visualizing the Training set results:


Now in this step, we will visualize the training set result. To do so, we will use the
scatter() function of the pyplot library, which we have already imported in the pre-
processing step. The scatter () function will create a scatter plot of observations.

In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary
of employees. In the function, we will pass the real values of training set, which means a
year of experience x_train, training set of Salaries y_train, and color of the observations.
Here we are taking a green color for the observation, but it can be any color as per the
choice.

Now, we need to plot the regression line, so for this, we will use the plot() function of
the pyplot library. In this function, we will pass the years of experience for training set,
predicted salary for training set x_pred, and color of the line.

Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".

After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel()
function.

Finally, we will represent all above things in a graph using show(). The code is given
below:

1. mtp.scatter(x_train, y_train, color="green")   
2. mtp.plot(x_train, x_pred, color="red")    
3. mtp.title("Salary vs Experience (Training Dataset)")  
4. mtp.xlabel("Years of Experience")  
5. mtp.ylabel("Salary(In Rupees)")  
6. mtp.show()   

Output:

By executing the above lines of code, we will get the below graph plot as an output.
In the above plot, we can see the real values observations in green dots and predicted
values are covered by the red regression line. The regression line shows a correlation
between the dependent and independent variable.

The good fit of the line can be observed by calculating the difference between actual
values and predicted values. But as we can see in the above plot, most of the
observations are close to the regression line, hence our model is good for the
training set.

Step: 5. visualizing the Test set results:

In the previous step, we have visualized the performance of our model on the training
set. Now, we will do the same for the Test set. The complete code will remain the same
as the above code, except in this, we will use x_test, and y_test instead of x_train and
y_train.

Here we are also changing the color of observations and regression line to differentiate
between the two plots, but it is optional.

1. #visualizing the Test set results  
2. mtp.scatter(x_test, y_test, color="blue")   
3. mtp.plot(x_train, x_pred, color="red")    
4. mtp.title("Salary vs Experience (Test Dataset)")  
5. mtp.xlabel("Years of Experience")  
6. mtp.ylabel("Salary(In Rupees)")  
7. mtp.show()  

Output:

By executing the above line of code, we will get the output as:

In the above plot, there are observations given by the blue color, and prediction is given
by the red regression line. As we can see, most of the observations are close to the
regression line, hence we can say our Simple Linear Regression is a good model and
able to make good predictions.

Multiple Linear Regression


In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there
may be various cases in which the response variable is affected by more than one
predictor variable; for such cases, the Multiple Linear Regression algorithm is used.

Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it


takes more than one predictor variable to predict the response variable. We can define it
as:

Multiple Linear Regression is one of the important regression algorithms which models the
linear relationship between a single dependent continuous variable and more than one
independent variable.

Example:

Prediction of CO2 emission based on engine size and number of cylinders in a car.

PlayNext
Unmute

Current Time 0:00

Duration 18:10
Loaded: 0.37%
 
Fullscreen
Backward Skip 10sPlay VideoForward Skip 10s

Some key points about MLR:

o For MLR, the dependent or target variable(Y) must be the continuous/real, but
the predictor or independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent
variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.

MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression,
so the same is applied for the multiple linear regression equation, the equation
becomes:

1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</
sub>+ b<sub>3</sub>x<sub>3</sub>+...... bnxn       ............... (a)  

Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn....= Coefficients of the model.

x1, x2, x3, x4,...= Various Independent/feature variable

Assumptions for Multiple Linear Regression:

o A linear relationship should exist between the Target and predictor variables.


o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent
variable) in data.

Implementation of Multiple Linear Regression model using


Python:
To implement MLR using Python, we have below problem:

Problem Description:

We have a dataset of 50 start-up companies. This dataset contains five main
information: R&D Spend, Administration Spend, Marketing Spend, State, and Profit
for a financial year. Our goal is to create a model that can easily determine which
company has a maximum profit, and which is the most affecting factor for the profit of a
company.

Since we need to find the Profit, so it is the dependent variable, and the other four
variables are independent variables. Below are the main steps of deploying the MLR
model:

1. Data Pre-processing Steps


2. Fitting the MLR model to the training set
3. Predicting the result of the test set

Step-1: Data Pre-processing Step:

The very first step is data pre-processing, which we have already discussed in this
tutorial. This process contains the below steps:

o Importing libraries: Firstly we will import the library which will help in building
the model. Below is the code for it:

1. # importing libraries  
2. import numpy as nm  
3. import matplotlib.pyplot as mtp  
4. import pandas as pd  

o Importing dataset: Now we will import the dataset(50_CompList), which


contains all the variables. Below is the code for it:

1. #importing datasets  
2. data_set= pd.read_csv('50_CompList.csv')  

Output: We will get the dataset as:


In above output, we can clearly see that there are five variables, in which four variables
are continuous and one is categorical variable.

o Extracting dependent and independent Variables:

1. #Extracting Independent and dependent Variable  
2. x= data_set.iloc[:, :-1].values  
3. y= data_set.iloc[:, 4].values  

Output:

Out[5]:

array([[165349.2, 136897.8, 471784.1, 'New York'],


[162597.7, 151377.59, 443898.53, 'California'],
[153441.51, 101145.55, 407934.54, 'Florida'],
[144372.41, 118671.85, 383199.62, 'New York'],
[142107.34, 91391.77, 366168.42, 'Florida'],
[131876.9, 99814.71, 362861.36, 'New York'],
[134615.46, 147198.87, 127716.82, 'California'],
[130298.13, 145530.06, 323876.68, 'Florida'],
[120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'],
[101913.08, 110594.11, 229160.95, 'Florida'],
[100671.96, 91790.61, 249744.55, 'California'],
[93863.75, 127320.38, 249839.44, 'Florida'],
[91992.39, 135495.07, 252664.93, 'California'],
[119943.24, 156547.42, 256512.92, 'Florida'],
[114523.61, 122616.84, 261776.23, 'New York'],
[78013.11, 121597.55, 264346.06, 'California'],
[94657.16, 145077.58, 282574.31, 'New York'],
[91749.16, 114175.79, 294919.57, 'Florida'],
[86419.7, 153514.11, 0.0, 'New York'],
[76253.86, 113867.3, 298664.47, 'California'],
[78389.47, 153773.43, 299737.29, 'New York'],
[73994.56, 122782.75, 303319.26, 'Florida'],
[67532.53, 105751.03, 304768.73, 'Florida'],
[77044.01, 99281.34, 140574.81, 'New York'],
[64664.71, 139553.16, 137962.62, 'California'],
[75328.87, 144135.98, 134050.07, 'Florida'],
[72107.6, 127864.55, 353183.81, 'New York'],
[66051.52, 182645.56, 118148.2, 'Florida'],
[65605.48, 153032.06, 107138.38, 'New York'],
[61994.48, 115641.28, 91131.24, 'Florida'],
[61136.38, 152701.92, 88218.23, 'New York'],
[63408.86, 129219.61, 46085.25, 'California'],
[55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'],
[46014.02, 85047.44, 205517.64, 'New York'],
[28663.76, 127056.21, 201126.82, 'Florida'],
[44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'],
[38558.51, 82982.09, 174999.3, 'California'],
[28754.33, 118546.05, 172795.67, 'California'],
[27892.92, 84710.77, 164470.71, 'Florida'],
[23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'],
[22177.74, 154806.14, 28334.72, 'California'],
[1000.23, 124153.04, 1903.93, 'New York'],
[1315.46, 115816.21, 297114.46, 'Florida'],
[0.0, 135426.92, 0.0, 'California'],
[542.05, 51743.15, 0.0, 'New York'],
[0.0, 116983.8, 45173.06, 'California']], dtype=object)

As we can see in the above output, the last column contains categorical variables which
are not suitable to apply directly for fitting the model. So we need to encode this
variable.

Encoding Dummy Variables:


As we have one categorical variable (State), which cannot be directly applied to the
model, so we will encode it. To encode the categorical variable into numbers, we will use
the LabelEncoder class. But it is not sufficient because it still has some relational order,
which may create a wrong model. So in order to remove this problem, we will
use OneHotEncoder, which will create the dummy variables. Below is code for it:

1. #Catgorical data  
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder  
3. labelencoder_x= LabelEncoder()  
4. x[:, 3]= labelencoder_x.fit_transform(x[:,3])  
5. onehotencoder= OneHotEncoder(categorical_features= [3])    
6. x= onehotencoder.fit_transform(x).toarray()  

Here we are only encoding one independent variable, which is state as other variables
are continuous.

Output:
As we can see in the above output, the state column has been converted into dummy
variables (0 and 1). Here each dummy variable column is corresponding to the one
State. We can check by comparing it with the original dataset. The first column
corresponds to the California State, the second column corresponds to the Florida
State, and the third column corresponds to the New York State.

Note: We should not use all the dummy variables at the same time, so it must be 1 less
than the total number of dummy variables, else it will create a dummy variable trap.

o Now, we are writing a single line of code just to avoid the dummy variable trap:

1. #avoiding the dummy variable trap:  
2. x = x[:, 1:]  
If we do not remove the first dummy variable, then it may introduce multicollinearity in
the model.

As we can see in the above output image, the first column has been removed.

o Now we will split the dataset into training and test set. The code for this is given
below:

1. # Splitting the dataset into training and test set.  
2. from sklearn.model_selection import train_test_split  
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  

The above code will split our dataset into a training set and test set.
Output: The above code will split the dataset into training set and test set. You can
check the output by clicking on the variable explorer option given in Spyder IDE. The
test set and training set will look like the below image:

Test set:

Training set:
Note: In MLR, we will not do feature scaling as it is taken care by the library, so we don't
need to do it manually.

Step: 2- Fitting our MLR model to the Training set:


Now, we have well prepared our dataset in order to provide training, which means we
will fit our regression model to the training set. It will be similar to as we did in Simple
Linear Regression model. The code for this will be:

1. #Fitting the MLR model to the training set:  
2. from sklearn.linear_model import LinearRegression  
3. regressor= LinearRegression()  
4. regressor.fit(x_train, y_train)  

Output:
Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)

Now, we have successfully trained our model using the training dataset. In the next step,
we will test the performance of the model using the test dataset.

Step: 3- Prediction of Test set results:


The last step for our model is checking the performance of the model. We will do it by
predicting the test set result. For prediction, we will create a y_pred vector. Below is the
code for it:

1. #Predicting the Test set result;  
2. y_pred= regressor.predict(x_test)  

By executing the above lines of code, a new vector will be generated under the variable
explorer option. We can test our model by comparing the predicted values and test set
values.

Output:
In the above output, we have predicted result set and test set. We can check model
performance by comparing these two value index by index. For example, the first index
has a predicted value of 103015$ profit and test/real value of 103282$ profit. The
difference is only of 267$, which is a good prediction, so, finally, our model is completed
here.

o We can also check the score for training dataset and test dataset. Below is the
code for it:

1. print('Train Score: ', regressor.score(x_train, y_train))  
2. print('Test Score: ', regressor.score(x_test, y_test))  

Output: The score is:


Train Score: 0.9501847627493607
Test Score: 0.9347068473282446

The above score tells that our model is 95% accurate with the training dataset and
93% accurate with the test dataset.

Note: In the next topic, we will see how we can improve the performance of the model using
the Backward Elimination process.

Applications of Multiple Linear Regression:


There are mainly two applications of Multiple Linear Regression:

o Effectiveness of Independent variable on prediction:


o Predicting the impact of changes:

ML Polynomial Regression
o Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree polynomial.
The Polynomial Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

o It is also called the special case of Multiple Linear Regression in ML. Because we
add some polynomial terms to the Multiple Linear regression equation to convert
it into Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a
linear model."

Need for Polynomial Regression:


The need of Polynomial Regression in ML can be understood in the below points:

o If we apply a linear model on a linear dataset, then it provides us a good result


as we have seen in Simple Linear Regression, but if we apply the same model
without any modification on a non-linear dataset, then it will produce a drastic
output. Due to which loss function will increase, the error rate will be high, and
accuracy will be decreased.
o So for such cases, where data points are arranged in a non-linear fashion, we
need the Polynomial Regression model. We can understand it in a better way
using the below comparison diagram of the linear dataset and non-linear dataset.

o In the above image, we have taken a dataset which is arranged non-linearly. So if


we try to cover it with a linear model, then we can clearly see that it hardly covers
any data point. On the other hand, a curve is suitable to cover most of the data
points, which is of the Polynomial model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression.

Note: A Polynomial Regression algorithm is also called Polynomial Linear Regression


because it does not depend on the variables, instead, it depends on the coefficients, which
are arranged in a linear fashion.

Equation of the Polynomial Regression Model:


Simple Linear Regression equation:         y = b0+b1x         .........(a)

Multiple Linear Regression equation:         y= b0+b1x+ b2x2+ b3x3+....+ bnxn         ......


...(b)

Polynomial Regression equation:         y= b0+b1x + b2x2+ b3x3+....+ bnxn         ..........


(c)

PlayNext
Unmute

Current Time 0:00

Duration 18:10
Loaded: 0.37%
 
Fullscreen
Backward Skip 10sPlay VideoForward Skip 10s

When we compare the above three equations, we can clearly see that all three equations
are Polynomial equations but differ by the degree of variables. The Simple and Multiple
Linear equations are also Polynomial equations with a single degree, and the Polynomial
regression equation is Linear equation with the nth degree. So if we add a degree to our
linear equations, then it will be converted into Polynomial Linear equations.

Note: To better understand Polynomial Regression, you must have knowledge of Simple
Linear Regression.

Implementation of Polynomial Regression using


Python:
Here we will implement the Polynomial Regression using Python. We will understand it
by comparing Polynomial Regression model with the Simple Linear Regression model.
So first, let's understand the problem for which we are going to build the model.

Problem Description: There is a Human Resource company, which is going to hire a


new candidate. The candidate has told his previous salary 160K per annum, and the HR
have to check whether he is telling the truth or bluff. So to identify this, they only have a
dataset of his previous company in which the salaries of the top 10 positions are
mentioned with their levels. By checking the dataset available, we have found that there
is a non-linear relationship between the Position levels and the salaries. Our goal is
to build a Bluffing detector regression model, so HR can hire an honest candidate.
Below are the steps to build such a model.

Steps for Polynomial Regression:


The main steps involved in Polynomial Regression are given below:

o Data Pre-processing
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output.

Note: Here, we will build the Linear regression model as well as Polynomial Regression to
see the results between the predictions. And Linear regression model is for reference.

Data Pre-processing Step:

The data pre-processing step will remain the same as in previous regression models,
except for some changes. In the Polynomial Regression model, we will not use feature
scaling, and also we will not split our dataset into training and test set. It has two
reasons:

o The dataset contains very less information which is not suitable to divide it into a
test and training set, else our model will not be able to find the correlations
between the salaries and levels.
o In this model, we want very accurate predictions for salary, so the model should
have enough information.

The code for pre-processing step is given below:

1. # importing libraries  
2. import numpy as nm  
3. import matplotlib.pyplot as mtp  
4. import pandas as pd  
5.   
6. #importing datasets  
7. data_set= pd.read_csv('Position_Salaries.csv')  
8.   
9. #Extracting Independent and dependent Variable  
10. x= data_set.iloc[:, 1:2].values  
11. y= data_set.iloc[:, 2].values  

Explanation:

o In the above lines of code, we have imported the important Python libraries to
import dataset and operate on it.
o Next, we have imported the dataset 'Position_Salaries.csv', which contains three
columns (Position, Levels, and Salary), but we will consider only two columns
(Salary and Levels).
o After that, we have extracted the dependent(Y) and independent variable(X) from
the dataset. For x-variable, we have taken parameters as [:,1:2], because we want
1 index(levels), and included :2 to make it as a matrix.

Output:
By executing the above code, we can read our dataset as:

As we can see in the above output, there are three columns present (Positions, Levels,
and Salaries). But we are only considering two columns because Positions are equivalent
to the levels or may be seen as the encoded form of Positions.

Here we will predict the output for level 6.5 because the candidate has 4+ years'
experience as a regional manager, so he must be somewhere between levels 7 and 6.

Building the Linear regression model:

Now, we will build and fit the Linear regression model to the dataset. In building
polynomial regression, we will take the Linear regression model as reference and
compare both the results. The code is given below:

1. #Fitting the Linear Regression to the dataset  
2. from sklearn.linear_model import LinearRegression  
3. lin_regs= LinearRegression()  
4. lin_regs.fit(x,y)  
In the above code, we have created the Simple Linear model using lin_regs object
of LinearRegression class and fitted it to the dataset variables (x and y).

Output:

Out[5]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)

Building the Polynomial regression model:

Now we will build the Polynomial Regression model, but it will be a little different from
the Simple Linear model. Because here we will use PolynomialFeatures class
of preprocessing library. We are using this class to add some extra features to our
dataset.

1.  #Fitting the Polynomial regression to the dataset  
2. from sklearn.preprocessing import PolynomialFeatures  
3. poly_regs= PolynomialFeatures(degree= 2)  
4. x_poly= poly_regs.fit_transform(x)  
5. lin_reg_2 =LinearRegression()  
6. lin_reg_2.fit(x_poly, y)  

In the above lines of code, we have used poly_regs.fit_transform(x), because first we


are converting our feature matrix into polynomial feature matrix, and then fitting it to
the Polynomial regression model. The parameter value(degree= 2) depends on our
choice. We can choose it according to our Polynomial features.

After executing the code, we will get another matrix x_poly, which can be seen under
the variable explorer option:
Next, we have used another LinearRegression object, namely lin_reg_2, to fit
our x_poly vector to the linear model.

Output:

Out[11]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)

Visualizing the result for Linear regression:

Now we will visualize the result for Linear regression model as we did in Simple Linear
Regression. Below is the code for it:

1. #Visulaizing the result for Linear Regression model  
2. mtp.scatter(x,y,color="blue")  
3. mtp.plot(x,lin_regs.predict(x), color="red")  
4. mtp.title("Bluff detection model(Linear Regression)")  
5. mtp.xlabel("Position Levels")  
6. mtp.ylabel("Salary")  
7. mtp.show()  

Output:

In the above output image, we can clearly see that the regression line is so far from the
datasets. Predictions are in a red straight line, and blue points are actual values. If we
consider this output to predict the value of CEO, it will give a salary of approx. 600000$,
which is far away from the real value.

So we need a curved model to fit the dataset other than a straight line.

Visualizing the result for Polynomial Regression

Here we will visualize the result of Polynomial regression model, code for which is little
different from the above model.

Code for this is given below:

1. #Visulaizing the result for Polynomial Regression  
2. mtp.scatter(x,y,color="blue")  
3. mtp.plot(x, lin_reg_2.predict(poly_regs.fit_transform(x)), color="red")  
4. mtp.title("Bluff detection model(Polynomial Regression)")  
5. mtp.xlabel("Position Levels")  
6. mtp.ylabel("Salary")  
7. mtp.show()  

In the above code, we have taken lin_reg_2.predict(poly_regs.fit_transform(x), instead of


x_poly, because we want a Linear regressor object to predict the polynomial features
matrix.

Output:

As we can see in the above output image, the predictions are close to the real values.
The above plot will vary as we will change the degree.

For degree= 3:

If we change the degree=3, then we will give a more accurate plot, as shown in the
below image.
SO as we can see here in the above output image, the predicted salary for level 6.5 is
near to 170K$-190k$, which seems that future employee is saying the truth about his
salary.

Degree= 4: Let's again change the degree to 4, and now will get the most accurate plot.
Hence we can get more accurate results by increasing the degree of Polynomial.

Predicting the final result with the Linear Regression model:

Now, we will predict the final output using the Linear regression model to see whether
an employee is saying truth or bluff. So, for this, we will use the predict() method and
will pass the value 6.5. Below is the code for it:
1. lin_pred = lin_regs.predict([[6.5]])  
2. print(lin_pred)  

Output:

[330378.78787879]

Predicting the final result with the Polynomial Regression model:

Now, we will predict the final output using the Polynomial Regression model to
compare with Linear model. Below is the code for it:

1. poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))  
2. print(poly_pred)  

Output:

[158862.45265153]

As we can see, the predicted output for the Polynomial Regression is


[158862.45265153], which is much closer to real value hence, we can say that future
employee is saying true.

Logistic Regression in Machine Learning


o Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for predicting
the categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and
1, it gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and
discrete datasets.
o Logistic Regression can be used to classify the observations using different types
of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:

Note: Logistic regression uses the concept of predictive modeling as regression; therefore,


it is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted


values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends to
1, and a value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".

Python Implementation of Logistic Regression (Binomial)


To understand the implementation of Logistic Regression in Python, we will use the
below example:

PlayNext
Unmute

Current Time 0:00

Duration 18:10
Loaded: 0.37%
 
Fullscreen
Backward Skip 10sPlay VideoForward Skip 10s

Example: There is a dataset given which contains the information of various users


obtained from the social networking sites. There is a car making company that has
recently launched a new SUV car. So the company wanted to check how many users
from the dataset, wants to purchase the car.

For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary (Independent
variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we
will use the same steps as we have done in previous topics of Regression. Below are the
steps:

o Data Pre-processing step


o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that
we can use it in our code efficiently. It will be the same as we have done in Data pre-
processing topic. The code for this is given below:

1. #Data Pre-procesing Step  
2. # importing libraries  
3. import numpy as nm  
4. import matplotlib.pyplot as mtp  
5. import pandas as pd  
6.   
7. #importing datasets  
8. data_set= pd.read_csv('user_data.csv')  

By executing the above lines of code, we will get the dataset as the output. Consider the
given image:

Now, we will extract the dependent and independent variables from the given dataset.
Below is the code for it:
1. #Extracting Independent and dependent Variable  
2. x= data_set.iloc[:, [2,3]].values  
3. y= data_set.iloc[:, 4].values  

In the above code, we have taken [2, 3] for x because our independent variables are age
and salary, which are at index 2, 3. And we have taken 4 for y variable because our
dependent variable is at index 4. The output will be:

Now we will split the dataset into a training set and test set. Below is the code for it:

1. # Splitting the dataset into training and test set.  
2. from sklearn.model_selection import train_test_split  
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)  
The output for this is given below:

For test
set: 

For training set:


In logistic regression, we will do feature scaling because we want accurate result of
predictions. Here we will only scale the independent variable because dependent
variable have only 0 and 1 values. Below is the code for it:

1. #feature Scaling  
2. from sklearn.preprocessing import StandardScaler    
3. st_x= StandardScaler()    
4. x_train= st_x.fit_transform(x_train)    
5. x_test= st_x.transform(x_test)  

The scaled output is given below:


2. Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset using the training
set. For providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.

After importing the class, we will create a classifier object and use it to fit the model to
the logistic regression. Below is the code for it:

1. #Fitting Logistic Regression to the training set  
2. from sklearn.linear_model import LogisticRegression  
3. classifier= LogisticRegression(random_state=0)  
4. classifier.fit(x_train, y_train)  
Output: By executing the above code, we will get the below output:

Out[5]:

1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,  
2.                    intercept_scaling=1, l1_ratio=None, max_iter=100,  
3.                    multi_class='warn', n_jobs=None, penalty='l2',  
4.                    random_state=0, solver='warn', tol=0.0001, verbose=0,  
5.                    warm_start=False)  

Hence our model is well fitted to the training set.

3. Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by using
test set data. Below is the code for it:

1. #Predicting the test set result  
2. y_pred= classifier.predict(x_test)  

In the above code, we have created a y_pred vector to predict the test set result.

Output: By executing the above code, a new vector (y_pred) will be created under the
variable explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to
purchase or not purchase the car.

4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the classification.
To create it, we need to import the confusion_matrix function of the sklearn library.
After importing the function, we will call it using a new variable cm. The function takes
two parameters, mainly y_true( the actual values) and y_pred (the targeted value return
by the classifier). Below is the code for it:

1. #Creating the Confusion matrix  
2. from sklearn.metrics import confusion_matrix  
3. cm= confusion_matrix()  
Output:

By executing the above code, a new confusion matrix will be created. Consider the
below image:

We can find the accuracy of the predicted result by interpreting the confusion matrix. By
above output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect
Output).

5. Visualizing the training set result

Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:

1. #Visualizing the training set result  
2. from matplotlib.colors import ListedColormap  
3. x_set, y_set = x_train, y_train  
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, s
tep  =0.01),  
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))  
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape)
,  
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))  
8. mtp.xlim(x1.min(), x1.max())  
9. mtp.ylim(x2.min(), x2.max())  
10. for i, j in enumerate(nm.unique(y_set)):  
11.     mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],  
12.         c = ListedColormap(('purple', 'green'))(i), label = j)  
13. mtp.title('Logistic Regression (Training set)')  
14. mtp.xlabel('Age')  
15. mtp.ylabel('Estimated Salary')  
16. mtp.legend()  
17. mtp.show()  

In the above code, we have imported the ListedColormap class of Matplotlib library to


create the colormap for visualizing the result. We have created two new
variables x_set and y_set to replace x_train and y_train. After that, we have used
the nm.meshgrid command to create a rectangular grid, which has a range of -
1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01 resolution.

To create a filled contour, we have used mtp.contourf command, it will create regions


of provided colors (purple and green). In this function, we have passed
the classifier.predict to show the predicted data points predicted by the classifier.

Output: By executing the above code, we will get the below output:

The graph can be explained in the below points:


o In the above graph, we can see that there are some Green points within the
green region and Purple points within the purple region.
o All these data points are the observation points from the training set, which
shows the result for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased (dependent variable) is
probably 0, i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable) is
probably 1 means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low
salary, did not purchase the car, whereas older users with high estimated salary
purchased the car.
o But there are some purple points in the green region (Buying the car) and some
green points in the purple region(Not buying the car). So we can say that
younger users with a high estimated salary purchased the car, whereas an older
user with a low estimated salary did not purchase the car.

The goal of the classifier:

We have successfully visualized the training set result for the logistic regression, and our
goal for this classification is to divide the users who purchased the SUV car and who did
not purchase the car. So from the output graph, we can clearly see the two regions
(Purple and Green) with the observation points. The Purple region is for those users who
didn't buy the car, and Green Region is for those users who purchased the car.

Linear Classifier:

As we can see from the graph, the classifier is a Straight line or linear in nature as we
have used the Linear model for Logistic Regression. In further topics, we will learn for
non-linear Classifiers.

Visualizing the test set result:

Our model is well trained using the training dataset. Now, we will visualize the result for
new observations (Test set). The code for the test set will remain same as above except
that here we will use x_test and y_test instead of x_train and y_train. Below is the code
for it:

1. #Visulaizing the test set result  
2. from matplotlib.colors import ListedColormap  
3. x_set, y_set = x_test, y_test  
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, s
tep  =0.01),  
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))  
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape)
,  
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))  
8. mtp.xlim(x1.min(), x1.max())  
9. mtp.ylim(x2.min(), x2.max())  
10. for i, j in enumerate(nm.unique(y_set)):  
11.     mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],  
12.         c = ListedColormap(('purple', 'green'))(i), label = j)  
13. mtp.title('Logistic Regression (Test set)')  
14. mtp.xlabel('Age')  
15. mtp.ylabel('Estimated Salary')  
16. mtp.legend()  
17. mtp.show()  

Output:
The above graph shows the test set result. As we can see, the graph is divided into two
regions (Purple and Green). And Green observations are in the green region, and Purple
observations are in the purple region. So we can say it is a good prediction and model.
Some of the green and purple data points are in different regions, which can be ignored
as we have already calculated this error using the confusion matrix (11 Incorrect output).

Hence our model is pretty good and ready to make new predictions for this
classification problem.

Classification Algorithm in Machine Learning


As we know, the Supervised Machine Learning algorithm can be broadly classified into
Regression and Classification Algorithms. In Regression algorithms, we have predicted
the output for continuous values, but to predict the categorical values, we need
Classification algorithms.

What is the Classification Algorithm?


The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data. In Classification, a
program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or
Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with
the corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

PlayNext
Unmute

Current Time 0:00

Duration 18:10
Loaded: 0.37%
 
Fullscreen
Backward Skip 10sPlay VideoForward Skip 10s

1. y=f(x), where y = categorical output  

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given
dataset, and these algorithms are mainly used to predict the output for the categorical
data.

Classification algorithms can be better understood using the below diagram. In the
below diagram, there are two classes, class A and Class B. These classes have features
that are similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier.
There are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible outcomes,
then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes,
then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:


In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis of
the most related data stored in the training dataset. It takes less time in training
but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner
takes more time in learning, and less time in prediction. Example: Decision Trees,
Naïve Bayes, ANN.

Types of ML Classification Algorithms:


Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Note: We will learn the above algorithms in later chapters.

Evaluating a Classification model:


Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So for evaluating a Classification model, we have the
following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a


probability value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual
value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))  

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the


performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as
below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands


for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the
AUC-ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-
axis and FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms


Classification algorithms can be used in different places. Below are some popular use
cases of Classification Algorithms:

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.

K-Nearest Neighbor(KNN) Algorithm for


Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
o Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we
can use the KNN algorithm, as it works on a similarity measure. Our KNN model
will find the similar features of the new data set to the cats and dogs images and
based on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN
algorithm:

PlayNext
Unmute

Current Time 0:00

Duration 18:10
Loaded: 0.37%
 
Fullscreen
Backward Skip 10sPlay VideoForward Skip 10s
o There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the
data points for all the training samples.

Python implementation of the KNN algorithm


To do the Python implementation of the K-NN algorithm, we will use the same problem
and dataset which we have used in Logistic Regression. But here we will improve the
performance of the model. Below is the problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company that has


manufactured a new SUV car. The company wants to give the ads to the users who are
interested in buying that SUV. So for this problem, we have a dataset that contains
multiple user's information through the social network. The dataset contains lots of
information but the Estimated Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent variable. Below is the dataset:
Steps to implement the K-NN algorithm:

o Data Pre-processing step


o Fitting the K-NN algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic Regression. Below
is the code for it:

1. # importing libraries  
2. import numpy as nm  
3. import matplotlib.pyplot as mtp  
4. import pandas as pd  
5.   
6. #importing datasets  
7. data_set= pd.read_csv('user_data.csv')  
8.   
9. #Extracting Independent and dependent Variable  
10. x= data_set.iloc[:, [2,3]].values  
11. y= data_set.iloc[:, 4].values  
12.   
13. # Splitting the dataset into training and test set.  
14. from sklearn.model_selection import train_test_split  
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)  
16.   
17. #feature Scaling  
18. from sklearn.preprocessing import StandardScaler    
19. st_x= StandardScaler()    
20. x_train= st_x.fit_transform(x_train)    
21. x_test= st_x.transform(x_test)  

By executing the above code, our dataset is imported to our program and well pre-
processed. After feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.

o Fitting K-NN classifier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing the
class, we will create the Classifier object of the class. The Parameter of this class
will be
o n_neighbors: To define the required neighbors of the algorithm. Usually, it
takes 5.
o metric='minkowski': This is the default parameter and it decides the
distance between the points.
o p=2: It is equivalent to the standard Euclidean metric.
And then we will fit the classifier to the training data. Below is the code for it:

1. #Fitting K-NN classifier to the training set  
2. from sklearn.neighbors import KNeighborsClassifier  
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )  
4. classifier.fit(x_train, y_train)  

Output: By executing the above code, we will get the output as:

Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')

o Predicting the Test Result: To predict the test set result, we will create
a y_pred vector as we did in Logistic Regression. Below is the code for it:

1. #Predicting the test set result  
2. y_pred= classifier.predict(x_test)  

Output:

The output for the above code will be:


o Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the accuracy
of the classifier. Below is the code for it:

1. #Creating the Confusion matrix  
2.     from sklearn.metrics import confusion_matrix  
3.     cm= confusion_matrix(y_test, y_pred)  

In above code, we have imported the confusion_matrix function and called it using the
variable cm.

Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7
incorrect predictions, whereas, in Logistic Regression, there were 11 incorrect
predictions. So we can say that the performance of the model is improved by using the
K-NN algorithm.

o Visualizing the Training set result:


Now, we will visualize the training set result for K-NN model. The code will
remain same as we did in Logistic Regression, except the name of the graph.
Below is the code for it:

1. #Visulaizing the trianing set result  
2. from matplotlib.colors import ListedColormap  
3. x_set, y_set = x_train, y_train  
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, s
tep  =0.01),  
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))  
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape)
,  
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))  
8. mtp.xlim(x1.min(), x1.max())  
9. mtp.ylim(x2.min(), x2.max())  
10. for i, j in enumerate(nm.unique(y_set)):  
11.     mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],  
12.         c = ListedColormap(('red', 'green'))(i), label = j)  
13. mtp.title('K-NN Algorithm (Training set)')  
14. mtp.xlabel('Age')  
15. mtp.ylabel('Estimated Salary')  
16. mtp.legend()  
17. mtp.show()  

Output:

By executing the above code, we will get the below graph:

The output graph is different from the graph which we have occurred in Logistic
Regression. It can be understood in the below points:

o As we can see the graph is showing the red point and green points. The
green points are for Purchased(1) and Red Points for not Purchased(0)
variable.
o The graph is showing an irregular boundary instead of showing any
straight line or any curve because it is a K-NN algorithm, i.e., finding the
nearest neighbor.
o The graph has classified users in the correct categories as most of the
users who didn't buy the SUV are in the red region and users who bought
the SUV are in the green region.
o The graph is showing good result but still, there are some green points in
the red region and red points in the green region. But this is no big issue
as by doing this model is prevented from overfitting issues.
o Hence our model is well trained.
o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new
dataset, i.e., Test dataset. Code remains the same except some minor changes:
such as x_train and y_train will be replaced by x_test and y_test.
Below is the code for it:

1. #Visualizing the test set result  
2. from matplotlib.colors import ListedColormap  
3. x_set, y_set = x_test, y_test  
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, s
tep  =0.01),  
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))  
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape)
,  
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))  
8. mtp.xlim(x1.min(), x1.max())  
9. mtp.ylim(x2.min(), x2.max())  
10. for i, j in enumerate(nm.unique(y_set)):  
11.     mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],  
12.         c = ListedColormap(('red', 'green'))(i), label = j)  
13. mtp.title('K-NN algorithm(Test set)')  
14. mtp.xlabel('Age')  
15. mtp.ylabel('Estimated Salary')  
16. mtp.legend()  
17. mtp.show()  

Output:
The above graph is showing the output for the test data set. As we can see in the graph,
the predicted output is well good as most of the red points are in the red region and
most of the green points are in the green region.

However, there are few green points in the red region and a few red points in the green
region. So these are the incorrect observations that we have observed in the confusion
matrix(7 Incorrect output).

You might also like