KEMBAR78
UNIT-2 - Data Science (Partial) | PDF | Logistic Regression | Regression Analysis
0% found this document useful (0 votes)
8 views21 pages

UNIT-2 - Data Science (Partial)

Unit 2 of the Data Science course covers Exploratory Data Analysis (EDA) and various machine learning algorithms, including supervised and unsupervised learning. EDA is essential for data scientists to identify patterns and clean data, while machine learning enables predictions and decision-making based on data. The unit also discusses the roles of data scientists and the importance of EDA in business decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views21 pages

UNIT-2 - Data Science (Partial)

Unit 2 of the Data Science course covers Exploratory Data Analysis (EDA) and various machine learning algorithms, including supervised and unsupervised learning. EDA is essential for data scientists to identify patterns and clean data, while machine learning enables predictions and decision-making based on data. The unit also discusses the roles of data scientists and the importance of EDA in business decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

IV B.

Tech – I Sem | - Data Science UNIT-2

UNIT - 2
DATA ANALYSIS AND ALGORITHMS: Exploratory Data Analysis (EDA), tools for EDA,
The Data Science Process, role of data scientist’s, case study. Algorithms: Machine Learning
Algorithms, Three Basic Algorithms - Linear Regression - k-Nearest Neighbors (k-NN) - k-
means – SVM, Naïve Bayes, Logistic Regression.

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is one of the techniques used for extracting vital features
and trends used by machine learning and deep learning models in Data Science.

Exploratory Data Analysis (EDA) is widely used by Data Scientists while analyzing and
investigating Data sets, summarizing the main characteristics of data to the visualizing method.
It helps the Data Scientist to discover Data Patterns, Spot anomalies, hypothesis testing, and or
assumption. Tt can be defined as a method that helps the Data Scientist determine the best ways
to manipulate the given data source to get the answer that is needed as a goal.

Exploratory Data Analysis is majorly performed using the following methods:


 Univariate visualization — provides summary statistics for each field in the
raw data set
 Bivariate visualization — is performed to find the relationship between each
variable in the dataset and the target variable of interest
 Multivariate visualization — is performed to understand interactions
between different fields in the dataset
 Dimensionality reduction — helps to understand the fields in the data that
account for the most variance between observations and allow for the
processing of a reduced volume of data.Steps Involved in Exploratory
Data Analysis
1. Data Collection
Data collection is an essential part of exploratory data analysis. It refers to the process of
finding and loading data into our system.
2. Data Cleaning
Data cleaning refers to the process of removing unwanted variables and values from your
dataset and getting rid of any irregularities in it. Such anomalies can disproportionately skew
the data and hence adversely affect the results. Some steps that can be done to clean data are:
 Removing missing values, outliers, and unnecessary rows/ columns.
 Re-indexing and reformatting our data.
3. Univariate Analysis
In Univariate Analysis, you analyze data of just one variable. A variable in your dataset refers
to a single feature/ column. You can do this either with graphical or non-graphical means by
finding specific mathematical values in the data. Some visual methods include:
 Histograms: Bar plots in which the frequency of data is represented with rectangle
bars.
 Box-plots: Here the information is represented in the form of boxes.
4. Bivariate Analysis
Here, you use two variables and compare them. This way, you can find how one feature
affects the other. It is done with scatter plots, which plot individual data points
or correlation matrices that plot the correlation in hues. You can also use boxplots.

Page 1 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

TOOLS REQUIRED FOR EXPLORATORY DATA ANALYSIS:


Some of the most common tools used to create an EDA are:
1. R: An open-source programming language and free software environment for statistical
computing and graphics supported by the R foundation for statistical computing. The R
language is widely used among statisticians in developing statistical observations and data
analysis.
2. Python: An interpreted, object-oriented programming language with dynamic semantics.
Its high level, built-in data structures, combined with dynamic binding, make it very
attractive for rapid application development, also as to be used as a scripting or glue
language to attach existing components together.

Example: Polymer Search : Polymer Search is a tool that allows users to harness the power
of AI to generate insights from their data and create interactive databases that allows for easy
filtering and data exploration.

Importance of EDA in Data Science


The Data Science field is now very important in the business world as it provides many
opportunities to make vital business decisions by analyzing hugely gathered data.
A model built on such data results in sub-optimal performance.

Data set description


The dataset contains cases from the research carried out between the years 1958 and 1970 at
the University of Chicago’s Billings Hospital on the survival of patients who had undergone
surgery for breast cancer.
Attribute information :
1. Patient’s age at the time of operation (numerical).
2. Year of operation (year — 1900, numerical).
3. A number of positive axillary nodes were detected (numerical).
4. Survival status (class attribute)
1: the patient survived 5 years or longer post-operation.
2: the patient died within 5 years post-operation.
Attributes 1, 2, and 3 form our features (independent variables), while attribute 4 is our class
label (dependent variable).

Objective of Exploratory Data Analysis


The overall objective of exploratory data analysis is to obtain vital insights and hence usually
includes the following sub-objectives:
 Identifying and removing data outliers
 Identifying trends in time and space
 Uncover patterns related to the target
 Creating hypotheses and testing them through experiments
 Identifying new sources of data

Data Scientist Roles and Responsibilities


A data scientist’s job is to gather a large amount of data, analyze it, separate out the essential
information, and then utilize tools like SAS, R programming, Python, etc. to extract insights
that may be used to increase the productivity and efficiency of the business.

Data scientist roles and responsibilities include:


 Data mining or extracting usable data from valuable data sources
Page 2 of 21 By Dr. V.SATHYENDRA KUMAR
IV B.Tech – I Sem | - Data Science UNIT-2

 Using machine learning tools to select features, create and optimize classifiers
 Carrying out preprocessing of structured and unstructured data
 Enhancing data collection procedures to include all relevant information for
developing analytic systems
 Processing, cleansing, and validating the integrity of data to be used for analysis
 Analyzing large amounts of information to find patterns and solutions
 Developing prediction systems and machine learning algorithms
 Presenting results in a clear manner
 Propose solutions and strategies to tackle business challenges
 Collaborate with Business and IT teams

Machine Learning Algorithms


Machine Learning is the science of making computers learn and act like humans by feeding
data and information without being explicitly programmed.
Machine learning algorithms are trained with training data. When new data comes in, they can
make predictions and decisions accurately based on past data.
Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.

How Does Machine Learning Work?


Machine Learning is, undoubtedly, one of the most exciting subsets of Artificial Intelligence.
It completes the task of learning from data with specific inputs to the machine. It’s important
to understand what makes Machine Learning work and, thus, how it can be used in the future.

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount
of the data.
There are two types of machine learning:
1. Supervised Learning
2. Unsupervised Learning

Supervised Learning
In supervised learning, we use known or labeled data for the training data. Since the data is
known, the learning is, therefore, supervised, i.e., directed into successful execution. The input
data goes through the Machine Learning algorithm and is used to train the model. Once the
model is trained based on the known data, you can use unknown data into the model and get a
new response.

Page 3 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.

Advantages of Supervised learning:


o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
Disadvantages of supervised learning:
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from
the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.

Types of supervised Machine learning Algorithms:


Supervised learning can be further divided into two types:
1. Classification
2. Regression

1. Classification - Supervised Learning


Classification is used when the output variable is categorical i.e. with 2 or more classes. For
example, yes or no, male or female, true or false, etc.

In order to predict whether a mail is spam or not, we need to first teach the machine what a
spam mail is. This is done based on a lot of spam filters - reviewing the content of the mail,
reviewing the mail header, and then searching if it contains any false information.
Based on the content, label, and the spam score of the new incoming mail, the algorithm decides
whether it should land in the inbox or spam folder.

Page 4 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

Below are some popular Classification algorithms


o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

2. Regression - Supervised Learning


Regression is used when the output variable is a real or continuous value. In this case, there is
a relationship between two or more variables i.e., a change in one variable is associated with a
change in the other variable. For example, salary based on work experience or weight based on
height, etc.
Regression is used when the output variable is a real or continuous value. In this case, there is
a relationship between two or more variables i.e., a change in one variable is associated with a
change in the other variable. For example, salary based on work experience or weight based on
height, etc.

Let’s consider two variables - humidity and temperature. Here, ‘temperature’ is the
independent variable and ‘humidity' is the dependent variable. If the temperature increases,
then the humidity decreases.

Below are some popular Regression algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

Real-Life Applications of Supervised Learning


 Risk Assessment
Supervised learning is used to assess the risk in financial services or insurance
domains in order to minimize the risk portfolio of the companies.
 Image Classification
Image classification is one of the key use cases of demonstrating supervised machine
learning. For example, Facebook can recognize your friend in a picture from an
album of tagged photos.
 Fraud Detection
To identify whether the transactions made by the user are authentic or not.

Page 5 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

 Visual Recognition
The ability of a machine learning model to identify objects, places, people, actions,
and images.

What is Unsupervised Learning?


In Unsupervised Learning, the machine uses unlabeled data and learns on itself without any
supervision. The machine tries to find a pattern in the unlabeled data and gives a response.
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.

Use of Unsupervised Learning?


Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.

Unsupervised learning can be further grouped into types:


1. Clustering
2. Association

1. Clustering - Unsupervised Learning


Clustering is the method of dividing the objects into clusters that are similar between them and
are dissimilar to the objects belonging to another cluster. For example, finding out which
customers made similar product purchases.

Suppose a telecom company wants to reduce its customer churn rate by providing personalized
call and data plans. The behavior of the customers is studied and the model segments the
customers with similar traits. Several strategies are adopted to minimize churn rate and
maximize profit through suitable promotions and campaigns.

Page 6 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

Group A customers use more data and also have high call durations. Group B customers are
heavy Internet users, while Group C customers have high call duration. So, Group B will be
given more data benefit plants, while Group C will be given cheaper called call rate plans and
group A will be given the benefit of both.

2. Association - Unsupervised Learning

Association is a rule-based machine learning to discover the probability of the co-occurrence


of items in a collection. For example, finding out which products were purchased together.

Let’s say that a customer goes to a supermarket and buys bread, milk, fruits, and wheat. Another
customer comes and buys bread, milk, rice, and butter. Now, when another customer comes, it
is highly likely that if he buys bread, he will buy milk too. Hence, a relationship is established
based on customer behavior and recommendations are made.

An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database. It determines the set of items that occurs
together in the dataset. Association rule makes marketing strategy more effective. Such as
people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket Analysis.

Real-Life Applications of Unsupervised Learning


 Market Basket Analysis
It is a machine learning model based on the algorithm that if you buy a certain group
of items, you are less or more likely to buy another group of items.
 Semantic Clustering
Semantically similar words share a similar context. People post their queries on
websites in their own ways. Semantic clustering groups all these responses with the
same meaning in a cluster to ensure that the customer finds the information they
want quickly and easily. It plays an important role in information retrieval, good
browsing experience, and comprehension.

Page 7 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

 Delivery Store Optimization


Machine learning models are used to predict the demand and keep up with supply.
They are also used to open stores where the demand is higher and optimizing roots
for more efficient deliveries according to past data and behavior.
 Identifying Accident Prone Areas
Unsupervised machine learning models can be used to identify accident-prone areas
and introduce safety measures based on the intensity of those accidents.

Unsupervised Learning algorithms:


Below is the list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Advantages of Unsupervised Learning


o Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.

Disadvantages of Unsupervised Learning


o Unsupervised learning is intrinsically more difficult than supervised learning as it does
not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data
is not labeled, and algorithms do not know the exact output in advance.

Major differences between Supervised and Unsupervised Learning

Supervised machine learning Unsupervised machine learning


Parameters
technique technique
In a supervised learning model,
In unsupervised learning model, only
Process input and output variables will be
input data will be given
given.
Algorithms are trained using Algorithms are used against data which
Input Data
labeled data. is not labeled
Support vector machine, Neural Unsupervised algorithms can be divided
network, Linear and logistics into different categories: like Cluster
Algorithms Used
regression, random forest, and algorithms, K-means, Hierarchical
Classification trees. clustering, etc.
Computational Supervised learning is a simpler Unsupervised learning is
Complexity method. computationally complex

Page 8 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

Supervised machine learning Unsupervised machine learning


Parameters
technique technique
Supervised learning model uses
Unsupervised learning does not use
Use of Data training data to learn a link
output data.
between the input and the outputs.
Accuracy of Highly accurate and trustworthy
Less accurate and trustworthy method.
Results method.
Real Time Learning method takes place
Learning method takes place in real time.
Learning offline.
Number of
Number of classes is known. Number of classes is not known.
Classes
You cannot get precise information
Classifying big data can be a real regarding data sorting, and the output as
Main Drawback
challenge in Supervised Learning. data used in unsupervised learning is
labeled and not known.

Difference between Parametric and Non-Parametric Methods are as follows:


Parametric Methods Non-Parametric Methods

Parametric Methods uses a fixed number Non-Parametric Methods use the flexible
of parameters to build the model. number of parameters to build the model.

Parametric analysis is to test group means. A non-parametric analysis is to test medians.

It is applicable for both – Variable and


It is applicable only for variables. Attribute.

It always considers strong assumptions


about data. It generally fewer assumptions about data.

Parametric Methods require lesser data Non-Parametric Methods requires much more
than Non-Parametric Methods. data than Parametric Methods.

Parametric methods assumed to be a There is no assumed distribution in non-


normal distribution. parametric methods.

Parametric data handles – Intervals data or But non-parametric methods handle original
ratio data. data.

Page 9 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

Parametric Methods Non-Parametric Methods

Here when we use parametric methods When we use non-parametric methods then
then the result or outputs generated can be the result or outputs generated cannot be
easily affected by outliers. seriously affected by outliers.

Parametric Methods can perform well in Similarly, Non-Parametric Methods can


many situations but its performance is at perform well in many situations but its
peak (top) when the spread of each group performance is at peak (top) when the spread
is different. of each group is the same.

Parametric methods have more statistical Non-parametric methods have less statistical
power than Non-Parametric methods. power than Parametric methods.

As far as the computation is considered As far as the computation is considered these


these methods are computationally faster methods are computationally slower than the
than the Non-Parametric methods. Parametric methods.

Examples: Logistic Regression, Naïve


Bayes Model, etc. Examples: KNN, Decision Tree Model, etc.

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

K-Nearest Neighbors algorithm in Machine Learning (or KNN) is one of the most used learning
algorithms due to its simplicity. KNN is a lazy learning, non-parametric algorithm. It uses data
with several classes to predict the classification of the new sample point. KNN is non-
parametric since it doesn’t make any assumptions on the data being studied, i.e., the model is
distributed from the data.

Working of KNN Algorithm

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new
datapoints which further means that the new data point will be assigned a value based on how
closely it matches the points in the training set. We can understand its working with the help
of following steps −

Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN,
we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.
Step 3 − For each point in the test data do the following −
 3.1 − Calculate the distance between test data and each row of training data with
the help of any of the method namely: Euclidean, Manhattan or Hamming
distance. The most commonly used method to calculate distance is Euclidean.
 3.2 − Now, based on the distance value, sort them in ascending order.
Page 10 of 21 By Dr. V.SATHYENDRA KUMAR
IV B.Tech – I Sem | - Data Science UNIT-2

 3.3 − Next, it will choose the top K rows from the sorted array.
 3.4 − Now, it will assign a class to the test point based on most frequent class of
these rows.
Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN algorithm –

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

Page 11 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
Example -2
Suppose we have a dataset which can be plotted as follows −

Now, we need to classify new data point with black dot (at point 60,60) into blue or red class.
We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next
diagram –

the three nearest neighbors of the data point with black dot. Among those three, two of them
lies in Red class hence the black dot will also be assigned in red class.
Advantages of KNN Algorithm:
o It is simple to implement.

Page 12 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

o It is robust to the noisy training data


o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.

What is Logistic Regression?


Logistic regression is a statistical method that is used for building machine learning models
where the dependent variable is dichotomous: i.e. binary. Logistic regression is used to describe
data and the relationship between one dependent variable and one or more independent
variables. The independent variables can be nominal, ordinal, or of interval type.

The name “logistic regression” is derived from the concept of the logistic function that it uses.
The logistic function is also known as the sigmoid function. The value of this logistic function
lies between zero and one.

The goal of Logistic Regression is to discover a link between characteristics and the likelihood
of a specific outcome. For example, when predicting whether a student passes or fails an exam
based on the number of hours spent studying, the response variable has two values: pass and
fail.

A Logistic Regression model is similar to a Linear Regression model, except that the Logistic
Regression utilizes a more sophisticated cost function, which is known as the “Sigmoid
function” or “logistic function” instead of a linear function.
It’s called ‘Logistic Regression’ since the technique behind it is quite similar to Linear
Regression. The name “Logistic” comes from the Logit function, which is utilized in this
categorization approach.

Logistic Regression is considered a regression model also. This model creates a regression
model to predict the likelihood that a given data entry belongs to the category labeled “1.”

Logistic regression models the data using the sigmoid function, much as linear regression
assumes that the data follows a linear distribution.

Advantages of the Logistic Regression Algorithm


 Logistic regression performs better when the data is linearly separable
 It does not require too many computational resources as it’s highly interpretable
 There is no problem scaling the input features—It does not require tuning
 It is easy to implement and train a model using logistic regression
 It gives a measure of how relevant a predictor (coefficient size) is, and its direction
of association (positive or negative)

Page 13 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Example: There is a dataset given which contains the information of various users obtained
from the social networking sites. There is a car making company that has recently launched
a new SUV car. So the company wanted to check how many users from the dataset, wants
to purchase the car.
o For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary (Independent
variables).

Steps in Logistic Regression: To implement the Logistic Regression using Python, we will
use the same steps as we have done in previous topics of Regression. Below are the steps:
o Data Pre-processing step
o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

Applications of Logistic Regression


 Using the logistic regression algorithm, banks can predict whether a customer would
default on loans or not
 To predict the weather conditions of a certain place (sunny, windy, rainy, humid,
etc.)
 Ecommerce companies can identify buyers if they are likely to purchase a certain
product

Page 14 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

 Companies can predict whether they will gain or lose money in the next quarter,
year, or month based on their current performance
 To classify objects based on their features and attributes

Does the Logistic Regression Algorithm Work?


Consider the following example: An organization wants to determine an employee’s salary
increase based on their performance.
For this purpose, a linear regression algorithm will help them decide. Plotting a regression line
by considering the employee’s performance as the independent variable, and the salary increase
as the dependent variable will make their task easier.

Now, what if the organization wants to know whether an employee would get a promotion or
not based on their performance? The above linear graph won’t be suitable in this case. As such,
we clip the line at zero and one, and convert it into a sigmoid curve (S curve).

Based on the threshold values, the organization can decide whether an employee will get a
salary increase or not.

Linear Regression vs. Logistic Regression

Linear Regression Logistic Regression

Used to solve classification


Used to solve regression problems
problems

The response variable is


The response variables are continuous in nature
categorical in nature

Page 15 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

It helps to calculate the


It helps estimate the dependent variable when there
possibility of a particular event
is a change in the independent variable
taking place

It is a straight line It is an S-curve (S = Sigmoid)

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Naive Bayes uses for the following things:
o Face Recognition
o As a classifier, it is used to identify the faces or its other features, like nose, mouth,
eyes, etc.
o Weather Prediction
o It can be used to predict if the weather will be good or bad.
o Medical Diagnosis
o Doctors can diagnose patients by using the information that the classifier provides.
Healthcare professionals can use Naive Bayes to indicate if a patient is at high risk for
certain diseases and conditions, such as heart disease, cancer, and other ailments.
o News Classification
o With the help of a Naive Bayes classifier, Google News recognizes whether the news
is political, world news, and so on.
Advantages of Naive Bayes Classifier
The following are some of the benefits of the Naive Bayes classifier:
 It is simple and easy to implement
 It doesn’t require as much training data
 It handles both continuous and discrete data
 It is highly scalable with the number of predictors and data points
 It is fast and can be used to make real-time predictions
 It is not sensitive to irrelevant features

Page 16 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

What is Meant by the K-Means Clustering Algorithm?


K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this
clustering, unlike in supervised learning. K-Means performs the division of objects into
clusters that share similarities and are dissimilar to the objects belonging to another cluster.
The term ‘K’ is a number. You need to tell the system how many clusters you need to create.
For example, K = 2 refers to two clusters. There is a way of finding out what is the best or
optimum value of K for a given data.
For a better understanding of k-means, let's take an example from cricket. Imagine you
received data on a lot of cricket players from all over the world, which gives information on
the runs scored by the player and the wickets taken by them in the last ten matches. Based on
this information, we need to group the data into two clusters, namely batsman and bowlers.
Solution:
Assign data points
Here, we have our data set plotted on ‘x’ and ‘y’ coordinates. The information on the y-axis
is about the runs scored, and on the x-axis about the wickets taken by the players.
If we plot the data, this is how it would look:

Perform Clustering
We need to create the clusters, as shown below:

Page 17 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

Considering the same data set, let us solve the problem using K-Means clustering (taking K =
2).
The first step in k-means clustering is the allocation of two centroids randomly (as K=2).
Two points are assigned as centroids. Note that the points can be anywhere, as they are
random points. They are called centroids, but initially, they are not the central point of a
given data set.

The next step is to determine the distance between each of the randomly assigned centroids'
data points. For every point, the distance is measured from both the centroids, and whichever
distance is less, that point is assigned to that centroid. You can see the data points attached to
the centroids and represented here in blue and yellow.

The next step is to determine the actual centroid for these two clusters. The original randomly
allocated centroid is to be repositioned to the actual centroid of the clusters.

Page 18 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

This process of calculating the distance and repositioning the centroid continues until we
obtain our final cluster. Then the centroid repositioning stops.

As seen above, the centroid doesn't need anymore repositioning, and it means the algorithm
has converged, and we have the two clusters with a centroid.
Applications of K-Means Clustering
K-Means clustering is used in a variety of examples or business cases in real life, like:
 Academic performance
 Diagnostic systems
 Search engines
 Wireless sensor networks
Academic Performance
Based on the scores, students are categorized into grades like A, B, or C.
Diagnostic systems
The medical profession uses k-means in creating smarter medical decision support systems,
especially in the treatment of liver ailments.
Search engines
Clustering forms a backbone of search engines. When a search is performed, the search
results need to be grouped, and the search engines very often use clustering to do this.
Wireless sensor networks
The clustering algorithm plays the role of finding the cluster heads, which collect all the data
in its respective cluster.
How Does K-Means Clustering Work?
The flowchart below shows how k-means clustering works:

K-Means Clustering Algorithm


Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K
clusters.

Page 19 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

The steps to form clusters are:


Step 1: Choose K random points as cluster centers called centroids.
Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e.,
calculating its distance to each centroid)
Step 3: Identify new centroids by taking the average of the assigned points.
Step 4: Keep repeating step 2 and step 3 until convergence is achieved

The k-means clustering algorithm mainly performs two tasks:


o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

What is Linear Regression in Machine Learning?


Linear Regression is an algorithm that belongs to supervised Machine Learning. It tries to
apply relations that will predict the outcome of an event based on the independent variable
data points. The relation is usually a straight line that best fits the different data points as
close as possible. The output is of a continuous form, i.e., numerical value.
For example, the output could be revenue or sales in currency, the number of products
sold, etc. In the above example, the independent variable can be single or multiple.
Linear regression is used in many different fields, including finance, economics, and
psychology, to understand and predict the behavior of a particular variable. For example, in
finance, linear regression might be used to understand the relationship between a
company’s stock price and its earnings, or to predict the future value of a currency based on
its past performance.

Page 20 of 21 By Dr. V.SATHYENDRA KUMAR


IV B.Tech – I Sem | - Data Science UNIT-2

Types of Linear Regression


Linear Regression can be broadly classified into two types of algorithms:
1. Simple Linear Regression
A simple straight-line equation involving slope (dy/dx) and intercept (an
integer/continuous value) is utilized in simple Linear Regression. Here a simple form is:
y=mx+c where y denotes the output x is the independent variable, and c is the intercept
when x=0. With this equation, the algorithm trains the model of machine learning and
gives the most accurate output
2. Multiple Linear Regression
When a number of independent variables more than one, the governing linear equation
applicable to regression takes a different form like:
y= c+m1x1+m2x2… mnxn where represents the coefficient responsible for impact of
different independent variables x1, x2 etc. This machine learning algorithm, when applied,
finds the values of coefficients m1, m2, etc., and gives the best fitting line.

Page 21 of 21 By Dr. V.SATHYENDRA KUMAR

You might also like