KEMBAR78
ML & DA Unit2 - Notes | PDF | Logistic Regression | Mean Squared Error
0% found this document useful (0 votes)
35 views57 pages

ML & DA Unit2 - Notes

Unit 2 covers supervised learning in machine learning, focusing on training models with labeled data to make predictions. It discusses regression techniques, particularly linear and polynomial regression, their applications, and evaluation metrics like Mean Absolute Error and R-squared. The document provides insights into the workings, advantages, and limitations of these regression methods, along with implementation examples using Python.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views57 pages

ML & DA Unit2 - Notes

Unit 2 covers supervised learning in machine learning, focusing on training models with labeled data to make predictions. It discusses regression techniques, particularly linear and polynomial regression, their applications, and evaluation metrics like Mean Absolute Error and R-squared. The document provides insights into the workings, advantages, and limitations of these regression methods, along with implementation examples using Python.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Unit 2 Machine Learning and Data analytics Using Python

Unit -2
Supervised Learning
Introduction
Supervised machine learning is a fundamental approach for machine learning and
artificial intelligence. It involves training a model using labeled data, where each input
comes with a corresponding correct output.
The process is like a teacher guiding a student—hence the term "supervised" learning. In
this article, we'll explore the key components of supervised learning, the different types
of supervised machine learning algorithms used, and some practical examples of how it
works.

Image source: TechVidvan

Working of Supervised Machine Learning


• Training Data: The model is provided with a training dataset that includes input data
(features) and corresponding output data (labels or target variables).

• Learning Process: The algorithm processes the training data, learning the relationships
between the input features and the output labels. This is achieved by adjusting the model's
parameters to minimize the difference between its predictions and the actual labels.

After training, the model is evaluated using a test dataset to measure its accuracy and
performance. Then the model's performance is optimized by adjusting parameters and
using techniques like cross-validation to balance bias and variance. This ensures the model
generalizes well to new, unseen data.
Unit 2 Machine Learning and Data analytics Using Python

In summary, supervised machine learning involves training a model on labeled data to


learn patterns and relationships, which it then uses to make accurate predictions on new
data.

Regression
Regression in machine learning refers to a supervised learning technique where the
goal is to predict a continuous numerical value based on one or more independent
features. It finds relationships between variables so that predictions can be made.
we have two types of variables present in regression:
Dependent Variable (Target): The variable we are trying to predict e.g house
price.
Independent Variables (Features): The input variables that influence the
prediction e.g locality, number of rooms.

Linear Regression
Linear regression is a type of supervised machine-learning algorithm that learns
from the labelled datasets and maps the data points with most optimized linear
functions which can be used for prediction on new datasets. It assumes that there
is a linear relationship between the input and output, meaning the output changes
at a constant rate as the input changes. This relationship is represented by a
straight line.
For example we want to predict a student's exam score based on how many hours
they studied. We observe that as students study more hours, their scores go up. In
the example of predicting exam scores based on hours studied. Here
Independent variable (input): Hours studied because it's the factor we control
or observe.
Dependent variable (output): Exam score because it depends on how many
hours were studied.
Importance of Linear Regression
• Simplicity and Interpretability: It’s easy to understand and interpret, making it
a starting point for learning about machine learning.
• Predictive Ability: Helps predict future outcomes based on past data, making it
useful in various fields like finance, healthcare and marketing.
• Basis for Other Models: Many advanced algorithms, like logistic regression or
neural networks, build on the concepts of linear regression.
• Efficiency: It’s computationally efficient and works well for problems with a
linear relationship.
Unit 2 Machine Learning and Data analytics Using Python

• Widely Used: It’s one of the most widely used techniques in both statistics and
machine learning for regression tasks.
• Analysis: It provides insights into relationships between variables (e.g., how much
one variable influences another).
Best Fit Line in Linear Regression
In linear regression, the best-fit line is the straight line that most accurately
represents the relationship between the independent variable (input) and the
dependent variable (output). It is the line that minimizes the difference between
the actual data points and the predicted values from the model.
1. Goal of the Best-Fit Line
The goal of linear regression is to find a straight line that minimizes the error (the
difference) between the observed data points and the predicted values. This line
helps us predict the dependent variable for new, unseen data.

Here Y is called a dependent or target variable and X is called an independent variable


also known as the predictor of Y. There are many types of functions or modules that can
be used for regression. A linear function is the simplest type of function. Here, X may be a
single feature or multiple features representing the problem.
2. Equation of the Best-Fit Line
For simple linear regression (with one independent variable), the best-fit line is
represented by the equation
y=mx+by=mx+b
Where:
• y is the predicted value (dependent variable)
• x is the input (independent variable)
Unit 2 Machine Learning and Data analytics Using Python

• m is the slope of the line (how much y changes when x changes)


• b is the intercept (the value of y when x = 0)
The best-fit line will be the one that optimizes the values of m (slope) and b (intercept)
so that the predicted y values are as close as possible to the actual data points.
3. Minimizing the Error: The Least Squares Method
To find the best-fit line, we use a method called Least Squares. The idea behind this
method is to minimize the sum of squared differences between the actual values (data
points) and the predicted values from the line. These differences are called residuals.
The formula for residuals is:
Residual=yᵢ−y^ᵢResidual=yᵢ−y^ᵢ
Where:
• yᵢyᵢ is the actual observed value
• y^ᵢy^ᵢ is the predicted value from the line for that xᵢxᵢ
The least squares method minimizes the sum of the squared residuals:
Sumofsquarederrors(SSE)=Σ(yᵢ−y^ᵢ)²Sumofsquarederrors(SSE)=Σ(yᵢ−y^ᵢ)²
This method ensures that the line best represents the data where the sum of the squared
differences between the predicted values and actual values is as small as possible.
4. Interpretation of the Best-Fit Line
• Slope (m): The slope of the best-fit line indicates how much the dependent
variable (y) changes with each unit change in the independent variable (x). For
example if the slope is 5, it means that for every 1-unit increase in x, the value of y
increases by 5 units.
• Intercept (b): The intercept represents the predicted value of y when x = 0. It’s
the point where the line crosses the y-axis.
In linear regression some hypothesis are made to ensure reliability of the model's results.
Limitations
• Assumes Linearity: The method assumes the relationship between the variables is
linear. If the relationship is non-linear, linear regression might not work well.
• Sensitivity to Outliers: Outliers can significantly affect the slope and intercept,
skewing the best-fit line.
Types of Linear Regression
When there is only one independent feature it is known as Simple Linear Regression
or Univariate Linear Regression and when there are more than one feature it is known as
Multiple Linear Regression or Multivariate Regression.
Unit 2 Machine Learning and Data analytics Using Python

1. Simple Linear Regression


Simple linear regression is used when we want to predict a target value (dependent
variable) using only one input feature (independent variable). It assumes a straight-line
relationship between the two.

Example:
Predicting a person’s salary (y) based on their years of experience (x).
2. Multiple Linear Regression
Multiple linear regression involves more than one independent variable and one
dependent variable. The equation for multiple linear regression is:

The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
In regression set of records are present with X and Y values and these values are used to
learn a function so if you want to predict Y from an unknown X this learned function can
be used. In regression we have to find the value of Y, So, a function is required that predicts
continuous Y in the case of regression given X as independent features.
Use Case of Multiple Linear Regression
Multiple linear regression allows us to analyze relationship between multiple
independent variables and a single dependent variable. Here are some use cases:
Unit 2 Machine Learning and Data analytics Using Python

• Real Estate Pricing: In real estate MLR is used to predict property prices based
on multiple factors such as location, size, number of bedrooms, etc. This helps
buyers and sellers understand market trends and set competitive prices.
• Financial Forecasting: Financial analysts use MLR to predict stock prices or
economic indicators based on multiple influencing factors such as interest rates,
inflation rates and market trends. This enables better investment strategies and
risk management24.
• Agricultural Yield Prediction: Farmers can use MLR to estimate crop yields
based on several variables like rainfall, temperature, soil quality and fertilizer
usage. This information helps in planning agricultural practices for optimal
productivity
• E-commerce Sales Analysis: An e-commerce company can utilize MLR to assess
how various factors such as product price, marketing promotions and seasonal
trends impact sales.
Implementation of Linear Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Sample data
X = np.array([1,2,3,4,5]).reshape(-1,1)
y = np.array([2,4,5,4,5])

# Model
model = LinearRegression()
model.fit(X, y)

# Prediction
y_pred = model.predict(X)

# Plot
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
Unit 2 Machine Learning and Data analytics Using Python

plt.title("Linear Regression")
plt.xlabel("X")
plt.ylabel("y")
plt.show()

Polynomial Regression
Polynomial regression is a regression technique used to model the relationship
between an independent variable xx and a dependent variable yy as an nth-
degree polynomial. Unlike linear regression, which assumes a straight-line (linear)
relationship, polynomial regression can capture non-linear, curved patterns in
data by including higher-degree terms of the independent variable

Application of Polynomial Regression


The reason behind the vast use cases of the polynomial regression is that approximately
all of the real-world data is non-linear in nature and hence when we fit a non-linear model
Unit 2 Machine Learning and Data analytics Using Python

on the data or a curvilinear regression line then the results that we obtain are far better
than what we can achieve with the standard linear regression. Some of the use cases of
the Polynomial regression are as stated below:
• The growth rate of tissues.
• Progression of disease epidemics
• Distribution of carbon isotopes in lake sediments
Advantages & Disadvantages of using Polynomial Regression
Advantages of using Polynomial Regression
• A broad range of functions can be fit under it.
• Polynomial basically fits a wide range of curvatures.
• Polynomial provides the best approximation of the relationship between
dependent and independent variables.
Disadvantages of using Polynomial Regression
• These are too sensitive to outliers.
• The presence of one or two outliers in the data can seriously affect the results of
nonlinear analysis.
• In addition, there are unfortunately fewer model validation tools for the detection
of outliers in nonlinear regression than there are for linear regression.
Implementation of Polynomial Regression

from sklearn.preprocessing import PolynomialFeatures


from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

X = np.array([1,2,3,4,5]).reshape(-1,1)
y = np.array([1,4,9,16,25])

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

model = LinearRegression()
Unit 2 Machine Learning and Data analytics Using Python

model.fit(X_poly, y)

y_pred = model.predict(X_poly)

plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.title("Polynomial Regression")
plt.xlabel("X")
plt.ylabel("y")
plt.show()

Model Evaluation metrics


• Evaluation metrics for regression are essential for assessing the performance of
regression models specifically.
• These metrics help in measuring how well a regression model is able to predict
continuous outcomes.
Unit 2 Machine Learning and Data analytics Using Python

• Common regression evaluation metrics for regression include Mean Absolute


Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-
squared (Coefficient of Determination), and Mean Absolute Percentage Error
(MAPE).
• By utilizing these regression-specific metrics, data scientists and machine learning
engineers can evaluate the accuracy and effectiveness of their metrics for
regression models in making predictions.
• Some common regression metrics in scikit-learn with examples
• Mean Absolute Error (MAE)
• Mean Squared Error (MSE)
• R-squared (R²) Score
• Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
• In the fields of statistics and machine learning, the Mean Absolute Error (MAE) is
a frequently employed metric.
• It's a measurement of the typical absolute discrepancies between a dataset's
actual values and projected values.
• Mathematical Formula
• The formula to calculate MAE for a data with "n" data points is:

Example
from sklearn.metrics import mean_absolute_error
true_values = [2.5, 3.7, 1.8, 4.0, 5.2]
predicted_values = [2.1, 3.9, 1.7, 3.8, 5.0]
mae = mean_absolute_error(true_values, predicted_values)
print(“quot;Mean Absolute Error:”, mae)
Output: Mean Absolute Error: 0.22000000000000003
Unit 2 Machine Learning and Data analytics Using Python

Mean Squared Error (MSE)


• A popular metric in statistics and machine learning is the Mean Squared
Error (MSE).
• It measures the square root of the average discrepancies between a dataset's
actual values and projected values.
• MSE is frequently utilized in regression issues and is used to assess how well
predictive models work.
• Mathematical Formula
• For a dataset containing 'n' data points, the MSE calculation formula is:

Example
from sklearn.metrics import mean_squared_error
true_values = [2.5, 3.7, 1.8, 4.0, 5.2]
predicted_values = [2.1, 3.9, 1.7, 3.8, 5.0]
mse = mean_squared_error(true_values, predicted_values)
print(“Mean Squared Error:”, mse)
Output: Mean Squared Error: 0.057999999999999996
R-squared (R²) Score
• A statistical metric frequently used to assess the goodness of fit of a regression
model is the R-squared (R2) score, also referred to as the coefficient of
determination.
• It quantifies the percentage of the dependent variable's variation that the model's
independent variables contribute to. R2 is a useful statistic for evaluating the
overall effectiveness and explanatory power of a regression model.
• Mathematical Formula
The formula to calculate the R-squared score is as follows:
Unit 2 Machine Learning and Data analytics Using Python

Example
from sklearn.metrics import r2_score
true_values = [2.5, 3.7, 1.8, 4.0, 5.2]
predicted_values = [2.1, 3.9, 1.7, 3.8, 5.0]
r2 = r2_score(true_values, predicted_values)
print("R-squared (R²) Score:", r2)
Output: R-squared (R²) Score: 0.9588769143505389
Root Mean Squared Error (RMSE)
• RMSE stands for Root Mean Squared Error. It is a usually used metric in regression
analysis and machine learning to measure the accuracy or goodness of fit of a
predictive model, especially when the predictions are continuous numerical
values.
• The RMSE quantifies how well the predicted values from a model align with the
actual observed values in the dataset. Here's how it works:
• Calculate the Squared Differences: For each data point, subtract the
predicted value from the actual (observed) value, square the result, and
sum up these squared differences.
• Compute the Mean: Divide the sum of squared differences by the number
of data points to get the mean squared error (MSE).
• Take the Square Root: To obtain the RMSE, simply take the square root of
the MSE.
The formula for RMSE for a data with 'n' data points is as follows:
Unit 2 Machine Learning and Data analytics Using Python

RMSE – Example
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
# Sample data
true_prices = np.array([250000, 300000, 200000, 400000, 350000])
predicted_prices = np.array([240000, 310000, 210000, 380000, 340000])
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(true_prices, predicted_prices))
print(“Root Mean Squared Error (RMSE):”, rmse)

Output: Root Mean Squared Error (RMSE): 12649.110640673518

Classification of Supervised learning

Classification
• Classification is a supervised machine learning method where the model tries to
predict the correct label of a given input data.
• In classification, the model is fully trained using the training data, and then it is
evaluated on test data before being used to perform prediction on new unseen
data.
• For instance, an algorithm can learn to predict whether a given email is spam or
ham (no spam), as illustrated below.
Unit 2 Machine Learning and Data analytics Using Python

Types of Classification
1. Binary Classification
This is the simplest kind of classification. In binary classification, the goal is to sort
the data into two distinct categories. Think of it like a simple choice between two
options. Imagine a system that sorts emails into either spam or not spam. It
works by looking at different features of the email like certain keywords or
sender details, and decides whether it’s spam or not. It only chooses between these
two options.
2. Multiclass Classification
Here, instead of just two categories, the data needs to be sorted into more than
two categories. The model picks the one that best matches the input. Think of an
image recognition system that sorts pictures of animals into categories
like cat, dog, and bird.
Basically, machine looks at the features in the image (like shape, color, or
texture) and chooses which animal the picture is most likely to be based on
the training it received.
Unit 2 Machine Learning and Data analytics Using Python

Binary classification vs Multi class classification


3. Multi-Label Classification
In multi-label classification single piece of data can belong to multiple
categories at once. Unlike multiclass classification where each data point belongs
to only one class, multi-label classification allows datapoints to belong to
multiple classes. A movie recommendation system could tag a movie as
both action and comedy. The system checks various features (like movie plot,
actors, or genre tags) and assigns multiple labels to a single piece of data, rather
than just one.
Multilabel classification is relevant in specific use cases, but not as crucial for a
starting overview of classification.
How does Classification in Machine Learning Work?
Classification involves training a model using a labeled dataset, where each input is paired
with its correct output label. The model learns patterns and relationships in the data, so
it can later predict labels for new, unseen inputs.
In machine learning, classification works by training a model to learn patterns from
labeled data, so it can predict the category or class of new, unseen data. Here's how it
works:
1. Data Collection: You start with a dataset where each item is labeled with the
correct class (for example, "cat" or "dog").
2. Feature Extraction: The system identifies features (like color, shape, or texture)
that help distinguish one class from another. These features are what the model
uses to make predictions.
3. Model Training: Classification - machine learning algorithm uses the labeled data
to learn how to map the features to the correct class. It looks for patterns and
relationships in the data.
Unit 2 Machine Learning and Data analytics Using Python

4. Model Evaluation: Once the model is trained, it's tested on new, unseen data to
check how accurately it can classify the items.
5. Prediction: After being trained and evaluated, the model can be used to predict
the class of new data based on the features it has learned.
6. Model Evaluation: Evaluating a classification model is a key step in machine
learning. It helps us check how well the model performs and how good it is at
handling new, unseen data. Depending on the problem and needs we can use
different metrics to measure its performance.

Examples of Machine Learning Classification in Real Life


• Credit risk assessment: Algorithms predict whether a loan applicant is likely to
default by analyzing factors such as credit score, income, and loan history. This
helps banks make informed lending decisions and minimize financial risk.
• Medical diagnosis : Machine learning models classify whether a patient has a
certain condition (e.g., cancer or diabetes) based on medical data such as test
results, symptoms, and patient history. This aids doctors in making quicker, more
accurate diagnoses, improving patient care.
• Image classification : Applied in fields such as facial recognition, autonomous
driving, and medical imaging.
Unit 2 Machine Learning and Data analytics Using Python

• Sentiment analysis: Determining whether the sentiment of a piece of text is


positive, negative, or neutral. Businesses use this to understand customer
opinions, helping to improve products and services.
• Fraud detection : Algorithms detect fraudulent activities by analyzing transaction
patterns and identifying anomalies crucial in protecting against credit card fraud
and other financial crimes.
• Recommendation systems : Used to recommend products or content based on
past user behavior, such as suggesting movies on Netflix or products on Amazon.
This personalization boosts user satisfaction and sales for businesses.
Classification Algorithms
For implementation of any classification model it is essential to
understand Logistic Regression, which is one of the most fundamental and
widely used algorithms in machine learning for classification tasks. There are
various types of classifiers algorithms

Logistic Regression
• Logistic Regression is a supervised machine learning algorithm used for
classification problems. Unlike linear regression which predicts continuous values
it predicts the probability that an input belongs to a specific class.
• It is used for binary classification where the output can be one of two possible
categories such as Yes/No, True/False or 0/1.
• It uses sigmoid function to convert inputs into a probability value between 0 and
1.
Types of Logistic Regression
Logistic regression can be classified into three main types based on the nature of the
dependent variable:
1. Binomial Logistic Regression: This type is used when the dependent variable
has only two possible categories. Examples include Yes/No, Pass/Fail or 0/1. It is
the most common form of logistic regression and is used for binary classification
problems.
2. Multinomial Logistic Regression: This is used when the dependent variable has
three or more possible categories that are not ordered. For example, classifying
animals into categories like "cat," "dog" or "sheep." It extends the binary logistic
regression to handle multiple classes.
3. Ordinal Logistic Regression: This type applies when the dependent variable has
three or more categories with a natural order or ranking. Examples include ratings
like "low," "medium" and "high." It takes the order of the categories into account
when modeling.
Unit 2 Machine Learning and Data analytics Using Python

Understanding Sigmoid Function


1. The sigmoid function is a important part of logistic regression which is used to convert
the raw output of the model into a probability value between 0 and 1.
2. This function takes any real number and maps it into the range 0 to 1 forming an "S"
shaped curve called the sigmoid curve or logistic curve. Because probabilities must lie
between 0 and 1, the sigmoid function is perfect for this purpose.
3. In logistic regression, we use a threshold value usually 0.5 to decide the class label.
• If the sigmoid output is same or above the threshold, the input is classified as Class
1.
• If it is below the threshold, the input is classified as Class 0.
This approach helps to transform continuous input values into meaningful class
predictions.
Key Terminologies of Logistic Regression
• Independent Variables
These are the features or factors used to predict the model's outcome. They are
the inputs that help determine the value of the dependent variable. For instance,
independent variables might include age, income, and past buying behavior in a
model predicting whether a customer will purchase a product.
• Dependent Variable
This is the outcome the model is trying to predict. In logistic regression, the
dependent variable is binary, meaning it has two possible values, such as "yes" or
"no," "spam" or "not spam." The goal is to estimate the probability of this variable
being in one category versus the other.
• Logistic Function
The logistic function is a formula that converts the model’s input into a probability
score between 0 and 1. This score indicates the likelihood of the dependent
variable being 1. It’s what turns the raw predictions into meaningful probabilities
that can be used for classification.
• Odds
Odds represent the ratio of the probability of an event happening to the probability
of it not happening. For example, if there’s a 75% chance of an event occurring, the
odds are 3 to 1. This concept helps to understand how likely an event is compared
to it not happening.
• Log-Odds
Log-odds, or the logit function, is the natural logarithm of the odds. In logistic
regression, the relationship between the independent variables and the
Unit 2 Machine Learning and Data analytics Using Python

dependent variable is expressed through log-odds. This helps model how changes
in the independent variables affect the likelihood of the outcome.
• Coefficient
Coefficients are the values that show how each independent variable influences
the dependent variable. They indicate the strength and direction of the
relationship. For example, a positive coefficient means that as the independent
variable increases, the likelihood of the dependent variable being 1 also increases.
• Intercept
The intercept is a constant term in the model representing the dependent
variable's log odds when all the independent variables are zero. It provides a
baseline level of the dependent variable’s probability before considering the
effects of the independent variables.
• Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is the method used to find the best-fitting
coefficients for the model. It determines the values that make the observed data
most probable under the logistic regression framework, ensuring the model
provides the most accurate predictions based on the given data.
Working of Logistic Regression
Consider the following example: An organization wants to determine an
employee’s salary increase based on their performance.
For this purpose, a linear regression algorithm will help them decide. Plotting a
regression line by considering the employee’s performance as the independent
variable, and the salary increase as the dependent variable will make their task
easier.

Now, what if the organization wants to know whether an employee would get a
promotion or not based on their performance? The above linear graph won’t be
Unit 2 Machine Learning and Data analytics Using Python

suitable in this case. As such, we clip the line at zero and one, and convert it into a
sigmoid curve (S curve).

Based on the threshold values, the organization can decide whether an employee
will get a salary increase or not.
To understand logistic regression, let’s go over the odds of success.
Odds (𝜃) = Probability of an event happening / Probability of an event not
happening
𝜃=p/1-p
The values of odds range from zero to ∞ and the values of probability lies between
zero and one.
Consider the equation of a straight line:
𝑦 = 𝛽0 + 𝛽1* 𝑥

Here, 𝛽0 is the y-intercept


𝛽1 is the slope of the line
x is the value of the x coordinate
y is the value of the prediction
Now to predict the odds of success, we use the following formula:
Unit 2 Machine Learning and Data analytics Using Python

Exponentiating both the sides, we have:

Let Y = e 𝛽0+𝛽1 * 𝑥
Then p(x) / 1 - p(x) = Y
p(x) = Y(1 - p(x))
p(x) = Y - Y(p(x))
p(x) + Y(p(x)) = Y
p(x)(1+Y) = Y
p(x) = Y / 1+Y

The equation of the sigmoid function is:

The sigmoid curve obtained from the above equation is as follows:


Unit 2 Machine Learning and Data analytics Using Python

Advantages of the Logistic Regression Algorithm


• Logistic regression performs better when the data is linearly separable
• It does not require too many computational resources as it’s highly interpretable
• There is no problem scaling the input features—It does not require tuning
• It is easy to implement and train a model using logistic regression
• It gives a measure of how relevant a predictor (coefficient size) is, and its direction
of association (positive or negative).

Differences Between Linear and Logistic Regression

Linear Regression Logistic Regression

Linear regression is used to Logistic regression is used to


predict the continuous predict the categorical
dependent variable using a dependent variable using a
given set of independent given set of independent
variables. variables.

It is used for solving regression It is used for solving


problem. classification problems.

In this we predict the value of In this we predict values of


continuous variables categorical variables

In this we find best fit line. In this we find S-Curve.


Unit 2 Machine Learning and Data analytics Using Python

Linear Regression Logistic Regression

Least square estimation Maximum likelihood


method is used for estimation estimation method is used for
of accuracy. Estimation of accuracy.

Output must be categorical


The output must be continuous
value such as 0 or 1, Yes or no
value, such as price, age etc.
etc.

It required linear relationship


It not required linear
between dependent and
relationship.
independent variables.

There may be collinearity There should be little to no


between the independent collinearity between
variables. independent variables.

K-Nearest Neighbor(KNN) Algorithm


• K-Nearest Neighbors (KNN) is a supervised machine learning algorithm generally
used for classification but can also be used for regression tasks.
• It works by finding the "k" closest data points (neighbors) to a given input and
makesa predictions based on the majority class (for classification) or the average
value (for regression).
• Since KNN makes no assumptions about the underlying data distribution it makes
it a non-parametric and instance-based learning method.

K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and at the time of
classification it performs an action on the dataset.
For example, consider the following table of data points containing two features:
Unit 2 Machine Learning and Data analytics Using Python

KNN Algorithm working visualization


The new point is classified as Category 2 because most of its closest neighbors are
blue squares. KNN assigns the category based on the majority of nearby points. The
image shows how KNN predicts the category of a new data point based on its closest
neighbours.
• The red diamonds represent Category 1 and the blue squares represent
Category 2.
• The new data point checks its closest neighbors (circled points).
• Since the majority of its closest neighbors are blue squares (Category 2) KNN
predicts the new data point belongs to Category 2.
KNN works by using proximity and majority voting to make predictions.
What is 'K' in K Nearest Neighbour?
In the k-Nearest Neighbours algorithm k is just a number that tells the algorithm
how many nearby points or neighbors to look at when it makes a decision.
Example: Imagine you're deciding which fruit it is based on its shape and size. You
compare it to fruits you already know.
• If k = 3, the algorithm looks at the 3 closest fruits to the new one.
• If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the
new fruit is an apple because most of its neighbors are apples.
Unit 2 Machine Learning and Data analytics Using Python

How to choose the value of k for KNN Algorithm?


• The value of k in KNN decides how many neighbors the algorithm looks at
when making a prediction.
• Choosing the right k is important for good results.
• If the data has lots of noise or outliers, using a larger k can make the predictions
more stable.
• But if k is too large the model may become too simple and miss important
patterns and this is called underfitting.
• So k should be picked carefully based on the data.
Statistical Methods for Selecting k
• Cross-Validation: Cross-Validation is a good way to find the best value of k is
by using k-fold cross-validation. This means dividing the dataset into k parts.
The model is trained on some of these parts and tested on the remaining ones.
This process is repeated for each part. The k value that gives the highest
average accuracy during these tests is usually the best one to use.
• Elbow Method: In Elbow Method we draw a graph showing the error rate or
accuracy for different k values. As k increases the error usually drops at first.
But after a certain point error stops decreasing quickly. The point where the
curve changes direction and looks like an "elbow" is usually the best choice for
k.
• Odd Values for k: It’s a good idea to use an odd number for k especially in
classification problems. This helps avoid ties when deciding which class is the
most common among the neighbors.
Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbor, these neighbors are used for
classification and regression task. To identify nearest neighbor we use below distance
metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a plane
or space. You can think of it like the shortest path you would walk if you were to go directly
from one point to another.

2. Manhattan Distance
Unit 2 Machine Learning and Data analytics Using Python

This is the total distance you would travel if you could only move along horizontal and
vertical lines like a grid or city streets. It’s also called "taxicab distance" because a taxi can
only drive along the grid-like streets of a city.

3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and
Manhattan distances as special cases.

From the formula above, when p=2, it becomes the same as the Euclidean distance
formula and when p=1, it turns into the Manhattan distance formula. Minkowski distance
is essentially a flexible formula that can represent either Euclidean or Manhattan distance
depending on the value of p.
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where
it predicts the label or value of a new data point by considering the labels or values of its
K nearest neighbors in the training dataset.

Step 1: Selecting the optimal value of K


• K represents the number of nearest neighbors that needs to be considered while
making prediction.
Step 2: Calculating distance
• To measure the similarity between target and training data points Euclidean
distance is used. Distance is calculated between data points in the dataset and
target point.
Step 3: Finding Nearest Neighbors
Unit 2 Machine Learning and Data analytics Using Python

• The k data points with the smallest distances to the target point are nearest
neighbors.
Step 4: Voting for Classification or Taking Average for Regression
• When you want to classify a data point into a category like spam or not spam, the
KNN algorithm looks at the K closest points in the dataset. These closest points are
called neighbors. The algorithm then looks at which category the neighbors belong
to and picks the one that appears the most. This is called majority voting.
• In regression, the algorithm still looks for the K closest points. But instead of voting
for a class in classification, it takes the average of the values of those K neighbors.
This average is the predicted value for the new point for the algorithm.
It shows how a test point is classified based on its nearest neighbors. As the test point
moves the algorithm identifies the closest 'k' data points i.e. 5 in this case and assigns test
point the majority class label that is grey label class here.
Python Implementation of KNN Algorithm
1. Importing Libraries
Counter is used to count the occurrences of elements in a list or iterable. In KNN after
finding the k nearest neighbor labels Counter helps count how many times each label
appears.
import numpy as np
from collections import Counter
2. Defining the Euclidean Distance Function
euclidean_distance is to calculate euclidean distance between points.
def euclidean_distance(point1, point2):
return np.sqrt(np.sum((np.array(point1) - np.array(point2))**2))
3. KNN Prediction Function
• distances.append saves how far each training point is from the test point, along
with its label.
• distances.sort is used to sorts the list so the nearest points come first.
• k_nearest_labels picks the labels of the k closest points.
• Uses Counter to find which label appears most among those k labels that becomes
the prediction.
def knn_predict(training_data, training_labels, test_point, k):
distances = []
for i in range(len(training_data)):
Unit 2 Machine Learning and Data analytics Using Python

dist = euclidean_distance(test_point, training_data[i])


distances.append((dist, training_labels[i]))
distances.sort(key=lambda x: x[0])
k_nearest_labels = [label for _, label in distances[:k]]
return Counter(k_nearest_labels).most_common(1)[0][0]
4. Training Data, Labels and Test Point
training_data = [[1, 2], [2, 3], [3, 4], [6, 7], [7, 8]]
training_labels = ['A', 'A', 'A', 'B', 'B']
test_point = [4, 5]
k=3
5. Prediction
prediction = knn_predict(training_data, training_labels, test_point, k)
print(prediction)
Output:
A
The algorithm calculates the distances of the test point [4, 5] to all training points selects
the 3 closest points as k = 3 and determines their labels. Since the majority of the closest
points are labelled 'A' the test point is classified as 'A'.
Note: In machine learning we can also use Scikit Learn python library which has in
built functions to perform KNN machine learning model
Applications of KNN
• Recommendation Systems: Suggests items like movies or products by finding
users with similar preferences.
• Spam Detection: Identifies spam emails by comparing new emails to known spam
and non-spam examples.
• Customer Segmentation: Groups customers by comparing their shopping
behavior to others.
• Speech Recognition: Matches spoken words to known patterns to convert them
into text.
Advantages of KNN
• Simple to use: Easy to understand and implement.
• No training step: No need to train as it just stores the data and uses it during
prediction.
Unit 2 Machine Learning and Data analytics Using Python

• Few parameters: Only needs to set the number of neighbors (k) and a distance
method.
• Versatile: Works for both classification and regression problems.
Disadvantages of KNN
• Slow with large data: Needs to compare every point during prediction.
• Struggles with many features: Accuracy drops when data has too many features.
• Can Overfit: It can overfit especially when the data is high-dimensional or not
clean.
Decision Tree in Machine Learning
A decision tree is a supervised learning algorithm used for both classification and
regression tasks. It has a hierarchical tree structure which consists of a root node,
branches, internal nodes and leaf nodes. It It works like a flowchart help to make decisions
step by step where:
• Internal nodes represent attribute tests
• Branches represent attribute values
• Leaf nodes represent final decisions or predictions.
Decision trees are widely used due to their interpretability, flexibility and low
preprocessing needs.
How Does a Decision Tree Work?
A decision tree splits the dataset based on feature values to create pure subsets ideally all
items in a group belong to the same class. Each leaf node of the tree corresponds to a class
label and the internal nodes are feature-based decision points. Let’s understand this with
an example.
Unit 2 Machine Learning and Data analytics Using Python

Let’s consider a decision tree for predicting whether a customer will buy a product based
on age, income and previous purchases: Here's how the decision tree works:
1. Root Node (Income)
First Question: "Is the person’s income greater than $50,000?"
• If Yes, proceed to the next question.
• If No, predict "No Purchase" (leaf node).
2. Internal Node (Age):
If the person’s income is greater than $50,000, ask: "Is the person’s age above 30?"
• If Yes, proceed to the next question.
• If No, predict "No Purchase" (leaf node).
3. Internal Node (Previous Purchases):
• If the person is above 30 and has made previous purchases, predict "Purchase"
(leaf node).
• If the person is above 30 and has not made previous purchases, predict "No
Purchase" (leaf node).

Decision making with 2 Decision Tree


Example: Predicting Whether a Customer Will Buy a Product Using Two Decision Trees
Tree 1: Customer Demographics
Unit 2 Machine Learning and Data analytics Using Python

First tree asks two questions:


1. "Income > $50,000?"
• If Yes, Proceed to the next question.
• If No, "No Purchase"
2. "Age > 30?"
• Yes: "Purchase"
• No: "No Purchase"
Tree 2: Previous Purchases
"Previous Purchases > 0?"
• Yes: "Purchase"
• No: "No Purchase"
Once we have predictions from both trees, we can combine the results to make a final
prediction. If Tree 1 predicts "Purchase" and Tree 2 predicts "No Purchase", the final
prediction might be "Purchase" or "No Purchase" depending on the weight or confidence
assigned to each tree. This can be decided based on the problem context.

Understanding Decision Tree with Real life use case:


Till now we have understand about the attributes and components of decision tree. Now
lets jump to a real life use case in which how decision tree works step by step.
Step 1. Start with the Whole Dataset
We begin with all the data which is treated as the root node of the decision tree.
Step 2. Choose the Best Question (Attribute)
Pick the best question to divide the dataset. For example ask: "What is the outlook?"
Possible answers: Sunny, Cloudy or Rainy.
Step 3. Split the Data into Subsets
Divide the dataset into groups based on the question:
• If Sunny go to one subset.
• If Cloudy go to another subset.
• If Rainy go to the last subset.
Unit 2 Machine Learning and Data analytics Using Python

Step 4. Split Further if Needed (Recursive Splitting)


For each subset ask another question to refine the groups. For example If the Sunny subset
is mixed ask: "Is the humidity high or normal?"
• High humidity → "Swimming".
• Normal humidity → "Hiking".
Step 5. Assign Final Decisions (Leaf Nodes)
When a subset contains only one activity, stop splitting and assign it a label:
• Cloudy → "Hiking".
• Rainy → "Stay Inside".
• Sunny + High Humidity → "Swimming".
• Sunny + Normal Humidity → "Hiking".
Step 6. Use the Tree for Predictions
To predict an activity follow the branches of the tree. Example: If the outlook is Sunny and
the humidity is High follow the tree:
• Start at Outlook.
• Take the branch for Sunny.
• Then go to Humidity and take the branch for High Humidity.
• Result: "Swimming".
A decision tree works by breaking down data step by step asking the best possible
questions at each point and stopping once it reaches a clear decision. It's an easy and
understandable way to make choices. Because of their simple and clear structure decision
trees are very helpful in machine learning for tasks like sorting data into categories or
making predictions.

Random Forest Algorithm in Machine Learning


Random Forest is a machine learning algorithm that uses many decision trees to make
better predictions. Each tree looks at different random parts of the data and their results
are combined by voting for classification or averaging for regression. This helps in
improving accuracy and reducing errors.

Working of Random Forest Algorithm


• Create Many Decision Trees: The algorithm makes many decision trees each
using a random part of the data. So every tree is a bit different.
Unit 2 Machine Learning and Data analytics Using Python

• Pick Random Features: When building each tree it doesn’t look at all the features
(columns) at once. It picks a few at random to decide how to split the data. This
helps the trees stay different from each other.
• Each Tree Makes a Prediction: Every tree gives its own answer or prediction
based on what it learned from its part of the data.
• Combine the Predictions:
o For classification we choose a category as the final answer is the one that
most trees agree on i.e majority voting.
o For regression we predict a number as the final answer is the average of
all the trees predictions.
• Why It Works Well: Using random data and features for each tree helps avoid
overfitting and makes the overall prediction more accurate and trustworthy.
Random forest is also a ensemble learning technique which you can learn more about from:
Ensemble Learning
Key Features of Random Forest
• Handles Missing Data: It can work even if some data is missing so you don’t
always need to fill in the gaps yourself.
• Shows Feature Importance: It tells you which features (columns) are most useful
for making predictions which helps you understand your data better.
• Works Well with Big and Complex Data: It can handle large datasets with many
features without slowing down or losing accuracy.
• Used for Different Tasks: You can use it for both classification like predicting
types or labels and regression like predicting numbers or amounts.
Assumptions of Random Forest
• Each tree makes its own decisions: Every tree in the forest makes its own
predictions without relying on others.
• Random parts of the data are used: Each tree is built using random samples and
features to reduce mistakes.
• Enough data is needed: Sufficient data ensures the trees are different and learn
unique patterns and variety.
• Different predictions improve accuracy: Combining the predictions from
different trees leads to a more accurate final result.
Implementing Random Forest for Classification Tasks
Here we will predict survival rate of a person in titanic.
• Import libraries and load the Titanic dataset.
Unit 2 Machine Learning and Data analytics Using Python

• Remove rows with missing target values ('Survived').


• Select features like class, sex, age, etc and convert 'Sex' to numbers.
• Fill missing age values with the median.
• Split the data into training and testing sets, then train a Random Forest model.
• Predict on test data, check accuracy and print a sample prediction result.
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, classification_report

import warnings

warnings.filterwarnings('ignore')

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

titanic_data = pd.read_csv(url)

titanic_data = titanic_data.dropna(subset=['Survived'])

X = titanic_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]

y = titanic_data['Survived']

X.loc[:, 'Sex'] = X['Sex'].map({'female': 0, 'male': 1})

X.loc[:, 'Age'].fillna(X['Age'].median(), inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

rf_classifier.fit(X_train, y_train)

y_pred = rf_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


Unit 2 Machine Learning and Data analytics Using Python

classification_rep = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")

print("\nClassification Report:\n", classification_rep)

sample = X_test.iloc[0:1]

prediction = rf_classifier.predict(sample)

sample_dict = sample.iloc[0].to_dict()

print(f"\nSample Passenger: {sample_dict}")

print(f"Predicted Survival: {'Survived' if prediction[0] == 1 else 'Did Not Survive'}")

Output:

We evaluated model's performance using a classification report to see how well it predicts
the outcomes and used a random sample to check model prediction.
Implementing Random Forest for Regression Tasks
We will do house price prediction here.
• Load the California housing dataset and create a DataFrame with features and
target.
• Separate the features and the target variable.
• Split the data into training and testing sets (80% train, 20% test).
• Initialize and train a Random Forest Regressor using the training data.
• Predict house values on test data and evaluate using MSE and R² score.
• Print a sample prediction and compare it with the actual value.
Unit 2 Machine Learning and Data analytics Using Python

import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

california_housing = fetch_california_housing()
california_data = pd.DataFrame(california_housing.data,
columns=california_housing.feature_names)
california_data['MEDV'] = california_housing.target

X = california_data.drop('MEDV', axis=1)
y = california_data['MEDV']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

rf_regressor.fit(X_train, y_train)

y_pred = rf_regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)

single_data = X_test.iloc[0].values.reshape(1, -1)


predicted_value = rf_regressor.predict(single_data)
print(f"Predicted Value: {predicted_value[0]:.2f}")
print(f"Actual Value: {y_test.iloc[0]:.2f}")

print(f"Mean Squared Error: {mse:.2f}")


Unit 2 Machine Learning and Data analytics Using Python

print(f"R-squared Score: {r2:.2f}")

Output:

We evaluated the model's performance using Mean Squared Error and R-squared
Score which show how accurate the predictions are and used a random sample to
check model prediction.
Advantages of Random Forest
• Random Forest provides very accurate predictions even with large datasets.
• Random Forest can handle missing data well without compromising with
accuracy.
• It doesn’t require normalization or standardization on dataset.
• When we combine multiple decision trees it reduces the risk of overfitting of the
model.
Limitations of Random Forest
• It can be computationally expensive especially with a large number of trees.
• It’s harder to interpret the model compared to simpler models like decision trees.

Evaluation Metrics in Machine Learning


• When building machine learning models, it’s important to understand how well
they perform.
• Evaluation metrics help us to measure the effectiveness of our models. Whether
we are solving a classification problem, predicting continuous values or clustering
data, selecting the right evaluation metric allows us to assess how well the model
meets our goals.
Classification Metrics
Classification problems aim to predict discrete categories. To evaluate the
performance of classification models, we use the following metrics:
1. Accuracy
Accuracy is a fundamental metric used for evaluating the performance of a
classification model. It tells us the proportion of correct predictions made
by the model out of all predictions.
Unit 2 Machine Learning and Data analytics Using Python

While accuracy provides a quick snapshot, it can be misleading in cases of


imbalanced datasets. For example, in a dataset with 90% class A and 10%
class B, a model predicting only class A will still achieve 90% accuracy but
it will fail to identify any class B instances.
Accuracy is good but it gives a False Positive sense of achieving high
accuracy. The problem arises due to the possibility of misclassification of
minor class samples being very high.
2. Precision
It measures how many of the positive predictions made by the model are
actually correct. It's useful when the cost of false positives is high such as
in medical diagnoses where predicting a disease when it’s not present can
have serious consequences.

Where:
• TP = True Positives
• FP = False Positives
Precision helps ensure that when the model predicts a positive outcome, it’s likely
to be correct.
3. Recall
Recall or Sensitivity measures how many of the actual positive cases were
correctly identified by the model. It is important when missing a positive
case (false negative) is more costly than false positives.

Where:
• FN = False Negatives
4. F1 Score
The F1 Score is the harmonic mean of precision and recall. It is useful when we
need a balance between precision and recall as it combines both into a single
number. A high F1 score means the model performs well on both metrics. Its range
is [0,1].
Unit 2 Machine Learning and Data analytics Using Python

Lower recall and higher precision gives us great accuracy but then it misses a large
number of instances. More the F1 score better will be performance. It can be
expressed mathematically in this way:

Area Under Curve (AUC) and ROC Curve


It is useful for binary classification tasks. The AUC value represents the probability
that the model will rank a randomly chosen positive example higher than a
randomly chosen negative example. AUC ranges from 0 to 1 with higher values
showing better model performance.
1. True Positive Rate(TPR)
Also known as sensitivity or recall, the True Positive Rate measures how many
actual positive instances were correctly identified by the model. It answers the
question: "Out of all the actual positive cases, how many did the model correctly
identify?"
Formula:

Where:
• TP = True Positives (correctly predicted positive cases)
• FN = False Negatives (actual positive cases incorrectly predicted as negative)
2. True Negative Rate(TNR)
Also called specificity, the True Negative Rate measures how many actual negative
instances were correctly identified by the model. It answers the question: "Out of
all the actual negative cases, how many did the model correctly identify as
negative?"
Formula:

Where:
• TN = True Negatives (correctly predicted negative cases)
• FP = False Positives (actual negative cases incorrectly predicted as positive)
Unit 2 Machine Learning and Data analytics Using Python

3. False Positive Rate(FPR)


It measures how many actual negative instances were incorrectly classified as
positive. It’s a key metric when the cost of false positives is high such as in fraud
detection.
Formula:

Where:
• FP = False Positives (incorrectly predicted positive cases)
• TN = True Negatives (correctly predicted negative cases)
4. False Negative Rate(FNR)
It measures how many actual positive instances were incorrectly classified as
negative. It answers: "Out of all the actual positive cases, how many were
misclassified as negative?"
Formula:

Where:
• FN = False Negatives (incorrectly predicted negative cases)
• TP = True Positives (correctly predicted positive cases)
ROC Curve
It is a graphical representation of the True Positive Rate (TPR) vs the False
Positive Rate (FPR) at different classification thresholds. The curve helps
us visualize the trade-offs between sensitivity (TPR) and specificity (1 -
FPR) across various thresholds. Area Under Curve (AUC) quantifies the
overall ability of the model to distinguish between positive and negative
classes.
• AUC = 1: Perfect model (always correctly classifies positives and negatives).
• AUC = 0.5: Model performs no better than random guessing.
• AUC < 0.5: Model performs worse than random guessing (showing that the
model is inverted).
Unit 2 Machine Learning and Data analytics Using Python

Model Training and Evaluation


Train Test Split
• Train test split is a model validation procedure that allows you to simulate how
a model would perform on new/unseen data by splitting a dataset into a
training set and a testing set.
• The training set is data used to train the model, and the testing set data (which
is new to the model) is used to test the model’s performance and accuracy.
• A train test split can also involve splitting data into a validation set, which is
data used to fine-tune hyperparameters and optimize the model during the
training process.
Unit 2 Machine Learning and Data analytics Using Python

Train test split procedure


1. Arrange the Data
• Make sure your data is arranged into a format acceptable for train test split.
In scikit-learn, this consists of separating your full dataset into “Features” and
“Target.”
2. Split the Data
• Split the dataset into two pieces — a training set and a testing set. This consists
of random sampling without replacement about 75 percent of the rows (you can
vary this) and putting them into your training set. The remaining 25 percent is put
into your test set. Note that the colors in “Features” and “Target” indicate where
their data will go (“X_train,” “X_test,” “y_train,” “y_test”) for a particular train test
split.
3. Train the Model
• Train the model on the training set. This is “X_train” and “y_train” in the image.
4. Test the Model
• Test the model on the testing set (“X_test” and “y_test” in the image) and evaluate
the performance.
Methods for Splitting Data in a Train Test Split
1. Random Splitting
Random splitting involves randomly shuffling data and splitting it into training and
testing sets based on given percentages (like 75% training and 25% testing). This
is one of the most popular methods for splitting data in train test splits due to it
being simple and easy to implement, and is used by default in the scikit-
learn train_test_split() method. Random splitting is most effective for large,
diverse datasets where categories are generally represented equally in the data.
2. Stratified Splitting
Stratified splitting divides a dataset in a way that preserves its proportion of
classes or categories. This creates training and testing sets with class proportions
representative of the original dataset. Using stratified splitting can prevent
model bias, and is most effective for imbalanced datasets or datasets where
categories aren’t represented equally. In a scikit-learn train test split, stratified
splitting can be used by specifying the stratify parameter in
the train_test_split() method.
3. Time-Based Splitting
Time-based splitting involves organizing data in a set by points in time, ensuring
past data is in the training set and future or later data is in the testing set. Splitting
data based on time works to simulate real-world scenarios (for example,
predicting future financial or market trends) and allows for time series
Unit 2 Machine Learning and Data analytics Using Python

analysis on time series datasets. However, one drawback to time-based splitting is


that it may not fully capture trends for non-stationary data (data that continually
changes over time). In scikit-learn, time series data can be split into training and
testing sets by using the TimeSeriesSplit() method.

Cross Validation in Machine Learning


• Cross-validation is a technique used to check how well a machine learning model
performs on unseen data.
• It splits the data into several parts, trains the model on some parts and tests it on
the remaining part repeating this process multiple times.
• Finally the results from each validation step are averaged to produce a more
accurate estimate of the model's performance.
• The main purpose of cross validation is to prevent overfitting.
Types of Cross-Validation
1. Holdout Validation
In Holdout Validation we perform training on the 50% of the given dataset and rest
50% is used for the testing purpose. It's a simple and quick way to evaluate a
model. The major drawback of this method is that we perform training on the 50%
of the dataset, it may possible that the remaining 50% of the data contains some
important information which we are leaving while training our model that can lead
to higher bias.
2. LOOCV (Leave One Out Cross Validation)
In this method we perform training on the whole dataset but leaves only one data-
point of the available dataset and then iterates for each data-point. In LOOCV the
model is trained on n−1n−1 samples and tested on the one omitted sample
repeating this process for each data point in the dataset. It has some advantages
as well as disadvantages also.
An advantage of using this method is that we make use of all data points and hence
it is low bias.
The major drawback of this method is that it leads to higher variation in the testing
model as we are testing against one data point. If the data point is an outlier it can
lead to higher variation.
Another drawback is it takes a lot of execution time as it iterates over the number
of data points we have.
3. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the cross-
validation process maintains the same class distribution as the entire dataset. This
Unit 2 Machine Learning and Data analytics Using Python

is particularly important when dealing with imbalanced datasets where certain


classes may be under represented. In this method:
The dataset is divided into k folds while maintaining the proportion of classes in
each fold.
During each iteration, one-fold is used for testing and the remaining folds are used
for training.
The process is repeated k times with each fold serving as the test set exactly once.
Stratified Cross-Validation is essential when dealing with classification problems
where maintaining the balance of class distribution is crucial for the model to
generalize well to unseen data.
4. K-Fold Cross Validation
In K-Fold Cross Validation we split the dataset into k number of subsets known as
folds then we perform training on the all the subsets but leave one (k-1) subset for
the evaluation of the trained model. In this method, we iterate k times with a
different subset reserved for testing purpose each time.
Note: It is always suggested that the value of k should be 10 as the lower value of
k takes towards validation and higher value of k leads to LOOCV method.
Hyperpaameter Tuning using Grid Search CV
• Data-Driven decision-making has large involvement of Machine Learning
Algorithms.
• For a business problem, the professional never rely on one algorithm. One always
applies multiple relevant algorithms based on the problem and selects the best
model based on the best performance metrics shown by the models. But this is not
the end.
• One can increase the model performance using hyperparameters. Thus, finding
the optimal hyperparameters would help us achieve the best-performing model.
There are several techniques for choosing a model’s hyperparameters, including
Random Search, sklearn’s GridSearchCV, Manual Search, and Bayesian
Optimization. Among these, gridsearchcv is widely recognized for its efficiency in
tuning parameters.
Hyperparameter Tuning
• Hyperparameters are the parameters that are not learned from data. They are set
before the training process starts.
• Examples:
• C and kernel in SVM
• max_depth in Decision Trees
• k in KNN
Unit 2 Machine Learning and Data analytics Using Python

• learning_rate in Gradient Boosting


Tuning hyperparameters means finding the best combination of these to improve
model performance.
Why use Grid Search CV?
• Manually trying different hyperparameter values is tedious and inefficient.
• Grid Search CV (Grid Search with Cross-Validation):
• Automates the process.
• Tries all possible combinations from the specified set of hyperparameters.
• Uses cross-validation to estimate how well each combination generalizes.
• It helps select the hyperparameter values that give the best performance on
unseen data.
How does Grid Search CV work?
• You define a parameter grid: a dictionary where the keys are hyperparameter
names and values are lists of values to try.
• It trains models on every possible combination of the parameter grid.
• For each combination, it performs cross-validation (e.g., 5-fold).
• It finds the combination that results in the best cross-validation score
Example: Hyperparameter tuning of SVM using Grid Search CV
• Steps:
• Load data
• Define parameter grid
• Run GridSearchCV
• Get the best parameters & evaluate
Advantages of Grid Search CV
• Finds optimal hyperparameter combinations.
• Performs systematic search.
• Cross-validation ensures model generalizes well.
Overfitting and Underfitting
• In Machine Learning, our goal is to build models that generalize well —
meaning they perform well not just on the training data, but also on new,
unseen data.
• Two common problems that prevent models from generalizing well are:
Unit 2 Machine Learning and Data analytics Using Python

Type What happens

Overfitting Model learns too much, including noise and details that don’t matter.

Underfitting Model learns too little, missing important patterns.

Overfitting
• Overfitting occurs when a model learns not just the underlying patterns in the
training data, but also the random noise.
As a result, it performs extremely well on training data but fails to generalize to
new data.
• Imagine you are trying to predict student marks based on study hours.
If your model creates a very complex curve that passes through every single data
point exactly, it might capture random variations (like one student who studied
little but scored high).
On new data, this complex model fails badly.
Underfitting
• Underfitting occurs when a model is too simple to capture the underlying
structure of the data.
It fails to even perform well on the training data.
• Suppose you try to use a straight line (linear regression) to predict outcomes
where the data actually follows a curve.
The model misses important trends.
Unit 2 Machine Learning and Data analytics Using Python

Unit -3
Unsupervised Learning
Unsupervised learning is a branch of machine learning that deals with unlabeled data.
Unlike supervised learning, where the data is labeled with a specific category or outcome,
unsupervised learning algorithms are tasked with finding patterns and relationships
within the data without any prior knowledge of the data's meaning.
Unsupervised machine learning algorithms find hidden patterns and data without any
human intervention, i.e., we don't give output to our model. The training model has
only input parameter values and discovers the groups or patterns on its own.

The image shows set of animals: elephants, camels, and cows that represents raw data
that the unsupervised learning algorithm will process.
• The "Interpretation" stage signifies that the algorithm doesn't have predefined
labels or categories for the data. It needs to figure out how to group or organize
the data based on inherent patterns.
• Algorithm represents the core of unsupervised learning process using techniques
like clustering, dimensionality reduction, or anomaly detection to identify patterns
and structures in the data.
• Processing stage shows the algorithm working on the data.
The output shows the results of the unsupervised learning process. In this case, the
algorithm might have grouped the animals into clusters based on their species (elephants,
camels, cows).
Unsupervised Learning Algorithms
There are mainly 3 types of Algorithms which are used for Unsupervised dataset.
• Clustering
• Association Rule Learning
• Dimensionality Reduction
Unit 2 Machine Learning and Data analytics Using Python

1. Clustering Algorithms
Clustering in unsupervised machine learning is the process of grouping unlabeled data into
clusters based on their similarities. The goal of clustering is to identify patterns and
relationships in the data without any prior knowledge of the data's meaning.
Broadly this technique is applied to group data based on different patterns, such as
similarities or differences, our machine model finds. These algorithms are used to process
raw, unclassified data objects into groups. For example, in the above figure, we have not
given output parameter values, so this technique will be used to group clients based on the
input parameters provided by our data.
Some common clustering algorithms:
• K-means Clustering: Groups data into K clusters based on how close the points are
to each other.
• Hierarchical Clustering: Creates clusters by building a tree step-by-step, either
merging or splitting groups.
• Density-Based Clustering (DBSCAN): Finds clusters in dense areas and treats
scattered points as noise.
• Mean-Shift Clustering: Discovers clusters by moving points toward the most
crowded areas.
• Spectral Clustering: Groups data by analyzing connections between points using
graphs.
2. Association Rule Learning
Association rule learning is also known as association rule mining is a common technique
used to discover associations in unsupervised machine learning. This technique is a rule-
based ML technique that finds out some very useful relations between parameters of a
large data set. This technique is basically used for market basket analysis that helps to
better understand the relationship between different products.
For e.g. shopping stores use algorithms based on this technique to find out the relationship
between the sale of one product w.r.t to another's sales based on customer behavior. Like
if a customer buys milk, then he may also buy bread, eggs, or butter. Once trained well,
such models can be used to increase their sales by planning different offers.
Some common Association Rule Learning algorithms:
• Apriori Algorithm: Finds patterns by exploring frequent item combinations step-
by-step.
• FP-Growth Algorithm: An Efficient Alternative to Apriori. It quickly identifies
frequent patterns without generating candidate sets.
• Eclat Algorithm: Uses intersections of itemsets to efficiently find frequent patterns.
Unit 2 Machine Learning and Data analytics Using Python

• Efficient Tree-based Algorithms: Scales to handle large datasets by organizing


data in tree structures.
3. Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of features in a dataset
while preserving as much information as possible. This technique is useful for improving
the performance of machine learning algorithms and for data visualization.
Imagine a dataset of 100 features about students (height, weight, grades, etc.). To focus on
key traits, you reduce it to just 2 features: height and grades, making it easier to visualize
or analyze the data.
Here are some popular Dimensionality Reduction algorithms:
• Principal Component Analysis (PCA): Reduces dimensions by transforming data
into uncorrelated principal components.
• Linear Discriminant Analysis (LDA): Reduces dimensions while maximizing class
separability for classification tasks.
• Non-negative Matrix Factorization (NMF): Breaks data into non-negative parts to
simplify representation.
• Locally Linear Embedding (LLE): Reduces dimensions while preserving the
relationships between nearby points.
• Isomap: Captures global data structure by preserving distances along a manifold.
Challenges of Unsupervised Learning
Here are the key challenges of unsupervised learning:
• Noisy Data: Outliers and noise can distort patterns and reduce the effectiveness
of algorithms.
• Assumption Dependence: Algorithms often rely on assumptions (e.g., cluster
shapes), which may not match the actual data structure.
• Overfitting Risk: Overfitting can occur when models capture noise instead of
meaningful patterns in the data.
• Limited Guidance: The absence of labels restricts the ability to guide the
algorithm toward specific outcomes.
• Cluster Interpretability: Results, such as clusters, may lack clear meaning or
alignment with real-world categories.
• Sensitivity to Parameters: Many algorithms require careful tuning of
hyperparameters, such as the number of clusters in k-means.
• Lack of Ground Truth: Unsupervised learning lacks labeled data, making it
difficult to evaluate the accuracy of results.
Applications of Unsupervised learning
Unit 2 Machine Learning and Data analytics Using Python

Unsupervised learning has diverse applications across industries and domains. Key
applications include:
• Customer Segmentation: Algorithms cluster customers based on purchasing
behavior or demographics, enabling targeted marketing strategies.
• Anomaly Detection: Identifies unusual patterns in data, aiding fraud detection,
cybersecurity, and equipment failure prevention.
• Recommendation Systems: Suggests products, movies, or music by analyzing
user behavior and preferences.
• Image and Text Clustering: Groups similar images or documents for tasks like
organization, classification, or content recommendation.
• Social Network Analysis: Detects communities or trends in user interactions on
social media platforms.
• Astronomy and Climate Science: Classifies galaxies or groups weather patterns
to support scientific research

K means Clustering
K-Means Clustering is an Unsupervised Machine Learning algorithm which groups
unlabeled dataset into different clusters. It is used to organize data into groups
based on their similarity.
Understanding K-means Clustering
For example online store uses K-Means to group customers based on purchase
frequency and spending creating segments like Budget Shoppers, Frequent Buyers
and Big Spenders for personalised marketing.
The algorithm works by first randomly picking some central points
called centroids and each data point is then assigned to the closest centroid
forming a cluster. After all the points are assigned to a cluster the centroids are
updated by finding the average position of the points in each cluster. This process
repeats until the centroids stop changing forming clusters. The goal of clustering
is to divide the data points into clusters so that similar data points belong to same
group.
How k-means clustering works?
We are given a data set of items with certain features and values for these
features like a vector. The task is to categorize those items into groups. To
achieve this we will use the K-means algorithm. 'K' in the name of the
algorithm represents the number of groups/clusters we want to classify
our items into.
K means Clustering
Unit 2 Machine Learning and Data analytics Using Python

The algorithm will categorize the items into k groups or clusters of similarity. To
calculate that similarity we will use the Euclidean distance as a measurement. The
algorithm works as follows:
1. First we randomly initialize k points called means or cluster centroids.
2. We categorize each item to its closest mean and we update the mean's coordinates,
which are the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our
clusters.
The "points" mentioned above are called means because they are the mean values
of the items categorized in them. To initialize these means, we have a lot of options.
An intuitive method is to initialize the means at random items in the data set.
Another method is to initialize the means at random values between the
boundaries of the data set. For example for a feature x the items have values in
[0,3] we will initialize the means with values for x at [0,3].
Selecting the right number of clusters is important for meaningful segmentation to
do this we use Elbow Method for optimal value of k in KMeans which is a graphical
tool used to determine the optimal number of clusters (k) in K-means.
Implementation of K-means clustering
# Import required libraries

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler

# Load the Iris dataset

iris = load_iris()

X = iris.data

# Optional: Normalize the data

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Apply KMeans clustering

kmeans = KMeans(n_clusters=3, random_state=42)

kmeans.fit(X_scaled)
Unit 2 Machine Learning and Data analytics Using Python

# Predicted cluster labels

labels = kmeans.labels_

# Print cluster centers

print("Cluster Centers:\n", kmeans.cluster_centers_)

# Plotting the clusters (using first two features for 2D plot)

plt.figure(figsize=(8, 5))

plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis', s=50)

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],

s=200, c='red', label='Centroids', marker='X')

plt.title("K-Means Clustering (Iris Dataset)")

plt.xlabel("Feature 1 (Standardized)")

plt.ylabel("Feature 2 (Standardized)")

plt.legend()

plt.grid(True)

plt.show()
Unit 2 Machine Learning and Data analytics Using Python

Hierarchical Clustering in Machine Learning


Hierarchical clustering is used to group similar data points together based on their similarity
creating a hierarchy or tree-like structure. The key idea is to begin with each data point as its
own separate cluster and then progressively merge or split them based on their similarity. Lets
understand this with the help of an example
Imagine you have four fruits with different weights: an apple (100g), a banana (120g), a cherry
(50g) and a grape (30g). Hierarchical clustering starts by treating each fruit as its own group.
• It then merges the closest groups based on their weights.
• First the cherry and grape are grouped together because they are the lightest.
• Next the apple and banana are grouped together.
Finally all the fruits are merged into one large group, showing how hierarchical clustering
progressively combines the most similar data points.
Types of Hierarchical Clustering
Now we understand the basics of hierarchical clustering. There are two main types of hierarchical
clustering.
1. Agglomerative Clustering
2. Divisive clustering
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC).
Unlike flat clustering hierarchical clustering provides a structured way to group data. This
clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively agglomerate
pairs of clusters until all clusters have been merged into a single cluster that contains all data.

Hierarchical Agglomerative Clustering


Unit 2 Machine Learning and Data analytics Using Python

Workflow for Hierarchical Agglomerative clustering


1. Start with individual points: Each data point is its own cluster. For example if you have
5 data points you start with 5 clusters each containing just one data point.
2. Calculate distances between clusters: Calculate the distance between every pair of
clusters. Initially since each cluster has one point this is the distance between the two data
points.
3. Merge the closest clusters: Identify the two clusters with the smallest distance and
merge them into a single cluster.
4. Update distance matrix: After merging you now have one less cluster. Recalculate the
distances between the new cluster and the remaining clusters.
5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance matrix
until you have only one cluster left.
6. Create a dendrogram: As the process continues you can visualize the merging of clusters
using a tree-like diagram called a dendrogram. It shows the hierarchy of how clusters are
merged.
Python implementation of the above algorithm using the scikit-learn library:
from sklearn.cluster import AgglomerativeClustering

import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0],

[4, 2], [4, 4], [4, 0]])

clustering = AgglomerativeClustering(n_clusters=2).fit(X)

print(clustering.labels_)

Output :
[1, 1, 1, 0, 0, 0]
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to prespecify the
number of clusters. Top-down clustering requires a method for splitting a cluster that contains the
whole data and proceeds by splitting clusters recursively until individual data have been split into
singleton clusters.
Workflow for Hierarchical Divisive clustering :
1. Start with all data points in one cluster: Treat the entire dataset as a single large cluster.
2. Split the cluster: Divide the cluster into two smaller clusters. The division is typically
done by finding the two most dissimilar points in the cluster and using them to separate
the data into two parts.
3. Repeat the process: For each of the new clusters, repeat the splitting process:
1. Choose the cluster with the most dissimilar points.
2. Split it again into two smaller clusters.
Unit 2 Machine Learning and Data analytics Using Python

4. Stop when each data point is in its own cluster: Continue this process until every data
point is its own cluster, or the stopping condition (such as a predefined number of
clusters) is met.

Hierarchical Divisive clustering


Computing Distance Matrix
While merging two clusters we check the distance between two every pair of clusters and
merge the pair with the least distance/most similarity. But the question is how is that distance
determined. There are different ways of defining Inter Cluster distance/similarity. Some of them
are:
1. Min Distance: Find the minimum distance between any two points of the cluster.
2. Max Distance: Find the maximum distance between any two points of the cluster.
3. Group Average: Find the average distance between every two points of the clusters.
4. Ward's Method: The similarity of two clusters is based on the increase in squared error
when two clusters are merged.
Unit 2 Machine Learning and Data analytics Using Python

Implementation code for Distance Matrix Comparision


import numpy as np

from scipy.cluster.hierarchy import dendrogram, linkage

import matplotlib.pyplot as plt

X = np.array([[1, 2], [1, 4], [1, 0],

[4, 2], [4, 4], [4, 0]])

Z = linkage(X, 'ward') # Ward Distance

dendrogram(Z) #plotting the dendogram

plt.title('Hierarchical Clustering Dendrogram')

plt.xlabel('Data point')

plt.ylabel('Distance')

plt.show()

Output:
Unit 2 Machine Learning and Data analytics Using Python

Hierarchical Clustering Dendrogram


Hierarchical clustering is widely used unsupervised learning technique that organize data into a
tree-like structure allow us to visualize relationships between data points using a dendrogram.
Unlike flat clustering methods it does not require a predefined number of clusters and provides a
structured way to explore data similarity.

You might also like