Linear Regression with Python
Implementation
Table of Content
1. What is Regression?
2. What is Linear Regression (LR)?
3. Basic assumption of Linear Regression (LR)
4 . Implementation of Linear Regression
Linear Regression with Python
Implementation
1.What is Regression?
Regression analysis is a statistical method that
helps us to understand the relationship between
dependent and one or more independent variables,
Dependent Variable
This is the Main Factor that we are trying to predict.
Independent Variable
These are the variables that have a relationship
with the dependent variable.
Linear Regression with Python
Implementation
2. What is Linear Regression?
In Machine Learning lingo, Linear Regression (LR)
means simply finding the best fitting line that explains
the variability between the dependent and independent
features very well or we can say it describes the linear
relationship between independent and dependent
features, and in linear regression, the algorithm
predicts the continuous features(e.g. Salary,
Price ), rather than deal with the categorical features
(e.g. cat, dog).
Types of Regression Analysis
There are many types of regression analysis, but in this
article, we will deal with,
1. Simple Linear Regression
Linear Regression with Python Implementation
Simple Linear Regression
Simple Linear Regression uses the slope-intercept
(weight-bias) form, where our model needs to find
the optimal value for both slope and intercept. So
with the optimal values, the model can find the
variability between the independent and
dependent features and produce accurate results.
In simple linear regression, the model takes a
single independent and dependent variable.
There are many equations to represent a straight
line, we will stick with the common equation,
Here, y and x are the dependent variables, and
independent variables respectively. b1(m) and
b0(c) are slope and y-intercept respectively.
Slope(m) tells, for one unit of increase in x, How
many units does it increase in y. When the line is
steep, the slope will be higher, the slope will be
lower for the less steep line.
Constant(c) means, What is the value of y when the
x is zero.
Linear Regression with Python
HowImplementation
the Model will Select the Best
Fit Line?
First, our model will try a bunch of
different straight lines from that it
finds the optimal line that predicts
our data points well.
From the nearby picture, you can
notice there are 4 lines, and any
guess which will be our best fit line?
Ok, For finding the best fit line our
model uses the cost function. In
machine learning, every algorithm https://corporatefinanceinstitute.com/
multiple-linear-regression
has a cost function, and in simple
linear regression, the goal of our
Linear Regression with Python
Implementation
every algorithm has a
cost function, and in
simple linear
regression, the goal of
our algorithm is to find Yi – Actual value,
a minimal value for the Y^i – Predicted value,
n – number of records.
cost function.
And in linear ( yi – yi_hat ) is a Loss Function. And you
can find in most times people will
regression (LR), we interchangeably use the word loss and cost
function. But they are different, and we are
have many cost squaring the terms to neglect the negative
functions, but mostly value.
used cost function is
MSE(Mean Squared
Error). It is also known
Linear Regression with Python
Implementation
How the Model will
Select the Best Fit
Line?
Loss Function
It is a calculation of
loss for single training
data. Steps
Our model will fit all possible lines and find an
overall average error between the actual and
predicted values for each line respectively.
Cost Function Selects the line which has the lowest overall
It is a calculation of error. And that will be the best fit line.
average loss over the
Linear Regression with Python
Implementation
From the nearby picture, blue data
points are representing the actual
values from training data, a red
line(vector) is the predicted value
for that actual blue data point. we
can notice a random error, the
actual value-predicted value,
model is trying to minimize the
error between the actual and
predicted value. Because in the
real world we need a model, which Steps
Our model will fit all possible lines and find
makes the prediction very well. So an overall average error between the actual
our model will find the loss and predicted values for each line
between all the actual and respectively.
Selects the line which has the lowest overall
predicted values respectively. And error. And that will be the best fit line.
it selects the line which has an
average error of all points lower.
Linear Regression with Python
Multiple Linear Implementation
3. Assumption of Linear
Regression Regression
In multiple linear regression, our
model will apply the same steps.
Linearity
The first and most obvious assumption
In multiple linear regression
of our model is linearity.
instead of having a single
• It means, there must be a linear
independent variable, the model
relationship between the dependent
has multiple independent
and independent features.
variables to predict the
dependent variable.
• Without a Linear relationship,
accurate predictions won’t be
possible. The most commonly used
method to find the linear relationship
where bo is the y-intercept,
is a correlation, Scatterplot.
b1,b2,b3,b4…,bn are slopes of the
independent variables • A correlation provides information on
x1,x2,x3,x4…,xn and y is the the strength and direction of the
dependent variable. linear relationship between two
Here instead of finding a line, our variables.
model will find the best plane in
Linear Regression with Python
Implementation3. Assumption of Linear
Multicollinearity
Regression Linearity Normality The Independent variables
The first and most Normality doesn’t mean should not be correlated with
obvious assumption of our independent each other, when they are
our model is linearity. It variable should be correlated with each other,
means, there must be a normally distributed. then we could conclude that
linear relationship Linear Regression can one variable explains another
between the dependent work perfectly with non- variable well. So we don’t
and independent normal distribution. need two variables doing the
features. Without a Normality means our same thing. eg. You have two
Linear relationship, errors(residuals) should pens, both the pens do the
accurate predictions be normally distributed. same which is writing so you
won’t be possible. The We can get the errors of don’t need two pens for
most commonly used the model in the writing (consider two pens are
method to find the statsmodels using the magical pens, so you won’t
linear relationship is a below code. run out of ink). Before
correlation, Scatterplot. We can use Histogram dropping one of the two
A correlation provides and statsmodels Q-Q variables you must also see
information on the plot to check the how much does both
strength and direction probability distribution independent variables are
4. Implementation of Linear
Regression
An Introduction to K-Means Clustering
K-means clustering is a method that comes from signal processing
and uses the k-means algorithm. It aims to group a set of n
observations into k clusters. Each observation is placed in the
cluster where the average (or center) of the cluster is closest to
it, making that cluster its representative.
k-means clustering is one of the most popular ways to group data.
earning Objectives
-Get introduced to K-Means Clustering.
-Understand the properties of clusters and the various
evaluation metrics for clustering.
-Get acquainted with some of the many real-world
applications of K-Means Clustering.
-Implement K-Means Clustering in Python on a real-world
dataset.
An Introduction to K-Means Clustering
1.What is K-Means Clustering?
2.How K-Means Clustering Works?
3.Objective of k means Clustering
4.What is Clustering?
5.Properties of K means Clustering
6.Understanding the Different Evaluation Metrics for
Clustering
7.How to Apply K-Means Clustering Algorithm?
8.Implementing K-Means Clustering in Python From Scratch
9.Challenges With the K-Means Clustering Algorithm
10.K-Means++ to Choose Initial Cluster Centroids for K-Means
Clustering
An Introduction to K-Means Clustering
1.What is K-Means Clustering?
• K-means clustering is a popular unsupervised
machine learning algorithm used for partitioning a dataset
into a pre-defined number of clusters. The goal is to group
similar data points together and discover underlying patterns
or structures within the data.
• Recall the first property of clusters – it states that the points
within a cluster should be similar to each other. So, our aim
here is to minimize the distance between the points within a
cluster.
• The main objective of the K-Means algorithm is to minimize
the sum of distances between the points and their respective
An Introduction to K-Means Clustering
2.How K-Means Clustering Works?
Here’s how it works:
Initialization: Start by randomly selecting K points from the dataset.
These points will act as the initial cluster centroids.
Assignment: For each data point in the dataset, calculate the
distance between that point and each of the K centroids. Assign the
data point to the cluster whose centroid is closest to it. This step
effectively forms K clusters.
Update centroids: Once all data points have been assigned to
clusters, recalculate the centroids of the clusters by taking the
mean of all data points assigned to each cluster.
Repeat: Repeat steps 2 and 3 until convergence. Convergence
occurs when the centroids no longer change significantly or when a
specified number of iterations is reached.
Final Result: Once convergence is achieved, the algorithm outputs
the final cluster centroids and the assignment of each data point to
An Introduction to K-Means Clustering
3.Objective of k means Clustering
The main objective of k-means clustering is to partition your data
into a specific number (k) of groups, where data points within each
group are similar and dissimilar to points in other groups. It
achieves this by minimizing the distance between data points and
their assigned cluster’s center, called the centroid.
Here’s an objective:
Grouping similar data points: K-means aims to identify patterns in
your data by grouping data points that share similar
characteristics together. This allows you to discover underlying
structures within the data.
Minimizing within-cluster distance: The algorithm strives to make
sure data points within a cluster are as close as possible to each
other, as measured by a distance metric (usually Euclidean
distance). This ensures tight-knit clusters with high cohesiveness.
Maximizing between-cluster distance: Conversely, k-means also
tries to maximize the separation between clusters. Ideally, data
An Introduction to K-Means Clustering
4. What is Clustering?
Cluster analysis is a technique in data mining and machine learning
that groups similar objects into clusters. K-means clustering, a
popular method, aims to divide a set of objects into K clusters,
minimizing the sum of squared distances between the objects and
their respective cluster centers.
let’s try understanding this with a simple example. A bank wants to give
credit card offers to its customers. Currently, they look at the details of each
customer and, based on this information, decide which offer should be given
to which customer.
Now, the bank can potentially have millions of customers. Does it make sense
to look at the details of each customer separately and then make a decision?
Certainly not! It is a manual process and will take a huge amount of time.
An Introduction to K-Means Clustering
5. Properties of K means
Clustering
Properties of K means Clustering
How about another example of k-
means clustering algorithm?
We’ll take the same bank as
before, which wants to segment
its customers. For simplicity
purposes, let’s say the bank only
wants to use the income and
debt to make the segmentation.
They collected the customer data
and used a scatter plot to
On the X-axis, we have the income of the
visualize it:
customer, and the y-axis represents the amount of
debt. Here, we can clearly visualize that these
customers can be segmented into 4 different
clusters, as shown nearby:
An Introduction to K-Means Clustering
5. Properties of K means Clustering
• All the data points in a cluster
should be similar to each other.
• The data points from different
clusters should be as different as
possible.
On the X-axis, we have the income of the
customer, and the y-axis represents the amount of
debt. Here, we can clearly visualize that these
customers can be segmented into 4 different
clusters, as shown nearby:
An Introduction to K-Means Clustering
6. Understanding the Different Evaluation Metrics for Clustering
Inertia
Recall the first property of clusters we covered above.
• It tells us how far the points within a cluster are.
So, inertia actually calculates the sum of distances of all the
points within a cluster from the centroid of that cluster.
Normally, we use Euclidean distance as the distance metric, as
long as most of the features are numeric; otherwise, Manhattan
distance in case most of the features are categorical.
We calculate this for all the clusters; the final inertial value is
the sum of all these distances. This distance within the clusters
is known as intracluster distance. So, inertia gives us the sum of
intracluster distances:
Keeping this in mind, we can say that the lesser the inertia
An Introduction to K-Means Clustering
6. Understanding the Different Evaluation
Metrics for Clustering
Dunn Index
inertia makes sure that the first property of
clusters is satisfied. But it does not care about
the second property
– that different clusters should be as different
from each other as possible.
This is where the Dunn index comes into action.
Along with the distance between the centroid and
points, the Dunn index also takes into account the
distance between two clusters. This distance
between the centroids of two different clusters is
known as inter-cluster distance. Let’s look at the
formula of the Dunn index:
Dunn index is the ratio of the minimum of inter-
cluster distances and maximum of intracluster
distances.
An Introduction to K-Means Clustering
7. How to Apply K-Means Clustering Algorithm?
Let’s now take an example to understand how K-
Means actually works
We have these 8 points, and we want to apply k-
means to create clusters for these points. Here’s how
we can do it.
1.Choose the number of clusters k
The first step in k-means is to pick the number of
clusters, k.
2. Select k random points from the data as centroids
Next, we randomly select the centroid for each
cluster. Let’s say we want to have 2 clusters, so k is
equal to 2 here. We then randomly select the
centroid:
Here, the red and green circles represent the
centroid for these clusters.
An Introduction to K-Means Clustering
7. How to Apply K-Means Clustering Algorithm?
4. Recompute the centroids of newly formed clusters
Now, once we have assigned all of the points to either cluster, the next
step is to compute the centroids of newly formed clusters:
Here, the red and green crosses are the new centroids.
5.Repeat steps 3 and 4
We then repeat steps 3 and 4:
The step of computing the centroid and assigning all the points
to the cluster based on their distance from the centroid is a
single iteration. But wait – when should we stop this process? It
can’t run till eternity, right?
Stopping Criteria for K-Means Clustering
There are essentially three stopping criteria that can be adopted to stop
the K-means algorithm:
Centroids of newly formed clusters do not change
Points remain in the same cluster
Decision Tree
1.What is a Decision Tree?
2.Types of Decision Tree
3.Decision Tree Terminologies
4.How decision tree algorithms work?
5.Decision Tree Assumptions
6.Entropy
7.How do Decision Trees use Entropy?
8.Information Gain
9.When to Stop Splitting?
10.Pruning
11.Decision tree example
Decision Tree
1.What is a Decision Tree?
A decision tree is a hierarchical structure that
uses a flowchart like a tree structure to show
the predictions that result from a series of
feature-based splits.
2. Types of Decision Tree
ID3 : This algorithm measures how mixed up
the data is at a node using something called
entropy. It then chooses the feature that
helps to clarify the data the most.
C4.5 : This is an improved version of ID3 that
can handle missing data and continuous
attributes.
CART : This algorithm uses a different
measure called Gini impurity to decide how to
split the data. It can be used for both
classification (sorting data into categories)
Decision Tree
3. Decision Tree Terminologies
Before learning more about decision trees let’s get familiar
with some of the terminologies:
Root Node:
Decision Nodes:
Leaf Nodes:
Sub-Tree:
Pruning:
Branch / Sub-Tree:
Decision Tree
3. Decision Tree Terminologies
Before learning more about decision trees let’s get familiar with some of the
terminologies:
Root Node:
The initial node at the beginning of a decision tree, where the entire
population or dataset starts dividing based on various features or conditions.
Decision Nodes:
Nodes resulting from the splitting of root nodes are known as decision nodes.
These nodes represent intermediate decisions or conditions within the tree.
Leaf Nodes:
Nodes where further splitting is not possible, often indicating the final
classification or outcome. Leaf nodes are also referred to as terminal nodes.
Sub-Tree:
Similar to a subsection of a graph being called a sub-graph, a sub-section of a
Decision Tree
3. Decision Tree Terminologies
Before learning more about decision trees let’s get familiar with some
of the terminologies:
Pruning:
The process of removing or cutting down specific nodes in a decision
tree to prevent overfitting and simplify the model.
Branch / Sub-Tree:
A subsection of the entire decision tree is referred to as a branch or
sub-tree. It represents a specific path of decisions and outcomes within
the tree.
Parent and Child Node:
In a decision tree, a node that is divided into sub-nodes is known as a
parent node, and the sub-nodes emerging from it are referred to as
child nodes. The parent node represents a decision or condition, while
the child nodes represent the potential outcomes or further decisions
Example of Decision Tree
Decision Tree
Let’s understand decision trees with the help of an example:
Example of Decision Tree Decision Tree
Let’s understand decision trees with the help of an example:
In the below diagram the tree will first ask what is the weather? Is
it sunny, cloudy, or rainy? If yes then it will go to the next feature
which is humidity and wind. It will again check if there is a strong
wind or weak, if it’s a weak wind and it’s rainy then the person may
go and play.
Decision Tree
4.How decision tree algorithms work?
Decision Tree algorithm works in simpler steps
Starting at the Root: The algorithm begins at the top,
called the “root node,” representing the entire dataset.
Asking the Best Questions: It looks for the most
important feature or question that splits the data into the
most distinct groups. This is like asking a question at a fork
in the tree.
Branching Out: Based on the answer to that question, it
divides the data into smaller subsets, creating new
branches. Each branch represents a possible route through
the tree.
Repeating the Process: The algorithm continues asking
questions and splitting the data at each branch until it
Decision Tree
5. Decision Tree Assumptions
Several assumptions are made to build effective models when
creating decision trees. These assumptions help guide the tree’s
construction and impact its performance. Here are some common
assumptions and considerations when creating decision trees:
Binary Splits
Decision trees typically make binary splits, meaning each node
divides the data into two subsets based on a single feature or
condition. This assumes that each decision can be represented as a
binary choice.
Recursive Partitioning
Decision trees use a recursive partitioning process, where each node
is divided into child nodes, and this process continues until a stopping
criterion is met. This assumes that data can be effectively subdivided
into smaller, more manageable subsets. classifications.
Decision Tree
5. Decision Tree Assumptions
Several assumptions are made
Feature Independence
Decision trees often assume that the features used for splitting nodes
are independent. In practice, feature independence may not hold, but
decision trees can still perform well if features are correlated.
Homogeneity
Decision trees aim to create homogeneous subgroups in each node,
meaning that the samples within a node are as similar as possible
regarding the target variable. This assumption helps in achieving
clear decisionboundaries.
Top-Down Greedy Approach
Decision trees are constructed using a top-down, greedy approach,
where each split is chosen to maximize information gain or minimize
Decision Tree
5. Decision Tree Assumptions
Several assumptions are made
Categorical and Numerical Features
Decision trees can handle both categorical and numerical features.
However, they may require different splitting strategies for each type.
Overfitting
Decision trees are prone to overfitting when they capture noise in the
data. Pruning and setting appropriate stopping criteria are used to
address this assumption.
Impurity Measures
Decision trees use impurity measures such as Gini impurity or entropy
to evaluate how well a split separates classes. The choice of impurity
measure can impact tree construction.
Decision Tree
5. Decision Tree Assumptions
Several assumptions are made
No Missing Values
Decision trees assume that there are no missing values in the dataset
or that missing values have been appropriately handled through
imputation or other methods.
Equal Importance of Features
Decision trees may assume equal importance for all features unless
feature scaling or weighting is applied to emphasize certain features.
Decision Tree
5. Decision Tree Assumptions
Several assumptions are made
No Outliers
Decision trees are sensitive to outliers, and extreme values can
influence their construction. Preprocessing or robust methods may be
needed to handle outliers effectively.
Sensitivity to Sample Size
Small datasets may lead to overfitting, and large datasets may result
in overly complex trees. The sample size and tree depth should be
balanced.
Decision Tree
6. Entropy
Entropy is nothing but the uncertainty in our dataset or measure of
disorder. Let me try to explain this with the help of an example.
Suppose you have a group of friends who decides which movie they
can watch together on Sunday. There are 2 choices for movies, one is
“Lucy” and the second is “Titanic” and now everyone has to tell their
choice.
After everyone gives their answer we see that “Lucy” gets 4 votes
and “Titanic” gets 5 votes. Which movie do we watch now? Isn’t it
hard to choose 1 movie now because the votes for both the movies
are somewhat equal.
Decision Tree
6. Entropy
This is exactly what we call disorderness, there is an equal number of
votes for both the movies, and we can’t really decide which movie we
should watch. It would have been much easier if the votes for “Lucy”
were 8 and for “Titanic” it was 2. Here we could easily say that the
majority of votes are for “Lucy” hence everyone will be watching this
movie.
In a decision tree, the output is mostly “yes” or “no”
The formula for Entropy is shown below:
Decision Tree
Decision Tree