ML & DA Unit2 - Notes
ML & DA Unit2 - Notes
Unit -2
Supervised Learning
Introduction
Supervised machine learning is a fundamental approach for machine learning and
artificial intelligence. It involves training a model using labeled data, where each input
comes with a corresponding correct output.
The process is like a teacher guiding a student—hence the term "supervised" learning. In
this article, we'll explore the key components of supervised learning, the different types
of supervised machine learning algorithms used, and some practical examples of how it
works.
• Learning Process: The algorithm processes the training data, learning the relationships
between the input features and the output labels. This is achieved by adjusting the model's
parameters to minimize the difference between its predictions and the actual labels.
After training, the model is evaluated using a test dataset to measure its accuracy and
performance. Then the model's performance is optimized by adjusting parameters and
using techniques like cross-validation to balance bias and variance. This ensures the model
generalizes well to new, unseen data.
Unit 2 Machine Learning and Data analytics Using Python
Regression
Regression in machine learning refers to a supervised learning technique where the
goal is to predict a continuous numerical value based on one or more independent
features. It finds relationships between variables so that predictions can be made.
we have two types of variables present in regression:
Dependent Variable (Target): The variable we are trying to predict e.g house
price.
Independent Variables (Features): The input variables that influence the
prediction e.g locality, number of rooms.
Linear Regression
Linear regression is a type of supervised machine-learning algorithm that learns
from the labelled datasets and maps the data points with most optimized linear
functions which can be used for prediction on new datasets. It assumes that there
is a linear relationship between the input and output, meaning the output changes
at a constant rate as the input changes. This relationship is represented by a
straight line.
For example we want to predict a student's exam score based on how many hours
they studied. We observe that as students study more hours, their scores go up. In
the example of predicting exam scores based on hours studied. Here
Independent variable (input): Hours studied because it's the factor we control
or observe.
Dependent variable (output): Exam score because it depends on how many
hours were studied.
Importance of Linear Regression
• Simplicity and Interpretability: It’s easy to understand and interpret, making it
a starting point for learning about machine learning.
• Predictive Ability: Helps predict future outcomes based on past data, making it
useful in various fields like finance, healthcare and marketing.
• Basis for Other Models: Many advanced algorithms, like logistic regression or
neural networks, build on the concepts of linear regression.
• Efficiency: It’s computationally efficient and works well for problems with a
linear relationship.
Unit 2 Machine Learning and Data analytics Using Python
• Widely Used: It’s one of the most widely used techniques in both statistics and
machine learning for regression tasks.
• Analysis: It provides insights into relationships between variables (e.g., how much
one variable influences another).
Best Fit Line in Linear Regression
In linear regression, the best-fit line is the straight line that most accurately
represents the relationship between the independent variable (input) and the
dependent variable (output). It is the line that minimizes the difference between
the actual data points and the predicted values from the model.
1. Goal of the Best-Fit Line
The goal of linear regression is to find a straight line that minimizes the error (the
difference) between the observed data points and the predicted values. This line
helps us predict the dependent variable for new, unseen data.
Example:
Predicting a person’s salary (y) based on their years of experience (x).
2. Multiple Linear Regression
Multiple linear regression involves more than one independent variable and one
dependent variable. The equation for multiple linear regression is:
The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
In regression set of records are present with X and Y values and these values are used to
learn a function so if you want to predict Y from an unknown X this learned function can
be used. In regression we have to find the value of Y, So, a function is required that predicts
continuous Y in the case of regression given X as independent features.
Use Case of Multiple Linear Regression
Multiple linear regression allows us to analyze relationship between multiple
independent variables and a single dependent variable. Here are some use cases:
Unit 2 Machine Learning and Data analytics Using Python
• Real Estate Pricing: In real estate MLR is used to predict property prices based
on multiple factors such as location, size, number of bedrooms, etc. This helps
buyers and sellers understand market trends and set competitive prices.
• Financial Forecasting: Financial analysts use MLR to predict stock prices or
economic indicators based on multiple influencing factors such as interest rates,
inflation rates and market trends. This enables better investment strategies and
risk management24.
• Agricultural Yield Prediction: Farmers can use MLR to estimate crop yields
based on several variables like rainfall, temperature, soil quality and fertilizer
usage. This information helps in planning agricultural practices for optimal
productivity
• E-commerce Sales Analysis: An e-commerce company can utilize MLR to assess
how various factors such as product price, marketing promotions and seasonal
trends impact sales.
Implementation of Linear Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Sample data
X = np.array([1,2,3,4,5]).reshape(-1,1)
y = np.array([2,4,5,4,5])
# Model
model = LinearRegression()
model.fit(X, y)
# Prediction
y_pred = model.predict(X)
# Plot
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
Unit 2 Machine Learning and Data analytics Using Python
plt.title("Linear Regression")
plt.xlabel("X")
plt.ylabel("y")
plt.show()
Polynomial Regression
Polynomial regression is a regression technique used to model the relationship
between an independent variable xx and a dependent variable yy as an nth-
degree polynomial. Unlike linear regression, which assumes a straight-line (linear)
relationship, polynomial regression can capture non-linear, curved patterns in
data by including higher-degree terms of the independent variable
on the data or a curvilinear regression line then the results that we obtain are far better
than what we can achieve with the standard linear regression. Some of the use cases of
the Polynomial regression are as stated below:
• The growth rate of tissues.
• Progression of disease epidemics
• Distribution of carbon isotopes in lake sediments
Advantages & Disadvantages of using Polynomial Regression
Advantages of using Polynomial Regression
• A broad range of functions can be fit under it.
• Polynomial basically fits a wide range of curvatures.
• Polynomial provides the best approximation of the relationship between
dependent and independent variables.
Disadvantages of using Polynomial Regression
• These are too sensitive to outliers.
• The presence of one or two outliers in the data can seriously affect the results of
nonlinear analysis.
• In addition, there are unfortunately fewer model validation tools for the detection
of outliers in nonlinear regression than there are for linear regression.
Implementation of Polynomial Regression
X = np.array([1,2,3,4,5]).reshape(-1,1)
y = np.array([1,4,9,16,25])
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model = LinearRegression()
Unit 2 Machine Learning and Data analytics Using Python
model.fit(X_poly, y)
y_pred = model.predict(X_poly)
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.title("Polynomial Regression")
plt.xlabel("X")
plt.ylabel("y")
plt.show()
Example
from sklearn.metrics import mean_absolute_error
true_values = [2.5, 3.7, 1.8, 4.0, 5.2]
predicted_values = [2.1, 3.9, 1.7, 3.8, 5.0]
mae = mean_absolute_error(true_values, predicted_values)
print(“quot;Mean Absolute Error:”, mae)
Output: Mean Absolute Error: 0.22000000000000003
Unit 2 Machine Learning and Data analytics Using Python
Example
from sklearn.metrics import mean_squared_error
true_values = [2.5, 3.7, 1.8, 4.0, 5.2]
predicted_values = [2.1, 3.9, 1.7, 3.8, 5.0]
mse = mean_squared_error(true_values, predicted_values)
print(“Mean Squared Error:”, mse)
Output: Mean Squared Error: 0.057999999999999996
R-squared (R²) Score
• A statistical metric frequently used to assess the goodness of fit of a regression
model is the R-squared (R2) score, also referred to as the coefficient of
determination.
• It quantifies the percentage of the dependent variable's variation that the model's
independent variables contribute to. R2 is a useful statistic for evaluating the
overall effectiveness and explanatory power of a regression model.
• Mathematical Formula
The formula to calculate the R-squared score is as follows:
Unit 2 Machine Learning and Data analytics Using Python
Example
from sklearn.metrics import r2_score
true_values = [2.5, 3.7, 1.8, 4.0, 5.2]
predicted_values = [2.1, 3.9, 1.7, 3.8, 5.0]
r2 = r2_score(true_values, predicted_values)
print("R-squared (R²) Score:", r2)
Output: R-squared (R²) Score: 0.9588769143505389
Root Mean Squared Error (RMSE)
• RMSE stands for Root Mean Squared Error. It is a usually used metric in regression
analysis and machine learning to measure the accuracy or goodness of fit of a
predictive model, especially when the predictions are continuous numerical
values.
• The RMSE quantifies how well the predicted values from a model align with the
actual observed values in the dataset. Here's how it works:
• Calculate the Squared Differences: For each data point, subtract the
predicted value from the actual (observed) value, square the result, and
sum up these squared differences.
• Compute the Mean: Divide the sum of squared differences by the number
of data points to get the mean squared error (MSE).
• Take the Square Root: To obtain the RMSE, simply take the square root of
the MSE.
The formula for RMSE for a data with 'n' data points is as follows:
Unit 2 Machine Learning and Data analytics Using Python
RMSE – Example
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
# Sample data
true_prices = np.array([250000, 300000, 200000, 400000, 350000])
predicted_prices = np.array([240000, 310000, 210000, 380000, 340000])
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(true_prices, predicted_prices))
print(“Root Mean Squared Error (RMSE):”, rmse)
Classification
• Classification is a supervised machine learning method where the model tries to
predict the correct label of a given input data.
• In classification, the model is fully trained using the training data, and then it is
evaluated on test data before being used to perform prediction on new unseen
data.
• For instance, an algorithm can learn to predict whether a given email is spam or
ham (no spam), as illustrated below.
Unit 2 Machine Learning and Data analytics Using Python
Types of Classification
1. Binary Classification
This is the simplest kind of classification. In binary classification, the goal is to sort
the data into two distinct categories. Think of it like a simple choice between two
options. Imagine a system that sorts emails into either spam or not spam. It
works by looking at different features of the email like certain keywords or
sender details, and decides whether it’s spam or not. It only chooses between these
two options.
2. Multiclass Classification
Here, instead of just two categories, the data needs to be sorted into more than
two categories. The model picks the one that best matches the input. Think of an
image recognition system that sorts pictures of animals into categories
like cat, dog, and bird.
Basically, machine looks at the features in the image (like shape, color, or
texture) and chooses which animal the picture is most likely to be based on
the training it received.
Unit 2 Machine Learning and Data analytics Using Python
4. Model Evaluation: Once the model is trained, it's tested on new, unseen data to
check how accurately it can classify the items.
5. Prediction: After being trained and evaluated, the model can be used to predict
the class of new data based on the features it has learned.
6. Model Evaluation: Evaluating a classification model is a key step in machine
learning. It helps us check how well the model performs and how good it is at
handling new, unseen data. Depending on the problem and needs we can use
different metrics to measure its performance.
Logistic Regression
• Logistic Regression is a supervised machine learning algorithm used for
classification problems. Unlike linear regression which predicts continuous values
it predicts the probability that an input belongs to a specific class.
• It is used for binary classification where the output can be one of two possible
categories such as Yes/No, True/False or 0/1.
• It uses sigmoid function to convert inputs into a probability value between 0 and
1.
Types of Logistic Regression
Logistic regression can be classified into three main types based on the nature of the
dependent variable:
1. Binomial Logistic Regression: This type is used when the dependent variable
has only two possible categories. Examples include Yes/No, Pass/Fail or 0/1. It is
the most common form of logistic regression and is used for binary classification
problems.
2. Multinomial Logistic Regression: This is used when the dependent variable has
three or more possible categories that are not ordered. For example, classifying
animals into categories like "cat," "dog" or "sheep." It extends the binary logistic
regression to handle multiple classes.
3. Ordinal Logistic Regression: This type applies when the dependent variable has
three or more categories with a natural order or ranking. Examples include ratings
like "low," "medium" and "high." It takes the order of the categories into account
when modeling.
Unit 2 Machine Learning and Data analytics Using Python
dependent variable is expressed through log-odds. This helps model how changes
in the independent variables affect the likelihood of the outcome.
• Coefficient
Coefficients are the values that show how each independent variable influences
the dependent variable. They indicate the strength and direction of the
relationship. For example, a positive coefficient means that as the independent
variable increases, the likelihood of the dependent variable being 1 also increases.
• Intercept
The intercept is a constant term in the model representing the dependent
variable's log odds when all the independent variables are zero. It provides a
baseline level of the dependent variable’s probability before considering the
effects of the independent variables.
• Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is the method used to find the best-fitting
coefficients for the model. It determines the values that make the observed data
most probable under the logistic regression framework, ensuring the model
provides the most accurate predictions based on the given data.
Working of Logistic Regression
Consider the following example: An organization wants to determine an
employee’s salary increase based on their performance.
For this purpose, a linear regression algorithm will help them decide. Plotting a
regression line by considering the employee’s performance as the independent
variable, and the salary increase as the dependent variable will make their task
easier.
Now, what if the organization wants to know whether an employee would get a
promotion or not based on their performance? The above linear graph won’t be
Unit 2 Machine Learning and Data analytics Using Python
suitable in this case. As such, we clip the line at zero and one, and convert it into a
sigmoid curve (S curve).
Based on the threshold values, the organization can decide whether an employee
will get a salary increase or not.
To understand logistic regression, let’s go over the odds of success.
Odds (𝜃) = Probability of an event happening / Probability of an event not
happening
𝜃=p/1-p
The values of odds range from zero to ∞ and the values of probability lies between
zero and one.
Consider the equation of a straight line:
𝑦 = 𝛽0 + 𝛽1* 𝑥
Let Y = e 𝛽0+𝛽1 * 𝑥
Then p(x) / 1 - p(x) = Y
p(x) = Y(1 - p(x))
p(x) = Y - Y(p(x))
p(x) + Y(p(x)) = Y
p(x)(1+Y) = Y
p(x) = Y / 1+Y
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and at the time of
classification it performs an action on the dataset.
For example, consider the following table of data points containing two features:
Unit 2 Machine Learning and Data analytics Using Python
2. Manhattan Distance
Unit 2 Machine Learning and Data analytics Using Python
This is the total distance you would travel if you could only move along horizontal and
vertical lines like a grid or city streets. It’s also called "taxicab distance" because a taxi can
only drive along the grid-like streets of a city.
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and
Manhattan distances as special cases.
From the formula above, when p=2, it becomes the same as the Euclidean distance
formula and when p=1, it turns into the Manhattan distance formula. Minkowski distance
is essentially a flexible formula that can represent either Euclidean or Manhattan distance
depending on the value of p.
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where
it predicts the label or value of a new data point by considering the labels or values of its
K nearest neighbors in the training dataset.
• The k data points with the smallest distances to the target point are nearest
neighbors.
Step 4: Voting for Classification or Taking Average for Regression
• When you want to classify a data point into a category like spam or not spam, the
KNN algorithm looks at the K closest points in the dataset. These closest points are
called neighbors. The algorithm then looks at which category the neighbors belong
to and picks the one that appears the most. This is called majority voting.
• In regression, the algorithm still looks for the K closest points. But instead of voting
for a class in classification, it takes the average of the values of those K neighbors.
This average is the predicted value for the new point for the algorithm.
It shows how a test point is classified based on its nearest neighbors. As the test point
moves the algorithm identifies the closest 'k' data points i.e. 5 in this case and assigns test
point the majority class label that is grey label class here.
Python Implementation of KNN Algorithm
1. Importing Libraries
Counter is used to count the occurrences of elements in a list or iterable. In KNN after
finding the k nearest neighbor labels Counter helps count how many times each label
appears.
import numpy as np
from collections import Counter
2. Defining the Euclidean Distance Function
euclidean_distance is to calculate euclidean distance between points.
def euclidean_distance(point1, point2):
return np.sqrt(np.sum((np.array(point1) - np.array(point2))**2))
3. KNN Prediction Function
• distances.append saves how far each training point is from the test point, along
with its label.
• distances.sort is used to sorts the list so the nearest points come first.
• k_nearest_labels picks the labels of the k closest points.
• Uses Counter to find which label appears most among those k labels that becomes
the prediction.
def knn_predict(training_data, training_labels, test_point, k):
distances = []
for i in range(len(training_data)):
Unit 2 Machine Learning and Data analytics Using Python
• Few parameters: Only needs to set the number of neighbors (k) and a distance
method.
• Versatile: Works for both classification and regression problems.
Disadvantages of KNN
• Slow with large data: Needs to compare every point during prediction.
• Struggles with many features: Accuracy drops when data has too many features.
• Can Overfit: It can overfit especially when the data is high-dimensional or not
clean.
Decision Tree in Machine Learning
A decision tree is a supervised learning algorithm used for both classification and
regression tasks. It has a hierarchical tree structure which consists of a root node,
branches, internal nodes and leaf nodes. It It works like a flowchart help to make decisions
step by step where:
• Internal nodes represent attribute tests
• Branches represent attribute values
• Leaf nodes represent final decisions or predictions.
Decision trees are widely used due to their interpretability, flexibility and low
preprocessing needs.
How Does a Decision Tree Work?
A decision tree splits the dataset based on feature values to create pure subsets ideally all
items in a group belong to the same class. Each leaf node of the tree corresponds to a class
label and the internal nodes are feature-based decision points. Let’s understand this with
an example.
Unit 2 Machine Learning and Data analytics Using Python
Let’s consider a decision tree for predicting whether a customer will buy a product based
on age, income and previous purchases: Here's how the decision tree works:
1. Root Node (Income)
First Question: "Is the person’s income greater than $50,000?"
• If Yes, proceed to the next question.
• If No, predict "No Purchase" (leaf node).
2. Internal Node (Age):
If the person’s income is greater than $50,000, ask: "Is the person’s age above 30?"
• If Yes, proceed to the next question.
• If No, predict "No Purchase" (leaf node).
3. Internal Node (Previous Purchases):
• If the person is above 30 and has made previous purchases, predict "Purchase"
(leaf node).
• If the person is above 30 and has not made previous purchases, predict "No
Purchase" (leaf node).
• Pick Random Features: When building each tree it doesn’t look at all the features
(columns) at once. It picks a few at random to decide how to split the data. This
helps the trees stay different from each other.
• Each Tree Makes a Prediction: Every tree gives its own answer or prediction
based on what it learned from its part of the data.
• Combine the Predictions:
o For classification we choose a category as the final answer is the one that
most trees agree on i.e majority voting.
o For regression we predict a number as the final answer is the average of
all the trees predictions.
• Why It Works Well: Using random data and features for each tree helps avoid
overfitting and makes the overall prediction more accurate and trustworthy.
Random forest is also a ensemble learning technique which you can learn more about from:
Ensemble Learning
Key Features of Random Forest
• Handles Missing Data: It can work even if some data is missing so you don’t
always need to fill in the gaps yourself.
• Shows Feature Importance: It tells you which features (columns) are most useful
for making predictions which helps you understand your data better.
• Works Well with Big and Complex Data: It can handle large datasets with many
features without slowing down or losing accuracy.
• Used for Different Tasks: You can use it for both classification like predicting
types or labels and regression like predicting numbers or amounts.
Assumptions of Random Forest
• Each tree makes its own decisions: Every tree in the forest makes its own
predictions without relying on others.
• Random parts of the data are used: Each tree is built using random samples and
features to reduce mistakes.
• Enough data is needed: Sufficient data ensures the trees are different and learn
unique patterns and variety.
• Different predictions improve accuracy: Combining the predictions from
different trees leads to a more accurate final result.
Implementing Random Forest for Classification Tasks
Here we will predict survival rate of a person in titanic.
• Import libraries and load the Titanic dataset.
Unit 2 Machine Learning and Data analytics Using Python
import warnings
warnings.filterwarnings('ignore')
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_data = pd.read_csv(url)
titanic_data = titanic_data.dropna(subset=['Survived'])
y = titanic_data['Survived']
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
print(f"Accuracy: {accuracy:.2f}")
sample = X_test.iloc[0:1]
prediction = rf_classifier.predict(sample)
sample_dict = sample.iloc[0].to_dict()
Output:
We evaluated model's performance using a classification report to see how well it predicts
the outcomes and used a random sample to check model prediction.
Implementing Random Forest for Regression Tasks
We will do house price prediction here.
• Load the California housing dataset and create a DataFrame with features and
target.
• Separate the features and the target variable.
• Split the data into training and testing sets (80% train, 20% test).
• Initialize and train a Random Forest Regressor using the training data.
• Predict house values on test data and evaluate using MSE and R² score.
• Print a sample prediction and compare it with the actual value.
Unit 2 Machine Learning and Data analytics Using Python
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
california_housing = fetch_california_housing()
california_data = pd.DataFrame(california_housing.data,
columns=california_housing.feature_names)
california_data['MEDV'] = california_housing.target
X = california_data.drop('MEDV', axis=1)
y = california_data['MEDV']
rf_regressor.fit(X_train, y_train)
y_pred = rf_regressor.predict(X_test)
Output:
We evaluated the model's performance using Mean Squared Error and R-squared
Score which show how accurate the predictions are and used a random sample to
check model prediction.
Advantages of Random Forest
• Random Forest provides very accurate predictions even with large datasets.
• Random Forest can handle missing data well without compromising with
accuracy.
• It doesn’t require normalization or standardization on dataset.
• When we combine multiple decision trees it reduces the risk of overfitting of the
model.
Limitations of Random Forest
• It can be computationally expensive especially with a large number of trees.
• It’s harder to interpret the model compared to simpler models like decision trees.
Where:
• TP = True Positives
• FP = False Positives
Precision helps ensure that when the model predicts a positive outcome, it’s likely
to be correct.
3. Recall
Recall or Sensitivity measures how many of the actual positive cases were
correctly identified by the model. It is important when missing a positive
case (false negative) is more costly than false positives.
Where:
• FN = False Negatives
4. F1 Score
The F1 Score is the harmonic mean of precision and recall. It is useful when we
need a balance between precision and recall as it combines both into a single
number. A high F1 score means the model performs well on both metrics. Its range
is [0,1].
Unit 2 Machine Learning and Data analytics Using Python
Lower recall and higher precision gives us great accuracy but then it misses a large
number of instances. More the F1 score better will be performance. It can be
expressed mathematically in this way:
Where:
• TP = True Positives (correctly predicted positive cases)
• FN = False Negatives (actual positive cases incorrectly predicted as negative)
2. True Negative Rate(TNR)
Also called specificity, the True Negative Rate measures how many actual negative
instances were correctly identified by the model. It answers the question: "Out of
all the actual negative cases, how many did the model correctly identify as
negative?"
Formula:
Where:
• TN = True Negatives (correctly predicted negative cases)
• FP = False Positives (actual negative cases incorrectly predicted as positive)
Unit 2 Machine Learning and Data analytics Using Python
Where:
• FP = False Positives (incorrectly predicted positive cases)
• TN = True Negatives (correctly predicted negative cases)
4. False Negative Rate(FNR)
It measures how many actual positive instances were incorrectly classified as
negative. It answers: "Out of all the actual positive cases, how many were
misclassified as negative?"
Formula:
Where:
• FN = False Negatives (incorrectly predicted negative cases)
• TP = True Positives (correctly predicted positive cases)
ROC Curve
It is a graphical representation of the True Positive Rate (TPR) vs the False
Positive Rate (FPR) at different classification thresholds. The curve helps
us visualize the trade-offs between sensitivity (TPR) and specificity (1 -
FPR) across various thresholds. Area Under Curve (AUC) quantifies the
overall ability of the model to distinguish between positive and negative
classes.
• AUC = 1: Perfect model (always correctly classifies positives and negatives).
• AUC = 0.5: Model performs no better than random guessing.
• AUC < 0.5: Model performs worse than random guessing (showing that the
model is inverted).
Unit 2 Machine Learning and Data analytics Using Python
Overfitting Model learns too much, including noise and details that don’t matter.
Overfitting
• Overfitting occurs when a model learns not just the underlying patterns in the
training data, but also the random noise.
As a result, it performs extremely well on training data but fails to generalize to
new data.
• Imagine you are trying to predict student marks based on study hours.
If your model creates a very complex curve that passes through every single data
point exactly, it might capture random variations (like one student who studied
little but scored high).
On new data, this complex model fails badly.
Underfitting
• Underfitting occurs when a model is too simple to capture the underlying
structure of the data.
It fails to even perform well on the training data.
• Suppose you try to use a straight line (linear regression) to predict outcomes
where the data actually follows a curve.
The model misses important trends.
Unit 2 Machine Learning and Data analytics Using Python
Unit -3
Unsupervised Learning
Unsupervised learning is a branch of machine learning that deals with unlabeled data.
Unlike supervised learning, where the data is labeled with a specific category or outcome,
unsupervised learning algorithms are tasked with finding patterns and relationships
within the data without any prior knowledge of the data's meaning.
Unsupervised machine learning algorithms find hidden patterns and data without any
human intervention, i.e., we don't give output to our model. The training model has
only input parameter values and discovers the groups or patterns on its own.
The image shows set of animals: elephants, camels, and cows that represents raw data
that the unsupervised learning algorithm will process.
• The "Interpretation" stage signifies that the algorithm doesn't have predefined
labels or categories for the data. It needs to figure out how to group or organize
the data based on inherent patterns.
• Algorithm represents the core of unsupervised learning process using techniques
like clustering, dimensionality reduction, or anomaly detection to identify patterns
and structures in the data.
• Processing stage shows the algorithm working on the data.
The output shows the results of the unsupervised learning process. In this case, the
algorithm might have grouped the animals into clusters based on their species (elephants,
camels, cows).
Unsupervised Learning Algorithms
There are mainly 3 types of Algorithms which are used for Unsupervised dataset.
• Clustering
• Association Rule Learning
• Dimensionality Reduction
Unit 2 Machine Learning and Data analytics Using Python
1. Clustering Algorithms
Clustering in unsupervised machine learning is the process of grouping unlabeled data into
clusters based on their similarities. The goal of clustering is to identify patterns and
relationships in the data without any prior knowledge of the data's meaning.
Broadly this technique is applied to group data based on different patterns, such as
similarities or differences, our machine model finds. These algorithms are used to process
raw, unclassified data objects into groups. For example, in the above figure, we have not
given output parameter values, so this technique will be used to group clients based on the
input parameters provided by our data.
Some common clustering algorithms:
• K-means Clustering: Groups data into K clusters based on how close the points are
to each other.
• Hierarchical Clustering: Creates clusters by building a tree step-by-step, either
merging or splitting groups.
• Density-Based Clustering (DBSCAN): Finds clusters in dense areas and treats
scattered points as noise.
• Mean-Shift Clustering: Discovers clusters by moving points toward the most
crowded areas.
• Spectral Clustering: Groups data by analyzing connections between points using
graphs.
2. Association Rule Learning
Association rule learning is also known as association rule mining is a common technique
used to discover associations in unsupervised machine learning. This technique is a rule-
based ML technique that finds out some very useful relations between parameters of a
large data set. This technique is basically used for market basket analysis that helps to
better understand the relationship between different products.
For e.g. shopping stores use algorithms based on this technique to find out the relationship
between the sale of one product w.r.t to another's sales based on customer behavior. Like
if a customer buys milk, then he may also buy bread, eggs, or butter. Once trained well,
such models can be used to increase their sales by planning different offers.
Some common Association Rule Learning algorithms:
• Apriori Algorithm: Finds patterns by exploring frequent item combinations step-
by-step.
• FP-Growth Algorithm: An Efficient Alternative to Apriori. It quickly identifies
frequent patterns without generating candidate sets.
• Eclat Algorithm: Uses intersections of itemsets to efficiently find frequent patterns.
Unit 2 Machine Learning and Data analytics Using Python
Unsupervised learning has diverse applications across industries and domains. Key
applications include:
• Customer Segmentation: Algorithms cluster customers based on purchasing
behavior or demographics, enabling targeted marketing strategies.
• Anomaly Detection: Identifies unusual patterns in data, aiding fraud detection,
cybersecurity, and equipment failure prevention.
• Recommendation Systems: Suggests products, movies, or music by analyzing
user behavior and preferences.
• Image and Text Clustering: Groups similar images or documents for tasks like
organization, classification, or content recommendation.
• Social Network Analysis: Detects communities or trends in user interactions on
social media platforms.
• Astronomy and Climate Science: Classifies galaxies or groups weather patterns
to support scientific research
K means Clustering
K-Means Clustering is an Unsupervised Machine Learning algorithm which groups
unlabeled dataset into different clusters. It is used to organize data into groups
based on their similarity.
Understanding K-means Clustering
For example online store uses K-Means to group customers based on purchase
frequency and spending creating segments like Budget Shoppers, Frequent Buyers
and Big Spenders for personalised marketing.
The algorithm works by first randomly picking some central points
called centroids and each data point is then assigned to the closest centroid
forming a cluster. After all the points are assigned to a cluster the centroids are
updated by finding the average position of the points in each cluster. This process
repeats until the centroids stop changing forming clusters. The goal of clustering
is to divide the data points into clusters so that similar data points belong to same
group.
How k-means clustering works?
We are given a data set of items with certain features and values for these
features like a vector. The task is to categorize those items into groups. To
achieve this we will use the K-means algorithm. 'K' in the name of the
algorithm represents the number of groups/clusters we want to classify
our items into.
K means Clustering
Unit 2 Machine Learning and Data analytics Using Python
The algorithm will categorize the items into k groups or clusters of similarity. To
calculate that similarity we will use the Euclidean distance as a measurement. The
algorithm works as follows:
1. First we randomly initialize k points called means or cluster centroids.
2. We categorize each item to its closest mean and we update the mean's coordinates,
which are the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our
clusters.
The "points" mentioned above are called means because they are the mean values
of the items categorized in them. To initialize these means, we have a lot of options.
An intuitive method is to initialize the means at random items in the data set.
Another method is to initialize the means at random values between the
boundaries of the data set. For example for a feature x the items have values in
[0,3] we will initialize the means with values for x at [0,3].
Selecting the right number of clusters is important for meaningful segmentation to
do this we use Elbow Method for optimal value of k in KMeans which is a graphical
tool used to determine the optimal number of clusters (k) in K-means.
Implementation of K-means clustering
# Import required libraries
import numpy as np
iris = load_iris()
X = iris.data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans.fit(X_scaled)
Unit 2 Machine Learning and Data analytics Using Python
labels = kmeans.labels_
plt.figure(figsize=(8, 5))
plt.xlabel("Feature 1 (Standardized)")
plt.ylabel("Feature 2 (Standardized)")
plt.legend()
plt.grid(True)
plt.show()
Unit 2 Machine Learning and Data analytics Using Python
import numpy as np
clustering = AgglomerativeClustering(n_clusters=2).fit(X)
print(clustering.labels_)
Output :
[1, 1, 1, 0, 0, 0]
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to prespecify the
number of clusters. Top-down clustering requires a method for splitting a cluster that contains the
whole data and proceeds by splitting clusters recursively until individual data have been split into
singleton clusters.
Workflow for Hierarchical Divisive clustering :
1. Start with all data points in one cluster: Treat the entire dataset as a single large cluster.
2. Split the cluster: Divide the cluster into two smaller clusters. The division is typically
done by finding the two most dissimilar points in the cluster and using them to separate
the data into two parts.
3. Repeat the process: For each of the new clusters, repeat the splitting process:
1. Choose the cluster with the most dissimilar points.
2. Split it again into two smaller clusters.
Unit 2 Machine Learning and Data analytics Using Python
4. Stop when each data point is in its own cluster: Continue this process until every data
point is its own cluster, or the stopping condition (such as a predefined number of
clusters) is met.
plt.xlabel('Data point')
plt.ylabel('Distance')
plt.show()
Output:
Unit 2 Machine Learning and Data analytics Using Python