Data Mining Theory and Python Project.pptx

Course Name: Software Development Project – II
Course Code : ICT-3112
3/3/2024 Presented by: Sadika, Noor & Rakib 2
Team members
No Name ID
01 Sadika Khatun Jhinu IT20029
02 Gazi Md. Noor Hossain IT20030
03 Rakibul Islam IT20031
Supervisor
Md. Tanvir Rahman
Assistant Professor
Dept. of ICT
MBSTU

 Dataset
 Data Mining
 Python Programming Language
 Binary & Discrete Classification
 Euclidean Distance
 Minkowski Distance
 Regression Analysis
 Linear Regression
 Covariance
 Deviation
 Prediction Using SVM
 ROC Curve
Contents

Our aim is to
 Collect a Dataset from Kaggle
 Implement the knowledge that we learnt in Data Mining
...Course
 Implement using Python Programming Language
Project Proposal

Dataset
 We collect this restaurant dataset from Kaggle. Kaggle is a
popular online platform for data science competitions, machine
learning challenges, and data sets which is founded in 2010.
 It contains customer details, their personal ratings and their
payment system.
 It’s a numerical dataset.
 It contains 2000 data for analysis.
 The dataset file is in .csv (Comma Separated Value) format which
allows data to be saved in a tabular format.
• The attributes of this file:
1. CustomerID
2. Height
3. Weight
4. Age
5. annual_income
6. ratings
7. Price
8. Payment

Data Mining
Data mining is a process of extracting meaningful patterns, trends, and insights
from large volumes of data. It involves the use of advanced algorithms and
statistical techniques to discover hidden relationships within datasets.
Key features of Data Mining:
 Classification and Clustering: Data mining allows for the categorization
of data into distinct groups through classification. Clustering involves
grouping similar data points together without predefined categories.
 Anomaly Detection: It can identify unusual or anomalous data points.
This feature is valuable for fraud detection, outlier identification, and
quality control.
 Regression Analysis: This involves the estimation of relationships
between variables.
 Association Rule Mining: It identifies relationships between different
items in a datasets.
 Predictive Modeling: Data mining enables the creation of predictive
models that can forecast future trends or outcomes based on historical data.

Python Programming Language
 Python is a high-level, versatile, and dynamically-typed programming language
known for its simplicity, readability, and extensive standard library.
 Python programming language is being used in web development, Machine Learning
applications, along with all cutting-edge technology in Software Industry.
 Python's simplicity, readability, extensive libraries, and versatility have made it a
favored language across a wide range of industries and applications, from web
development to scientific research and artificial intelligence.
Applications of Python
Python can be used on a server to create web applications
Python can be used alongside software to create workflows.
Python can connect to database systems. It can also read and modify files.
Python can connect to database systems and can also read and modify files.
Python can be used to handle big data and perform complex mathematics.

Classification
Classification is a process of categorizing data or objects into predefined classes
or categories based on their features or attributes. In machine learning,
classification is a type of supervised learning technique where an algorithm is
trained on a labeled dataset to predict the class or category of new, unseen data.
Classification is of two types:
1. Binary Classification: In binary classification, the goal is to classify the input into one of two
classes or categories.
2. Multiclass Classification: In multi-class classification, the goal is to classify the input into one
of several classes or categories.

Binarization
• A simple technique to binarize a categorical attribute is the following: If
there are m categorical values, then uniquely assign each original value
to an integer in the interval (0, m-1)
• Here, if we split (Weight) from data set by applying some condition then
the code is:
condition1 = data['weight'] < 30
condition2 = (data['weight']>=30)&(data['weight’]<=60)
condition3 = data['weight'] > 60
data['Below_30'] = condition1.astype(int)
data['Between_30_and_60'] = condition2.astype(int)
data['Above_60'] = condition3.astype(int)
print(data)

Binarization

Discretization
Discretization is typically applied to attributes that are used in classification or
association analysis. Transformation of a continuous attribute to a categorical attribute
involves two subtasks: deciding how many categories, n, to have and determining how
to map the values of the continuous attribute to these categories.
Here for, threshold = 3 We can split our (Weight) dataset into 3 specific categories.
num_bins = 3
bin_labels = ['Less', 'Medium', 'More']
data['New Weight'] = pd.cut(data['weight'],
bins=num_bins, labels=bin_labels)
print(data)

Discretization

Euclidean Distance
The Euclidean distance is a measure of the straight-line distance between two
points in Euclidean space. It is the most commonly used distance metric in
geometry and machine learning.
Properties:
1. It is always non-negative (d≥0).
2. It is symmetric, meaning the distance from point A to point B is the same as from point B
to point A.
3. It satisfies the triangle inequality, which means the shortest distance between two points
is a straight line.
Euclidean distance, d = 𝑖=1
𝑛
(𝑥𝑖 − 𝑦𝑖 )2
point1 = data['weight']
point2 = data['height']
distance = np.linalg.norm(point1 - point2)
Euclidean distance: 2698.051

Minkowski Distance
The Minkowski distance is a metric used to measure the distance between two points in
a multidimensional space. It is a generalization of other distance metrics like Euclidean
distance and Manhattan distance.
Minkowski Distance, d = 𝑖=1
𝑛
|𝑥𝑖 − 𝑦𝑖 |𝑝
1
𝑝
Some properties of the Minkowski distance:
1. When p=1, it is called the Manhattan distance or L1 norm.
2. When p=2, it is called the Euclidean distance or L2 norm.
3. If p approaches infinity, the Minkowski distance approaches the Chebyshev
distance
point1 = data['weight']
point2 = data['height’]
p = 2
Distance = np.power(np.sum(np.abs(point1 - point2) ** p), 1/p)
Minkowski distance (p=2): 2698.0517

Regression Analysis
Regression analysis is a statistical method that shows the relationship between
two or more variables.
 Usually expressed in a graph, the method tests the relationship between a
dependent variable against independent variables.
 Typically, the independent variable(s) changes with the dependent variable(s)
and the regression analysis attempts to answer which factors matter most to
that change.
 Generally, regression analysis is used to:
 Try and explain a phenomenon
 Predict future events
 Optimize manufacturing and delivery processes
 Resolve errors
 Provide new insights

Linear Regression
• Linear regression is a type of supervised machine learning algorithm that
computes the linear relationship between a dependent variable and one or
more independent features.
• The equation for Linear Regression, y = ax + b
Here, x is independent variable
y is dependent variable
a = intercept point of regression line
b = slop of regression line
Again,
b =
(𝑥𝑦) −
𝑥. 𝑦
𝑛
𝑥2 −
( 𝑥)
2
𝑛
and, a = 𝑦 − 𝑏. 𝑥

Linear Regression
• model = LinearRegression()
• model.fit(X, Y)
• slope = model.coef_[0]
• intercept = model.intercept_
Slope (Coefficient): 2.889
Intercept: -68.252

Covariance
Covariance is a measure of the relationship between two random variables
and to what extent, they change together. It defines the changes between
the two variables, such that change in one variable is equal to change in
another variable.
X = data['weight']
Y = data['height’]
mean_X = np.mean(X)
mean_Y = np.mean(Y)
covariance = np.sum((X - mean_X) * (Y - mean_Y)) / (len(X) - 1)
Covariance of Height and Weight: 11.17
Sample covariance Formula:
Cov(x,y) =

Standard Deviation
Standard deviation is a statistical measure that quantifies the amount
of variation or dispersion in a set of data points. It provides a way to
understand how spread out the values in a dataset are around the
mean.
Standard Deviation, σ = 𝑖=1
𝑛 𝑥𝑖−𝑥 2
2
X = data['height’]
Y = data[‘Weight’]
mean_X = np.mean(X)
std_dev_X = np.sqrt(np.mean((X - mean_X)**2))
Standard Deviation of Height: 1.97
Standard Deviation of Weight: 11.50

Prediction Algorithm
Prediction refers to the process of estimating or forecasting future events,
outcomes, or values based on existing data and patterns.
Key points:
 Methodology: Predictions are made using various techniques and models. These
may include statistical methods, machine learning algorithms, regression analysis,
time series analysis, and more.
 Training Data: To make accurate predictions, models are typically trained on
historical or existing data so that we can make a relationship or pattern with new or
unseen data.
 Accuracy and Performance: The accuracy of predictions is a critical metric.
Models are evaluated based on how well they can generalize to new data.
 Applications: Prediction is widely used across various domains. For instance, in
finance, predictions are made about stock prices; in healthcare, predictions are made
about disease progression; in weather forecasting, predictions are made about future
weather conditions.

Support Vector Machine (SVM)
 Support Vector Machine (SVM) is a powerful machine learning algorithm
used for linear or nonlinear classification, regression, and even outlier
detection tasks.
 SVMs can be used for a variety of tasks, such as text classification, image
classification, spam detection, handwriting identification, gene expression
analysis, face detection, and anomaly detection.

 Drop function: In Python, the drop function is a built-in function in the
standard library. It is used for removing columns.
CustomerID height weight age annual_income rate price payment
1 65 112 19 15000 3.4 1325 cash
2 71 136 21 35000 3.9 1600 cash
3 69 153 20 86000 3.7 1850 VISA
4 68 142 23 59000 2.7 2075 VISA
5 67 144 31 38000 2.8 1600 VISA
6 68 123 22 58000 3.4 2075 VISA
7 69 141 35 31000 4.1 1650 VISA
8 70 136 23 84000 2.8 2075 VISA
9 67 112 64 97000 3.2 1650 cash

 Level Encoder: Label Encoding is a technique that is used to convert
categorical columns into numerical ones so that they can be fitted by machine
learning models which only take numerical data. It is an important pre-
processing step in a machine-learning project.
 fillna: fillna is a method used in Python for filling missing values in a pandas
DataFrame or Series. It's a common operation when working with data, as
missing values can cause issues when performing calculations or visualizing
data.
 mean: mean refers to the average of a set of numbers.
mean = sum(numbers) / len(numbers)
mean = np.mean(numbers)

Test Data & Training data:
In machine learning and statistical modeling, datasets are typically divided into
two main subsets: training data and test data. These subsets serve distinct
purposes in developing and evaluating predictive models:
• Training Data: The training data is used to train or build the predictive model.
Moreover it is used for teaching the model how to make predictions or
classifications.
• Test Data: The test data is used to evaluate the model's performance and
assess how well it generalizes to new, unseen data.
Here, 20% of Data is used for Test purpose.
And also used random state = 57

 Accuracy: This is the ratio of correctly predicted instances (both true positives and
true negatives) to the total instances in the dataset.
Accuracy : 0.615
 Precision: Also known as Positive Predictive Value, it is the ratio of true positives to
the sum of true positives and false positives. It measures the accuracy of the positive
predictions.
Precision : 1.0
 Recall: Also known as Sensitivity, Hit Rate, or True Positive Rate, it is the ratio of
true positives to the sum of true positives and false negatives. It measures the
sensitivity to detect the positive class.
Recall : 0.615
F1-measure: The harmonic mean of precision and recall. It provides a balance
between precision and recall and is particularly useful when dealing with imbalanced
datasets.
F1-measure : 0.761

ROC Curve
The Receiver Operating Characteristic (ROC) curve is a graphical representation
that illustrates the diagnostic ability of a binary classification model. It plots the
True Positive Rate against the False Positive Rate for different classification
thresholds.

Data Mining Theory and Python Project.pptx

More Related Content

Similar to Data Mining Theory and Python Project.pptx

Recently uploaded

Data Mining Theory and Python Project.pptx