KEMBAR78
Introduction to ML_Data Preprocessing.pptx
Introduction to ML
2
Machine Learning Definition
• Definition by Arthur Samuel (1959): Machine Learning is a field of
study that gives computers the ability to learn without being explicitly
programmed
Data Preprocessing
Data scientists process and analyse data using a number of methods and tools, such
as statistical models, machine learning algorithms, and data visualisation software.
Data science seeks to uncover patterns in data that can help with decision-making,
process improvement, and the creation of new opportunities. Business, engineering,
and the social sciences are all included in this interdisciplinary field.
Data Preprocessing
Pre-processing refers to the
transformations applied to our data
before feeding it to the algorithm. Data
preprocessing is a technique that is used
to convert the raw data into a clean data
set.
In other words, whenever the data is
gathered from different sources it is
collected in raw format which is not
Intoduction
Need of Data Preprocessing
•
For achieving better results from the applied model in Machine Learning projects the format of the data
has to be in a proper manner.
• Some specified Machine Learning model needs information in a specified format, for example,
Random Forest algorithm does not support null values, therefore to execute random forest
algorithm null values have to be managed from the original raw data set.
• Another aspect is that the data set should be formatted in such a way that more than one Machine
Learning and Deep Learning algorithm are executed in one data set, and best out of them is chosen.
Steps in Data Preprocessing
 Step 1: Import the necessary libraries
 Step 2: Load the dataset
 Step 3: Statistical Analysis
 Step 4: Check the outliers
 Step 5: Correlation
 Step 6: Separate independent features and Target Variables
 Step 7: Normalization or Standardization
Steps in Data Preprocessing
Step 1: Import the necessary libraries
# importing libraries
import pandas as pd : Pandas is a powerful library used for data manipulation and analysis,
especially for handling structured data like spreadsheets or databases.
import scipy : SciPy builds on top of NumPy and provides functions for optimization, signal
processing, linear algebra, statistics, and more.
import numpy as np : NumPy is a fundamental package for scientific computing in Python. It
provides support for arrays, matrices, mathematical functions, and operations that are essential for
numerical computations.
from sklearn.preprocessing import MinMaxScaler : This line imports the MinMaxScaler
class from the scikit-learn library (sklearn). Scikit-learn is a popular machine learning library in Python.
The MinMaxScaler is used for scaling features in a dataset to a specific range, usually between 0 and
1.
import seaborn as sns : Seaborn is a data visualization library built on top of Matplotlib that
provides a higher-level interface for creating aesthetically pleasing and informative statistical
graphics.
import matplotlib.pyplot as plt : This line imports the pyplot submodule from the
Step 2: Load the dataset
# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/Data/diabetes.csv')
print(df.head())
Check the data info
df.info()
As we can see from the above info that the our dataset has 9
columns and each columns has 768 values. There is no Null
values in the dataset.
• We can also check the null values using df.isnull()
Step 3: Statistical Analysis
In statistical analysis, first, we use the df.describe() which will give a descriptive overview of the dataset.
df.describe()
The above table shows the count, mean, standard deviation, min, 25%, 50%, 75%, and max values for
each column. When we carefully observe the table we will find that. Insulin, Pregnancies, BMI,
BloodPressure columns has outliers.
Let’s plot the boxplot for each column for easy understanding.
Step 4: Check the outliers:
# Box Plots
fig, axs = plt.subplots(9,1,dpi=95, figsize=(7,17))
i = 0
for col in df.columns:
axs[i].boxplot(df[col], vert=False)
axs[i].set_ylabel(col)
i+=1
plt.show()
Correlation refers to the statistical relationship between the two entities. It measures the
extent to which two variables are linearly related. For example, the height and weight of a
person are related, and taller people tend to be heavier than shorter people.
Step 5: Correlation
Step 5: Correlation
#correlation
corr = df.corr()
plt.figure(dpi=130)
sns.heatmap(df.corr(), annot=True,
fmt= '.2f')
plt.show()
corr['Outcome'].sort_values(ascend
ing = Falase
We can also camapare by single columns in
descending order
Step 6: Separate independent features and Target Variables
# separate array into input and output components
X = df.drop(columns =['Outcome'])
Y = df.Outcome
Step 7: Normalization or Standardization
Normalization
• MinMaxScaler scales the data so that each feature is in the range [0, 1].
• It works well when the features have different scales and the algorithm being used is sensitive to the
scale of the features, such as k-nearest neighbors or neural networks.
• Rescale your data using scikit-learn using the MinMaxScaler.
Standardization
• Standardization is a useful technique to transform attributes with a Gaussian distribution and
differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a
standard deviation of 1.
• We can standardize data using scikit-learn with the StandardScaler class.
• It works well when the features have a normal distribution or when the algorithm being used is not
sensitive to the scale of the features
Introduction to ML_Data Preprocessing.pptx
Introduction to ML_Data Preprocessing.pptx
Introduction to ML_Data Preprocessing.pptx
Introduction to ML_Data Preprocessing.pptx
Introduction to ML_Data Preprocessing.pptx
Introduction to ML_Data Preprocessing.pptx

Introduction to ML_Data Preprocessing.pptx

  • 1.
  • 2.
    2 Machine Learning Definition •Definition by Arthur Samuel (1959): Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed
  • 9.
  • 10.
    Data scientists processand analyse data using a number of methods and tools, such as statistical models, machine learning algorithms, and data visualisation software. Data science seeks to uncover patterns in data that can help with decision-making, process improvement, and the creation of new opportunities. Business, engineering, and the social sciences are all included in this interdisciplinary field. Data Preprocessing Pre-processing refers to the transformations applied to our data before feeding it to the algorithm. Data preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not Intoduction
  • 11.
    Need of DataPreprocessing • For achieving better results from the applied model in Machine Learning projects the format of the data has to be in a proper manner. • Some specified Machine Learning model needs information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values have to be managed from the original raw data set. • Another aspect is that the data set should be formatted in such a way that more than one Machine Learning and Deep Learning algorithm are executed in one data set, and best out of them is chosen.
  • 12.
    Steps in DataPreprocessing  Step 1: Import the necessary libraries  Step 2: Load the dataset  Step 3: Statistical Analysis  Step 4: Check the outliers  Step 5: Correlation  Step 6: Separate independent features and Target Variables  Step 7: Normalization or Standardization
  • 13.
    Steps in DataPreprocessing Step 1: Import the necessary libraries # importing libraries import pandas as pd : Pandas is a powerful library used for data manipulation and analysis, especially for handling structured data like spreadsheets or databases. import scipy : SciPy builds on top of NumPy and provides functions for optimization, signal processing, linear algebra, statistics, and more. import numpy as np : NumPy is a fundamental package for scientific computing in Python. It provides support for arrays, matrices, mathematical functions, and operations that are essential for numerical computations. from sklearn.preprocessing import MinMaxScaler : This line imports the MinMaxScaler class from the scikit-learn library (sklearn). Scikit-learn is a popular machine learning library in Python. The MinMaxScaler is used for scaling features in a dataset to a specific range, usually between 0 and 1. import seaborn as sns : Seaborn is a data visualization library built on top of Matplotlib that provides a higher-level interface for creating aesthetically pleasing and informative statistical graphics. import matplotlib.pyplot as plt : This line imports the pyplot submodule from the
  • 14.
    Step 2: Loadthe dataset # Load the dataset df = pd.read_csv('/content/drive/MyDrive/Data/diabetes.csv') print(df.head())
  • 15.
    Check the datainfo df.info() As we can see from the above info that the our dataset has 9 columns and each columns has 768 values. There is no Null values in the dataset. • We can also check the null values using df.isnull()
  • 16.
    Step 3: StatisticalAnalysis In statistical analysis, first, we use the df.describe() which will give a descriptive overview of the dataset. df.describe() The above table shows the count, mean, standard deviation, min, 25%, 50%, 75%, and max values for each column. When we carefully observe the table we will find that. Insulin, Pregnancies, BMI, BloodPressure columns has outliers.
  • 17.
    Let’s plot theboxplot for each column for easy understanding. Step 4: Check the outliers: # Box Plots fig, axs = plt.subplots(9,1,dpi=95, figsize=(7,17)) i = 0 for col in df.columns: axs[i].boxplot(df[col], vert=False) axs[i].set_ylabel(col) i+=1 plt.show()
  • 18.
    Correlation refers tothe statistical relationship between the two entities. It measures the extent to which two variables are linearly related. For example, the height and weight of a person are related, and taller people tend to be heavier than shorter people. Step 5: Correlation
  • 19.
    Step 5: Correlation #correlation corr= df.corr() plt.figure(dpi=130) sns.heatmap(df.corr(), annot=True, fmt= '.2f') plt.show() corr['Outcome'].sort_values(ascend ing = Falase We can also camapare by single columns in descending order
  • 20.
    Step 6: Separateindependent features and Target Variables # separate array into input and output components X = df.drop(columns =['Outcome']) Y = df.Outcome Step 7: Normalization or Standardization Normalization • MinMaxScaler scales the data so that each feature is in the range [0, 1]. • It works well when the features have different scales and the algorithm being used is sensitive to the scale of the features, such as k-nearest neighbors or neural networks. • Rescale your data using scikit-learn using the MinMaxScaler. Standardization • Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. • We can standardize data using scikit-learn with the StandardScaler class. • It works well when the features have a normal distribution or when the algorithm being used is not sensitive to the scale of the features