0% found this document useful (0 votes)

163 views6 pages

Data Preprocessing for ML Models

The experiment aimed to apply various data preprocessing techniques on a dataset to prepare it for machine learning algorithms. The steps included checking for missing values and handling them by dropping rows, encoding categorical variables, splitting the data into training and test sets, and performing feature scaling. The code demonstrated reading in a dataset, checking for missing values, label encoding categorical variables, splitting the data, and scaling features.

Uploaded by

mohan kukreja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

163 views6 pages

Data Preprocessing for ML Models

Uploaded by

mohan kukreja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Experiment No.

1
Aim : To apply various data-preprocessing techniques on a data set to prepare it for machine learning algorithms. Theory What is
Data Preprocessing ? Data preprocessing is a data mining technique that involves transforming raw data into an understandable
format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many
errors. Data preprocessing is a proven method of resolving such issues.

Steps Check missing values Categorical variables (Label Encoding and One-Hot Enconding) Split data into training and testing sets
Feature scaling

Code and Output

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('50_Startups.csv')

In [3]:
data.head(5)

Out[3]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 New York 192261.83

1 162597.70 151377.59 443898.53 California 191792.06

2 153441.51 101145.55 407934.54 Florida 191050.39

3 144372.41 118671.85 383199.62 New York 182901.99

4 142107.34 91391.77 366168.42 Florida 166187.94

In [4]:

data.shape

Out[4]:
(50, 5)

In [5]:

data.columns #features

Out[5]:
Index(['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit'], dtype='object')

Checking missing values

In [6]:
#check for missing values
data.isnull().any()
#It is observed that every column has missing values

Out[6]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool

Handling missing values

1. Drop rows having null values

2. Fill missing values with mean/median/mode or any relevant value

In [7]:
# Dropping null rows
data.dropna(inplace=True)
data.isnull().any()
#No null values now

Out[7]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool

In [8]:
print(data.shape)

(50, 5)

Handling categorical variables

In [17]:
data2 = pd.read_csv('50_Startups.csv')
data2.head()

Out[17]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 New York 192261.83

1 162597.70 151377.59 443898.53 California 191792.06

2 153441.51 101145.55 407934.54 Florida 191050.39

3 144372.41 118671.85 383199.62 New York 182901.99

4 142107.34 91391.77 366168.42 Florida 166187.94

In [18]:

data2['Profit'].unique()

Out[18]:
array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
90708.19, 89949.14, 81229.06, 81005.76, 78239.91, 77798.83,
71498.49, 69758.98, 65200.33, 64926.08, 49490.75, 42559.73,
35673.41, 14681.4 ])

In [160]:

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

In [19]:

data_LE = data2.copy()
data_LE['State'] = label_encoder.fit_transform(data_LE['State'])

In [20]:

data_LE.head()

Out[20]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 2 192261.83

1 162597.70 151377.59 443898.53 0 191792.06

2 153441.51 101145.55 407934.54 1 191050.39

3 144372.41 118671.85 383199.62 2 182901.99

4 142107.34 91391.77 366168.42 1 166187.94

In [21]:
data_LE_df = pd.DataFrame(data_LE)

In [22]:
data_LE_df.dropna(inplace=True)

In [23]:
data_LE_df

Out[23]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 2 192261.83

1 162597.70 151377.59 443898.53 0 191792.06

2 153441.51 101145.55 407934.54 1 191050.39

3 144372.41 118671.85 383199.62 2 182901.99

4 142107.34 91391.77 366168.42 1 166187.94

5 131876.90 99814.71 362861.36 2 156991.12

6 134615.46 147198.87 127716.82 0 156122.51

7 130298.13 145530.06 323876.68 1 155752.60

8 120542.52 148718.95 311613.29 2 152211.77

9 123334.88 108679.17 304981.62 0 149759.96

10 101913.08 110594.11 229160.95 1 146121.95

11 100671.96 91790.61 249744.55 0 144259.40

12 93863.75 127320.38 249839.44 1 141585.52

13 R&D Spend Administration
91992.39 135495.07 Marketing
252664.93 Spend State
0 Profit
134307.35

14 119943.24 156547.42 256512.92 1 132602.65

15 114523.61 122616.84 261776.23 2 129917.04

16 78013.11 121597.55 264346.06 0 126992.93

17 94657.16 145077.58 282574.31 2 125370.37

18 91749.16 114175.79 294919.57 1 124266.90

19 86419.70 153514.11 0.00 2 122776.86

20 76253.86 113867.30 298664.47 0 118474.03

21 78389.47 153773.43 299737.29 2 111313.02

22 73994.56 122782.75 303319.26 1 110352.25

23 67532.53 105751.03 304768.73 1 108733.99

24 77044.01 99281.34 140574.81 2 108552.04

25 64664.71 139553.16 137962.62 0 107404.34

26 75328.87 144135.98 134050.07 1 105733.54

27 72107.60 127864.55 353183.81 2 105008.31

28 66051.52 182645.56 118148.20 1 103282.38

29 65605.48 153032.06 107138.38 2 101004.64

30 61994.48 115641.28 91131.24 1 99937.59

31 61136.38 152701.92 88218.23 2 97483.56

32 63408.86 129219.61 46085.25 0 97427.84

33 55493.95 103057.49 214634.81 1 96778.92

34 46426.07 157693.92 210797.67 0 96712.80

35 46014.02 85047.44 205517.64 2 96479.51

36 28663.76 127056.21 201126.82 1 90708.19

37 44069.95 51283.14 197029.42 0 89949.14

38 20229.59 65947.93 185265.10 2 81229.06

39 38558.51 82982.09 174999.30 0 81005.76

40 28754.33 118546.05 172795.67 0 78239.91

41 27892.92 84710.77 164470.71 1 77798.83

42 23640.93 96189.63 148001.11 0 71498.49

43 15505.73 127382.30 35534.17 2 69758.98

44 22177.74 154806.14 28334.72 0 65200.33

45 1000.23 124153.04 1903.93 2 64926.08

46 1315.46 115816.21 297114.46 1 49490.75

47 0.00 135426.92 0.00 0 42559.73

48 542.05 51743.15 0.00 2 35673.41

49 0.00 116983.80 45173.06 0 14681.40

Splitting into training and testing sets

In [26]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_LE_df,data_LE_df['Profit'],test_size=0.2)

/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This mo

dule was deprecated in version 0.18 in favor of the model_selection module into which all the refa
ctored classes and functions are moved. Also note that the interface of the new CV iterators are d
ctored classes and functions are moved. Also note that the interface of the new CV iterators are d
ifferent from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)

In [27]:
X_train.head()

Out[27]:

R&D Spend Administration Marketing Spend State Profit

25 64664.71 139553.16 137962.62 0 107404.34

0 165349.20 136897.80 471784.10 2 192261.83

10 101913.08 110594.11 229160.95 1 146121.95

14 119943.24 156547.42 256512.92 1 132602.65

35 46014.02 85047.44 205517.64 2 96479.51

In [28]:
y_train.head()

Out[28]:
25 107404.34
0 192261.83
10 146121.95
14 132602.65
35 96479.51
Name: Profit, dtype: float64

Feature Scaling

In [29]:
from sklearn.preprocessing import StandardScaler
standard_X = StandardScaler()

In [30]:
X_train = standard_X.fit_transform(X_train)
X_test = standard_X.fit_transform(X_test)

In [31]:
pd.DataFrame(X_train) #SCALED

Out[31]:

0 1 2 3 4

0 -0.147778 0.768777 -0.732925 -1.248168 -0.078585

1 2.099133 0.672035 2.246595 1.187282 2.114855

2 0.683470 -0.286287 0.081064 -0.030443 0.922208

3 1.085838 1.387929 0.325194 -0.030443 0.572754

4 -0.563993 -1.217028 -0.129964 1.187282 -0.360975

5 -0.949166 0.003426 -0.422023 -1.248168 -0.832442

6 -1.590858 -0.053492 -1.561117 -1.248168 -2.475335

7 0.158509 1.286864 0.710993 1.187282 0.022449

8 -0.730373 -1.292275 -0.402355 -1.248168 -0.760949

9 0.5215450 0.9700481 0.5578052 1.1872823 0.3858104

10 1.316921 0.986533 0.926449 -0.030443 1.171146

11 -0.607378 -2.447162 -0.205725 -1.248168 -0.529776

12 -0.352435 -0.560869 -0.048589 -0.030443 -0.353236

13 0.964891 0.151737 0.372172 1.187282 0.503335

14 -1.244827 0.325357 -1.647149 1.187282 -1.051661

15 -1.578762 -2.430403 -1.964309 1.187282 -1.932723

16 -1.139408 -1.912880 -0.310728 1.187282 -0.755177

17 -1.561502 -0.096031 0.687583 -0.030443 -1.575565

18 -0.968390 -1.229294 -0.496328 -0.030443 -0.843843

19 1.631008 0.008009 1.455935 1.187282 1.872917

20 0.018321 0.342926 1.188029 1.187282 -0.140519

21 -1.095932 1.324489 -1.711408 -1.248168 -1.169496

22 -0.083778 -0.462735 0.755901 -0.030443 -0.044215

23 0.090208 0.935743 -0.767847 -0.030443 -0.121772

24 0.150110 0.114601 0.395109 -1.248168 0.427751

25 0.462077 0.620929 0.290849 -1.248168 0.616818

26 1.099212 1.102714 0.816992 1.187282 1.079621

27 1.413268 1.047333 -0.824374 -1.248168 1.180707

28 1.580460 -0.985886 1.303923 -0.030443 1.440884

29 1.352154 -0.679013 1.274406 1.187282 1.203160

30 -0.951188 0.313476 -0.169154 -0.030443 -0.510155

31 0.655773 -0.971355 0.264783 -1.248168 0.874064

32 -0.554798 1.429699 -0.082837 -1.248168 -0.354945

33 -1.063279 -0.811085 -0.643327 -1.248168 -1.006698

34 -1.568537 0.207705 -1.947316 1.187282 -1.176585

35 0.503839 0.323101 0.265630 -0.030443 0.804948

36 0.060431 0.157781 0.742964 -0.030443 -0.002386

37 0.110850 -0.167035 0.701417 -1.248168 0.207550

38 0.456649 -0.155796 0.667992 -0.030443 0.357287

39 0.337715 1.277416 -1.964309 1.187282 0.318772

Result
The data set was pre-processed by using label encoding, feature scaling (standardisation, normalization), checking for missing values
and splitting data into training and testing sets.

Conclusion
In Real world data are generally incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data. Noisy: containing errors or outliers. Inconsistent: containing discrepancies in codes or names. Hence, it is essential
to pre-process data so that algorithms can be applied without any hindrance.

Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Mining Exercises - Solutions
No ratings yet
Data Mining Exercises - Solutions
5 pages
Fuzzy Logic and Applications PDF
No ratings yet
Fuzzy Logic and Applications PDF
13 pages
Machine Learning CA 2
No ratings yet
Machine Learning CA 2
19 pages
Chapter4 Classification Prediction
No ratings yet
Chapter4 Classification Prediction
173 pages
Classification Exam Prep
No ratings yet
Classification Exam Prep
9 pages
Chapter 2 Solution - Matlab Programming With Applications For Engineers
No ratings yet
Chapter 2 Solution - Matlab Programming With Applications For Engineers
7 pages
KNN Algorithm - PPT (Autosaved)
0% (1)
KNN Algorithm - PPT (Autosaved)
8 pages
MATLAB Matrix Basics
No ratings yet
MATLAB Matrix Basics
51 pages
Ain Shams University Faculty of Engineering
No ratings yet
Ain Shams University Faculty of Engineering
2 pages
Student Dropout Prediction
No ratings yet
Student Dropout Prediction
11 pages
Student's Behavior Clustering Based On Ubiquitous Learning Log Data Using Unsupervised Machine Learning
No ratings yet
Student's Behavior Clustering Based On Ubiquitous Learning Log Data Using Unsupervised Machine Learning
7 pages
ANN Architecture
No ratings yet
ANN Architecture
41 pages
Data Mining
No ratings yet
Data Mining
15 pages
Ai Minor 6th Sem
No ratings yet
Ai Minor 6th Sem
1 page
Performance and Early Drop Prediction For Higher Education Students Using Machine Learning
No ratings yet
Performance and Early Drop Prediction For Higher Education Students Using Machine Learning
9 pages
Gradient Descent Algorithm Guide
No ratings yet
Gradient Descent Algorithm Guide
11 pages
Lecture 2.1.9 Comparison of BNN and ANN
No ratings yet
Lecture 2.1.9 Comparison of BNN and ANN
5 pages
Al3451 - Question Bank
100% (1)
Al3451 - Question Bank
12 pages
Student Performance Prediction Techniques
No ratings yet
Student Performance Prediction Techniques
6 pages
(Time: Hours) (Marks: 75) Please Check Whether You Have Got The Right Question Paper
No ratings yet
(Time: Hours) (Marks: 75) Please Check Whether You Have Got The Right Question Paper
16 pages
Stastics 6th Sem
No ratings yet
Stastics 6th Sem
266 pages
Ad3351 Daa Question Bank
No ratings yet
Ad3351 Daa Question Bank
12 pages
Academic Performance Prediction Based On Multisource, Multi Feature Behavioral Data
No ratings yet
Academic Performance Prediction Based On Multisource, Multi Feature Behavioral Data
6 pages
IS328 Final Exam
No ratings yet
IS328 Final Exam
12 pages
Aiml Lab Manual 2023
No ratings yet
Aiml Lab Manual 2023
17 pages
Clustering MCQs - Data Science Quiz
No ratings yet
Clustering MCQs - Data Science Quiz
4 pages
Set and Frozenset in Python
No ratings yet
Set and Frozenset in Python
5 pages
Midterm Exam Fall 2019 Solution PDF
No ratings yet
Midterm Exam Fall 2019 Solution PDF
7 pages
AI Statistical Methods Course
No ratings yet
AI Statistical Methods Course
23 pages
Instructor Support
100% (1)
Instructor Support
150 pages
ML LAB Mannual-1
No ratings yet
ML LAB Mannual-1
79 pages
Machine Learning Algorithm Guide
100% (1)
Machine Learning Algorithm Guide
37 pages
SVM Example
No ratings yet
SVM Example
10 pages
Data Mining Exam Questions
No ratings yet
Data Mining Exam Questions
25 pages
Data Exploration and Visualization - AD3301 - Hand Written Notes - Unit 5 - Multivariate and Time Series Analysis
No ratings yet
Data Exploration and Visualization - AD3301 - Hand Written Notes - Unit 5 - Multivariate and Time Series Analysis
59 pages
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Credit Card Fraud Detection Seminar
No ratings yet
Credit Card Fraud Detection Seminar
16 pages
1.linear Regression PSP
No ratings yet
1.linear Regression PSP
92 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Deep Learning Technique Syllabus
100% (1)
Deep Learning Technique Syllabus
2 pages
DataMining Workbook Answers
No ratings yet
DataMining Workbook Answers
18 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Week1 R Programming Questions
No ratings yet
Week1 R Programming Questions
3 pages
Introduction of Neural Network
100% (1)
Introduction of Neural Network
69 pages
k-Means Clustering Guide
No ratings yet
k-Means Clustering Guide
2 pages
DWM Manual
No ratings yet
DWM Manual
60 pages
Parameter Estimation
100% (1)
Parameter Estimation
24 pages
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
No ratings yet
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
4 pages
Int. To Data Analytics and Cyber Security Syllabus
No ratings yet
Int. To Data Analytics and Cyber Security Syllabus
2 pages
Uses of Predictive Analytics
No ratings yet
Uses of Predictive Analytics
4 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
Ch1 2
100% (1)
Ch1 2
97 pages
Exam - Applied Mathematics I, December 26, 2004.
100% (1)
Exam - Applied Mathematics I, December 26, 2004.
6 pages
Minterm Maxterm K Map
No ratings yet
Minterm Maxterm K Map
15 pages
Profitanalysis
No ratings yet
Profitanalysis
18 pages
Datos de 50 Nuevas
No ratings yet
Datos de 50 Nuevas
63 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Application of C Language in Electronics
No ratings yet
Application of C Language in Electronics
21 pages
30-11-2024 - SR - Super60 - STERLING-BT - Jee-Main - RPTM-20 - KEY & Sol'S
No ratings yet
30-11-2024 - SR - Super60 - STERLING-BT - Jee-Main - RPTM-20 - KEY & Sol'S
10 pages
06 - CCS-Module-2.5 BECCS Final
No ratings yet
06 - CCS-Module-2.5 BECCS Final
46 pages
Ncoi Promotion
100% (5)
Ncoi Promotion
16 pages
Dlims 2.0
No ratings yet
Dlims 2.0
2 pages
Ielts
0% (1)
Ielts
35 pages
GIMP Graffiti Guide for Beginners
No ratings yet
GIMP Graffiti Guide for Beginners
6 pages
Madrid, Mira Flor F. 3BSA-5
No ratings yet
Madrid, Mira Flor F. 3BSA-5
3 pages
Chapter 7: Probability II Probability of An Event No
100% (1)
Chapter 7: Probability II Probability of An Event No
2 pages
Ileana Dr. Macalinao, G.R. No. 175490: Republic of The Philippines Supreme Court Manila
No ratings yet
Ileana Dr. Macalinao, G.R. No. 175490: Republic of The Philippines Supreme Court Manila
13 pages
Kitting & Assembly Best Practices
No ratings yet
Kitting & Assembly Best Practices
6 pages
ABBF Analysis in P6
100% (1)
ABBF Analysis in P6
11 pages
BUS468 Case Analysis
No ratings yet
BUS468 Case Analysis
20 pages
Cost and Evaluation - DeSIGN 2
No ratings yet
Cost and Evaluation - DeSIGN 2
21 pages
Expoerter Gar
No ratings yet
Expoerter Gar
44 pages
Validation of Hot Corrosion and Fatigue Models in HOTPITS: K. S. Chan
No ratings yet
Validation of Hot Corrosion and Fatigue Models in HOTPITS: K. S. Chan
11 pages
Technical Meeting No.57
No ratings yet
Technical Meeting No.57
10 pages
Consti 1 - Syllabus
No ratings yet
Consti 1 - Syllabus
26 pages
The Importance of University Education
No ratings yet
The Importance of University Education
5 pages
2-Systems Classification - Tutorialspoint
No ratings yet
2-Systems Classification - Tutorialspoint
3 pages
Rubric For Individual Assignment
No ratings yet
Rubric For Individual Assignment
3 pages
191 Cases: Fraud in Nonprofits
No ratings yet
191 Cases: Fraud in Nonprofits
2 pages
Telkomsel Lte FDD Interworking Strategy For l1800 Amp U900 U2100
No ratings yet
Telkomsel Lte FDD Interworking Strategy For l1800 Amp U900 U2100
10 pages
Ingersoll Rand PARTS R2.2, R4IU, R5,5IU-10-200
100% (5)
Ingersoll Rand PARTS R2.2, R4IU, R5,5IU-10-200
17 pages
Water Cooled Chillers with Heat Recovery
No ratings yet
Water Cooled Chillers with Heat Recovery
2 pages
Jaykosai CV Latest
No ratings yet
Jaykosai CV Latest
1 page
16-000268-3, Agr Lite Frame
No ratings yet
16-000268-3, Agr Lite Frame
4 pages
Operating Instructions and Service Manual Basketball Shotclock MODEL MP-2299
No ratings yet
Operating Instructions and Service Manual Basketball Shotclock MODEL MP-2299
20 pages
BRO Recruitment 2024 Application Form
No ratings yet
BRO Recruitment 2024 Application Form
4 pages