Experiment No.
1
Aim : To apply various data-preprocessing techniques on a data set to prepare it for machine learning algorithms. Theory What is
Data Preprocessing ? Data preprocessing is a data mining technique that involves transforming raw data into an understandable
format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many
errors. Data preprocessing is a proven method of resolving such issues.
Steps Check missing values Categorical variables (Label Encoding and One-Hot Enconding) Split data into training and testing sets
Feature scaling
Code and Output
In [1]:
import numpy as np
import pandas as pd
In [2]:
data = pd.read_csv('50_Startups.csv')
In [3]:
data.head(5)
Out[3]:
R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 151377.59 443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94
In [4]:
data.shape
Out[4]:
(50, 5)
In [5]:
data.columns #features
Out[5]:
Index(['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit'], dtype='object')
Checking missing values
In [6]:
#check for missing values
data.isnull().any()
#It is observed that every column has missing values
Out[6]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool
Handling missing values
1. Drop rows having null values
2. Fill missing values with mean/median/mode or any relevant value
In [7]:
# Dropping null rows
data.dropna(inplace=True)
data.isnull().any()
#No null values now
Out[7]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool
In [8]:
print(data.shape)
(50, 5)
Handling categorical variables
In [17]:
data2 = pd.read_csv('50_Startups.csv')
data2.head()
Out[17]:
R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 151377.59 443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94
In [18]:
data2['Profit'].unique()
Out[18]:
array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
90708.19, 89949.14, 81229.06, 81005.76, 78239.91, 77798.83,
71498.49, 69758.98, 65200.33, 64926.08, 49490.75, 42559.73,
35673.41, 14681.4 ])
In [160]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
In [19]:
data_LE = data2.copy()
data_LE['State'] = label_encoder.fit_transform(data_LE['State'])
In [20]:
data_LE.head()
Out[20]:
R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 2 192261.83
1 162597.70 151377.59 443898.53 0 191792.06
2 153441.51 101145.55 407934.54 1 191050.39
3 144372.41 118671.85 383199.62 2 182901.99
4 142107.34 91391.77 366168.42 1 166187.94
In [21]:
data_LE_df = pd.DataFrame(data_LE)
In [22]:
data_LE_df.dropna(inplace=True)
In [23]:
data_LE_df
Out[23]:
R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 2 192261.83
1 162597.70 151377.59 443898.53 0 191792.06
2 153441.51 101145.55 407934.54 1 191050.39
3 144372.41 118671.85 383199.62 2 182901.99
4 142107.34 91391.77 366168.42 1 166187.94
5 131876.90 99814.71 362861.36 2 156991.12
6 134615.46 147198.87 127716.82 0 156122.51
7 130298.13 145530.06 323876.68 1 155752.60
8 120542.52 148718.95 311613.29 2 152211.77
9 123334.88 108679.17 304981.62 0 149759.96
10 101913.08 110594.11 229160.95 1 146121.95
11 100671.96 91790.61 249744.55 0 144259.40
12 93863.75 127320.38 249839.44 1 141585.52
13 R&D Spend Administration
91992.39 135495.07 Marketing
252664.93 Spend State
0 Profit
134307.35
14 119943.24 156547.42 256512.92 1 132602.65
15 114523.61 122616.84 261776.23 2 129917.04
16 78013.11 121597.55 264346.06 0 126992.93
17 94657.16 145077.58 282574.31 2 125370.37
18 91749.16 114175.79 294919.57 1 124266.90
19 86419.70 153514.11 0.00 2 122776.86
20 76253.86 113867.30 298664.47 0 118474.03
21 78389.47 153773.43 299737.29 2 111313.02
22 73994.56 122782.75 303319.26 1 110352.25
23 67532.53 105751.03 304768.73 1 108733.99
24 77044.01 99281.34 140574.81 2 108552.04
25 64664.71 139553.16 137962.62 0 107404.34
26 75328.87 144135.98 134050.07 1 105733.54
27 72107.60 127864.55 353183.81 2 105008.31
28 66051.52 182645.56 118148.20 1 103282.38
29 65605.48 153032.06 107138.38 2 101004.64
30 61994.48 115641.28 91131.24 1 99937.59
31 61136.38 152701.92 88218.23 2 97483.56
32 63408.86 129219.61 46085.25 0 97427.84
33 55493.95 103057.49 214634.81 1 96778.92
34 46426.07 157693.92 210797.67 0 96712.80
35 46014.02 85047.44 205517.64 2 96479.51
36 28663.76 127056.21 201126.82 1 90708.19
37 44069.95 51283.14 197029.42 0 89949.14
38 20229.59 65947.93 185265.10 2 81229.06
39 38558.51 82982.09 174999.30 0 81005.76
40 28754.33 118546.05 172795.67 0 78239.91
41 27892.92 84710.77 164470.71 1 77798.83
42 23640.93 96189.63 148001.11 0 71498.49
43 15505.73 127382.30 35534.17 2 69758.98
44 22177.74 154806.14 28334.72 0 65200.33
45 1000.23 124153.04 1903.93 2 64926.08
46 1315.46 115816.21 297114.46 1 49490.75
47 0.00 135426.92 0.00 0 42559.73
48 542.05 51743.15 0.00 2 35673.41
49 0.00 116983.80 45173.06 0 14681.40
Splitting into training and testing sets
In [26]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_LE_df,data_LE_df['Profit'],test_size=0.2)
/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This mo
dule was deprecated in version 0.18 in favor of the model_selection module into which all the refa
ctored classes and functions are moved. Also note that the interface of the new CV iterators are d
ctored classes and functions are moved. Also note that the interface of the new CV iterators are d
ifferent from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
In [27]:
X_train.head()
Out[27]:
R&D Spend Administration Marketing Spend State Profit
25 64664.71 139553.16 137962.62 0 107404.34
0 165349.20 136897.80 471784.10 2 192261.83
10 101913.08 110594.11 229160.95 1 146121.95
14 119943.24 156547.42 256512.92 1 132602.65
35 46014.02 85047.44 205517.64 2 96479.51
In [28]:
y_train.head()
Out[28]:
25 107404.34
0 192261.83
10 146121.95
14 132602.65
35 96479.51
Name: Profit, dtype: float64
Feature Scaling
In [29]:
from sklearn.preprocessing import StandardScaler
standard_X = StandardScaler()
In [30]:
X_train = standard_X.fit_transform(X_train)
X_test = standard_X.fit_transform(X_test)
In [31]:
pd.DataFrame(X_train) #SCALED
Out[31]:
0 1 2 3 4
0 -0.147778 0.768777 -0.732925 -1.248168 -0.078585
1 2.099133 0.672035 2.246595 1.187282 2.114855
2 0.683470 -0.286287 0.081064 -0.030443 0.922208
3 1.085838 1.387929 0.325194 -0.030443 0.572754
4 -0.563993 -1.217028 -0.129964 1.187282 -0.360975
5 -0.949166 0.003426 -0.422023 -1.248168 -0.832442
6 -1.590858 -0.053492 -1.561117 -1.248168 -2.475335
7 0.158509 1.286864 0.710993 1.187282 0.022449
8 -0.730373 -1.292275 -0.402355 -1.248168 -0.760949
9 0.5215450 0.9700481 0.5578052 1.1872823 0.3858104
10 1.316921 0.986533 0.926449 -0.030443 1.171146
11 -0.607378 -2.447162 -0.205725 -1.248168 -0.529776
12 -0.352435 -0.560869 -0.048589 -0.030443 -0.353236
13 0.964891 0.151737 0.372172 1.187282 0.503335
14 -1.244827 0.325357 -1.647149 1.187282 -1.051661
15 -1.578762 -2.430403 -1.964309 1.187282 -1.932723
16 -1.139408 -1.912880 -0.310728 1.187282 -0.755177
17 -1.561502 -0.096031 0.687583 -0.030443 -1.575565
18 -0.968390 -1.229294 -0.496328 -0.030443 -0.843843
19 1.631008 0.008009 1.455935 1.187282 1.872917
20 0.018321 0.342926 1.188029 1.187282 -0.140519
21 -1.095932 1.324489 -1.711408 -1.248168 -1.169496
22 -0.083778 -0.462735 0.755901 -0.030443 -0.044215
23 0.090208 0.935743 -0.767847 -0.030443 -0.121772
24 0.150110 0.114601 0.395109 -1.248168 0.427751
25 0.462077 0.620929 0.290849 -1.248168 0.616818
26 1.099212 1.102714 0.816992 1.187282 1.079621
27 1.413268 1.047333 -0.824374 -1.248168 1.180707
28 1.580460 -0.985886 1.303923 -0.030443 1.440884
29 1.352154 -0.679013 1.274406 1.187282 1.203160
30 -0.951188 0.313476 -0.169154 -0.030443 -0.510155
31 0.655773 -0.971355 0.264783 -1.248168 0.874064
32 -0.554798 1.429699 -0.082837 -1.248168 -0.354945
33 -1.063279 -0.811085 -0.643327 -1.248168 -1.006698
34 -1.568537 0.207705 -1.947316 1.187282 -1.176585
35 0.503839 0.323101 0.265630 -0.030443 0.804948
36 0.060431 0.157781 0.742964 -0.030443 -0.002386
37 0.110850 -0.167035 0.701417 -1.248168 0.207550
38 0.456649 -0.155796 0.667992 -0.030443 0.357287
39 0.337715 1.277416 -1.964309 1.187282 0.318772
Result
The data set was pre-processed by using label encoding, feature scaling (standardisation, normalization), checking for missing values
and splitting data into training and testing sets.
Conclusion
In Real world data are generally incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data. Noisy: containing errors or outliers. Inconsistent: containing discrepancies in codes or names. Hence, it is essential
to pre-process data so that algorithms can be applied without any hindrance.