PHUONG NGUYEN
THE DATA SCIENCE PROCESS
HOW to embed Machine Learning into business
CONTENT
1. A SIMPLE EXAMPLE IN PYTHON
2. STANDARD DATA SCIENCE PROCESSES
3. MACHINE LEARNING CANVAS
DATA SCIENCE PROCESS
https://github.com/nnbphuong/datascience4biz/blob/
master/Overview_of_the_Data_Science_Process.ipynb
THE DATA SCIENCE PROCESS
1. DETERMINE THE PURPOSE
▪
▪
2. OBTAIN THE DATA
▪
▪
▪
import pandas as pd
# Load data
housing_df = pd.read_csv('WestRoxbury.csv')
housing_df.shape #find dimension of data frame
housing_df.head() #show the 1st five rows
print(housing_df) #show all the data
# Rename columns: replace spaces with '_’
housing_df = housing_df.rename
(columns={'TOTAL VALUE ': 'TOTAL_VALUE’}) # explicit
housing_df.columns = [s.strip().replace(' ', '_’)
for s in housing_df.columns] # all columns
# Show first four rows of the data
housing_df.loc[0:3] # loc[a:b] gives rows a to b, inclusive
housing_df.iloc[0:4] # iloc[a:b] gives rows a to b-1
# Different ways of showing the first 10
# values in column TOTAL_VALUE
housing_df['TOTAL_VALUE'].iloc[0:10]
housing_df.iloc[0:10]['TOTAL_VALUE']
housing_df.iloc[0:10].TOTAL_VALUE
# use dot notation if the column name has no spaces
# Show the fifth row of the first 10 columns
housing_df.iloc[4][0:10]
housing_df.iloc[4, 0:10]
housing_df.iloc[4:5, 0:10]
# use a slice to return a data frame
# Use pd.concat to combine non-consecutive columns into a
# new data frame. Axis argument specifies dimension along
# which concatenation happens, 0=rows, 1=columns.
pd.concat([housing_df.iloc[4:6,0:2],
housing_df.iloc[4:6,4:6]], axis=1)
# To specify a full column, use:
housing.iloc[:,0:1]
housing.TOTAL_VALUE
# show the first 10 rows of the first column
housing_df['TOTAL_VALUE'][0:10]
# Descriptive statistics
# show length of first column
print('Number of rows ', len(housing_df['TOTAL_VALUE’]))
# show mean of column
print('Mean of TOTAL_VALUE ',
housing_df['TOTAL_VALUE'].mean())
# show summary statistics for each column
housing_df.describe()
# random sample of 5 observations
housing_df.sample(5)
# oversample houses with over 10 rooms
weights = [0.9 if rooms > 10 else 0.01
for rooms in housing_df.ROOMS]
housing_df.sample(5, weights=weights)
3. EXPLORE, CLEAN, AND PRE-PROCESS THE DATA
→
▪
▪
▪
housing_df.columns # print a list of variables
Index(['TOTAL_VALUE', 'TAX', 'LOT_SQFT',
'YR_BUILT', 'GROSS_AREA','LIVING_AREA',
'FLOORS', 'ROOMS', 'BEDROOMS', 'FULL_BATH',
'HALF_BATH','KITCHEN', 'FIREPLACE',
'REMODEL'], dtype='object')
•
HANDLING VARIABLES
▪
▪
▪
▪
▪
▪
▪
# REMODEL needs to be converted to a categorical variable
housing_df.REMODEL = housing_df.REMODEL.astype('category')
housing_df.REMODEL.cat.categories # Show number of categories
housing_df.REMODEL.dtype # Check type of converted variable
# use drop_first=True to drop the first dummy variable
housing_df = pd.get_dummies(housing_df,
prefix_sep='_', drop_first=True)
housing_df.columns
housing_df.loc[:,'REMODEL_Old':'REMODEL_Recent'].head(5)
['None', 'Old', 'Recent']
REMODEL_Old REMODEL_Recent
0 0 0
1 0 1
2 0 0
3 0 0
4 0 0
DETECTING OUTLIERS
▪
▪
housing_df.plot.scatter(x='ROOMS', y='FLOORS', legend=False)
HANDLING MISSING DATA
▪
▪
▪
▪
▪
▪
▪
▪
# To illustrate missing data procedures,
# we first convert a few entries for bedrooms to NA’s.
# Then we impute these missing values
# using the median of the remaining values.
missingRows = housing_df.sample(10).index
housing_df.loc[missingRows, 'BEDROOMS'] = np.nan
print(‘Number of rows with valid BEDROOMS values
after setting to NAN: ’, housing_df['BEDROOMS'].count())
medianBedrooms = housing_df['BEDROOMS'].median()
housing_df.BEDROOMS =
housing_df.BEDROOMS.fillna(value=medianBedrooms)
print(‘Number of rows with valid BEDROOMS values
after filling NA values: ’, housing_df['BEDROOMS'].count())
NORMALIZING/RESCALING DATA
▪
▪
▪
▪
# Normalizing a data frame
norm_df = (housing_df - housing_df.mean()) /
housing_df.std()
# Rescaling a data frame
norm_df = (housing_df - housing_df.min()) /
(housing_df.max() - housing_df.min())
4. REDUCE THE DATA DIMENSION
▪
▪
5. DETERMINE THE DATA SCIENCE TASK
▪
→
6. PARTITION THE DATA
▪
# set random_state for reproducibility
# training (60%) and validation (40%)
trainData, validData = train_test_split(housing_df,
test_size=0.40, random_state=1)
# produces Training: 3481 Validation: 2321
# training (50%), validation (30%), and test (20%)
trainData, temp = train_test_split(housing_df, test_size=0.5, random_state=1)
# now split temp into validation and test
validData, testData = train_test_split(temp, test_size=0.4, random_state=1)
# produces Training: 2901 Validation: 1741 Test: 1160
7. CHOOSE THE TECHNIQUES
▪
8. PERFORM THE TASK
▪
LinearRegression
# create list of predictors and outcome
excludeColumns = ('TOTAL_VALUE', 'TAX')
predictors = [s for s in housing_df.columns if s
not in excludeColumns]
outcome = 'TOTAL_VALUE’
# partition data
X = housing_df[predictors]
y = housing_df[outcome]
train_X, valid_X, train_y, valid_y =
train_test_split(X, y, test_size=0.4,
random_state=1)
model = LinearRegression()
model.fit(train_X, train_y)
train_pred = model.predict(train_X)
train_results = pd.DataFrame({
'TOTAL_VALUE': train_y,
'predicted': train_pred,
'residual': train_y - train_pred
})
# show sample of predictions
train_results.head()
TOTAL_VALUE predicted residual
2024 392.0 387.726258 4.273742
5140 476.3 430.785540 45.514460
5259 367.4 384.042952 -16.642952
421 350.3 369.005551 -18.705551
1401 348.1 314.725722 33.374278
valid_pred = model.predict(valid_X)
valid_results = pd.DataFrame({
'TOTAL_VALUE': valid_y,
'predicted': valid_pred,
'residual': valid_y - valid_pred
})
valid_results.head()
TOTAL_VALUE predicted residual
1822 462.0 406.946377 55.053623
1998 370.4 362.888928 7.511072
5126 407.4 390.287208 17.112792
808 316.1 382.470203 -66.370203
4034 393.2 434.334998 -41.134998
# import the utility function regressionSummary
from dmba import regressionSummary
# training set
regressionSummary(train_results.TOTAL_VALUE,
train_results.predicted)
# validation set
regressionSummary(valid_results.TOTAL_VALUE,
valid_results.predicted)
OUTPUT
Regression statistics (training)
Mean Error (ME) : -0.0000
Root Mean Squared Error (RMSE) : 43.0306
Mean Absolute Error (MAE) : 32.6042
Mean Percentage Error (MPE) : -1.1116
Mean Absolute Percentage Error (MAPE) : 8.4886
Regression statistics (validation)
Mean Error (ME) : -0.1463
Root Mean Squared Error (RMSE) : 42.7292
Mean Absolute Error (MAE) : 31.9663
Mean Percentage Error (MPE) : -1.0884
Mean Absolute Percentage Error (MAPE) : 8.3283
9. ASSESS AND INTERPRET THE RESULTS
▪
▪
10. DEPLOY THE BEST MODEL
▪
▪
new_data = pd.DataFrame({
'LOT_SQFT': [4200, 6444, 5035],
'YR_BUILT': [1960, 1940, 1925],
'GROSS_AREA': [2670, 2886, 3264],
'LIVING_AREA': [1710, 1474, 1523],
'FLOORS': [2.0, 1.5, 1.9],
'ROOMS': [10, 6, 6],
'BEDROOMS': [4, 3, 2],
'FULL_BATH': [1, 1, 1],
'HALF_BATH': [1, 1, 0],
'KITCHEN': [1, 1, 1],
'FIREPLACE': [1, 1, 0],
'REMODEL_Old': [0, 0, 0],
'REMODEL_Recent': [0, 0, 1],
})
print('Predictions: ', model.predict(new_data))
> Predictions: [384.47210285 378.06696706 386.01773842]
THE STANDARD DATA SCIENCE PROCESS
▪
▪
▪
▪
▪
▪
▪
THE STANDARD DATA SCIENCE PROCESS
THE STANDARD DATA SCIENCE PROCESS: CRISM-DM
THE STANDARD DATA SCIENCE PROCESS: SEMMA
THE STANDARD DATA SCIENCE PROCESS: KDD
THE STANDARD DATA SCIENCE PROCESS
THE MACHINE LEARNING CANVAS
▪
▪
▪
▪
THE MACHINE LEARNING CANVAS: GOAL
▪
▪
▪
▪
THE MACHINE LEARNING CANVAS: LEARN
▪
▪
▪
▪
THE MACHINE LEARNING CANVAS: PREDICT
▪
▪
THE MACHINE LEARNING CANVAS: EVALUATE
▪