KEMBAR78
ML MODULE 2.pdf
Data Cleaning
(Missing value, Outlier)
Exploratory Data Analysis
(Descriptive Statistics, Visualization)
Feature Engineering
(Data Transformation
(Encoding, Skew, Scale)
Feature Selection)
“Data is the fuel for
ML algorithms”
2
3
Case Study: A classification model for diagnosing Breast Cancer in women.
A sample of 1000 women were studied in a given population, 100 of them
with Breast Cancer while remaining 900 were without it. Split dataset into
70/30 train/test set.
The accuracy was 90% excellent.
A couple of months after deployment, some of the women who were
diagnosed by the model as having “no breast cancer” started showing
symptoms of Breast Cancer.
4
Actual
Predi
cted
Null Hypothesis
(H0) valid: Breast
Cancer
Null Hypothesis
(H0) invalid: No
Breast Cancer
Accept H0
(X has
disease)
TP = 0 FP (X might feel she
will die soon) = 0
0
Reject H0
(X does
not have
disease)
FN (X thinks she
is healthy when
suffering form
disease) = 30
TN = 270 300
30 270 300
Model has conveniently
classified all the test data as
“NO Breast Cancer”
Accuracy = (TP + TN) / (TP +
TN + FP + FN) = 90%
Precision (predict disease
correctly) = TP / (TP + FP) =
0%
Recall = TP / (TP + FN) = 0%
Isn’t it better to think you
have Breast Cancer and not
have it than to think you don’t
have Breast Cancer but
you’ve got it.
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ 5
https://towardsdatascience.com/fraud-detection-with-cost-sensitive-machine-learning-24b8760d35d9
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
6
Observed accuracy = (TP+TN)/(TP+TN+FP+FN) = (10+8)/(10+7+5+8) = 0.6
Expected accuracy = ((TP+FN)*(TP+FP))/(TP+TN+FP+FN) +
((FP+TN)*(FN+TN))/(TP+TN+FP+FN)) / (TP+TN+FP+FN) =
((((10+5)*(10+7))/30) + (((7+8)*(8+5))/30))/30 = (((15*17)/30)+((15*13)/30))/30
= (8.5+6.5)/30 = 0.5
Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy)
= (0.6-0.5)/(1-0.5) = 0.20
Actual class
Model
classific
ation
Cats Dogs
Cats 10 7 17
Dogs 5 8 13
15 15
60 125
5 5000
0.47
Precision = (TP) / (TP+FP) Recall = TP / (TP + FN) TASK
7
https://towardsdatascience.com/the-best-
classification-metric-youve-never-heard-of-the-
matthews-correlation-coefficient-3bf50a2f3e9a
TNR=1-FPR
8
“No one size fits all”
9
https://machinelearningmastery.com/handle-missing-data-python/ 10
11
Simple Imputer https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/
12
13
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
14
Pearson and ANOVA (parametric)
Spearman and Kendall’s rank (non parametric)
Chi2 test, Mutual Information
15
I(X ; Y) = H(X) – H(X | Y)
χ2 = ∑ (O − E)2 / E
F = MST/MSE
MST = SST/ p-1
MSE = SSE/N-p
SSE = ∑ (n−1)s2
16
REVERSE CORRELATION
17
X Y X-XMEAN Y-YMEAN X-(XMEAN)*X-(XMEAN) (Y-YMEAN)*(Y-YMEAN) X-
(XMEAN)*
Y-YMEAN)
X-(XMEAN)*X-
(XMEAN)
*(Y-YMEAN)*(Y-
YMEAN)
3 6 1 2 1 4 1 4
2 3 0 -1 0 1 0 0
2 5 0 -1 0 1 0 0
1 2 -1 -2 1 4 1 4
ME
AN
2 4 2 10 4
= 4/√20 = 0.8944 > 0 high correlation
18
Independent
variable
# OF ANIMAL AV. DOMESTIC ANIMAL S.D. S.D.2
DOG 5 12 2 4
CAT 5 16 1 1
HAMSTER 5 20 4 16
Different groups must have equal sample size
No relationship between subjects in each sample
To test more than 2 levels within an indep var
ρ = 3 TOTAL POPULATION
n = 5 # of samples
N = 15 total # of observation
SST = 5*[(12-16)2+(16-16)2+(20-16)2] = 160
MST = SST/ ρ-1 = 160/(3-1) = 80
SSE = (4+1+16)*(n-1) = 84
MSE = SSE/(N- ρ) = 84/(15-3) = 7
F = MST/MSE = 80/7 = 11.429
19
τ = (15-6)/21 = 0.4287
Interpretation: agreement between 2 experts
20
Cat Dog
Men 207 282 489
Women 231 242 473
438 524 962
Expected value
Cat Dog
Men 489*438/962 =
222.64
489*524/962
= 266.36
489
Women 473*438/962
=215.36
473*524/962
= 257.64
473
438 524 962
(O-E)2/E
Cat Dog
Men (207-222.64)2 =
1.099
(282-266.36)2
= 0.918
489
Women (231-215.36)2 =
1.136
(242-257.64)2
= 0.949
473
438 524 962
χ2 = 1.099 + 0.918 + 1.136 + 0.949 = 4.102
Degree of freedom = (row-1)*(col-1) = (2-1)*(2-1) = 1
21
https://machinelearningmastery.com/calculate-feature-importance-with-python/
22
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(), k_features=11, forward=True, floating=False, scoring = 'r2', cv = 0)
sbs = SFS(LinearRegression(), k_features=11,
forward=False, floating=False, cv=0)
sbs.fit(X, y)
sbs.k_feature_names_
from sklearn.feature_selection import RFE
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
23
from sklearn.feature_selection import SelectFromModel
sel_ = SelectFromModel(LogisticRegression(C=1, penalty='l1'))
sel_.fit(scaler.transform(X_train.fillna(0)), y_train)
from sklearn.linear_model import ElasticNet
regr = ElasticNet(random_state=0)
24
25
26
https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
27
https://machinelearningmas
tery.com/one-hot-encoding-
for-categorical-data/
df_dummies = pd.get_dummies(df, columgenderns=['sex'])
https://www.marsja.se/how-to-use-pandas-get_dummies-to-create-dummy-variables-in-python/
28
Assumptions by models:
1. Linear relationship between predictors and target variable
2. No noise i.e. there are no outliers in the data
3. No collinearity
4. Normal distribution of predictors and the target variable
5. Scale if it’s a distance-based algorithm
Solution
1. Log Transform (log(x))
2. Square Root (special case)
3. Power Transform - Box Cox (stabilize variance)
Reverse transformation while making predictions
29
30
https://towardsdatascience.com/data-visualization-for-machine-learning-and-data-science-a45178970be7
https://towardsdatascience.com/the-art-of-effective-visualization-of-multi-dimensional-data-6c7202990c57
• displays information as a series of data points connected by straight line segments
• to visualize the directional movement of one or more data over time i.e. time series data
• X axis would be datetime and the Y axis contains the measured quantity like monthly sales
• Eg. Simple, Multiple, Time Series Analysis
Source: https://www.machinelearningplus.com/plots/matplotlib-line-plot/ 31
• categorical data as rectangular bars with the height of bars proportional to the value
they represent
• example, data on the height of persons being grouped as ‘Tall’, ‘Medium’, ‘Short’ etc.
• used to compare between values of different categories in the data
• categorical data is nothing but a grouping of data into different logical groups
• Types include: Simple, Horizontal, Grouped and Stacked
https://www.machinelearningplus.co
m/plots/bar-plot-in-python/
32
• visualize the frequency distribution of numeric array by splitting it to small equal-sized bins.
• A histogram is drawn on large arrays. It computes the frequency distribution on an array and
makes a histogram out of it.
• Types include basic, grouped, Density curve, Facets
https://www.machinelearningplus.com/plots/matplotlib-histogram-python-examples/ 33
34
https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
To obtain the
Winsorized mean,
you sort the data
and replace the
smallest k values
by the (k+1)st
smallest value.
You do the same
for the largest
values, replacing
the k largest
values with the
(k-1)st largest
value
A normal point (on the left) requires more partitions
to be identified than an abnormal point (right)
https://towardsdatascience.com/outlier-detection-with-
isolation-forest-3d190448d45e
• visualize how a given data (variable) is distributed using quartiles
• shows the minimum, maximum, median, first quartile and third quartile in the data set
• method to graphically show the spread of a numerical variable through quartiles
• Middle 50% of all datapoints: IQR = Q3-Q1
• upper and lower whisker mark 1.5 times the IQR
from the top (and bottom) of the box
• points that lie outside the whiskers, i.e. 1.5 x IQR
in both directions are generally considered as
outliers (< Q1-1.5*IQR | > Q3+1.5*IQR)
• Types include basic, notched, violinplot
36
https://www.khanacademy.org/math/statistics-
probability/summarizing-quantitative-data/box-whisker-
plots/a/box-plot-review
TASK
• the values of two variables are plotted along two axes
• used to visualize the relationship between two variables
• Types include basic, correlation, linearfitplot, bubble plot
https://www.machinelearningplus.com/plots/python-scatter-plot/
37
• Correlation between the variables indicates how the variables are inter-related
• Correlation is not Causation
1. Each cell in the grid represents the value of the correlation coefficient
between two variables.
2. It is a square and symmetric matrix.
3. All diagonal elements are 1.
4. The axes ticks denote the feature each of them represents.
5. A large positive value (near to 1.0) indicates a strong positive correlation.
6. A large negative value (near to -1.0) indicates a strong negative
correlation.
7. A value near to 0 (both positive or negative) indicates the absence of any
correlation between the two variables, and hence those variables are
independent of each other.
8. Each cell in the above matrix is also represented by shades of a color.
Here darker shades of the color indicate smaller values while brighter shades
correspond to larger values (near to 1).
9. This scale is given with the help of a color-bar on the right side of the plot.
38
• Eg. a person’s height and weight, age and sales price of a car, or years of education
and annual income
• Doesn’t affect DT
• kNN affected
• Cause
• Insufficient data
• Dummy variables
• Including a variable in the regression that is actually a combination of two
other variables.
• Identify (corr>0.4, Variance Inflation Factor score>5 high correlation )
• Sol
• Feature selection
• PCA
• More data
• Ridge regression reduces magnitude of model coefficients 39
Actual
Cats Dogs
Predic
ted
Cats 60 125
Dogs 5 5000
40
1. Explain essential Python libraries numpy, pandas, scipy, scikit-learn, statsmodels.
2. Find Accuracy, Precision, Recall, Kappa Score, MCC, F1score, ROCAUC on.
3. How is a missing value represented. What are the types and ways of dealing with missing values.
4. Discuss data transformation methods for categorical data and numerical data.
5. Explain Python visualization tools - matplotlib, pandas, seaborn, bokeh, plotly.
6. Discuss imbalanced data handling mechanisms and problems if imbalance is not handled.
7. How can you determine which features are most important in your model? Which feature selection algorithm should be used
when. State with example.
8. Discuss Wrapper based Feature selection methods with example diagram.
9. Describe various category of Filter based feature selection methods based on type of features with mathematical equation.
10. Compute Karl Pearson and Spearman Coefficient of Correlation.
11. Find Kendall’s Rank Correlation Coefficient Tau.
12. Indicate the different types of transformations, data has to be subjected to, before dimensionality reduction techniques can
be applied.
41

ML MODULE 2.pdf

  • 1.
    Data Cleaning (Missing value,Outlier) Exploratory Data Analysis (Descriptive Statistics, Visualization) Feature Engineering (Data Transformation (Encoding, Skew, Scale) Feature Selection) “Data is the fuel for ML algorithms”
  • 2.
  • 3.
    3 Case Study: Aclassification model for diagnosing Breast Cancer in women. A sample of 1000 women were studied in a given population, 100 of them with Breast Cancer while remaining 900 were without it. Split dataset into 70/30 train/test set. The accuracy was 90% excellent. A couple of months after deployment, some of the women who were diagnosed by the model as having “no breast cancer” started showing symptoms of Breast Cancer.
  • 4.
    4 Actual Predi cted Null Hypothesis (H0) valid:Breast Cancer Null Hypothesis (H0) invalid: No Breast Cancer Accept H0 (X has disease) TP = 0 FP (X might feel she will die soon) = 0 0 Reject H0 (X does not have disease) FN (X thinks she is healthy when suffering form disease) = 30 TN = 270 300 30 270 300 Model has conveniently classified all the test data as “NO Breast Cancer” Accuracy = (TP + TN) / (TP + TN + FP + FN) = 90% Precision (predict disease correctly) = TP / (TP + FP) = 0% Recall = TP / (TP + FN) = 0% Isn’t it better to think you have Breast Cancer and not have it than to think you don’t have Breast Cancer but you’ve got it.
  • 5.
  • 6.
    6 Observed accuracy =(TP+TN)/(TP+TN+FP+FN) = (10+8)/(10+7+5+8) = 0.6 Expected accuracy = ((TP+FN)*(TP+FP))/(TP+TN+FP+FN) + ((FP+TN)*(FN+TN))/(TP+TN+FP+FN)) / (TP+TN+FP+FN) = ((((10+5)*(10+7))/30) + (((7+8)*(8+5))/30))/30 = (((15*17)/30)+((15*13)/30))/30 = (8.5+6.5)/30 = 0.5 Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy) = (0.6-0.5)/(1-0.5) = 0.20 Actual class Model classific ation Cats Dogs Cats 10 7 17 Dogs 5 8 13 15 15 60 125 5 5000 0.47 Precision = (TP) / (TP+FP) Recall = TP / (TP + FN) TASK
  • 7.
  • 8.
    8 “No one sizefits all”
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    Pearson and ANOVA(parametric) Spearman and Kendall’s rank (non parametric) Chi2 test, Mutual Information 15 I(X ; Y) = H(X) – H(X | Y) χ2 = ∑ (O − E)2 / E F = MST/MSE MST = SST/ p-1 MSE = SSE/N-p SSE = ∑ (n−1)s2
  • 16.
  • 17.
    17 X Y X-XMEANY-YMEAN X-(XMEAN)*X-(XMEAN) (Y-YMEAN)*(Y-YMEAN) X- (XMEAN)* Y-YMEAN) X-(XMEAN)*X- (XMEAN) *(Y-YMEAN)*(Y- YMEAN) 3 6 1 2 1 4 1 4 2 3 0 -1 0 1 0 0 2 5 0 -1 0 1 0 0 1 2 -1 -2 1 4 1 4 ME AN 2 4 2 10 4 = 4/√20 = 0.8944 > 0 high correlation
  • 18.
    18 Independent variable # OF ANIMALAV. DOMESTIC ANIMAL S.D. S.D.2 DOG 5 12 2 4 CAT 5 16 1 1 HAMSTER 5 20 4 16 Different groups must have equal sample size No relationship between subjects in each sample To test more than 2 levels within an indep var ρ = 3 TOTAL POPULATION n = 5 # of samples N = 15 total # of observation SST = 5*[(12-16)2+(16-16)2+(20-16)2] = 160 MST = SST/ ρ-1 = 160/(3-1) = 80 SSE = (4+1+16)*(n-1) = 84 MSE = SSE/(N- ρ) = 84/(15-3) = 7 F = MST/MSE = 80/7 = 11.429
  • 19.
    19 τ = (15-6)/21= 0.4287 Interpretation: agreement between 2 experts
  • 20.
    20 Cat Dog Men 207282 489 Women 231 242 473 438 524 962 Expected value Cat Dog Men 489*438/962 = 222.64 489*524/962 = 266.36 489 Women 473*438/962 =215.36 473*524/962 = 257.64 473 438 524 962 (O-E)2/E Cat Dog Men (207-222.64)2 = 1.099 (282-266.36)2 = 0.918 489 Women (231-215.36)2 = 1.136 (242-257.64)2 = 0.949 473 438 524 962 χ2 = 1.099 + 0.918 + 1.136 + 0.949 = 4.102 Degree of freedom = (row-1)*(col-1) = (2-1)*(2-1) = 1
  • 21.
  • 22.
    22 from mlxtend.feature_selection importSequentialFeatureSelector as SFS from sklearn.linear_model import LinearRegression sfs = SFS(LinearRegression(), k_features=11, forward=True, floating=False, scoring = 'r2', cv = 0) sbs = SFS(LinearRegression(), k_features=11, forward=False, floating=False, cv=0) sbs.fit(X, y) sbs.k_feature_names_ from sklearn.feature_selection import RFE rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
  • 23.
    23 from sklearn.feature_selection importSelectFromModel sel_ = SelectFromModel(LogisticRegression(C=1, penalty='l1')) sel_.fit(scaler.transform(X_train.fillna(0)), y_train) from sklearn.linear_model import ElasticNet regr = ElasticNet(random_state=0)
  • 24.
  • 25.
  • 26.
  • 27.
    27 https://machinelearningmas tery.com/one-hot-encoding- for-categorical-data/ df_dummies = pd.get_dummies(df,columgenderns=['sex']) https://www.marsja.se/how-to-use-pandas-get_dummies-to-create-dummy-variables-in-python/
  • 28.
  • 29.
    Assumptions by models: 1.Linear relationship between predictors and target variable 2. No noise i.e. there are no outliers in the data 3. No collinearity 4. Normal distribution of predictors and the target variable 5. Scale if it’s a distance-based algorithm Solution 1. Log Transform (log(x)) 2. Square Root (special case) 3. Power Transform - Box Cox (stabilize variance) Reverse transformation while making predictions 29
  • 30.
  • 31.
    • displays informationas a series of data points connected by straight line segments • to visualize the directional movement of one or more data over time i.e. time series data • X axis would be datetime and the Y axis contains the measured quantity like monthly sales • Eg. Simple, Multiple, Time Series Analysis Source: https://www.machinelearningplus.com/plots/matplotlib-line-plot/ 31
  • 32.
    • categorical dataas rectangular bars with the height of bars proportional to the value they represent • example, data on the height of persons being grouped as ‘Tall’, ‘Medium’, ‘Short’ etc. • used to compare between values of different categories in the data • categorical data is nothing but a grouping of data into different logical groups • Types include: Simple, Horizontal, Grouped and Stacked https://www.machinelearningplus.co m/plots/bar-plot-in-python/ 32
  • 33.
    • visualize thefrequency distribution of numeric array by splitting it to small equal-sized bins. • A histogram is drawn on large arrays. It computes the frequency distribution on an array and makes a histogram out of it. • Types include basic, grouped, Density curve, Facets https://www.machinelearningplus.com/plots/matplotlib-histogram-python-examples/ 33
  • 34.
  • 35.
    To obtain the Winsorizedmean, you sort the data and replace the smallest k values by the (k+1)st smallest value. You do the same for the largest values, replacing the k largest values with the (k-1)st largest value A normal point (on the left) requires more partitions to be identified than an abnormal point (right) https://towardsdatascience.com/outlier-detection-with- isolation-forest-3d190448d45e
  • 36.
    • visualize howa given data (variable) is distributed using quartiles • shows the minimum, maximum, median, first quartile and third quartile in the data set • method to graphically show the spread of a numerical variable through quartiles • Middle 50% of all datapoints: IQR = Q3-Q1 • upper and lower whisker mark 1.5 times the IQR from the top (and bottom) of the box • points that lie outside the whiskers, i.e. 1.5 x IQR in both directions are generally considered as outliers (< Q1-1.5*IQR | > Q3+1.5*IQR) • Types include basic, notched, violinplot 36 https://www.khanacademy.org/math/statistics- probability/summarizing-quantitative-data/box-whisker- plots/a/box-plot-review TASK
  • 37.
    • the valuesof two variables are plotted along two axes • used to visualize the relationship between two variables • Types include basic, correlation, linearfitplot, bubble plot https://www.machinelearningplus.com/plots/python-scatter-plot/ 37
  • 38.
    • Correlation betweenthe variables indicates how the variables are inter-related • Correlation is not Causation 1. Each cell in the grid represents the value of the correlation coefficient between two variables. 2. It is a square and symmetric matrix. 3. All diagonal elements are 1. 4. The axes ticks denote the feature each of them represents. 5. A large positive value (near to 1.0) indicates a strong positive correlation. 6. A large negative value (near to -1.0) indicates a strong negative correlation. 7. A value near to 0 (both positive or negative) indicates the absence of any correlation between the two variables, and hence those variables are independent of each other. 8. Each cell in the above matrix is also represented by shades of a color. Here darker shades of the color indicate smaller values while brighter shades correspond to larger values (near to 1). 9. This scale is given with the help of a color-bar on the right side of the plot. 38
  • 39.
    • Eg. aperson’s height and weight, age and sales price of a car, or years of education and annual income • Doesn’t affect DT • kNN affected • Cause • Insufficient data • Dummy variables • Including a variable in the regression that is actually a combination of two other variables. • Identify (corr>0.4, Variance Inflation Factor score>5 high correlation ) • Sol • Feature selection • PCA • More data • Ridge regression reduces magnitude of model coefficients 39
  • 40.
    Actual Cats Dogs Predic ted Cats 60125 Dogs 5 5000 40 1. Explain essential Python libraries numpy, pandas, scipy, scikit-learn, statsmodels. 2. Find Accuracy, Precision, Recall, Kappa Score, MCC, F1score, ROCAUC on. 3. How is a missing value represented. What are the types and ways of dealing with missing values. 4. Discuss data transformation methods for categorical data and numerical data. 5. Explain Python visualization tools - matplotlib, pandas, seaborn, bokeh, plotly. 6. Discuss imbalanced data handling mechanisms and problems if imbalance is not handled. 7. How can you determine which features are most important in your model? Which feature selection algorithm should be used when. State with example. 8. Discuss Wrapper based Feature selection methods with example diagram. 9. Describe various category of Filter based feature selection methods based on type of features with mathematical equation. 10. Compute Karl Pearson and Spearman Coefficient of Correlation. 11. Find Kendall’s Rank Correlation Coefficient Tau. 12. Indicate the different types of transformations, data has to be subjected to, before dimensionality reduction techniques can be applied.
  • 41.