KEMBAR78
Data Analytics in R - A Case Study Based Approach | PDF | Principal Component Analysis | Cluster Analysis
0% found this document useful (0 votes)
16 views81 pages

Data Analytics in R - A Case Study Based Approach

Data Analytics in R - A Case Study based Approach

Uploaded by

amadorrivas95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views81 pages

Data Analytics in R - A Case Study Based Approach

Data Analytics in R - A Case Study based Approach

Uploaded by

amadorrivas95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

DATA ANALYTICS IN R:

A CASE STUDY BASED APPROACH

Dr. R.S. Kamath


Dr. S.S. Jamsandekar
K.G. Kharade
Dr. R.K. Kamat

ISO 9001:2015 CERTIFIED


© Authors
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording and/or otherwise without the prior written permission of the
authors and the publisher.

First Edition : 2020

Published by : Mrs. Meena Pandey for Himalaya Publishing House Pvt. Ltd.,
“Ramdoot”, Dr. Bhalerao Marg, Girgaon, Mumbai - 400 004.
Phone: 022-23860170, 23863863; Fax: 022-23877178
E-mail: himpub@bharatmail.co.in ; Website: www.himpub.com
Branch Offices :
New Delhi : “Pooja Apartments”, 4-B, Murari Lal Street, Ansari Road, Darya Ganj, New Delhi - 110 002.
Phone: 011-23270392, 23278631; Fax: 011-23256286
Nagpur : Kundanlal Chandak Industrial Estate, Ghat Road, Nagpur - 440 018.
Phone: 0712-2721215, 3296733; Telefax: 0712-2721216
Bengaluru : Plot No. 91-33, 2nd Main Road, Seshadripuram, Behind Nataraja Theatre,
Bengaluru - 560 020. Phone: 080-41138821; Mobile: 09379847017, 09379847005
Hyderabad : No. 3-4-184, Lingampally, Besides Raghavendra Swamy Matham, Kachiguda,
Hyderabad - 500 027. Phone: 040-27560041, 27550139
Chennai : New No. 48/2, Old No. 28/2, Ground Floor, Sarangapani Street, T. Nagar,
Chennai - 600 017. Mobile: 09380460419
Pune : “Laksha” Apartment, First Floor, No. 527, Mehunpura,
Shaniwarpeth (Near Prabhat Theatre), Pune - 411 030.
Phone: 020-24496323, 24496333; Mobile: 09370579333
Lucknow : House No. 731, Shekhupura Colony, Near B.D. Convent School, Aliganj,
Lucknow - 226 022. Phone: 0522-4012353; Mobile: 09307501549
Ahmedabad : 114, “SHAIL”, 1st Floor, Opp. Madhu Sudan House, C.G. Road, Navrang Pura,
Ahmedabad - 380 009. Phone: 079-26560126; Mobile: 09377088847
Ernakulam : 39/176 (New No. 60/251), 1st Floor, Karikkamuri Road, Ernakulam,
Kochi - 682 011. Phone: 0484-2378012, 2378016; Mobile: 09387122121
Cuttack : New LIC Colony, Behind Kamala Mandap, Badambadi,
Cuttack - 753 012, Odisha. Mobile: 09338746007
Kolkata : 108/4, Beliaghata Main Road, Near ID Hospital, Opp. SBI Bank,
Kolkata - 700 010. Phone: 033-32449649; Mobile: 07439040301

DTP by : Nilima

Printed at : SAP Print Solutions Pvt. Ltd., Mumbai. On behalf of HPH.


PREFACE
In the era of ubiquitous computerization, generation of abundance of data
is experienced in day-to-day life. Today’s world has therefore transformed into
data driven and customer laid. Brand value of the businesses has now been
decided by the data. Extracting valid, novel, useful and understandable patterns
from these data is still posing challenges thereby creating a huge gap between
the data storage and knowledge discovery. Due to the interwoven nature of
application domains, the Data Analytics discipline is emerging as a multi-
disciplinary one requiring expertise from allied domains such as Statistics,
Machine learning, Databases, Information retrieval and Visualization.
Case studies are important tool for active learning and student participation.
Present book features a collection of selected case studies illustrating Data
Analytics in R. Routine description of algorithms are avoided. This is assumed
as prerequisite for the intended audience. This book is an excursion for its
audience in the areas in particular but not necessarily limited to data pre-
processing, association rule mining, classification and clustering. The post-
processing of the analytics results to device useful patterns are exemplified. The
scripting is done through the open source tool R and the scripts are provided for
learners to practice. Results will be put forth through systematic visualization
techniques.
We have used valid data warehouses available on the web. One such
platform for example is UCI Machine Learning repository which provides data-
driven content on various topics.

... Dr. R.S. Kamath, Dr. S.S. Jamsandekar, K.G. Kharade and Dr. R.K. Kamat
CONTENTS
1. Data Preprocessing in R 1 – 12
1.1 Need of Data Preprocessing
1.2 Handling Missing Data
1.2.1 Data Description and Exploration
1.2.2 Techniques of Handling Missing Data
1.2.3 R Program
1.3 Data Transformation and Normalization
1.3.1 R Program
1.4 Data Discretization
1.5 Summary
1.6 R Program
1.7 References

2. K-means Cluster Analysis for Indian Liver Patients:


Unsupervised Learning Approach 13 – 17
2.1 Introduction
2.2 Exploring Data
2.3 K-means Cluster: Theoretical Approach
2.4 Indian Liver Patient Cluster Analysis
2.5 Results and Discussions
2.6 Summary
2.7 R Program for K-means Cluster Analysis for Indian Liver
Patients
2.8 References

3. Sales Prediction by Principal Component Analysis and


Recursive Partitioning Regression Tree Method 18 – 28
3.1 Problem Description and Objectives
3.2 Data Description
3.3 Building PCA Model
3.4 Analysis and Visualization of Principal Components
3.5 Prediction Model using rpart with PC
3.6 Results
3.7 Summary
3.8 R Program
3.9 References

4. Analysis of Variable Importance Measures for Parkinson’s


Data: Random Forest Approach 29 – 33
4.1 Problem Description
4.2 Data Description and Exploration
4.3 Random Forest Technique
4.4 Analysis of Variable Importance Measures
4.5 Results and Discussions
4.6 Summary
4.7 References

5. Classification Model for Thoracic Surgery Data: A Decision


Tree Based Approach 34 – 38
5.1 Problem Description
5.2 Data Exploration
5.3 Theory on Decision Tree
5.4 Decision Tree Model for Thoracic Surgery Data
5.5 Summary
5.6 R Program for the Design of Decision Tree Model on
Thoracic Surgery Data
5.7 References

6. Fertility Data Analysis with MXNet in R: A Feedforward


Neural Net Approach 39 – 43
6.1 Introduction
6.2 Data Description and Exploration
6.3 Feedforward Neural Net Theory
6.4 Fertility Data Analysis with MXNet
6.5 Summary
6.6 R Program for Fertility Data Analysis with MXNet using
Feedforward Neural Net
6.7 References
7. Analysis of Breast Tissue using KNN Classification: A Lazy
Learning Approach 44 – 48
7.1 Introduction
7.2 Data Set Explanation and Exploration
7.3 Lazy Learning Theory
7.4 KNN Classification for Breast Tissue Data
7.5 Summary
7.6 R Program for the Analysis of Breast Tissue Based on
Electrical Impedance Measurements using KNN
Classification
7.7 References

8. Frequent Item Set Generation and Correlational Analysis for


Supermarket Transactional Data using Associative Rule
Mining with Equivalence Class Transformation 49 – 53
8.1 Problem Description and Objectives
8.2 Data Description
8.3 Theory on Frequent Item Set Mining and Association
Mining
8.4 Correlational Analysis with ECLAT and Apriori Method
8.5 Visualization of Frequent Item Set and Association Rules
8.6 Summary
8.7 R Program
8.8 References

9. Forecasting Infant Mortality Rate in India: A Time Series


Modelling Approach 54 – 57
9.1 Problem Description
9.2 The Available Data
9.3 Time Series Modelling
9.4 Forecasting Infant Mortality Rate
9.5 Summary
9.6 R Program for Forecasting Infant Mortality Rate in India
Using Time Series Modelling
9.7 References
10. Hierarchical Cluster Analysis for Immunotherapy Data:
An Unsupervised Approach 58 – 62
10.1 Introduction
10.2 Material and Methods
10.3 Computational Details, Results and Discussion
10.4 Summary
10.5 R Program for the Design of Hierarchical Clusters on
Immunotherapy Data
10.6 References

11. Predictive Model for Diabetic Retinopathy Debrecen Data: 63 – 69


A Deep Learning Approach
11.1 Introduction
11.2 Material and Methods
11.3 Computational Details: Deep Neural Network Architecture
11.4 Results and Discussion
11.5 Summary
11.6 R Program for the Construction of DL Model
11.7 References
LIST OF FIGURES
1.1 Aggregation Plot Showing the Variables with Missing Values and the Missing 3
Value Frequency
1.2 Automobile Data Set with Missing Values before Applying Central 4
Imputation Method and After Imputation
1.3 Automobile Data Set before knn Imputation with Missing Values 5
1.4 Automobile Data Set After knn Imputation 5
1.5 Statistical Summary of iris Dataset 7
1.6 Summary Results of Scale Transformation on iris Dataset 7
1.7 Summary Results of Center Transformation on iris Dataset 8
1.8 Summary Results of Scale and Center Combination Transformation on iris 8
Dataset
1.9 Summary Results of Range Normalization 8
1.10 Histogram Plot using Different Discretize Methods 10
2.1 Basic Statistical Computations Indian Liver Patients Dataset 13
2.2 R Code for Determining No. of Clusters by Plotting 14
2.3 Number of Clusters Against within Clusters Sum of Squares 14
2.4 Cluster Means for Four Clusters 15
2.5 Cluster Plot for 1st and 2nd Principle Components 16
2.6 Cluster Plot for 1st and 2nd Discriminant Functions 16
3.1 Pre-processed Supermarket Sale dataset Structure 20
3.2 Screeplot of PCA 20
3.3 Plot of Correlational Variables of PC 21
3.4 Contribution of Variables to PC1 22
3.5 Contribution of Variables to PC2 22
3.6 CP Plot 23
3.7 CP Table with Relative Error and Standard Deviation 23
4.1 Variable Importance Measures for Parkinson’s Data 30
4.2 Random Forest Model for the Classification of Parkinson’s Data 30
4.3 Relative Importance of Variables 31
4.4 Error Rate Progressively for the Number of Trees Built 32
4.5 ROC Curve based on Out-of-bag Predictions 32
4.6 Error Matrix for Random Forest on Validation Data 32
5.1 Decision Tree for Thoracic surgery data 35
5.2 Summary of the Decision Tree model for Classification 36
5.3 Graphical Representation of Cross Validated Error Summary 36
5.4 Partial Structure of Node and Split Details 36
5.5 Predicted Values for Test Data 36
6.1 Basic Statistical Analysis of Fertility Dataset 40
6.2 Model Accuracy During Training 41
7.1 Exploratory Analysis of Electrical Impedance Measurements 44
7.2 Trained KNN Model for Breast Tissue Classification 45
7.3 Accuracy versus K-value Plot 46
7.4 Results of Prediction on Test Data 46
8.1 Count of Top 10 Frequent Groceries item 50
8.2 Top 10 Rules using Apriori 50
8.3 Grouped Matrix of Top 10 Rules 51
8.4 Graph Based Plot of Association Rules 52
9.1 Time-series Plot for Infant Mortality Rate 55
9.2 Fitting Time-series Plot into Line 55
9.3 ACF plot 55
9.4 PACF plot 55
9.5 ARMA Prediction Model Output in R 56
9.6 Time-series Model Predicting Future 10 Values 56
10.1 Basic Statistical Computations on Immunotherapy Data 58
10.2 Dendrogram Plot for Hierarchical Clustering on Immunotherapy Data 59
10.3 Cluster Dendrogram on Immunotherapy Data – Four Clusters 60
10.4 Scatter Plot for Cluster Visualization 61
11.1 Deep Neural Network Architecture 64
11.2 Summary of DL Model for the Present Study 65
11.3 Performance of the DL Model for four Hidden Layer with 15, 12, 9 and 6 65
Hidden Neurons
11.4 DL Model Accuracy for Training and Testing Data 65
11.5 DL Model loss for Training and Testing Data 66
11.6 Confusion Matrix for the Test Data 66
LIST OF TABLES
1.1 Automobile Dataset Attribute Information with Variable Mapping 2–3
1.2 Min – max Values after Transformation 9
1.3 Boundary values and item count in each bin for different discretize method 11
2.1 Performance Evaluation for Accuracy of Cluster Designs 15
3.1 Attribute Information of Supermarket Sales Dataset 18 – 19
3.2 Principal Components with Percentage of Variance 20 – 21
5.1 Variable Importance Measures of Thoracic Data 36
5.2 Actual and Predicted Values of Test Set 37
7.1 Electrical Impedance Measurements Dataset 44 – 45
7.2 Actual and Predicted values of Test Set 46
9.1 Infant Mortality Rate in India 54
10.1 Agglomeration Methods and their Coefficients 59
10.2 Details of Clusters 60
11.1 Diabetic Retinopathy Debrecen Dataset 63
11.2 Deep Neural Network properties of present study 64 – 65
Case Study 1

Data Preprocessing in R

1.1 Need of Data Preprocessing


Usually the data in real cases have a lot of impurities or rather in data mining or data science
terminology it is called dirty data. The reasons for data being dirty are several reasons could, data
could be incomplete, noisy, inconsistent etc…
Data is said to be incomplete when its attributes has missing values (usually blank), attributes
required for further analysis or of some interest are missing or only summarized or aggregate data is
available without the details e.g., Gender = “ ”
Noisy data is one that contains erroneous values or values out of the scope and domain range
referred as outliers e.g., age = “-10”
Data is inconsistent if it contains discrepancies in entity record attributes e.g., Age = “ 45 ”
Birthday = “19/08/1998”
If data from source 1 has ordinal data type attribute and has rating style “1,2,3”, another source
of same data has rating style “A, B, C”, e.g., this raises discrepancy between duplicate records
If dirty data is used for further analysis it could lead to dirty results (Garbage In Garbage Out).
Therefore it is important to preprocess data to get quality results. Sometimes data mining and data
analytic method support only particular type of attribute data type or are sensitive to scale thus data
needs to be transformed before application of method.
At times the whole dataset is too large to be handled by a particular algorithm, so either
dimension (i.e., attribute columns) or number of records need to be reduced.
Data preprocessing includes
● Data cleaning
● Data integration
● Data transformation
● Data dimension and numerosity reduction.
● Data discretisation (numerical data).

1.2 Handling Missing Data


Data Cleaning in R: Dealing with Missing/Unknown Values.
Missing variable values are a frequent problem in real world data sets.
2 Data Analytics in R: A Case Study Based Approach

Some Possible Strategies are


● Delete or remove all lines in a data set with some unknown value.
● Replace or fill in the unknowns with the most common value (a statistic of centrality, mean or
average).
● Fill-in with the most common value on the cases that are more “ similar ” to the one with
unknowns.
● Explore eventual correlations between variables etc.

1.2.1 Data Description and Exploration


The data used in case study for handling missing data is Automobile Data Set retrieve from UCI
repository. The dataset contains 206 number of instances and 26 attributes with some missing values
in some instances.
The current data set consists of three types of entities:
(a) The automobile specifications in terms of various characteristic features.
(b) its assigned risk rating,
(c) its normalized value for average loss of payment per insured vehicle year in comparison to
other cars.
The entity listed is the risk rating that corresponds to the degree to which the auto is more risky
than its price, “symboling” process is used to associated risk to the cars initially based on the price and
then it is adjusted by moving the scale up (more risky) or down (less risky) i.e. +3 indicates more risky,
that -3 indicates negligible risk or pretty safe.
The third entity represents is the relative average loss per insured car per year. This value is
normalized for all autos within a particular size classification (two-door small, station wagons,
sports/speciality, etc...).
The 26 attributes of the automobile dataset are renamed as v1, v2……v26. The Table 1.1 shows
the mapping of variable names
Table1.1: Automobile Dataset Attribute Information with Variable Mapping
Variable Attribute Type Value range
V1 Symboling Numeric -3, -2, -1, 0, 1, 2, 3.
V2 Normalized-losses Continuous 65 to 256.
V3 Make Categorical alfa-romero, audi, bmw, chevrolet, dodge, honda,
isuzu, jaguar, mazda, mercedes-benz, mercury,
mitsubishi, nissan, peugot, plymouth,
porsche,renault, saab, subaru, toyota, volkswagen,
volvo
V4 Fuel-type Categorical diesel, gas
V5 Aspiration Categorical std, turbo
V6 Num-of-doors Categorical four, two.
V7 Body-style Categorical hardtop, wagon, sedan, hatchback, convertible.
V8 Drive-wheels Categorical 4wd, fwd, rwd
Data Preprocessing in R 3

V9 Engine-location Categorical front, rear


V10 Wheel-base Continuous 86.6 to 120.9.
V11 Length Continuous 141.1 to 208.1.
V12 Width Continuous 60.3 to 72.3
V13 Height Continuous 47.8 to 59.8.
V14 Curb-weight Continuous 1488 to 4066
V15 Engine-type Categorical dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
V16 Num-of-cylinders Categorical eight, five, four, six, three, twelve, two
V17 Engine-size Continuous 61 to 326.
V18 Fuel-system Categorical 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
V19 Bore Continuous 2.54 to 3.94
V20 Stroke Continuous 2.07 to 4.17
V21 Compression-ratio Continuous 7 to 23
V22 Horsepower Continuous 48 to 288
V23 Peak-rpm Continuous 4150 to 6600.
V24 24. city-mpg Continuous 13 to 49
V25 Highway-mpg Continuous 16 to 54
V26 Price Continuous 5118 to 45400
Exploration of the dataset is done to find which variables observations are missing and how many.
We have used Aggregation plots for answering these questions. The Figure 1.1 clearly shows that
proportion of missing values constitute 20% of V2 that is normalized-losses and very minimal of 1%
to 2% for V6,V19,V20,V22,V23 and V26. In the combinations plot on the right-hand side, the grid
presents all combinations of missing (red) and observed (blue) values present in the data. There are
159 complete observations, and 34 rows have missing values for v2 and the rest of the variable
combination have very less rows of missing values. Such as 1 row with v2 and v6 variable value
missing, 2 rows with v2, v22,v23 variable missing values, 4 rows with v2 and v26 variable missing.
Plot Zoom

34

159

Fig. 1.1: Aggregation Plot Showing the Variables with Missing Values and the Missing Value Frequency
4 Data Analytics in R: A Case Study Based Approach

1.2.2 Techniques for Handling Missing Data


First method is deleting the missing records, in the case if the variable is having missing values
for many records, it is better off to remove the variable from the dataset but in the case if the variable
is of importance then delete only those records having missing values.
For the current dataset this method for deleting the only those rows where v6,v22,v23,v26 have 2
to 4 rows of missing values will be suitable.
Second method is to replace the missing values with mean or mode value of the column. This can
be done using central imputation function. This function will replace the missing value with median of
all values of that column if the variable is numeric or integer and with mode value (value that occurs
maximum number of times) if character, factor, labelled, Date and logical variables.
In the current dataset this method is applicable for variable V2, Figure 1.2 shows the results of
applying Central Imputation method on Automobile data on missing values with before and after
imputation

Fig. 1.2: Automobile Data Set with Missing Values before Applying Central Imputation Method and After
Imputation
Replacing the missing values by mean, mode or median is a very crude way of handling missing
values as it result in rough approximation which at times can give satisfactory results depending upon
the context.
Another way of handling missing value is replacing it with a predicted value, this can be done
using knnImputation(). For every record to be imputed, it identifies ‘k’ closest observations based on
the euclidean distance and computes the weighted average of these ‘k’ observations.
Data Preprocessing in R 5

The third method used here is replacing the missing values by knnImputation()
knnImputation() method requires the variable to be numeric, therefore only column 1,2,10:14 are
retained and other column variables where numeric conversion could not be applied are dropped off.
Figure 1.3 and Figure 1.4 shows dataset before and after imputation.

Fig. 1.3: Automobile Data Set before knn Imputation with Missing Values

Fig. 1.4: Automobile Data Set After knn Imputation

1.2.3 R Program
library(DMwR)
library(VIM)
# load the dataset in data frame car
car<-data.frame(Automobile.data)
# replace all missing values indicated with '?' by 'NA'
car[,1]
for(i in 1:26) car[,i][car[,i]=='?'] <- 'NA'
6 Data Analytics in R: A Case Study Based Approach

car
# count the number of rows with missing values
# complete.cases returns a logical vector indicating which cases in dataset are complete
#
aggr(car, numbers = TRUE, prop = c(TRUE, FALSE))
nrow(car[!complete.cases(car),])
# option 1 delete the records with missing values
noNA.car <- na.omit(car) # Option 1
nrow(noNA.car[!complete.cases(noNA.car),])
# Option 2 centralimputation
noNA.car <- centralImputation(car)
nrow(noNA.car[!complete.cases(noNA.car),])
newcar<-data.frame(noNA.car)
# data before replace
head(car)
#data after replace
head(newcar)
## [1] 0
noNa.car<-knnImputation(car)
for(i in 10:14) car[,i] <- as.numeric(car[,i])
# Option 3 knnImputation
noNA.car<- knnImputation(car[,c(1,2,10,11,12,13,14)],10)
nrow(noNA.car[!complete.cases(noNA.car),])
newcar<-data.frame(noNA.car)
# data before replace
car
#data after replace
Newcar

1.3 Data Transformation - Normalization


In data analytics or machine learning one of the crucial step of preprocessing is data
transformation.
There are different data transformation techniques
● Smoothing: Remove noise from data
● Attribute/feature construction: New attributes constructed from the given ones
● Aggregation: Summarization, data cube construction - basically used in data warehousing
● Normalization: Scaled to fall within a smaller, specified range
Data Preprocessing in R 7

– min-max normalization
– z-score normalization
– normalization by decimal scaling
Among the listed methods normalization of data is given more importance as specially, if it is a
case of getting all your data on the same scale, if the scales for different features are wildly different,
this can have adverse effects on results obtained.
Data used here is iris data set downloaded from UCI repository [2], Figure 1.5 shows the
statistical summary of iris data set.

Fig. 1.5: Statistical Summary of iris Dataset


The dataset consist of 5 variables sepal.length, sepal.width, petal.length, petal.width and species
class of iris flower. Since species is a categorical variable it is not considered in scaling. The Figure
1.5 shows min, max mean, median values of each of 4 variables.
Method 1 for transformation - The scale transform it calculates the standard deviation for an
attribute and divides each value by that standard deviation.
The results after scale transform is shown in Figure 1.6

Fig. 1.6: Summary Results of Scale Transformation on iris Dataset


Method 2 is the center transformation this calculates the mean for an attribute and subtracts it
from each value. Figure 1.7 shows results after center transform.
8 Data Analytics in R: A Case Study Based Approach

Fig. 1.7: Summary Results of Center Transformation on iris Dataset


Method 3 is combining scale and center transformation Figure 1.8 shows the result of it.

Fig. 1.8: Summary Results of Scale and Center Combination Transformation on iris Dataset
Method 4 - Data values can be scaled into the range of [0, 1] which is called normalization and
result of it is as shown in figure 1.9.

Fig. 1.9: Summary Results of Range Normalization


Table 1.2 shows a comparative min and max values of the 4 attributes using all four above
mentioned method.
Data Preprocessing in R 9

Table 1.2: Min – max Values after Transformation

Sepal.length Sepal.width Petal.length Petal.width


Method Max Min Max Min Max Min Max Min
None 7.900 4.300 4.400 2.000 6.900 1.000 2.500 .100
Scale 9.540 5.193 10.095 4.589 3.908 0.566 3.279 0.131
Center 2.056 1.543 1.3426 1.057 3.142 2.758 1.300 1.099
Scale and 2.483 -1.863 3.080 -2.425 1.779 -1.562 1.706 -1.442
Center
Range 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000

1.3.1 R Program
library(caret)
ir<-data.frame(read.csv("D:/DM-R/iris.csv"))
summary(ir[,1:4])
#The scale transform calculates the standard deviation for an attribute and
#divides each value by that standard deviation.
preparams<-preProcess(ir[,1:4], method=c("scale"))
print(preparams)
transformed<-predict(preparams,ir[,1:4])
summary(transformed)
#The center transform calculates the mean for an attribute and subtracts it from each value.
preparams<-preProcess(ir[,1:4], method=c("center"))
print(preparams)
transformed<-predict(preparams,ir[,1:4])
summary(transformed)
#Combining the scale and center transforms will standardize your data.
preparams<-preProcess(ir[,1:4], method=c("center","scale"))
print(preparams)
transformed<-predict(preparams,ir[,1:4])
summary(transformed)
# Data values can be scaled into the range of [0, 1] which is called normalization.
preparams<-preProcess(ir[,1:4], method=c("range"))
print(preparams)
transformed<-predict(preparams,ir[,1:4])
summary(transformed)

1.4 Data Discretization


Many a time it is required to convert the numerical data to categorical or nominal, that is because
not all data analytics and mining methods work with numerical values. In such case data discretization
10 Data Analytics in R: A Case Study Based Approach

methods can be deployed to get the required attributes converted to nominal values. Discretization can
also be used for data size reduction. Basically, Discretization divides the range of a continuous
attribute into intervals and later these interval labels can then be used to replace actual data values.
Discretization techniques include the following:
● Data discretization by binning: This is a top-down unsupervised splitting technique based
on a specified number of bins.
● Data discretization by histogram analysis: In this technique, a histogram partitions the
values of an attribute into disjoint ranges called buckets or bins. It is also an unsupervised
method.
● Data discretization by cluster analysis: In this technique, a clustering algorithm can be
applied to discretize a numerical attribute by partitioning the values of that attribute into
clusters or groups.
● Data discretization by decision tree analysis: In this approach a decision tree is create by
applying a top-down splitting approach; it is a supervised method. It is a hierarchical
discretization where the numeric attribute is discretized by selecting the value of the attribute
that has minimum entropy as a split-point, and then recursively partitions the resulting
intervals.
● Data discretization by correlation analysis: This employs a bottom-up approach by finding
the best neighboring intervals and then merging them to form larger intervals, recursively. It
is supervised method.
Discretize() method of a rules package in R used perform dicretization of iris dataset based on
equal interval, equal frequency, k means cluster analysis and user specified intervals.
Plot Zoom

Equal Interval length Equal Frequency

0.5 1.0 1.5 2.5 0.5 1.0 1.5 2.0 2.5

K-Means Fixed

0.5 1.0 1.5 2.5 2.0


X X

Fig. 1.10: Histogram Plot using Different Discretize Methods


The figure 1.10 shows histogram plot of different discretize methods with interval boundaries. It
is observed from the plots that k means cluster analysis for discretization gives optimum interval
Data Preprocessing in R 11

boundaries shown in Table 1.3 as the bins of data points are form more accurately than other methods
where boundary points are shifted to other bin.
Table 1.3 Boundary values and item count in each bin for different discretize method

Discretize method Bin 1 Bin 2 Bin 3


boundaries/count boundaries/count boundaries/count
Equal length [0.1,0.9)/50 [0.9,1.7)/52 [0.9,1.7)/48
Equal frequency [0.1,0.867)/50 [0.867,1.6)/48 [1.6,2.5]/52
Cluster analysis [0.1,0.785)/50 [0.785,1.69)/54 [1.69,2.5]/46

1.5 Summary
The current case study has exhibited how preprocessing of dirty/ incomplete data can be done
using different R data preprocessing functionality. It also showed that knnimputation() method
comparatively a better imputation method among the missing data handling techniques. Authors have
also demonstrated the different normalization and discretization methods which can be used to scale
and transform data respectively.

1.6 R Program
library(arules)
data(iris)
x <- iris[,4]
hist(x, breaks=20, main="Data")
plot(x)
plot(x,type='p',main="Equal Interval length")
def.par <- par(no.readonly = TRUE) # save default
layout(mat=rbind(1:2,3:4))
### convert continuous variables into categories (there are 3 types of flowers)
### default is equal interval width
table(discretize(x, "interval", breaks=3))

hist(x, breaks=20, main="Equal Interval length")


#plot(x,type='p',main="Equal Interval length")
abline(v=discretize(x, method= "interval",breaks=3, onlycuts=TRUE),col="red")
### equal frequency
table(discretize(x, "frequency", breaks=3))
hist(x, breaks=20, main="Equal Frequency")
#plot(x,type='p',main="Equal Frequency")
abline(v=discretize(x, method="frequency", breaks=3, onlycuts=TRUE),col="red")
### k-means clustering
table(discretize(x, "cluster", breaks=3))
12 Data Analytics in R: A Case Study Based Approach

hist(x, breaks=20, main="K-Means")


#plot(x,type='p',main="kmeans")
abline(v=discretize(x, method="cluster", breaks=3, onlycuts=TRUE), col="red")
## user-specified
table(discretize(x, "fixed", breaks = c(-Inf,.8,1.6, Inf),
labels=c("small", "medium","large")))
hist(x, breaks=20, main="Fixed")
#plot(x,type='p',main="Fixed")
abline(v=discretize(x, method="fixed", breaks = c(-Inf,.8,1.6,Inf),
onlycuts=TRUE), col="red")

1.7 References
1. Automobile Data Set – UCI repository https://archive.ics.uci.edu/ml/datasets/automobile.
2. Iris dataset - https://archive.ics.uci.edu/ml/datasets/iris.
3. Jiawei Han, Micheline Kamber and Jian Pei, “Data Mining: Concepts and Techniques ” , 3rd ed.
The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, July
2011. ISBN 978-012381479.
4. S.S. Jamsandekar, R.R. Mudholkar, “ Performance Evaluation by Fuzzy Inference Technique ” ,
International Journal of Soft Computing and Engineering (IJSCE) Volume-3, Issue-2 on May 05,
2013.
5. S.S. Jamsandekar, R.R. Mudholkar, “Fuzzy Classification System by self generated membership
function using clustering technique ” , BIJIT - BVICAM ’ s International Journal of Information
Technology Special issue on Fuzzy logic,Vol.6 No.1 Issue 11, (January - June, 2014).


Case Study 2

K-means Cluster Analysis for Indian Liver Patients:


Unsupervised Learning Approach

2.1 Introduction
The cluster analysis refers to a series of techniques that allow the subdivision of dataset into
subgroups, based on their similarities. This case study explores unsupervised classification of Indian
liver patients data using K-means clustering. The dataset for the present study is retrieved from UCI
repository. The objective is to exhibit the performance of k-means clustering by tuning number of
clusters and iterations on Indian Liver Patients data. Performance will be evaluated with reference to
within clusters sum of square errors.

2.2 Exploring Data


The dataset for this cluster analysis is retrieved from UCI data repository[1]. The data set for the
present study comprises of 416 liver patients details collected from north east of Andhra Pradesh,
India. Unsupervised classification is carried out for this data considering nine attributes such as Age of
the patient, Total Bilirubin (TB), Direct Bilirubin (DB), Alkphos Alkaline Phosphotase, Sgpt Alamine
Aminotransferase, Sgot Aspartate Aminotransferase, Total Protiens (TP), Albumin (ALB) and A/G
Ratio Albumin and Globulin Ratio. Figure 2.1 shows basic statistical computations of the dataset.

Fig. 2.1: Basic Statistical Computations Indian Liver Patients Dataset

2.3 K-means Cluster: Theoretical Approach


For the present study, we have employed K-means clustering, an unsupervised classification
technique. The K-means technique derives a collection of k clusters using a heuristic search starting
with a selection of k randomly chosen clusters each of which in the beginning represents a cluster
14 Data Analytics in R: A Case Study Based Approach

mean. Further the cluster analysis is based on measuring similarity between data items by computing
the distance between each pair. The similarity is measured with respect to the mean value of the data
items in a cluster.

2.4 Indian Liver Patient Cluster Analysis


The present investigation of visualization and analysis of liver patients clusters is simulated in R,
an open source data mining platform. The R program for the same is given at the end of this case study.
Liver patients records are clustered based on nine parameters. A single record Li represented as
multidimensional data vector and is defined as:
Li = [ Agei, TBi, DBi, Alki, Agpi, Sgoti, TPi, Albi, Agi]
Where i = 1 to 416
The data preprocessing is carried out by removing null values present in the dataset using
na.omit() function. K-means clustering requires the analyst to specify the number of clusters to
retrieve. We plotted the “number of clusters against” the “within clusters sum of squares”, which is
the parameters ought to be minimized during the clustering process. R code for the same is shown in
figure 2.2. Figure 2.3 shows within-cluster sum of squares for different numbers of clusters. Plot
reveals that this quantity decreases up to a point 6, and then there is no much significant decrease in
value.

Fig. 2.2: R Code for Determining No. of Clusters by Plotting

Number of clusters

Fig. 2.3: Number of Clusters Against within Clusters Sum of Squares


The K-means clustering technique divides the data items into k partitions in such a way that the
resulting within cluster similarity is more but the inter cluster similarity is less. The R function
kmeans() is applied to get clusters of liver patients. The experiment is tuned by varying number of
clusters by keeping number of iterations 15, as constant. The experiment is summarized in table 2.1.
K-means Cluster Analysis for Indian Liver Patients: Unsupervised Learning Approach 15

Performance is evaluated with reference to within clusters sum of square errors and
BetweenSS/TotalSS. Ideally a clustering that has the properties of internal cohesion and external
separation, i.e. the BSS/TSS ratio should approach 1.
Table 2.1: Performance Evaluation for Accuracy of Cluster Designs

No. of clusters Cluster Size Between_SS/Total_SS


3 10, 370, 34 57.9%
4 359, 20, 33, 2 75%
5 9, 59, 19, 2, 325 81.2%
6 2, 30, 59, 8, 306,9 86.2%

2.5 Results and Discussions


Considering cluster performance explained in table 2.1 and figure 2.3, it has been decided to
design four clusters of liver patients. The table shows that for 4 clusters BSS/TSS ratio is 75%
indicating a good fit. However, the clusters size also taken into the consideration. K-means algorithm
is applied with four clusters. Thus derived k-means clustering results into 4 clusters of sizes 359, 20,
32 and 2. The corresponding cluster means are shown in figure 2.4. The within-cluster variation
measures the extent of each event in a cluster, differing from the others in the same cluster. Figure 2.5
and 2.6 shows cluster plot for 1st and 2nd principle components and discriminant functions
respectively. Plot reveals that cluster 1 is densed and some of the data points of cluster 2 are closer to
cluster 1.

Fig. 2.4: Cluster Means for Four Clusters


Analyzing the cluster means, we can relate each group with each of the four classes of patients:
● Cluster 4 has highest value for Total Bilirubi, Direct Bilirubi, Sgpt Alamine Aminotransferase,
Sgot Aspartate Aminotransferase. There are only two data points in cluster 4.
● Cluster 1 is lowest values for Total Bilirubi, Direct Bilirubi, Alkphos Alkaline Phosphotase,
Sgpt Alamine Aminotransferase, Sgot Aspartate Aminotransferase. There are 359 data points
in cluster 1.
● Alkphos Alkaline Phosphotase value is larger for cluster 3 as compared to others. And also
Age more than other clusters.
16 Data Analytics in R: A Case Study Based Approach

● The values for Total Protiens (TP), Albumin (ALB) and A/G Ratio Albumin and Globulin
Ratio are lowest for cluster.

Component 1
These two components explain 51.97% of the point

Fig. 2.5: Cluster Plot for 1st and 2nd Principle Fig. 2.6: Cluster Plot for 1st and 2nd Discriminant
Components Functions

2.6 Summary
Authors have demonstrated the K-means clustering as tool for anlysing liver patients dataset
together with visulaization for interpreting the results. This case study has demonstrated cluster
analysis by modulating number of clusters. Cluster analysis is accomplished by resorting to a series of
techniques that allow the subdivision of a dataset into subgroups, based on their similarities. Thus
derived k-means clustering results into 4 clusters of sizes 359, 20 33 and 2. The result suggests that k-
means has the potential to exhibit the pre-eminent tool for liver patients cluster analysis.

2.7 R Program for K-means Cluster Analysis for Indian Liver


Patients
#Loading required libraries
library(stats)
library(purrr)
library(cluster)
library(fpc)
#Loading Indian liver patients data into R environment
lp<- read.csv("D:/AI HA/Dataset/liverPatient.csv")
set.seed(123)
# Data Preprocessing
lp <- na.omit(lp)
lp <- scale(lp)
# Determine number of clusters
wss <- (nrow(lp)-1)*sum(apply(lp,2,var))
for (i in 2:10) wss[i] <- sum(kmeans(lp, centers=i)$withinss)
plot(1:10, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
K-means Cluster Analysis for Indian Liver Patients: Unsupervised Learning Approach 17

# Designing three clusters


km <- kmeans(lp, 3, iter.max = 15)
print(km)

# Cluster Plot against 1st 2 principal components


clusplot(lp, km$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

# Centroid Plot against 1st 2 discriminant functions


plotcluster(lp, km$cluster)

2.8 References
1. Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and
Computer Science.
2. R. S. Kamath, R. K. Kamat, S.M. Pujar, Correlating R & D Expenditure and Scholarly Publication
Output Using K-Means Clustering, International Journal of Information Technology, Modeling
and Computing (IJITMC), Vol. 5, No.1, February, Page 1-7, 2017.
3. Graham Williams, Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
Discovery, Springer, New York, 2011.
4. R.K. Kamat, R.S. Kamath, Visualization of Earthquake Clusters over Space: K-Means Approach,
Journal of Chemical and Pharmaceutical Sciences (JCHPS),Volume 10, Issue 1, 2017, Page 250-
253.
5. R.S. Kamath and R.K. Kamat , K-Means Clustering for Analyzing Productivity in Light of R & D
Spillover, International Journal of Information Technology, Modeling and Computing (IJITMC),
Vol.4, No. 2, 2016, 55-64.


Case Study 3

Sales Prediction by Principal Component Analysis and


Recursive Partitioning Regression Tree Method

3.1 Problem Description and Objectives


Most supermarket companies like to estimate the upcoming trades. A superior forecasting can
avoid them from overestimating or under-estimating the future trades which leads to a great harm to
the companies. Using reliable sales prediction model the supermarkets may possibly purchase
products more sensibly and leading to profitable revenues. This a need for every business in current
high competition marketplace.

Objectives
● Principal component analysis for dataset dimensionality reduction and feature selection.
● Building a predictive regression tree model for supermarket sales prediction.

3.2 Data Description


In this case study we have used supermarket sales data set retrieve from kaggle[1], The dataset is
one of the historical sales of supermarket company which was recorded in 3 different branches for first
3 months of year 2019, the dimension of dataset is 1000 X17. The table 3.1 shows the attribute
information with range of values.
Table 3.1: Attribute Information of Supermarket Sales Dataset

Sr. No Attribute Name Description and Range of Values


1 Invoice id: Computer generated sales slip invoice identification number.
2 Branch: Branch of supercenter (3 branches are available identified by A, B and
C).
3 City: Location of super centers.
4 Gender: Gender type of customer.
5 Customer type: Type of customers, recorded by Members for customers using member
card and Normal for without member card.
6 Product line: General item categorization groups Electronic accessories, Fashion
accessories, Food and beverages, Health and beauty, Home and
lifestyle, Sports and travel
7 Unit price: Price of each product in $.
8 Total: Total price including tax.
9 Tax: 5% tax fee for customer buying.
Sales Prediction by Principal Component Analysis and Recursive Partitioning Regression Tree Method 19

10 Quantity: Number of products purchased by customer.


11 Date: Date of purchase (Record available from January 2019 to March 2019.
12 Time: Purchase time (10am to 9pm).
13 Payment: Payment used by customer for purchase (3 methods are available –
Cash, Credit card and Ewallet).
14 COGS: Cost of goods sold.
15 Gross margin Gross Margin percentage of each product
percentage:
16 Gross income: Gross Income
17 Rating: Customer stratification rating on their overall shopping experience (On
a scale of 1 to 10).

3.3 Building PCA Model


Data Preprocessing: the dataset has 17 variables and not all are predicting factors, instead of
using all variables in predictive analysis. The factors contributing to prediction of sales can be featured
out. The Principal component analysis has worked as a good dimension reduction and feature selection
technique [2],[3].
The main objective of Principal Component Analysis (PCA) feature selection/extraction i.e. is to
reduce the number of dimensions that describe our data, without information loss. The first step in
PCA is to decide upon the number of principal components or factors we want to retain.
But to use PCA on the dataset all the variables should be of type numeric, thus data
transformation is required for the variables which are non numeric. In the current working dataset
branch, city, customer type, product line, gender, payment are categorical variable are converted to
numeric.
C() method of R is used for categorical to numeric conversion on the above listed variables,
followed by dropping all categorical variables and replacing it with all generated vectors of respective
categorical variables.
Also invoice_id the first variable is drooped from the dataset. For the Date variables, since all
dates are of the same year and also day of month is not that important only month of the sales is
retained.
For time variable only hour value is retained. The gross margin variable is a constant value and
PCA cannot be applied on constant values thus dropped gross margin.
The preprocessed data contains 15 variable all in numeric data type format as shown in Figure
3.1
Prcomp() method of R is used to do PCA and a scree plot of the Prcomp() results is used to find
the number of factors/principal components to retain in an factor analysis (FA) or principal component
analysis (PCA) respectively. A scree plot displays the eigenvalues in a downward curve, with
descending order of eigenvalues that shows the percentage of variance among the principal
components.
20 Data Analytics in R: A Case Study Based Approach

Fig. 3.1: Pre-processed Supermarket Sale dataset Structure

3.4 Analysis and Visualization of Principal Components


According, to the screen test for the supermarket dataset the “elbow” of the graph where the
eigenvalues seem to level off is 2, there we have retained PC1 and PC2.
Using the standard deviation of principal components (PC ’ s), the variance and percentage of
variance is computed as shown in Table 3.2, it shows the first 10 PC’s and their percentage of variance.
According to percentage of variance, it shows that PC1 and PC2 has higher Percentage.
Plot Zoom
Scree plot

30

20

10

0
1 2 3 4 5 6 7 8 9 10
Dimensions

Fig. 3.2 : Screeplot of PCA


Table 3.2: Principal Components with Percentage of Variance

Principal component Percentage of variance


PC1 32.814
PC2 10.547
PC3 7.783
PC4 7.179
Sales Prediction by Principal Component Analysis and Recursive Partitioning Regression Tree Method 21

PC5 6.762
PC6 6.646
PC7 6.508
PC8 6.200
PC9 6.068
PC10 5.689
The Plot below shown in Figure 3.3 shows variable correlation plots. It shows the relationships
between all variables. It can be interpreted as follow:
● Positively correlated variables are grouped together.
● Negatively correlated variables are plotted on opposite sides of the plot origin.
● The distance between origin and the plotted variables measures the quality of the variables.
Variables that are far off from the origin are well represented on the factor map.
Factor map calculates the contribution of principal component variables, the larger the value the
more the variable contributes to the component
Thus city, branch, quantity, unit price, gross income, cogs, tax5 and total variables of the
supermarket Sales dataset has more contribution and these variable are more towards the correlation
circle of the plot. The Figure 3.4 and Figure 3.5 shows the contribution of variables in PC1 (Dim1)
and PC2 (Dim 2) respectively.
Plot Zoom
Variables - PCA
-1.0

0.5

0.0

-0.5

-1.0
-1.0 -0.5 0.0 0.5 1.0
Dim1 (32.8%)

Fig. 3.3: Plot of Correlational Variables of PC


22 Data Analytics in R: A Case Study Based Approach

Plot Zoom
Contribution of variables to Dim-1

Fig. 3.4: Contribution of Variables to PC1


Plot Zoom

Fig. 3.5: Contribution of Variables to PC2


Sales Prediction by Principal Component Analysis and Recursive Partitioning Regression Tree Method 23

3.5 Prediction Model using rpart with PC


The PCA components obtained are used to build the predictive model for sales prediction.The
model is created to build regression tree using decision tree technique rpart() with anova method
giving a sales prediction with ratio of 70:30 training and testing data set respectively.

3.6 Results
Plotcp() and printcp provides a graphical representation to the cross validated error summary.
The cp values are plotted against the geometric mean to depict the deviation until the minimum value
is reached.The regression tree model is built with complexity parameter CP =. 01 as derived from cp
plot and cp print as shown in Figure 3.6 and Figure 3.7 with an prediction accuracy of 92%, the root
node split has reduced the relative error by 70%.
Plot Zoom
Size of tree

Fig. 3.6: CP Plot

Fig. 3.7: CP Table with Relative Error and Standard Deviation

3.7 Summary
In this case study of product sales dataset authors have investigated on how Principal Component
Analysis (PCA) can be efficiently applied to reduce the dimensionality of the dataset and helps in
24 Data Analytics in R: A Case Study Based Approach

identifying major contributing factors/components. Also, subsequently building a prediction model on


principal components obtained shows higher prediction accuracy than otherwise and root node split
has reduced the relative error by 70%.

3.8 R Program
#super market sales data set analysis and prediction model
# read dataset
superMSales = read.csv("C:/Users/S.S.JAMSANDKAR/Downloads/supermarket-
sales/supermarket_sales-Sheet1.csv")
# preprocessing of data is required
# drop i - id column ( first column)
superMSales<-superMSales[-1]
# data transformation
#converting all categorical data variables to intergers as we need it for pcA
# create a branch vector
branch_v<-c(superMSales$Branch)
class(branch_v)
# create a city vector
city_v<-c(superMSales$City)
# create a customer_type vector
cust_type_v<-c(superMSales$Customer.type)
#Create gender vector 1 for female and 2 for female
gender_v <- c(superMSales$Gender)
# create product.line vector
product_v<-c(superMSales$Product.line)
# create a payment type vector
payment_v<-c(superMSales$Payment)
# drop all categorrical colums and replace it with all generated vectors of respective categorical
columns
# drop branch,city, customer.type,gender, product.line and payment columns
superMSales<-superMSales[-c(1:5,12)]
# load the tibble library to use addcoumn method of it to add vector at required position.
superMSales<-cbind(superMSales,branch_v,city_v)
superMSales<-cbind(superMSales,cust_type_v,gender_v,product_v,payment_v)
library(tibble)
head(superMSales)
# since PCA works on numeric variables, letâ ™s see if we have any variable other than numeric.
#check variable class
str(superMSales)
# still out of 16 variables 2 variables are categoriacal
Sales Prediction by Principal Component Analysis and Recursive Partitioning Regression Tree Method 25

# use lubridate package to convert to numeric


library(lubridate)
#The Date class means dates are stored as the number of days since January 1, 1970, with
#negative values for earlier dates. We can use the as.numeric function to view the raw values.
# since all dates are of same year and also date of month is not that important
# Month of sale could be a prominent feature
date_n<-as.numeric(month(mdy(superMSales$Date)))
class(superMSales$Time)
s<-strsplit(x=as.character(superMSales$Time), split = ":")
v1<-unlist(s)
j=1
i=1
while(i<= length(v1))
{
time_n[j]<-as.numeric(v1[i])
i=i+2
j=j+1
}

superMSales<-superMSales[-c(5:6)]
#add branch_v vector at that column position
library(tibble)
superMSales<-add_column(superMSales,date_n, .after = "Total")
superMSales<-add_column(superMSales,time_n,.after="date_n")
# gross.percentage.margin column is having a constant value PCA wont able to rescale it so we
will drop the column
superMSales<-superMSales[-8]

# we now have all the numerical values. Let's divide the data into test and train.
# 70:30 train and test data
train= 70/100*(nrow(superMSales))
print(train)
#divide the superMSales data
pca.train <- superMSales[1:train,]
pca.test <- superMSales[-(1:train),]
#We can now go ahead with PCA.
#The base R function prcomp() is used to perform PCA. By default, it centers the variable to have
mean equals to zero. With parameter scale. = T, we normalize the variables to have standard
deviation equals to 1.
str(pca.train)
26 Data Analytics in R: A Case Study Based Approach

#principal component analysis


prin_comp <- prcomp(pca.train, scale. = T)
library(factoextra)
#The proportion of variances retained by the principal components can be extracted as follow :
std_dev <- prin_comp$sdev
#compute variance
pr_var <- std_dev^2
#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:10]
fviz_pca_var(prin_comp,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
fviz_pca_var(prin_comp, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
names(prin_comp)
#The prcomp() function results in 5 useful measures:
#1. center and scale refers to respective mean and standard deviation of the variables that are
used for normalization prior to implementing PCA
#outputs the mean of variables
prin_comp$center
#outputs the standard deviation of variables
prin_comp$scale
#2. The rotation measure provides the principal component loading. Each column of rotation matrix
contains the principal component loading vector. This is the most important measure we should be
interested in.
prin_comp$rotation
#
prin_comp$rotation[1:5,1:4]
# Contributions of variables to PC1
fviz_contrib(prin_comp, choice = "var", axes = 1, top = 10)
# Contributions of variables to PC2
fviz_contrib(prin_comp, choice = "var", axes = 2, top = 10)
# prediction Model
#principal component analysis
pca.test
Sales Prediction by Principal Component Analysis and Recursive Partitioning Regression Tree Method 27

prin_comp_Test<- prcomp(pca.test, scale. = T)


fviz_pca_var(prin_comp_Test,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
#add a training set with principal components
train.data <- data.frame(Sales = pca.train$Total, prin_comp$x)
train.data
#we are interested in first 5 PCAs
train.data <- train.data[,1:5]
#run a decision tree
install.packages("rpart")
library(rpart)
rpart.model <- rpart(Sales ~ .,data = train.data, method = "anova")
rpart.model
plot(rpart.model, uniform=TRUE, main="Regression Tree")
text(rpart.model, use.n=TRUE, all=TRUE, cex=.8)
plotcp(rpart.model)
# print CP table
rpart.model$cptable
printcp(rpart.model)
#transform test into PCA
test.data <- predict(prin_comp_Test, newdata = pca.test)
test.data <- as.data.frame(test.data)
#select the first 5 components
test.data <- test.data[,1:5]
#make prediction on test data
rpart.prediction <- predict(rpart.model, test.data)
rpart.prediction
}
actuals<- pac.test$Total
# create confusion matrix
confMat <- table(actuals,rpart.prediction)
confMat
# find the diagonal of confusion matrix
di<-diag(confMat)
# calculate accuracy
accuracy <- sum(di)/sum(confMat)
print(accuracy)
28 Data Analytics in R: A Case Study Based Approach

3.9 References
1. Supermarket data set : https://www.kaggle.com/aungpyaeap/supermarket-sales
2. Ana Lúcia Silva Margarida GMS Cardoso, “Predicting supermarket sales: The use of regression
trees”, Journal of Targeting, Measurement and Analysis for Marketing, Springer, April
2005, Volume 13, Issue 3, pp 239–249.
3. Manjuprasad Shetty, Vishwasa Nawada, Kirthan Pai, Ranjan Kumar, “Predicting Sales In
Supermarkets”, International Research Journal of Engineering and Technology (IRJET), Volume:
05, Issue: 03, Mar-2018.
4. S.S. Jamsandekar, K.S. Mahajan, “ Machine Learning Approach for Marketing Intelligence:
Managerial Application”, International Journal of Engineering and Computer Science, Vol. 6(7),
July 2017.
5. “Forecasting Daily Supermarket Sales Using Exponentially Weighted Quantile Regression”,
James W. Taylor, European Journal of Operational Research, 2007, Vol. 178, pp. 154-167.


Case Study 4

Analysis of Variable Importance Measures for


Parkinson’s Data: Random Forest Approach

4.1 Problem Description


This case study portrays Random Forest (RF) modelling on Parkinson ’ s data for measure of
variable importance and classification. RF is a collection of unpruned decision trees. For the present
investigation optimum random forest architecture is derived by tuning the number of trees of RF
model. The importance of predictors is measured by considering “ Mean Decrease Accuracy ” and
“Mean Decrease Gini”. Moreover the performance of the model is evaluated with reference to Out-of-
bag (OOB) estimate of error rate.

4.2 Data Description and Exploration


Parkinson dataset for the present study is retrieved from UCI machine learning repository[1].
This dataset consists of 22 biomedical voice measurements from 31 people, six recordings per patient.
Thus derived dataset contains 195 records; each column in the dataset is a particular voice measure
and each row corresponds voice recording from these individuals. Out of these 31 individuals, 23 with
Parkinson’s disease which is indicated as value “1” in the “status” column [1].

4.3 Random Forest Technique


In the present investigation, we have carried out variable importance measures and classification
on Parkinson ’ s data using Random Forest approach. RF is a resourceful machine learning method
proficient of performing regression, classification, dimension reduction and other machine learning
tasks. It is a type of ensemble learning method, where a group of weak models combine to form a
powerful model.
The variable importance measures using RF model considered in this analysis are:
● Mean Decrease Accuracy
● Mean Decrease Gini
● Mean raw importance score of variable x for class 0
● Mean raw importance score of variable x for class 1
The higher values mean the variables are more important.

4.4 Analysis of Variable Importance Measures


This section details the experiment conducted for the present analysis. R and Rattle are used to
analyze model structure, number of trees in the forest and importance measures of variables for
partitioning the dataset. The package “random forest” is used for the present investigation. Figure 4.1
30 Data Analytics in R: A Case Study Based Approach

briefs variable importance measures for Parkinson ’ s data. This shows that the predictor “ PPE ” , the
“nonlinear measure of fundamental frequency variation” stands at the top where as “NHR”, the “the
ration of noise to tonal components in the voice” is least important variable of the dataset.

Fig. 4.1: Variable Importance Measures for Parkinson’s Data


In the present investigation, the model is tuned with different RF architecture. The training
dataset is used for parameter adjustment and validation set to control the process. RF builds many
decision trees using random subset of data and variables. The optimized RF architecture is derived and
the textual representation of the corresponding model is shown in figure 4.2. This model entails 200
trees in the forest with 4 partitioning variables. 136 observations are used for the construction of forest.
The classification error is 0.17 significantly less.

Fig. 4.2: Random Forest Model for the Classification of Parkinson’s Data
Analysis of Variable Importance Measures for Parkinson’s Data: Random Forest Approach 31

4.5 Results and Discussion


The “random forest” package of R is used to analyze variable importance and classification on
Parkinson’s data. Figure 4.3 shows the relative importance of the predictors of the dataset taken under
the study in terms of Mean Decrease Accuracy, Mean Decrease Gini, mean raw importance score of
variable x for class 0 and class 1. It reveals that:
● “PPE”, the nonlinear measure of fundamental frequency variation stands at the top
● “NHR”, the ration of noise to tonal components in the voice” is least important variable of the
dataset

Fig. 4.3: Relative Importance of Variables


32 Data Analytics in R: A Case Study Based Approach

The higher values mean the variables are more important. The predictors with less value are
dropped while building classification model. The study concludes that RF performs dimension
reduction proficiently on Parkinson’s data.
The second objective of the present study is to design RF model as classifier for Parkinson’s data.
Figure 4.3 elaborates the design of classification model in R environment. Figure 4.4 shows error plot
represents error rate progressively for the number of trees built. This helps in deciding optimum
number of trees to build while building RF model.
Error Rates Random Forest Parkinson.CSV OOB ROC Curve Random Forest Parkinson.CSV

200 0.0 0.2 0.4 0.6 0.8 1.0


0 50 100 150
False Alarm Rate
trees

Fig. 4.4: Error Rate Progressively for the Number of Fig. 4.5: ROC Curve based on Out-of-bag
Trees Built Predictions

The plot shown in figure 4.5 is Receiver Operating Characteristic (ROC) curve based on the out-
of-bag (OOB) predictions for each observation in the training dataset. This curve plots the true
positive rate against the false positive rate. Area under the curve is 0.969 which is nearer to 1. A better
model is one with larger area under the curve. Performance accuracy reveals RF model is appropriate
for classifying Parkinson’s data.
The RF model designed in this study is able to classify Parkinson ’ s data with very less error.
Figure 4.6 shows the confusion matrix obtained by applying validation dataset on the derived RF
model. The average class error is 2.1% which is very less.

Fig. 4.6: Error Matrix for Random Forest on Validation Data

4.6 Summary
This case study illustrated Random Forest (RF) modelling on Parkinson ’ s data for measure of
variable importance and classification. The reported investigation depicts optimum random forest
architecture achieved by tuning the number of trees of RF model. The importance of predictors is
Analysis of Variable Importance Measures for Parkinson’s Data: Random Forest Approach 33

measured by considering “Mean Decrease Accuracy” and “Mean Decrease Gini ”. Thus derived RF
entails 200 trees in the forest with 4 partitioning variables. Moreover the performance of the model is
evaluated with reference to Out-of-bag (OOB) estimate of error rate. The predictor “PPE” stands top
in the variable list.

4.7 References
1. “Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection”,
Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine
2007, 6:23.
2. R.S. Kamath, R.K. Kamat, “Modelling Fetal Morphologic Patterns through Cardiotocography
Data: A Random Forest based Approach”, Research Journal of Pharmaceutical, Biological and
Chemical Sciences (RJPBCS), volume 7, Issue 5, 2016.
3. Andy, L., & Matthew, W. (2002) “Classification and Regression by random Forest”, R News, 2(3)
4. Graham, W. “Data Mining with Rattle and R: The Art of Excavating Data for Knowledge
Discovery”, Springer, DOI 10.1007/978-1-4419-9890-3.
5. Kamath, R., & Kamat, R. (2016). “Educational Data Mining with R and Rattle”, River Publishers
Series in Information Science and Technology, River Publishers, Netherland, ISBN Print: 978-87-
93379-31-2 E-book: 978-87-93379-30-5.
6. Breiman L. “Machine Learning” 2001; 45(1): 5-32


Case Study 5

Classification Model for Thoracic Surgery Data:


A Decision Tree Based Approach

5.1 Problem Description


This case study presents Decision Tree [DT] modelling of Thoracic surgery data for the
classification of lung cancer patients ’ post operative life expectancy. The dataset employed in the
present study comprises two classes of life expectancy with sample size of 470 records. The decision
tree model entails recursive partitioning approach implemented in the “ rpart ” package of R. The
optimum model is derived by tuning parameters such as Min split, Min bucket, Max depth and
Complexity. The performance of the model is evaluated in terms of Mean Square Error (MSE)
estimate of error rate.

5.2 Data Exploration


The Thoracic surgery data retrieved from UCI machine learning repository is related to the post-
operative life expectancy in the lung cancer patients [1]. This dataset consists of 470 instances of lung
cancer patients classified into two classes of life expectancy such as death within one year after
surgery and survival. This classification is based on 16 attributes: Diagnosis, Forced vital capacity,
Volume that has been exhaled, Performance status, Pain before surgery, Haemoptysis before surgery,
Dyspnoea before surgery, Cough before surgery, Weakness before surgery, Size of the original tumour,
Diabetes mellitus, MI up to 6 months, peripheral arterial diseases, Smoking, Asthma and Age.

5.3 Theory on Decision Tree


Proposed research reveals decision tree based classification of Thoracic surgery data in to two
classes of post operative life expectancy. The DT is designed by splitting the population into sub-
populations based on differentiator in input variables. It uses one of the following techniques to
identify the most significant variable to get the homogeneous sets of sub-populations:
● Gini Index
● Chi-Square
● Information Gain
● Reduction in Variance
The DT model is simulated in R environment. The corresponding R programme is given at the
end of this case study. The model is conceived as a Sixteen-Input and Single-Output arrangement.
Figure 5.1 shows decision tree derived in the present investigation represents post operative life
expectancy of lung cancer patients. That the root node of decision tree tests Diagnosis value = “yes”
continues down to the left side of the tree, otherwise right side of the tree. The next test down this left
Classification Model for Thoracic Surgery Data: A Decision Tree Based Approach 35

side of the tree is PRE14 value. Thus it proceeds and will be able to retrieve class value for cancer
patient. The performance of the model is measured in terms of Mean Square Error between predicted
output and actual output.

5.4 Decision Tree Model for Thoracic Surgery Data


This section portrays details of experiment carried out for the classification of Thoracic surgery
data using decision tree approach. We used “rpart” function of “rpart” package. Figure 5.2 summarizes
DT model thus obtained, for classification. This textual view highlights the key interface widgets of
decision tree construction. The variable “CP” stands for Complexity Parameter. Figure 5.3 shows CP
values plotted against the geometric mean to depict the deviation until the minimum value is reached.

Fig. 5.1: Decision Tree for Thoracic surgery data


36 Data Analytics in R: A Case Study Based Approach

Size of tree
1 2 5 6

inf 0.043 0.025 0.015


cp

Fig. 5.2: Summary of the Decision Tree model for Fig. 5.3: Graphical Representation of Cross Validated
Classification Error Summary

Table 5.1: Variable Importance Measures of Thoracic Data


Variable Importance
Diagnosis 40
PRE4 32
PRE14 14
PRE5 12
PRE6 2
Table 5.1 gives variable importance computed by decision tree model. Figure 5.4 gives partial
structure of node’s split details. The each rule corresponds to one path through the tree, starting at the
root node and ending at a leaf node.

Fig. 5.4: Partial Structure of Node and Split Details

Thus derived DT model efficiently classifies test data with very less
error. Figure 5.5 shows the error matrix for the decision tree model on test
data. The first column denotes tuple IDs of observations those belongs to
test set. The cell values represent percentage belongingness to class “ 0 ”
and class “ 1 ” . For example the fifth record of the dataset 94% denotes
“False” for death within one year after surgery, implies survival is “true”.
Thus derived DT model classifies test data with less error. Table 5.2 gives
comparison of actual values versus values predicted by DT model.
Fig. 5.5: Predicted Values
for Test Data
Classification Model for Thoracic Surgery Data: A Decision Tree Based Approach 37

Table 5.2: Actual and Predicted Values of Test Set

Death within one year after surgery Survival


Actual values 28 138
DT predicted values 47 119

5.5 Summary
This case study presented modelling of Thoracic surgery data for the classification of lung cancer
patients ’ post operative life expectancy. Result concludes that DT modeling is a suitable approach
since the resulting analysis is more precise.

5.6 R Program for the Design of Decision Tree Model on


Thoracic Surgery Data
#Loading required libraries
library(rpart)
library(rpart.plot)

# Loading Thoracic data into R environment


ths<- read.csv("D:/AI HA/Dataset/Thoracic.csv")
summary(ths)
# Data preprocessing
ths <- na.omit(ths)

# Determine sample size


ind <- sample(2, nrow(ths), replace=TRUE, prob=c(0.67, 0.33))
train.x = ths[ind==1, 1:16]
train.y = ths[ind==1, 17]
test.x = ths[ind==2, 1:16]
test.y = ths[ind==2, 17]
ds <- cbind(train.x, train.y)

# Decision tree model using “rpart”


dt <- rpart(train.y ~ ., data = ds, method="class")
printcp(dt)
plotcp(dt)
summary(dt)
rpart.plot(dt)

#Predict output on test data


predicted= predict(dt,test.x)
predicted
38 Data Analytics in R: A Case Study Based Approach

5.7 References
1. Ziaba, M., Tomczak, J. M., Lubicz, M., & Aswiatec, J. (2013). “Boosted SVM for extracting rules
from imbalanced data in application to prediction of the post-operative life expectancy in the lung
cancer patients”. Applied Soft Computing.
2. R.S. Kamath, R.K. Kamat, “Modelling Fetal Morphologic Patterns Through Cardiotocography
Data: Decision Tree Based Approach”, Journal of Pharmacy Research, Vol 12, Issue 1, 2017, Page
9-12.
3. R.S. Kamath, R.K. Kamat, “Modelling of Random Textured Tandem Silicon Solar Cells
Characteristics: Decision Tree Approach”, Journal of Nano and Electronic Physics, Vol. 8 No 4(1),
2016, 04021(4pp).
4. C.C. Fischer, K.J. Tibbetts, D. Morgan, and G. “Ceder: Predicting crystal structure by merging
data mining with quantum mechanics”. Nature Mater, 5(8), 641-646 (2006).
5. R.S. Kamath, R.K. Kamat, “Modeling Mice Down Syndrome through Protein Expression: A
Decision Tree based Approach”, Research Journal of Pharmaceutical, Biological and Chemical
Sciences (RJPBCS), volume 7, Issue 4, 2016.
6. Kamath, R., & Kamat, R. (2016), “Educational Data Mining with R and Rattle”, River Publishers
Series in Information Science and Technology, River Publishers, Netherland, ISBN Print: 978-87-
93379-31-2 E-book: 978-87-93379-30-5.


Case Study 6

Fertility Data Analysis with MXNet in R:


A Feedforward Neural Net Approach

6.1 Introduction
This case study illustrates supervised learning with MXNet package for the analysis of fertility
data by designing feedforward neural network (FFNN) model. The dataset for the present study is
retrieved from UCI data repository. The optimum neural net architecture is derived by tuning network
parameters: number of hidden layers, number of neurons in the hidden layer, activation function,
evaluation metric, number of rounds, batch size and learning rate. The result concludes that FFNN
designed with MXNet package classifies test data more accurate and precise.

6.2 Data Description and Exploration


The fertility dataset taken from UCI data repository provides 100 semen samples analyzed
according to the WHO 2010 criteria[1]. These 100 samples diagnosed into two classes “normal” and
“altered”. There are nine predictors in the dataset: season when the analysis performed, age, childish
disease, accident, surgical intervention, high fever in last year, frequency of alcohol consumption,
smoking habit, sitting hours per day. There are 82 observations belongs to the class “normal” and rest
12 belongs to “altered” class. Figure 6.1 briefs basic statistical analysis of the dataset chosen for the
present study.

6.3 Feedforward Neural Net Theory


We have designed feedforward neural net as classifier using MXNet package in R. This package
offers an interface to build feedforward NN, recurrent NN and convolutional neural networks. It gives
greater control to configure the neural network manually. The function mx.model.Feedforward.create()
in MXNet is called for building feedforward network to classify the data into fertile or non-fertile.
The classifier function accepts following parameters:
● Training data and target label.
● Number of hidden nodes in each hidden layer.
● Number of nodes in the output layer.
● Type of the activation function.
● Type of the output loss.
● The device to train (GPU or CPU).
40 Data Analytics in R: A Case Study Based Approach

Fig. 6.1: Basic Statistical Analysis of Fertility Dataset

6.4 Fertility Data Analysis with MXNet


This sections elaborates step by step procedure of building feedforward neural net with MXNet
for the analysis of fertility data. The fertility data analytics is carried out in R environment.
1. Installing and loading MXNet package:
Install.package(“mxnet”)
Library(mxnet)
2. Loading fertility dataset in R environment:
ds <- read.csv('...../Fertility.csv')
3. Checking for missing values:
table(is.na(ds))
4. Basic statistical analysis:
summary(ds)
The corresponding output is shown in figure 6.1.
5. The target variable contains categorical value “ N ” and “ O ” . Converting class label into
numeric since MXNet accepts target variable as numerics:
ds[,10] = as.numeric(ds[,10])
6. Determine sample size, partitioning dataset into training set and testing set:
ind <- sample(2, nrow(ds), replace=TRUE, prob=c(0.67, 0.33))
train.x = data.matrix(ds[ind==1, 1:9])
train.y = ds[ind==1, 10]
test.x = data.matrix(ds[ind==2, 1:9])
test.y = ds[ind==2, 10]
Fertility Data Analysis with MXNet in R: A Feedforward Neural Net Approach 41

Both training and testing datasets converted into matrix format as required by MXNet
package.
7. Manually creating the feedforward neural net model:
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, num_hidden=6)
lrm <- mx.symbol.SoftmaxOutput(fc1)
The network consists of 6 neurons in the hidden layer and softmax function for binary
classification. The network optimizes for classification accuracy for classification. Additional
hidden layers can also be added while configuring more complex network.
8. Control random process in mxnet:
mx.set.seed(1)
9. Training feedforward neural net:
nnmodel <-
mx.model.Feedforward.create(symbol = lrm,
X = train.x,
y = train.y,
ctx = mx.cpu(),
num.round = 20,
eval.metric = mx.metric.accuracy,
array.batch.size = 10,
learning.rate = 0.01)
The accuracy in each round shown in the figure 6.2.
Fig. 6.2: Model Accuracy During
10. Predictions and evaluation: Training

preds = predict(nnmodel, test.x)


pred.label = max.col(t(preds))
table(pred.label, test.y)
The testing dataset comprises 32 and 1 observations belongs to “ normal ” and “ altered ” classes
respectively. The analysis concludes that thus derived feedforward neural net accurately classifies
testing dataset.

6.5 Summary
This case study has explored design of feedforward neural net for the classification of fertility
data using MXNet package in R. The optimum neural net architecture is derived by tuning network
parameters. Performance accuracy reveals that FFNN model is appropriate for classifying fertility data.
42 Data Analytics in R: A Case Study Based Approach

6.6 R Program for Fertility Data Analysis with MXNet


usingFeedforward Neural Net
#Loading required libraries
library(mlbench)
library(mxnet)
#Loading fertility data into R environment
ds <- read.csv('D:/AI HA/Dataset/Fertility.csv')
#Data preprocessing
table(is.na(ds))
summary(ds)
table(ds$Diagnosis)
pie(table(ds$Diagnosis))
ds[,10] = as.numeric(ds[,10])

# Determine sample size


ind <- sample(2, nrow(ds), replace=TRUE, prob=c(0.67, 0.33))
train.x = data.matrix(ds[ind==1, 1:9])
train.y = ds[ind==1, 10]
test.x = data.matrix(ds[ind==2, 1:9])
test.y = ds[ind==2, 10]
mx.set.seed(1)
#create feedforward neural net model
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, num_hidden=6) #6 neuron in one layer
lrm <- mx.symbol.SoftmaxOutput(fc1)
nnmodel <- mx.model.Feedforward.create(symbol = lrm,
X = train.x,
y = train.y,
ctx = mx.cpu(),
num.round = 20,
eval.metric = mx.metric.accuracy,
array.batch.size = 10,
learning.rate = 0.01)
graph.viz(nnmodel$symbol)
preds = predict(nnmodel, test.x)
#Prediction and evaluation
pred.label = max.col(t(preds))
table(pred.label, test.y)
summary(model)
Fertility Data Analysis with MXNet in R: A Feedforward Neural Net Approach 43

6.7 References
1. David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, and Magnus Johnsson.
“Predicting seminal quality with artificial intelligence methods. Expert Systems with
Applications”, 39(16):12564 – 12573, 2012
2. R.S. Kamath, R.K. Kamat, “Prediction of Seismic Tremor Magnitude for Andaman Nicobar
Islands using Artificial Neural Network”, Disaster Advances, Vol 11, Issue 3, Page 15-21, 2018
3. Alirezaee, M. (2012), “Ensemble of Neural Networks to Solve Class Imbalance Problem of
Protein Secondary Structure Prediction”. International Journal Of Artificial Intelligence &
Applications,3(6), 9-20. http://dx.doi.org/10.5121/ijaia.2012.3602.
4. Dongale, T., &Kamat, R. (2013). “Modelling of NTC thermistor using an artificial neural network
for non-linearity compensation”. Inf. Eng. Int. J, 1, 15-20.
5. R.S. Kamath, T.D. Dongale, Pankaj Pawar, R.K. Kamat, “Modeling Mice Down Syndrome
Through Protein Expression: An Artificial Neural Network Based Approach”, Journal of
Pharmacy Research, Vol 11, Issue 11 2017, Page 1300-1305.


Case Study 7

Analysis of Breast Tissue using KNN Classification:


A Lazy Learning Approach

7.1 Introduction
Lazy learning is a machine learning technique that doesn’t learn a discriminative function from
the training data but memorizes the training dataset instead. There is no training time but prediction
step is relatively expensive. This case study explores K-Nearest Neighbours (KNN), a lazy learning
approach for the classification of breast tissues based on electrical impedance measurements. The
classifier with the value K = 7 predicts the test data 74% accurately. The value ‘K’ stands for number
of nearest neighbours while predicting the class for test data.

7.2 Data Set Explanation and Exploration


The dataset for the present study is taken from UCI machine learning repository [1]. The dataset
is based on electrical impedance measurements in samples of freshly excised tissue from the breast. It
contains 106 instances of electrical impedance measurements classified into six classes based on nine
features. Figure 7.1 shows basic statistical analysis of electrical impedance measurements. Table 7.1
lists set of breast tissue classes and corresponding number of observations in the dataset.

Fig. 7.1: Exploratory Analysis of Electrical Impedance Measurements


Table 7.1: Electrical Impedance Measurements Dataset

Breast tissue Class Categorical Value No. of Observations


Carcinoma 1 21
Fibro-adenoma 2 15
Mastopathy 3 18
Glandular 4 16
Analysis of Breast Tissue using KNN Classification: A Lazy Learning Approach 45

Connective 5 14
Adipose 6 22

7.3. Lazy Learning Theory


This case study explores KNN classifier for breast tissue classification using package “caret” in R.
KNN doesn ’ t have training phase. So each time to make the prediction, KNN is searching for the
nearest neighbour in the training set. The theory behind KNN classification is to find K predefined
number of training samples that are nearest to a new record & predict a class for the new record. It
uses “Euclidean Distance” to measure the closeness. The class labels mentioned in table 7.1 are going
to be predicted by our KNN model.

7.4 KNN Classification for Breast Tissue Data


The analysis of breast tissue using KNN classification is carried out in R environment. The R
programme for the same is given at the end of this case study. For the present study, we used breast
tissue dataset from UCI repository. The dataset consists of 106 observations and 10 fields. Since the
values are varying, required standardization. Dataset is sliced into training set and test set of 70:30
ratio.
The function trainControl() of package caret is used for controlling the computational nuances of
train() function. Choosing a suitable value for the number of nearest neighbours is the prime issue of
KNN algorithm. We used cross-validation to select the best value. train() accepts classification
formula, training data, method knn, function trainControl(), preprocess parameter and tuning
parameter. The result of trained KNN model is shown in figure 7.2.

Fig. 7.2: Trained KNN Model for Breast Tissue Classification


46 Data Analytics in R: A Case Study Based Approach

Figure 7.2 shows Accuracy and Kappa


metrics result for different K value.
Accuracy is used to select the optimal
0.60
model having largest value. The variation
in Accuracy for different K-value can be
shown by plotting the graph. KNN model 0.55
automatically selects best K value. Figure
7.3 shows that the final K-value for the
model is 7.
0.50
Thus our KNN model for breast tissue
5 10 15 20
classification based on electrical impedance #Neighbors
measurements trained with K-value 7.
Fig. 7.3: Accuracy versus K-value Plot

The function predict() of caret package is used for predicting the class labels for test data. Table
7.2 gives comparison of number of observations actually in each class versus number of observations
predicted by KNN model. The confusion matrix is used to print the statistics of prediction result. This
is shown in figure 7.4. It shows that KNN model accuracy for test data is 73%.
Table 7.2: Actual and Predicted values of Test Set
Number of observation
Class Actual Predicted
Carcinoma 9 8
Fibro-adenoma 7 7
Mastopathy 6 7
Glandular 7 9
Connective 5 3
Adipose 7 7

Fig. 7.4: Results of Prediction on Test Data


Analysis of Breast Tissue using KNN Classification: A Lazy Learning Approach 47

7.5 Summary
This case study has portrayed the KNN a lazy learning approach for the classification of breast
tissues based on electrical impedance measurements. The classifier with the value K = 7 predicts the
test data 74% accurately. Each time to make a prediction, KNN model is searching for the nearest
neighbour in the entire training set. So it is named as lazy learning approach.

7.6 R Program for the Analysis of Breast Tissue Based


on Electrical Impedance Measurements using KNN
Classification
#loading required library
library(caret)

#loading data into R environment


bt <- read.csv("D:/AI HA/Dataset/BreastTissue.csv")
summary(bt)
table(bt$Class)

## Convert the dependent var to factor. Normalize the numeric variables


bt$Class <- factor(bt$Class)
num.vars <- sapply(bt, is.numeric)
bt[num.vars] <- lapply(bt[num.vars], scale)
summary(bt)

# Determine sample size


ind <- sample(2, nrow(bt), replace=TRUE, prob=c(0.67, 0.33))
train = bt[ind==1, 1:10]
test = bt[ind==2, 1:10]
table(test$Class)

#trainControl() method to control the computational nuances of train() method


trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)

#training knn model


knn_model <- train(Class ~., data = train, method = "knn",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
#print and plot knn model
48 Data Analytics in R: A Case Study Based Approach

knn_model
plot(knn_model)

#prediction on test data


test_preds <- predict(knn_model, newdata = test)
table(test_preds)

#prediction result
confusionMatrix(test_preds, test$Class)

7.7 References
1. Dua, D. and Karra Taniskidou, E. (2017), “UCI Machine Learning Repository”
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and
Computer Science.
2. Jiawei Han, Micheline Kamber and Jian Pei, “Data Minining Concepts and Techniques”, Third
Edition, 2012 Elsevier Inc.
3. R.S. Kamath, R.K. Kamat, “Modelling Fetal Morphologic Patterns Through Cardiotocography
Data: Decision Tree Based Approach”, Journal of Pharmacy Research, Vol 12, Issue 1, 2017, Page
9-12.
4. Graham, W. “Data Mining with Rattle and R: The Art of Excavating Data for Knowledge
Discovery”, Springer, DOI 10.1007/978-1-4419-9890-3.
5. Kamath, R., & Kamat, R. (2016), “Educational Data Mining with R and Rattle”, River Publishers
Series in Information Science and Technology, River Publishers, Netherland, ISBN Print: 978-87-
93379-31-2 E-book: 978-87-93379-30-5.
6. R.S. Kamath, T.D. Dongale, Pankaj Pawar, R.K. Kamat, “Modeling Mice Down Syndrome
Through Protein Expression: An Artificial Neural Network Based Approach”, Journal of
Pharmacy Research, Vol 11, Issue 11, 2017, Page 1300-1305.


Case Study 8

Frequent Item Set Generation and Correlational


Analysis for Supermarket Transactional Data using
Associative Rule Mining with Equivalence Class
Transformation

8.1 Problem Description and Objectives


With the increasing use of digital tools for point of sale in supermarkets, finding regularities in
transactional data can help in finding what products were often purchased together, what are the
subsequent purchases after buying a particular product. Thus provide a decision support for store
manager in store shelf arrangements, aids in formation of marketing stratergies, building
recommendation system

Objectives
● To find frequent items set in the transactional data of grocery supermarket
● Find association rules for correlated items

8.2 Data Description


The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data
from a typical local grocery outlet. The data set contains 9835 transactions and the items are
aggregated to 169 categories.[2] Groceries dataset is provided by Michael Hahsler, Kurt Hornik and
Thomas Reutterer to R as built in data for arules.

8.3 Theory on Frequent Item Set Mining and Association Mining


Frequent itemset is a pattern (a set of items, subsequences, substructures, etc.) that occurs
frequently in a data set as proposed by Rakesh Agrawal and Ramakrishnan Srikant. [1]
Association mining is based on the theories of market basket analysis with an aim to find
associations and correlations between the different items that customers place in their shopping basket.
Measures of Association Rules
The following measures are used to evaluate the strength of association. Suppose, you are
interested in the association between two item sets X and Y:
● Support = Number of records having both X and Y/Total Number of records in the dataset
● Confidence = Number of records having both X and Y/Number of records with A
● Expected Confidence = Number of records with B/Total Number of records
● Lift = Confidence/Expected Confidence.
50 Data Analytics in R: A Case Study Based Approach

Lift is the association growth factor by which the co-occurence X and Y exceeds the expected
probability when there is no relation between events X and Y. In other words, higher the
lift(>1),higher the chance of co-occurrence of Y with X.

8.4 Correlational Analysis with ECLAT and Apriori Method


Apriori is the most influential AR mining algorithm
It consists of two steps
1. Initially it generate all frequently
occurring items whose support ≥
minsup
Equivalence class transformation,
éclat() is applied to find frequent
itemsets with minimum support
constraint of 70% and max len = 15.
The method calculates support and
frequency for items, the Figure 8.1
shows top 10 frequent items

Fig. 8.1: Count of Top 10 Frequent Groceries item

2. Use frequent itemsets to generate association rules using apriori(), a low support and high
confidence helps to extract strong relationship even for less overall co-occurrences in data.
Minimum Support = 0.005 and confidence = 0.8 is set to extract strong relationship
A total of 120 rules are extracted and top 10 rules with decreasing confidence are shown in
Figure 8.2

Fig. 8.2: Top 10 Rules using Apriori


Frequent Item Set Generation and Correlational Analysis for Supermarket Transactional Data using ... 51

8.5 Visualization of Frequent Item Set and Association Rules


The top 10 rules extracted are plotted in grouped matrix shown in Figure 8.3. the size of the circle
plotted against the rule depicts the frequency of rule and interestingness of the rule is shown by darker
color of circle
Plot Zoom

Grouped Matrix for 10 Rules

Size: support
Color: confidence

RHS
(whole milk)

(other vegetables)

inspect Zoom in Zoom out end

Fig. 8.3: Grouped Matrix of Top 10 Rules


It describes the top 10 rules with its antecedent and consequent parts with items that are brought
frequently with co-occurrence.It is observed from the Figure 8.4 that the following 3 rules describes
most prominent association relation in the study dataset, whole milk, butter yogurt and tropical fruits
are frequently purchased together.
Rule 1: Whole milk  tropical fruits and yogurt
Rule 2: Whole milk  whipped cream and butter
Rule 3: Whole milk  butter and yogurt
52 Data Analytics in R: A Case Study Based Approach

Fig. 8.4: Graph Based Plot of Association Rules

8.6 Summary
The Groceries dataset is utilized in the current case study to find frequent items set in the
transactional data of grocery supermarket using the eclat() R method with minimum support of 70%.
Associations rules are obtained using apriori() with support =.005 and confidence = 0.8 as
interestingness measures to extract strong relationship among the correlated items.
In the current case study Grouped Matrix, Graph based plot visualization approach is
demonstrated for better interpretability of association rules.

8.7 R Program
library(arules)
library(arulesViz)
library(datasets)
Groceries
Frequent Item Set Generation and Correlational Analysis for Supermarket Transactional Data using ... 53

frequentItems <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15))


# calculates support for frequent items
itemFrequencyPlot (Groceries,topN=10,type="absolute") # plot frequent items
rules <- apriori (Groceries, parameter = list(supp = 0.005, conf = 0.5)) # Min Support as 0.001,
confidence as 0.8.
rules
quality(rules) # show the support, lift and confidence for all rules
# Show the top 10 rules, but only 2 digits
options (digits=2)
inspect (rules[1:10])
rules <- sort (rules, by="confidence", decreasing=TRUE)
# 'high-confidence' rules.
inspect(rules[1:10])
rules <- apriori (Groceries, parameter = list (supp = 0.001, conf = 0.5, maxlen=3)) # maxlen = 3
limits the elements in a rule to 3
redundant <- which (colSums (is.subset (rules, rules)) > 1) # get redundant rules in vector
rules <- rules[-redundant] # remove redundant rules
plot (rules[1:10],method="grouped matrix",shading="confidence", engine="interactive") # feel free to
expand and move around the objects in this plot
plot (rules[1:10],method="graph",shading="confidence", engine="interactive") # feel free to expand
and move around the objects in this plot
plot (rules, measure=c("support", "lift"), shading="confidence")

8.8 References
1. Rakesh Agrawal and Ramakrishnan Srikant, “ Fast algorithms for mining association rules.
“Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-
499, Santiago, Chile, September 1994.
2. Michael Hahsler, Kurt Hornik, and Thomas Reutterer (2006), “ Implications of probabilistic data
modeling for mining association rules”. Data and Information Analysis to Knowledge Engineering,
Studies in Classification, Data Analysis, and Knowledge Organization, pages 598–605. Springer-
Verlag.
3. S.S. Jamsandekar, R.R. Mudholkar, Fuzzy Inference Rule Generation Using Genetic Algorithm
Variant ” , IOSR-JCE, International Organization of Scientific Research journal of Computer
Engineering, Vol. 17, Issue 4, pp 09-16. July -Aug 2015.
4. S.S. Jamsandekar, K.S. Mahajan “ Portfolio Investment Model Using Neuro Fuzzy System ” ,
IJCSIT - International Journal of Computer Science and Information Technologies, Vol. 6 (2),
March-April 2015.

Case Study 9

Forecasting Infant Mortality Rate in India:


A Time Series Modelling Approach

9.1 Problem Description


The time series modelling is a method of prediction or forecasting involving time based data to
retrieve hidden insights for decision making. This case study portrays time series modelling for
forecasting infant mortality rate in India. We have retrieved the required dataset from Niti Ayog.

9.2 The Available Data


The dataset comprised of year wise infant mortality rate is retrieved from Niti Ayog. The dataset
is shown in Table 9.1, the values per 1000 lives.
Table 9.1: Infant Mortality Rate in India

Year Rate Year Rate


2000 68 2009 50
2001 66 2010 47
2002 63 2011 44
2003 60 2012 42
2004 58 2013 40
2005 58 2014 39
2006 57 2015 37
2007 55 2016 34
2008 53

9.3 Time Series Modelling


We have designed Auto Regression Moving Average (ARMA) model for forecasting using
“tseries” package in R. ARMA models are commonly used in time series modeling. The framework of
proposed time series modelling for forecasting infant mortality rate consists of following modules:
● Visualizing the time series
● Plot Auto Correlation Function (ACF) and Partial Correlation Function (PACF) charts
● Building ARMA model
● Predicting future values
Forecasting Infant Mortality Rate in India: A Time Series Modelling Approach 55

9.4 Forecasting Infant Mortality Rate


This sections elaborates step by step procedure of building time series model with tseries package
for the forecasting of infant mortality rate in India. The ARMA time series modelling is carried out in
R environment.
1. Installing and loading tseries package:
install.packages('tseries', dependencies=TRUE)
library(tseries)
2. Loading mortality dataset in R environment:
x <- read.csv("D:/AI HA/dataset/mortality.csv")
3. Converting dataframe into time series object
x1 <- ts(x, frequency=1)
4. It is required to analyze the trend before building any time series model. Visualizing the time
series and fitting plot in a line:
plot(x1)
abline(reg=lm(x1~time(x1)))

Fig. 9.1: Time Series Plot for Infant Mortality Rate Fig. 9.2: Fitting Time Series Plot into Line
The plot shows that there is a trend component which decreases the infant mortality rate year by
year.
5. Plotting ACF and PACF charts
acf(log(x1))
pacf(log(x1))

Fig. 9.3: ACF plot Fig. 9.4: PACF plot


56 Data Analytics in R: A Case Study Based Approach

6. Building ARMA model for forecast of infant mortality rate in India:


fit <- arima(log(x1), c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12))
The output is shown in figure 9.5.

Fig. 9.5: ARMA Prediction Model Output in R

7. Predicting future values:


pred <- predict(fit, n.ahead = 10)
ts.plot(x1,2.718^pred$pred, log =
“y”, lty = c(1,3))
Thus derived time series model
predicted infant mortality rate in India for
the year 2017 is 32 (per 1000 lives).

Fig. 9.6: Time Series Model Predicting Future 10 Values

9.5 Summary
This case study has explored design of time series model for the forecasting of infant mortality
rate in India. Performance accuracy reveals that ARMA model is appropriate for time series
forecasting.

9.6 R Program for Forecasting Infant Mortality Rate in India


Using Time Series Modelling
# Installing and loading requiredpackage
install.packages('tseries', dependencies=TRUE)
library(tseries)

# Loading data into R environment


x <- read.csv("D:/AI HA/dataset/mortality.csv")
x1 <- ts(x, frequency=1)
plot(x1)

# Fitting data into a line


abline(reg=lm(x1~time(x1)))
#ACF and PCAF plots
Forecasting Infant Mortality Rate in India: A Time Series Modelling Approach 57

adf.test(diff(log(x1)), alternative="stationary", k=0)


acf(log(x1))
pacf(log(x1))

#Building ARMA model


(fit <- arima(log(x1), c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12)))
#Predicting values
pred <- predict(fit, n.ahead = 10)
ts.plot(x1,2.718^pred$pred, log = "y", lty = c(1,3))
pred
round(exp(3.555543),2)

9.7 References
1. niti.gov.in, retrieved on 10th July 2019
2. Kamath R.S. and Kamat R.K., “Time-series Analysis and Forecasting of Rainfall at Idukki
district”, Kerala: Machine Learning Approach, Disaster Advances, Vol. 11 (11) November (2018).
3. Dalinina R., “Introduction to Forecasting with ARIMA in R”, https://www.datascience.com/blog/
introduction-to- forecastingwith-arima-in-r-learn-data-science-tutorials, August 27 (2018)
4. Gorakala S.K., “Time Series Analysis using R – forecast package”, https://www.r-bloggers.com/
time-series-analysis-using-r-forecastpackage/ retrieved on 25th August (2018)
5. Srivastava T., “A Complete Tutorial on Time Series Modelling in R”, https://www.analyticsvidhya.
com/ blog/ 2015/12/ completetutorial-time-series-modeling/ retrieved 25th August (2018).


Case Study 10

Hierarchical Cluster Analysis for Immunotherapy Data:


An Unsupervised Approach

10.1 Introduction
This case study portrays hierarchical cluster analysis for Immunotherapy data. This dataset
contains information about wart treatment results of 90 patients using Immunotherapy. In the present
study, we have used Agglomerative clustering with “ Ward ” as agglomerative method. The
agglomerative coefficient for this method is found to be 0.8657 which shows the strongest clustering
structure of the four methods assessed. The experiment has derived four clusters of size 6, 27, 31 and 6
on Immunotherapy data.

10.2 Materials and Methods


The dataset for the hierarchical cluster analysis is retrieved from UCI machine learning
repository [1]. The dataset contains information about wart treatment results of 90 patients using
immunotherapy. The result of the treatment is 1 for 72 patients out of the 90 patients. These 72
observations are considered for the analysis of hierarchical clustering. The clusters are designed based
on seven attribute: sex, age, time, number of warts, type, area and in duration diameter. Figure 10.1
shows basic statistical computation of the dataset for the present study.

Fig. 10.1: Basic Statistical Computations on Immunotherapy Data


Hierarchical clustering, an unsupervised learning approach does not require to pre-specify the
number of clusters to be generated. It results in an attractive tree-based representation of the
observations, called a dendrogram. There are two types of hierarchical clustering algorithms:
1. Agglomerative clustering, follows bottom-up approach
2. Divisive hierarchical clustering, follows top-down approach
In the present study, we have used Agglomerative clustering for the analysis of Immunotherapy
data set.
Hierarchical Cluster Analysis for Immunotherapy Data: An Unsupervised Approach 59

10.3 Computational Details, Results and Discussion


The present analysis is carried out in R environment and the corresponding R program is given at
the end of this case study. We used “agnes” function of “cluster” package. The experiment is carried
out by applying various agglomeration methods. Table 1 gives the agglomerative coefficient values for
these methods while analysing hierarchical clustering on Immunotherapy data.
Table 10.1: Agglomeration Methods and their Coefficients

Agglomeration (linkage) methods Agglomerative coefficient value


Complete linkage clustering 0.8201679
Mean linkage clustering 0.7999330
Single linkage clustering 0.7585475
Ward linkage clustering 0.8657153
The clustering can be measured by agglomerative coefficient, which measures the amount of
clustering structure found. The coefficient value closer to “1” suggests strong clustering structure. The
experiment reveals that Ward’s method identifies the strongest clustering structure of the four methods
assessed.
Figure 10.2 shows Dendrogram plot for hierarchical clustering on Immunotherapy data. Here
each leaf corresponds to one observation. The observations which are similar to each other are fused to
form branch and same process continues to higher height. The values on the vertical axis represent
dissimilarities between two observations. For example the value 2 indicates dissimilarity between
observations 29 and 32.

Fig. 10.2: Dendrogram Plot for Hierarchical Clustering on Immunotherapy Data


60 Data Analytics in R: A Case Study Based Approach

It is possible to cut the dendrogram to control the number of clusters to be obtained. This works
same as k-means clustering. The function “cuttree” is used for indentifying clusters. Figure 10.3 shows
thus derived four clusters of dendrogram on Immunotherapy data. It shows border around four clusters
in dendrogram. Table 10.2 specifies the number of observations in each cluster.

Fig. 10.3: Cluster Dendrogram on Immunotherapy Data – Four Clusters

Table 10.2: Details of Clusters

Cluster ID No. of observations


1 6
2 27
3 31
4 6
These clusters can be visualized by drawing scatter plot as shown in figure 10.4. The
“fviz_cluster” function from the “factoextra” package is applied here to do the same.
Hierarchical Cluster Analysis for Immunotherapy Data: An Unsupervised Approach 61

Fig. 10.4: Scatter Plot for Cluster Visualization

10.4 Summary
In the present study, we have carried out hierarchical cluster analysis for Immunotherapy data
using Agglomerative clustering with “ Ward ” as agglomerative method. This dataset contains
information about wart treatment results of 90 patients using Immunotherapy. The agglomerative
coefficient for this method is found to be 0.8657 which shows the strongest clustering structure of the
four methods assessed. The experiment has derived four clusters of size 6, 27, 31 and 6 on
Immunotherapy data.

10.5 R Program for the Design of Hierarchical Clusters on


Immunotherapy Data
#Loading required libraries
library(cluster)
library(factoextra)
# Loading Immunotherapy data into R environment
lt<- read.csv("D:/AI HA/Dataset/wartTreatment.csv")
# Data preprocessing
lt <- na.omit(lt)
lt <- scale(lt)
# Linkage methods to assess
method<- c( "average", "single", "complete", "ward")
names(method) <- c( "average", "single", "complete", "ward")
62 Data Analytics in R: A Case Study Based Approach

# Function to compute agglomerative coefficient for linkage methods


agcf <- function(x) {
agnes(lt, method = x)$agcf
}
map_dbl(method, agcf)

#Hierarchical clustering using ward method


hc <- agnes(lt, method = "ward")
pltree(hc, cex = 0.6, hang = -1, main = "Dendrogram of agnes")

#Cut the tree into four clusters


hc1 <- hclust(d, method = "ward.D2" )
subgrp <- cutree(hc1, k = 4)
plot(hc1, cex = 0.6)
rect.hclust(hc1, k = 4, border = 2:5)

# Number of members in each group


table(subgrp)
#Scatter plot for cluster visualization
fviz_cluster(list(data = lt, cluster = sub_grp))

10.6 References
1. F. Khozeimeh, R. Alizadehsani, M. Roshanzamir, A. Khosravi, P. Layegh, and S. Nahavandi, “An
expert system for selecting wart treatment method”, Computers in Biology and Medicine, vol. 81,
pp. 167-175, 2/1/ 2017.
2. F. Khozeimeh, F. Jabbari Azad, Y. Mahboubi Oskouei, M. Jafari, S. Tehranian, R. Alizadehsani,
et al., “Intralesional immunotherapy compared to cryotherapy in the treatment of warts”,
International Journal of Dermatology, 2017, DOI: 10.1111/ijd.13535
3. Graham, W. “Data Mining with Rattle and R: The Art of Excavating Data for Knowledge
Discovery”, Springer, DOI 10.1007/978-1-4419-9890-3.
4. Kamath, R., & Kamat, R. (2016). “Educational Data Mining with R and Rattle”, River Publishers
Series in Information Science and Technology, River Publishers, Netherland, ISBN Print: 978-87-
93379-31-2 E-book: 978-87-93379-30-5
5. R.S. Kamath, R.K. Kamat, “Visualization of University Clusters based on NIRF and NAAC
Scores: K-means Algorithm Approach”, University News, A Weekly Journal of Higher Education,
Vol. 57, No. 03, Jan 21-27, 2019


Case Study 11

Predictive Model for Diabetic Retinopathy Debrecen


Data: A Deep Learning Approach

11.1 Introduction
This case study portrays predictive model for diabetic retinopathy dataset using deep learning
(DL) approach. This dataset contains features extracted from the Messidor image set to predict
whether an image contains signs of diabetic retinopathy [DR] or not [1]. Deep neural network
architecture is inspired by the structure and function of the brain accounts for the use of many hidden
neurons and layers as an architectural advantage combined with new training paradigms [2]. This case
study provides a optimum DL architecture achieved by tuning the number layers, number of hidden
neurons of neural network model and other optimization parameters. Result concluded that model of
four hidden layers with number of hidden nodes 15, 12, 9 and 6 out perform other combinations
significantly. Moreover the performance of the model is evaluated with reference to accuracy and loss.

11.2 Material and Methods


The dataset for this predictive analytics is retrieved from UCI data repository [1]. This dataset
contains features extracted from the Messidor image set to predict whether an image contains signs of
diabetic retinopathy or not. It contains 1151 records classified into two classes based on 19 features.
The class label 1 represents signs of DR and 0 for no signs of DR. The Table 11.1 lists set of diabetic
retinopathy classes and corresponding number of observations in the dataset.
Table 11.1: Diabetic Retinopathy Debrecen Dataset

Class Class Value No. of Observations


Signs of DR 1 611
No signs of DR 0 540
This case study elaborates deep learning modeling in R using Keras for presence and absence of
diabetic retinopathy signs in Messidor image. Keras package of Python provides a high-level neural
networks API and capable of running on top of Tensor Flow [3]. Deep learning with its foundation in
ANN accounts for the use of many hidden neurons and layers as an architectural advantage combined
with new training paradigms [2]. A schematic diagram of the proposed deep neural network
architecture is presented in figure 11.1. Here we built Multi-Layer Perceptron for binary classification.
The proposed deep learning architecture consists of four hidden layers with number of hidden nodes
15, 12, 9 and 6 respectively. The relu, a rectifier activation function is used in hidden layer. The
softmax activation function is used in the output layer. Output layer creates 2 output values, one for
each class.
64 Data Analytics in R: A Case Study Based Approach

Fig. 11.1: Deep Neural Network Architecture

11.3 Computational Details: Deep Neural Network Architecture


The deep neural network modelling for diabetic debrecen data is carried out in R environment,
where we used R3.4.3, Rtools34 and Python 3.6.3 [5]. The R program for the same is given at the end
of this case study. The model is conceived as Multi-Input-Multi-Output configuration. The dataset
used in the present study includes 1151 observations classified into two classes based on 19 features.
The construction of deep learning model comprises of following steps:
1. The workspace is set up by installing and loading required libraries into RStudio.
2. Data frame is converted into matrix to make use of keras package since construction of neural
network multidimensional object expects single data type.
as.matrix()
3. While building neural network model, target attributes need to be transformed from a vector
to a matrix with a boolean for each class value.
to_categorical().
4. Multilayer Perceptron model is constructed here for diabetic retinopathy dataset. The hidden
layers used “relu”, a rectifier activation function.
5. There are two output values in output layer one for each class. The output layer used
“softmax” activation function
The experiment is carried out by tuning the number of hidden layers, number of hidden neurons
of neural network model and other optimization parameters by keeping the number of epochs 200 as
constant. In order to compare our result to the state-of-the-art we estimate accuracy and loss as
performance metrics. Table 11.2 lists thus derived best performing deep learning model. The summary
of the proposed deep learning model for diabetic retinopathy dataset using keras in R is shown in
Figure 11.2.
Table 11.2: Deep Neural Network properties of present study

Network Properties DL model for diabetic retinopathy data


Network Type Deep Neural Network
Model type Multilayer perceptron for binary classification
Predictive Model for Diabetic Retinopathy Debrecen Data: A Deep Learning Approach 65

No. of hidden layers Four


No. of hidden neurons 15,12 ,9 and 6
Input shape 20
Activation function at hidden layer relu, rectifier activation function
Activation function at output layer Softmax
Performance metrics Accuracy and Loss

Fig. 11.2: Summary of DL Model for the Present Study

11.4 Results and Discussion


Model Accuracy
0.7

0.6

0.5
data
training
validation

0.7

0.6
0 50 100 150 200
0.5
epoch
0 50 100 150 200
epoch

Fig. 11.3: Performance of the DL Model for four Fig. 11.4: DL Model Accuracy for Training and
Hidden Layer with 15, 12, 9 and 6 Hidden Neurons Testing Data
66 Data Analytics in R: A Case Study Based Approach

The keras package of R is used to Model Loss


analyze model structure, number of
concealed layer, number of hidden neurons
and activation functions [4]. Result across the
experiments is illustrated in this section.
Figure 11.3 explains performance of the DL
model for four hidden layer with 15, 12, 9
and 6 hidden neurons. Figure 11.4 and figure
11.5 presents the accuracy and loss for the 0 50 100 150 200
execution of DL model for training and epoch
validation data respectively.
Fig. 11.5: DL Model loss for Training and Testing Data

Thus derived deep learning model is able to classify


diabetic retinopathy data with very less error. Figure 11.6
shows the confusion matrix obtained by applying test
dataset on the derived deep learning model. The loss and
accuracy for the test data are 0.5416758 and 0.7320099
respectively. Result concludes that DL model for diabetic
retinopathy data is a suitable approach since the resulting
analysis is much more accurate and precise. Fig. 11.6: Confusion Matrix for the Test Data

11.5 Summary
Present case study portrayed the construction of deep learning model for diabetic retinopathy data.
The corresponding dataset is taken from UCI data repository. Number of experiments conducted and
performance of the models is evaluated. Result concluded that model of four layers with number of
hidden nodes 15, 12, 9 and 6 outperform other combinations significantly. The resulted DL model uses
relu and softmax activation functions for hidden layer and output layer respectively.

11.6 R Program for the Construction of DL Model


install.packages("devtools")
install.packages("tensorflow")
devtools::install_github("rstudio/keras")
library(keras)
library(tensorflow)
dr <- read.csv("D:/AI HA/dataset/diabetic.csv")
dr <- na.omit(dr)

dr[,20] <- as.numeric(dr[,20])


dr <- as.matrix(dr)
dimnames(dr) <- NULL
Predictive Model for Diabetic Retinopathy Debrecen Data: A Deep Learning Approach 67

# Determine sample size


ind <- sample(2, nrow(dr), replace=TRUE, prob=c(0.67, 0.33))

# Split the `dr` data


training <- dr[ind==1, 1:19]
testing <- dr[ind==2, 1:19]

# Split the class attribute


trainingTarget <- dr[ind==1, 20]
testingTarget <- dr[ind==2, 20]

trainLabels <- to_categorical(trainingTarget)


testLabels <- to_categorical(testingTarget)

# Initialize a sequential model


model <- keras_model_sequential()
# Add layers to the model
model %>%
layer_dense(units = 15, activation = 'relu', input_shape = 19) %>%
layer_dense(units = 12, activation = 'relu') %>%
layer_dense(units = 9, activation = 'relu') %>%
layer_dense(units = 6, activation = 'relu') %>%
layer_dense(units = 2, activation = 'softmax')
summary(model)
model %>% compile(
loss = 'categorical_crossentropy', #sparse_categorical_crossentropy'
optimizer = 'adam',
metrics = 'accuracy'
)

# Fit the model


history <- model %>% fit(
training,
trainLabels,
epochs = 200,
batch_size = 100,
validation_split = 0.2
)
68 Data Analytics in R: A Case Study Based Approach

plot(history)

# Plot the model accuracy of the training data


plot(history$metrics$acc, main="Model Accuracy", xlab = "epoch", ylab="accuracy", col="blue",
type="l")

# Plot the model accuracy of the validation data


lines(history$metrics$val_acc, col="green")

#add legend
legend("bottomright", c("train","test"), col=c("blue","green"), lty=c(1,1))

# Plot the loss of the training data


plot(history$metrics$loss, main="Model Loss", xlab = "epoch", ylab="loss", col="blue", type="l")

# Plot the model accuracy of the validation data


lines(history$metrics$val_loss, col="green")

#add legend
legend("bottomright", c("train","test"), col=c("blue","green"), lty=c(1,1))

# Predict the classes for the test data


classes <- model %>% predict_classes(testing, batch_size = 50)

# Confusion matrix
table(testingTarget, classes)
# Evaluate on test data and labels
score <- model %>% evaluate(testing, testLabels, batch_size = 50)

# Print the score


print(score)

11.7 References
1. BalintAntal, AndrasHajdu: “An ensemble-based system for automatic screening of diabetic
retinopathy”, Knowledge-Based Systems 60 (April 2014), 20-27.
2. “Deep Learning TutorialRelease 0.1LISA lab”, University of Montreal, http:// deeple
arning.net/tutorial/deeplearning.pdf retrieved on 2ndJuly, 2018
3. “Keras Documentation”, https://keras.io, retrieved on 2nd July, 2018
Predictive Model for Diabetic Retinopathy Debrecen Data: A Deep Learning Approach 69

4. “Deep learning with Keras detailed tutorial”, retrieved on 2ndJuly, 2018https:// www. Datacamp.
com/community/tutorials/keras-r-deep-learning.
5. Kamath, R., and Kamat, R. (2016). “Educational Data Mining with R and Rattle”, River
Publishers Series in Information Science and Technology, River Publishers, Netherland, ISBN
Print: 978-87-93379-31-2 E-book: 978-87-93379-30-5.
6. R.S. Kamath, R.K. Kamat, “Modeling Human Activity Recognition using Kears and Tensorflow:
Deep Learning Approach”, International Journal of Modern Electronics and Communication
Engineering (IJMECE), Volume No.-6, Issue No.-5, September 2018, Page 9 – 14.


ISBN: 978-93-5367-814-2

ISBN: 978-93-5367-814-2 ESM 0812

You might also like