KEMBAR78
Regression and Classification Analysis | PDF | Errors And Residuals | Coefficient Of Determination
0% found this document useful (0 votes)
132 views101 pages

Regression and Classification Analysis

1. A linear regression model is created to model the relationship between variables Y1, X1, and X2 using training data. 2. When plotting X1 and X2, the variables do not appear correlated, suggesting they are independent predictors. 3. The regression output shows X1 is a significant predictor of Y1 while X2 is not significant. Diagnostic checks will be performed on the model.

Uploaded by

sristi agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views101 pages

Regression and Classification Analysis

1. A linear regression model is created to model the relationship between variables Y1, X1, and X2 using training data. 2. When plotting X1 and X2, the variables do not appear correlated, suggesting they are independent predictors. 3. The regression output shows X1 is a significant predictor of Y1 while X2 is not significant. Diagnostic checks will be performed on the model.

Uploaded by

sristi agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Final DS Assignment

Vaibhav Goel BT17GCS118

11/25/2020

Contents
A. REGRESSION 1 1

B. REGRESSION 2 11

C. TIME SERIES 63

D. CLASSIFICATION 83

A. REGRESSION 1

library(MASS)

## Warning: package ’MASS’ was built under R version 3.6.3

str(Boston)

## ’data.frame’: 506 obs. of 14 variables:


## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

1
library(lmtest)

## Warning: package ’lmtest’ was built under R version 3.6.3

## Loading required package: zoo

## Warning: package ’zoo’ was built under R version 3.6.3

##
## Attaching package: ’zoo’

## The following objects are masked from ’package:base’:


##
## as.Date, as.Date.numeric

library(auditor)

## Warning: package ’auditor’ was built under R version 3.6.3

library(outliers)
library(hnp)

## Warning: package ’hnp’ was built under R version 3.6.3

#data-split
target <- Boston$medv
x <- as.matrix(Boston[-ncol(Boston)])

bias <- rep(1,length(target))


xfinal <- cbind(bias,x)

n <- nrow(Boston)
p <- ncol(xfinal)

beta <- (solve(t(xfinal) %*% xfinal) %*% t(xfinal)) %*% target


yhat <- xfinal %*% beta

sigmasquare <- sum((target - yhat)^2)/(n-p-1)


ybar <- mean(target)

rsquare <- 1-(sum((target -yhat)^2)/sum((target - ybar)^2))

adjustedrsquare <- rsquare - (p/(n-p-1))*(1 - rsquare)

varianceBeta <- solve(t(xfinal) %*% xfinal)*sigmasquare

seBeta <- (diag(varianceBeta)^0.5)

#t-test for Hypothesis Testing

2
t <- beta/seBeta
pValue <- 2*(1-pt(abs(t),df=n-p-1))
pValue <- round(pValue,4)
#all the beta(i) which have p-values<significant-level, are not considered significant since we reject n

#f-test for Hypothesis Testing f=MSM/MSE , MSM=SSm/DFm and MSE=SSE/DFe


msm <- sum(target - ybar)^2/(p-1)
mse <- sum(target -yhat)^2/(n-p)
f <- msm/mse
fValue <- qf(c(0.05,0.95),df1=p-1,df2=n-p)
if(f>fValue[1] && f<fValue[2])
{
print("Null Hypo")
} else
{
print("Alternate Hypo");
}

## [1] "Alternate Hypo"

#confidence-interval at 95% confidence


confidenceIntervals <- cbind(CIlower = beta - 1.96 * seBeta, CIupper = beta + 1.96 * seBeta)

residual <- target-yhat


plot(yhat,residual)

3
20
10
residual

0
−10

0 10 20 30 40

yhat

qqplot(yhat,residual,df=n-p-1)

## Warning in plot.window(...): "df" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "df" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "df" is not a


## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "df" is not a


## graphical parameter

## Warning in box(...): "df" is not a graphical parameter

## Warning in title(...): "df" is not a graphical parameter

4
20
10
residual

0
−10

0 10 20 30 40

yhat

mod1 <- lm(target~xfinal)

#if pvalue is greater than 0.05, then the parameter is normally distributed
shapiro.test(residual)

##
## Shapiro-Wilk normality test
##
## data: residual
## W = 0.90138, p-value < 2.2e-16

#Darkin-Watson test
dwtest(target ~ xfinal)

##
## Durbin-Watson test
##
## data: target ~ xfinal
## DW = 1.0784, p-value = 0.7609
## alternative hypothesis: true autocorrelation is greater than 0

sumnumRes <- 0
sumRes <- sum(residual^2)

5
#used to test autocorrelation , if value is between 0 and 2 , positive autocorrelation
for(i in 2:length(residual)){
sumnumRes <- sumnumRes + (residual[i]-residual[i-1])^2
}
dw <- sumnumRes/sumRes

#leverage verification
hotmatrixDiagonal <- diag(xfinal %*% solve(t(xfinal) %*% xfinal) %*% t(xfinal))
hotmatrixSumDiagonal <- sum(diag(xfinal %*% solve(t(xfinal) %*% xfinal) %*% t(xfinal)))

#alternateive half-normal pot for levearages


hnp(hotmatrixDiagonal)

## Half-normal plot with simulated envelope generated assuming the residuals are
## normally distributed under the null hypothesis.
3
Residuals

2
1
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Theoretical quantiles

#q-q plot for standardized residuals


standardizedResidual <- residual/((sigmasquare^0.5)*((1-hotmatrixDiagonal)^0.5))
qqplot(yhat,standardizedResidual,df=n-p-1)

6
6
4
standardizedResidual

2
0
−2

0 10 20 30 40

yhat

#p-value of a 2 tailed test where P(t>=test-statistic value)


outlierVal <- abs(standardizedResidual)
#p-val will be min
pval= 2*(1- pt(max(outlierVal),n-p-1))
#Bonferroni correction p<alpha/n indicates presence of an outlier
min(pval)< 0.05/nrow(xfinal)

## [1] TRUE

sum(pval < 0.05/nrow(xfinal))

## [1] 1

pvalCorrected <- 2*(1- pt((outlierVal),n-p-1))


#p-val is as same as corrected p val

#half-normal plot for cooks distance (doubt)


cookDistance <- ((residual^2)*hotmatrixDiagonal)/(p*(1-hotmatrixDiagonal))
hnp(cookDistance)

## Half-normal plot with simulated envelope generated assuming the residuals are
## normally distributed under the null hypothesis.

7
4
3
Residuals

2
1
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Theoretical quantiles

numberRow <- nrow(xfinal)


#partial regression plot / added variable
for(i in ncol(xfinal)){
plot(xfinal[1:numberRow,i],residual)
break
}

8
20
10
residual

0
−10

10 20 30

xfinal[1:numberRow, i]

#partial-residual plot
plot(yhat,residual,main=’Partial-Residual’)

9
Partial−Residual
20
10
residual

0
−10

0 10 20 30 40

yhat

#creating a dataframe to plot the regression line


data2 <- data.frame("length" = c(1,2,3,4,5,6,7,8,9,10), "square" = c(1,4,9,16,25,36,49,64,81,100))
x2 <- data2[["length"]]
y2 <- data2[["square"]]
Sxx <- sum((mean(x2) - x2)^2)
Sxy <- sum((mean(x2)-x2)*(mean(y2)-y2))
beta1 <- Sxy/Sxx
beta0 <- mean(y2)-beta1*(mean(x2))
samples <- runif(100, min = -10, max = 10)
yR <- beta0 + beta1*samples
plot(x2,y2)
lines(samples,yR)

10
100
80
60
y2

40
20
0

2 4 6 8 10

x2

B. REGRESSION 2
Question 1:

library(lmtest)
data <- read.table(’RegD1.txt’)
ind <- seq(10,100,by=10)
train <- data[-ind,]
test <- data[ind,]

#X1 and X2 are the features, thus we will find the relation between the two and it is independent and th
plot(train$X1,train$X2, xlab = "X1", ylab = "X2")

11
1.0
0.8
0.6
X2

0.4
0.2
0.0

−10 −5 0 5 10

X1

#Model building from training Data


model <- lm(Y1~X1+X2,data = train)
summary(model)

##
## Call:
## lm(formula = Y1 ~ X1 + X2, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1226 -1.3189 -0.0519 1.1825 3.3815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.77301 0.33680 8.233 1.66e-12 ***
## X1 0.19480 0.03465 5.623 2.24e-07 ***
## X2 -0.15753 0.61320 -0.257 0.798
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.711 on 87 degrees of freedom
## Multiple R-squared: 0.2666, Adjusted R-squared: 0.2498
## F-statistic: 15.81 on 2 and 87 DF, p-value: 1.387e-06

Diagnostics: 1. Checking Constant Variance by plotting the residuals against the fitted values We can observe
that the data is randomly distributed which satisfies the test. 2. Checking Normality through: a.Q-Q plot

12
of Residuals b.Shapiro Test - Since the p-value>0.05 , it is normally distributed 3.Checking Uncorrelated
Error usind Durbin-Watson Test : Errors are uncorrelated

#Diagnostics 1
plot(model,pch=15,which=1)

Residuals vs Fitted
4

42
57
2
Residuals

0
−2

95

1 2 3 4 5

Fitted values
lm(Y1 ~ X1 + X2)

#Diagnostics 2
plot(model, pch=15, which=2)

13
Normal Q−Q

42
2

57
Standardized residuals

1
0
−1

95
−2

−2 −1 0 1 2

Theoretical Quantiles
lm(Y1 ~ X1 + X2)

shapiro.test(model$residuals)

##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.97386, p-value = 0.06631

#Diagnostics 3
dwtest(model)

##
## Durbin-Watson test
##
## data: model
## DW = 2.1766, p-value = 0.7891
## alternative hypothesis: true autocorrelation is greater than 0

b) RegD2 Dataset

data <- read.table(’RegD2.txt’)


ind <- seq(10,100,by=10)
train <- data[-ind,]
test <- data[ind,]

14
#X1 and X2 are the features, thus we will find the relation between the features and it is independent a
plot(train$X1,train$X2,xlab = "X1", ylab = "X2")
1.0
0.8
0.6
X2

0.4
0.2
0.0

−10 −5 0 5 10 15

X1

plot(train$X1,train$X3,xlab = "X1", ylab = "X3")

15
8
6
X3

4
2
0

−10 −5 0 5 10 15

X1

plot(train$X2,train$X3,xlab = "X1", ylab = "X3")

16
8
6
X3

4
2
0

0.0 0.2 0.4 0.6 0.8 1.0

X1

#Model building from training Data


model <- lm(Y1~X1+X2+X3,data = train)
summary(model)

##
## Call:
## lm(formula = Y1 ~ X1 + X2 + X3, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.21901 -0.06800 -0.01703 0.07815 0.28281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.037081 0.027754 -1.336 0.185
## X1 0.201507 0.002523 79.857 <2e-16 ***
## X2 -0.454286 0.038373 -11.839 <2e-16 ***
## X3 0.705114 0.003951 178.466 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.1068 on 86 degrees of freedom
## Multiple R-squared: 0.9977, Adjusted R-squared: 0.9976
## F-statistic: 1.219e+04 on 3 and 86 DF, p-value: < 2.2e-16

Diagnostics: 1. Checking Constant Variance by plotting the residuals against the fitted values We can observe

17
that the data is randomly distributed which satisfies the test. 2. Checking Normality through: a.Q-Q plot
of Residuals b.Shapiro Test - Since the p-value>0.05 , it is normally distributed 3.Checking Uncorrelated
Error usind Durbin-Watson Test : Errors are uncorrelated

#Diagnostics 1
plot(model,pch=15,which=1)

Residuals vs Fitted
0.3

79
21 46
0.2
0.1
Residuals

0.0
−0.2

0 2 4 6 8

Fitted values
lm(Y1 ~ X1 + X2 + X3)

#Diagnostics 2
plot(model, pch=15, which=2)

18
Normal Q−Q
3

79
21
Standardized residuals

2
1
0
−1
−2

26

−2 −1 0 1 2

Theoretical Quantiles
lm(Y1 ~ X1 + X2 + X3)

shapiro.test(model$residuals)

##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.98497, p-value = 0.3886

#Diagnostics 3
dwtest(model)

##
## Durbin-Watson test
##
## data: model
## DW = 2.2712, p-value = 0.9044
## alternative hypothesis: true autocorrelation is greater than 0

b) RegD3 Dataset

data <- read.table(’RegD3.txt’)


ind <- seq(10,100,by=10)
train <- data[-ind,]
test <- data[ind,]

19
#X1 and X2 are the features, thus we will find the relation between the features and it is independent a
plot(train$X1,train$X2,xlab = "X1", ylab = "X2")
1.0
0.8
0.6
X2

0.4
0.2
0.0

−5 0 5 10 15

X1

plot(train$X1,train$X4,xlab = "X1", ylab = "X4")

20
5
4
3
X4

2
1
0

−5 0 5 10 15

X1

plot(train$X2,train$X4,xlab = "X1", ylab = "X4")

21
5
4
3
X4

2
1
0

0.0 0.2 0.4 0.6 0.8 1.0

X1

#Model building from training Data


model <- lm(Y1~X1+X2+X4,data = train)
summary(model)

##
## Call:
## lm(formula = Y1 ~ X1 + X2 + X4, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.210075 -0.064858 -0.001555 0.069632 0.256274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.066625 0.027747 2.401 0.0185 *
## X1 0.199642 0.002292 87.102 <2e-16 ***
## X2 -1.051784 0.037636 -27.946 <2e-16 ***
## X4 1.160520 0.005730 202.526 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.09636 on 86 degrees of freedom
## Multiple R-squared: 0.9982, Adjusted R-squared: 0.9982
## F-statistic: 1.618e+04 on 3 and 86 DF, p-value: < 2.2e-16

Diagnostics: 1. Checking Constant Variance by plotting the residuals against the fitted values We can

22
observe that the data is Non-linear. 2. Checking Normality through: a.Q-Q plot of Residuals b.Shapiro Test
- Since the p-value>0.05 , it is normally distributed 3.Checking Uncorrelated Error usind Durbin-Watson
Test : Errors are correlated

#Diagnostics 1
plot(model,pch=15,which=1)

Residuals vs Fitted
0.3

18
0.2

8
0.1
Residuals

0.0
−0.2

81

−2 0 2 4 6 8

Fitted values
lm(Y1 ~ X1 + X2 + X4)

#Diagnostics 2
plot(model, pch=15, which=2)

23
Normal Q−Q
3

18

8
Standardized residuals

2
1
0
−1
−2

81

−2 −1 0 1 2

Theoretical Quantiles
lm(Y1 ~ X1 + X2 + X4)

shapiro.test(model$residuals)

##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.9904, p-value = 0.7596

#Diagnostics 3
dwtest(model)

##
## Durbin-Watson test
##
## data: model
## DW = 1.5493, p-value = 0.01472
## alternative hypothesis: true autocorrelation is greater than 0

Question 2: Diagnostics: 1. Checking Constant Variance by plotting the residuals against the fitted values
We can observe that the data is randomly distributed which satisfies the test. After choosing only X5 and
X6 as the imporant features from summary(model) 2.Checking Normality through: a.Q-Q plot of Residuals
b.Shapiro Test - Since the p-value<0.05 , it is not normally distributed

24
data <- read.table(’RegD14.txt’,header = TRUE)

#model <- lm(Y~.,data=data)


#summary(model)
#Diagnostics 1
modelA <- lm(Y~X5+X6,data=data)
summary(modelA)

##
## Call:
## lm(formula = Y ~ X5 + X6, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.862e-13 -1.882e-13 -7.120e-14 -2.840e-14 4.200e-12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.145e-13 1.173e-12 -4.390e-01 0.663
## X5 1.000e+00 1.057e-14 9.462e+13 <2e-16 ***
## X6 1.000e+00 9.247e-15 1.081e+14 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 6.3e-13 on 47 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.456e+29 on 2 and 47 DF, p-value: < 2.2e-16

plot(modelA,pch=15,which=1)

25
Residuals vs Fitted
4e−12

1
Residuals

2e−12
0e+00

24 44

850 900 950 1000 1050 1100

Fitted values
lm(Y ~ X5 + X6)

shapiro.test(modelA$residuals)

##
## Shapiro-Wilk normality test
##
## data: modelA$residuals
## W = 0.26744, p-value = 2.2e-14

plot(modelA, pch=15, which=2)

26
Normal Q−Q

1
Standardized residuals

6
4
2

11
0

44

−2 −1 0 1 2

Theoretical Quantiles
lm(Y ~ X5 + X6)

library(car)

## Loading required package: carData

plot(modelA,which=5)

27
Residuals vs Leverage

1
6
Standardized residuals

0.5
2

11
0

Cook's distance 44

0.00 0.05 0.10 0.15 0.20 0.25

Leverage
lm(Y ~ X5 + X6)

#Cooks Distance vs Observations


plot(modelA,which=4)

28
Cook's distance
1
0.8
Cook's distance

0.6
0.4
0.2

11 44
0.0

0 10 20 30 40 50

Obs. number
lm(Y ~ X5 + X6)

outlierTest(modelA) # Bonferonni p-value for most extreme obs

## rstudent unadjusted p-value Bonferroni p


## 1 301.7519 1.749e-77 8.745e-76

qqPlot(modelA, main="QQ Plot") #qq plot for studentized resid

29
QQ Plot
300

1
Studentized Residuals(modelA)

200
100
50

44
0

−2 −1 0 1 2

t Quantiles

## [1] 1 44

leveragePlots(modelA)

30
Leverage Plots

30
11
10
44
1
5

20
0
Y | others

Y | others

10
−10

0
−20

5
144

−10
11

−20 −10 0 10 −10 0 10 20 30

X5 | others X6 | others

2 leverages(11,44) in residuals vs fitted plot since they are away from the red line at 0 1 outlier (extreme
value of y hat) 1 influential point(1) since it lies in the cooks distance in residuals vs fitted plot.
-Check and comment on the structure of the relationship between the predictors and the response.

summary(modelA)

##
## Call:
## lm(formula = Y ~ X5 + X6, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.862e-13 -1.882e-13 -7.120e-14 -2.840e-14 4.200e-12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.145e-13 1.173e-12 -4.390e-01 0.663
## X5 1.000e+00 1.057e-14 9.462e+13 <2e-16 ***
## X6 1.000e+00 9.247e-15 1.081e+14 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 6.3e-13 on 47 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.456e+29 on 2 and 47 DF, p-value: < 2.2e-16

31
The p-value is <0.05 along with a high F-statistic value, thus rejecting the null hypothesis leads us to conlcude
a potential relationship between the predictors and the response. The low Residual Standard Error (estimate
of s.d of irreducible error) means a low deviation of our model from the true regression line. R-squared data
at 1 implies 100% variance of the model is being explained which accounts for a good model.
-Compute and comment on the condition numbers.

x=model.matrix(modelA)
e=eigen (t(x) %*% x)
print(e$val)

## [1] 2.352951e+07 2.073634e+03 2.883929e-01

sqrt(e$val[1]/e$val)

## [1] 1.0000 106.5223 9032.6299

There is a wide range in the eigenvalues and several condition numbers are large. This means that problems
are being caused by more than just one linear combination

• Compute and comment on the correlations between the predictors.

round(cor(data[,c(-7)]),3)

## X1 X2 X3 X4 X5 X6
## X1 1.000 -0.371 0.870 0.593 -0.410 -0.349
## X2 -0.371 1.000 -0.001 -0.213 0.064 0.095
## X3 0.870 -0.001 1.000 0.617 -0.477 -0.401
## X4 0.593 -0.213 0.617 1.000 -0.893 -0.869
## X5 -0.410 0.064 -0.477 -0.893 1.000 0.970
## X6 -0.349 0.095 -0.401 -0.869 0.970 1.000

There are very large correlations among the predictors.


-Compute and comment on the VIF.

vif(modelA)

## X5 X6
## 17.06391 17.06391

VIF Test for removal of multicollinearity. VIFi>10 indicates serious multicollinearity for the predictor.
Question 3. Use the data “RegD4.txt” to fit a model using the following methods. (a) Least squares.

dat <- read.table("RegD4.txt", header = TRUE)


modela <- lm(Y1~.,dat)
summary(modela)

32
##
## Call:
## lm(formula = Y1 ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2377 -1.7117 -0.4551 2.3614 5.6978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.9197 11.8960 -3.356 0.00375 **
## X1 0.7156 0.1349 5.307 5.8e-05 ***
## X2 1.2953 0.3680 3.520 0.00263 **
## X3 -0.1521 0.1563 -0.973 0.34405
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
## F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09

(b)Least absolute deviations

library(quantreg)

## Loading required package: SparseM

##
## Attaching package: ’SparseM’

## The following object is masked from ’package:base’:


##
## backsolve

modelb <- rq(Y1~., data=dat)


summary(modelb)

##
## Call: rq(formula = Y1 ~ ., data = dat)
##
## tau: [1] 0.5
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) -39.68986 -41.61973 -29.67754
## X1 0.83188 0.51278 1.14117
## X2 0.57391 0.32182 1.41090
## X3 -0.06087 -0.21348 -0.02891

(c) Huber method

33
library(MASS)
modelc <- rlm(Y1~., data = dat)
summary(modelc)

##
## Call: rlm(formula = Y1 ~ ., data = dat)
## Residuals:
## Min 1Q Median 3Q Max
## -8.91753 -1.73127 0.06187 1.54306 6.50163
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -41.0265 9.8073 -4.1832
## X1 0.8294 0.1112 7.4597
## X2 0.9261 0.3034 3.0524
## X3 -0.1278 0.1289 -0.9922
##
## Residual standard error: 2.441 on 17 degrees of freedom

(d) Least trimmed squares

library(rrcov)

## Loading required package: robustbase

## Scalable Robust Estimators with High Breakdown Point (version 1.5-5)

modeld <- ltsReg(Y1~., data = dat)


summary(modeld)

##
## Call:
## ltsReg.formula(formula = Y1 ~ ., data = dat)
##
## Residuals (from reweighted LS):
## Min 1Q Median 3Q Max
## -2.5065 -0.4236 0.0000 0.5764 1.9342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Intercept -37.65246 4.73205 -7.957 2.37e-06 ***
## X1 0.79769 0.06744 11.828 2.48e-08 ***
## X2 0.57734 0.16597 3.479 0.00408 **
## X3 -0.06706 0.06160 -1.089 0.29611
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.253 on 13 degrees of freedom
## Multiple R-Squared: 0.975, Adjusted R-squared: 0.9692
## F-statistic: 169 on 3 and 13 DF, p-value: 1.159e-10

34
Question 4. Use the data to fit a model with Y as the response and only X3, X4, and X5 as predictors. Use
the Box-Cox method to determine the best transformation on the response.

library(faraway)

## Registered S3 methods overwritten by ’lme4’:


## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car

##
## Attaching package: ’faraway’

## The following object is masked from ’package:robustbase’:


##
## epilepsy

## The following objects are masked from ’package:car’:


##
## logit, vif

library(MASS)
data <- read.table(’RegD7.txt’)
model <- lm(Y~X3+X4+X5,data=data)
plot(model)

35
Residuals vs Fitted
15

53 220
232
10
Residuals

5
0
−5
−10

0 5 10 15 20 25

Fitted values
lm(Y ~ X3 + X4 + X5)

36
Normal Q−Q

220
3

53
232
Standardized residuals

2
1
0
−1
−2

−3 −2 −1 0 1 2 3

Theoretical Quantiles
lm(Y ~ X3 + X4 + X5)

37
Scale−Location
53 220
232
1.5
Standardized residuals

1.0
0.5
0.0

0 5 10 15 20 25

Fitted values
lm(Y ~ X3 + X4 + X5)

38
Residuals vs Leverage

220
3

53
Standardized residuals

2
1
0
−1
−2

167

Cook's distance
−3

0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035

Leverage
lm(Y ~ X3 + X4 + X5)

b <- boxcox(model)

39
95%
−700
−900
log−Likelihood

−1100
−1300

−2 −1 0 1 2

lam = b$x[which(b$y==max(b$y))]
model.inv <- lm((Y)^-1 ~ X3+X4+X5,data=data)
plot(model.inv)

40
Residuals vs Fitted
0.8

286 330
0.6
0.4
Residuals

327
0.2
0.0
−0.2

0.0 0.1 0.2 0.3

Fitted values
lm((Y)^−1 ~ X3 + X4 + X5)

41
Normal Q−Q

286
8

330
Standardized residuals

6
4

327
2
0
−2

−3 −2 −1 0 1 2 3

Theoretical Quantiles
lm((Y)^−1 ~ X3 + X4 + X5)

42
Scale−Location
286 330
2.5
Standardized residuals

2.0

327
1.5
1.0
0.5
0.0

0.0 0.1 0.2 0.3

Fitted values
lm((Y)^−1 ~ X3 + X4 + X5)

43
Residuals vs Leverage

286
8

330
0.5
Standardized residuals

6
4

327
2
0
−2

Cook's distance

0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035

Leverage
lm((Y)^−1 ~ X3 + X4 + X5)
The maximum likelihood occurs at 0.2626263 (lambda value) The model shows non linear trend. The
model.inv tranforms the response into a linear model with a certain increase in the number of outliers on
tranformation.
Question 5. Use transformations to find a good model with Y as the response and X1 and X2 as predictors
using the data .

data <- read.table(’RegD7a.txt’,header = TRUE)


model <- lm(Y~X1+X2,data=data)
summary(model)

##
## Call:
## lm(formula = Y ~ X1 + X2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## X1 4.7082 0.2643 17.816 < 2e-16 ***
## X2 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##

44
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16

#Log-transformation
lm_log <- lm(log1p(Y) ~ log1p(X1) + log1p(X2), data = data)
summary(lm_log)

##
## Call:
## lm(formula = log1p(Y) ~ log1p(X1) + log1p(X2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.16219 -0.04721 0.00006 0.06261 0.12252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.7737 0.7798 -8.686 1.96e-09 ***
## log1p(X1) 2.0562 0.0781 26.329 < 2e-16 ***
## log1p(X2) 1.0766 0.2001 5.379 9.85e-06 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.07868 on 28 degrees of freedom
## Multiple R-squared: 0.9774, Adjusted R-squared: 0.9758
## F-statistic: 605.5 on 2 and 28 DF, p-value: < 2.2e-16

plot(lm_log)

45
Residuals vs Fitted
0.15
0.05
Residuals

−0.05
−0.15

16
15 18

2.5 3.0 3.5 4.0

Fitted values
lm(log1p(Y) ~ log1p(X1) + log1p(X2))

46
Normal Q−Q
2
Standardized residuals

1
0
−1

16
−2

18 15

−2 −1 0 1 2

Theoretical Quantiles
lm(log1p(Y) ~ log1p(X1) + log1p(X2))

47
Scale−Location
1.5

15 18
16
Standardized residuals

1.0
0.5
0.0

2.5 3.0 3.5 4.0

Fitted values
lm(log1p(Y) ~ log1p(X1) + log1p(X2))

48
Residuals vs Leverage
2

11 17
Standardized residuals

1
0
−1
−2

18 0.5
Cook's distance

0.00 0.05 0.10 0.15 0.20 0.25

Leverage
lm(log1p(Y) ~ log1p(X1) + log1p(X2))
The maximum likelihood occurs at 0.303 (lambda value) Applying the log tranformation , we have
transformed the data into linear trend since it had sknewness and thus log tranformation was the choice.
Question 6. Use the data to fit a linear model. Implement the following variable selection methods to
determine the “best” model. (a) Backward Elimination. (b) AIC, AICC, BIC. (c) R2 (d) Mallows Cp.

library(olsrr)

##
## Attaching package: ’olsrr’

## The following object is masked from ’package:faraway’:


##
## hsb

## The following object is masked from ’package:MASS’:


##
## cement

## The following object is masked from ’package:datasets’:


##
## rivers

data <- read.table(’RegD8.txt’)


model <- lm(Y ~ ., data = data)
# stepwise backward regression
back <- ols_step_backward_p(model,details = TRUE)

49
## Backward Elimination Method
## ---------------------------
##
## Candidate Terms:
##
## 1 . X1
## 2 . X2
## 3 . X3
## 4 . X4
## 5 . X5
## 6 . X6
## 7 . X7
## 8 . X8
##
## We are eliminating variables based on p value...
##
## - X7
##
## Backward Elimination: Step 1
##
## Variable X7 Removed
##
## Model Summary
## --------------------------------------------------------------
## R 0.809 RMSE 0.705
## R-Squared 0.654 Coef. Var 28.436
## Adj. R-Squared 0.627 MSE 0.497
## Pred R-Squared 0.584 MAE 0.521
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------
## Regression 83.713 7 11.959 24.078 0.0000
## Residual 44.204 89 0.497
## Total 127.918 96
## -------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------
## (Intercept) 0.954 0.829 1.150 0.253 -0.694 2.602
## X1 0.592 0.086 0.604 6.879 0.000 0.421 0.762
## X2 0.448 0.168 0.193 2.672 0.009 0.115 0.782
## X3 -0.019 0.011 -0.125 -1.747 0.084 -0.041 0.003
## X4 0.108 0.058 0.135 1.853 0.067 -0.008 0.223
## X5 0.758 0.241 0.272 3.140 0.002 0.278 1.237
## X6 -0.104 0.090 -0.127 -1.155 0.251 -0.284 0.075

50
## X8 0.005 0.003 0.130 1.549 0.125 -0.002 0.012
## ---------------------------------------------------------------------------------------
##
##
##
## No more variables satisfy the condition of p value = 0.3
##
##
## Variables Removed:
##
## - X7
##
##
## Final Model Output
## ------------------
##
## Model Summary
## --------------------------------------------------------------
## R 0.809 RMSE 0.705
## R-Squared 0.654 Coef. Var 28.436
## Adj. R-Squared 0.627 MSE 0.497
## Pred R-Squared 0.584 MAE 0.521
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------
## Regression 83.713 7 11.959 24.078 0.0000
## Residual 44.204 89 0.497
## Total 127.918 96
## -------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------
## (Intercept) 0.954 0.829 1.150 0.253 -0.694 2.602
## X1 0.592 0.086 0.604 6.879 0.000 0.421 0.762
## X2 0.448 0.168 0.193 2.672 0.009 0.115 0.782
## X3 -0.019 0.011 -0.125 -1.747 0.084 -0.041 0.003
## X4 0.108 0.058 0.135 1.853 0.067 -0.008 0.223
## X5 0.758 0.241 0.272 3.140 0.002 0.278 1.237
## X6 -0.104 0.090 -0.127 -1.155 0.251 -0.284 0.075
## X8 0.005 0.003 0.130 1.549 0.125 -0.002 0.012
## ---------------------------------------------------------------------------------------

plot(back)

## geom_path: Each group consists of only one observation. Do you need to adjust

51
## the group aesthetic?

## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

page 1 of 2
R−Square C(p)
0.700 7.125

0.675 7.100

0.650 7.075

0.625 7.050

0.950 0.975 1.000 1.025 1.050 0.950 0.975 1.000 1.025 1.050

Adj. R−Square AIC


0.675
217.075
0.650
217.050
0.625
217.025
0.600
217.000
0.950 0.975 1.000 1.025 1.050 0.950 0.975 1.000 1.025 1.050

52
page 2 of 2
SBIC

−56.60

−56.62

−56.64

−56.66

−56.68
0.950 0.975 1.000 1.025 1.050

SBC
240.26

240.24

240.22

240.20

240.18

0.950 0.975 1.000 1.025 1.050

Build regression model from a set of candidate predictor variables by removing predictors based on p values,
in a stepwise manner until there is no variable left to remove any more.

#Rest of the methods


k <- ols_step_best_subset(model)
plot(k)

53
page 1 of 2
R−Square C(p)
0.66 25

0.63 20

0.60 15

0.57 10

0.54 5
2 4 6 8 2 4 6 8

Adj. R−Square AIC


0.625
230
0.600
225
0.575
220
0.550

2 4 6 8 2 4 6 8

54
page 2 of 2
SBIC

−45

−50

−55

2 4 6 8

SBC
245

240

235

230
2 4 6 8

Subset of predictors that do the best at meeting some well-defined objective criterion, such as having the
largest R2 value or the smallest MSE, Mallow’s Cp or AIC.
Question 7. Consider the data . Remove every tenth observation from the data for use as a test sample.
Use the remaining data as a training sample building the following models. (a) Linear regression with all
predictors.

data <- read.table(’RegD9.txt’,header = TRUE)


ind <- seq(10,100,by=10)
train <- data[-ind,]
test <- data[ind,]
model <- lm(Y~.,data=train)
summary(model)

##
## Call:
## lm(formula = Y ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4456 -0.3758 -0.1407 0.2172 15.0780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.487e+02 1.108e+01 40.500 <2e-16 ***
## X1 -4.105e+02 8.517e+00 -48.196 <2e-16 ***
## X2 1.347e-02 1.003e-02 1.344 0.180

55
## X3 9.256e-03 1.653e-02 0.560 0.576
## X4 -5.402e-03 2.925e-02 -0.185 0.854
## X5 -3.045e-02 7.215e-02 -0.422 0.673
## X6 2.687e-02 3.027e-02 0.888 0.376
## X7 1.893e-02 3.309e-02 0.572 0.568
## X8 2.415e-02 4.493e-02 0.537 0.591
## X9 -1.658e-02 4.466e-02 -0.371 0.711
## X10 -5.181e-03 7.474e-02 -0.069 0.945
## X11 -8.786e-02 6.738e-02 -1.304 0.194
## X12 -5.766e-02 5.314e-02 -1.085 0.279
## X13 3.737e-02 6.132e-02 0.609 0.543
## X14 6.042e-03 1.674e-01 0.036 0.971
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.299 on 227 degrees of freedom
## Multiple R-squared: 0.9773, Adjusted R-squared: 0.9759
## F-statistic: 698.8 on 14 and 227 DF, p-value: < 2.2e-16

(b) Linear regression with variables selected using AIC.

step(model)

## Start: AIC=141.14
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 +
## X12 + X13 + X14
##
## Df Sum of Sq RSS AIC
## - X14 1 0.0 383.1 139.14
## - X10 1 0.0 383.1 139.14
## - X4 1 0.1 383.1 139.17
## - X9 1 0.2 383.3 139.28
## - X5 1 0.3 383.4 139.33
## - X8 1 0.5 383.5 139.44
## - X3 1 0.5 383.6 139.47
## - X7 1 0.6 383.6 139.49
## - X13 1 0.6 383.7 139.53
## - X6 1 1.3 384.4 139.98
## - X12 1 2.0 385.0 140.39
## - X11 1 2.9 385.9 140.94
## - X2 1 3.0 386.1 141.05
## <none> 383.1 141.14
## - X1 1 3919.7 4302.8 724.49
##
## Step: AIC=139.14
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 +
## X12 + X13
##
## Df Sum of Sq RSS AIC
## - X10 1 0.0 383.1 137.14
## - X4 1 0.1 383.1 137.17
## - X9 1 0.2 383.3 137.29
## - X5 1 0.3 383.4 137.33

56
## - X8 1 0.5 383.5 137.45
## - X3 1 0.5 383.6 137.48
## - X7 1 0.6 383.6 137.49
## - X13 1 0.7 383.7 137.56
## - X6 1 1.3 384.4 137.98
## - X12 1 2.0 385.0 138.39
## - X11 1 3.0 386.1 139.03
## <none> 383.1 139.14
## - X2 1 3.5 386.6 139.35
## - X1 1 4080.0 4463.1 731.35
##
## Step: AIC=137.14
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X11 + X12 +
## X13
##
## Df Sum of Sq RSS AIC
## - X4 1 0.1 383.1 135.18
## - X9 1 0.3 383.4 135.32
## - X5 1 0.3 383.4 135.33
## - X8 1 0.5 383.5 135.45
## - X3 1 0.6 383.6 135.49
## - X7 1 0.6 383.6 135.50
## - X13 1 0.7 383.7 135.56
## - X6 1 1.4 384.4 136.00
## - X12 1 2.0 385.0 136.39
## - X11 1 3.2 386.2 137.14
## <none> 383.1 137.14
## - X2 1 3.8 386.8 137.52
## - X1 1 4080.9 4464.0 729.39
##
## Step: AIC=135.18
## Y ~ X1 + X2 + X3 + X5 + X6 + X7 + X8 + X9 + X11 + X12 + X13
##
## Df Sum of Sq RSS AIC
## - X9 1 0.2 383.4 133.34
## - X5 1 0.3 383.4 133.37
## - X3 1 0.5 383.6 133.51
## - X7 1 0.6 383.7 133.55
## - X8 1 0.6 383.7 133.57
## - X13 1 0.7 383.8 133.60
## - X6 1 1.6 384.7 134.17
## - X12 1 2.0 385.1 134.41
## - X11 1 3.2 386.3 135.17
## <none> 383.1 135.18
## - X2 1 3.9 387.0 135.62
## - X1 1 4097.2 4480.3 728.28
##
## Step: AIC=133.34
## Y ~ X1 + X2 + X3 + X5 + X6 + X7 + X8 + X11 + X12 + X13
##
## Df Sum of Sq RSS AIC
## - X5 1 0.4 383.7 131.56
## - X8 1 0.4 383.8 131.59
## - X3 1 0.5 383.8 131.63

57
## - X7 1 0.6 383.9 131.69
## - X13 1 0.7 384.0 131.75
## - X6 1 1.8 385.2 132.47
## - X12 1 2.5 385.9 132.90
## <none> 383.4 133.34
## - X11 1 3.2 386.6 133.35
## - X2 1 5.3 388.7 134.66
## - X1 1 4172.0 4555.4 730.30
##
## Step: AIC=131.56
## Y ~ X1 + X2 + X3 + X6 + X7 + X8 + X11 + X12 + X13
##
## Df Sum of Sq RSS AIC
## - X3 1 0.3 384.0 129.73
## - X7 1 0.5 384.2 129.85
## - X13 1 0.5 384.2 129.88
## - X8 1 0.6 384.3 129.92
## - X6 1 1.8 385.5 130.67
## - X12 1 2.8 386.5 131.32
## - X11 1 3.1 386.8 131.51
## <none> 383.7 131.56
## - X2 1 5.0 388.7 132.68
## - X1 1 4328.3 4712.1 736.48
##
## Step: AIC=129.73
## Y ~ X1 + X2 + X6 + X7 + X8 + X11 + X12 + X13
##
## Df Sum of Sq RSS AIC
## - X13 1 0.7 384.7 128.14
## - X7 1 0.8 384.8 128.21
## - X8 1 1.7 385.7 128.78
## - X12 1 2.5 386.5 129.32
## - X6 1 2.6 386.6 129.38
## - X11 1 2.8 386.8 129.51
## <none> 384.0 129.73
## - X2 1 4.7 388.7 130.70
## - X1 1 4751.9 5135.9 755.33
##
## Step: AIC=128.14
## Y ~ X1 + X2 + X6 + X7 + X8 + X11 + X12
##
## Df Sum of Sq RSS AIC
## - X7 1 0.7 385.3 126.57
## - X8 1 1.6 386.3 127.16
## - X12 1 1.9 386.6 127.34
## - X11 1 2.6 387.3 127.78
## <none> 384.7 128.14
## - X6 1 3.3 387.9 128.19
## - X2 1 4.4 389.1 128.90
## - X1 1 4756.7 5141.4 753.59
##
## Step: AIC=126.57
## Y ~ X1 + X2 + X6 + X8 + X11 + X12
##

58
## Df Sum of Sq RSS AIC
## - X12 1 2.1 387.4 125.89
## - X11 1 2.8 388.1 126.33
## <none> 385.3 126.57
## - X8 1 5.3 390.6 127.87
## - X2 1 6.4 391.7 128.56
## - X6 1 8.3 393.6 129.74
## - X1 1 7275.0 7660.3 848.08
##
## Step: AIC=125.89
## Y ~ X1 + X2 + X6 + X8 + X11
##
## Df Sum of Sq RSS AIC
## <none> 387.4 125.89
## - X11 1 3.4 390.9 126.02
## - X8 1 4.0 391.4 126.37
## - X6 1 6.5 394.0 127.94
## - X2 1 7.3 394.8 128.42
## - X1 1 7280.1 7667.5 846.31

##
## Call:
## lm(formula = Y ~ X1 + X2 + X6 + X8 + X11, data = train)
##
## Coefficients:
## (Intercept) X1 X2 X6 X8 X11
## 448.50931 -412.80035 0.01566 0.03971 0.03653 -0.08502

modelAIC <- lm(Y~ X1+X2+X6+X8+X11,data=train)


summary(modelAIC)

##
## Call:
## lm(formula = Y ~ X1 + X2 + X6 + X8 + X11, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3588 -0.3912 -0.1131 0.2695 15.3199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.485e+02 7.713e+00 58.146 <2e-16 ***
## X1 -4.128e+02 6.199e+00 -66.592 <2e-16 ***
## X2 1.566e-02 7.418e-03 2.111 0.0358 *
## X6 3.971e-02 1.990e-02 1.996 0.0471 *
## X8 3.653e-02 2.341e-02 1.561 0.1200
## X11 -8.502e-02 5.889e-02 -1.444 0.1501
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.281 on 236 degrees of freedom
## Multiple R-squared: 0.9771, Adjusted R-squared: 0.9766
## F-statistic: 2011 on 5 and 236 DF, p-value: < 2.2e-16

59
Y~ X1+X2+X6+X8+X11 gives us the least AIC value.
Out of all the best subset plots shown here, we look at the AIC plot and select those features which have
the lowest value
(c)Principal Component Analysis

library(pls)

##
## Attaching package: ’pls’

## The following object is masked from ’package:outliers’:


##
## scores

## The following object is masked from ’package:stats’:


##
## loadings

pcr_model <- pcr(Y~., data = train , scale = TRUE, validation = "CV")


summary(pcr_model)

## Data: X dimension: 242 14


## Y dimension: 242 1
## Fit method: svdpc
## Number of components considered: 14
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.39 6.299 4.469 4.527 4.402 4.195 2.210
## adjCV 8.39 6.290 4.456 4.502 4.372 4.164 2.176
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 2.165 2.183 1.606 1.515 1.458 1.477 1.416
## adjCV 2.153 2.180 1.558 1.501 1.449 1.468 1.407
## 14 comps
## CV 1.413
## adjCV 1.405
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 60.15 71.25 78.84 83.73 88.20 91.26 93.46 95.35
## Y 44.86 73.24 76.61 78.00 81.51 93.65 93.80 93.80
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 96.78 98.07 98.99 99.54 99.83 100.00
## Y 96.88 97.31 97.53 97.54 97.73 97.73

plot(pcr_model)

60
Y, 14 comps, validation
40
30
predicted

20
10
0

0 10 20 30 40

measured

predy <- predict(pcr_model,test)


actuals_and_preds <- data.frame(cbind(actuals=test$Score, predicteds = predy))
min_max_accuracy <- mean(apply(actuals_and_preds, 1, min) / apply(actuals_and_preds, 1, max))
print(min_max_accuracy)

## [1] 1

(d)Partial Least Square

library(caret)

## Loading required package: lattice

##
## Attaching package: ’lattice’

## The following object is masked from ’package:faraway’:


##
## melanoma

## Loading required package: ggplot2

##
## Attaching package: ’caret’

61
## The following object is masked from ’package:pls’:
##
## R2

library(pls)
modelpls <- train(
Y~., data = train, method = "pls",
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
predy <- predict(modelpls,test)
actuals_and_preds <- data.frame(cbind(actuals=test$Score, predicteds = predy))
min_max_accuracy <- mean(apply(actuals_and_preds, 1, min) / apply(actuals_and_preds, 1, max))
print(min_max_accuracy)

## [1] 1

(e)Ridge Regression

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.0-2

x_var <- data.matrix(train[, c("X1", "X2", "X3","X4", "X5", "X6","X7", "X8", "X9","X10", "X11", "X12","X
y_var <- train[,’Y’]
# Setting the range of lambda values
lambda_seq <- 10^seq(2, -2, by = -.1)
# Using glmnet function to build the ridge regression in r
fit <- glmnet(x_var, y_var, alpha = 0, lambda = lambda_seq)
# Checking the model
summary(fit)

## Length Class Mode


## a0 41 -none- numeric
## beta 574 dgCMatrix S4
## df 41 -none- numeric
## dim 2 -none- numeric
## lambda 41 -none- numeric
## dev.ratio 41 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric

62
# Using cross validation glmnet
ridge_cv <- cv.glmnet(x_var, y_var, alpha = 0, lambda = lambda_seq)
# Best lambda value
best_lambda <- ridge_cv$lambda.min
best_fit <- ridge_cv$glmnet.fit

best_ridge <- glmnet(x_var, y_var, alpha = 0, lambda = 0.03162278)


coef(best_ridge)

## 15 x 1 sparse Matrix of class "dgCMatrix"


## s0
## (Intercept) 4.421301e+02
## X1 -4.046327e+02
## X2 1.414329e-02
## X3 7.964367e-03
## X4 -6.317373e-03
## X5 -3.513366e-02
## X6 2.626170e-02
## X7 3.223843e-02
## X8 1.998798e-02
## X9 -1.255532e-02
## X10 -4.885175e-03
## X11 -8.334956e-02
## X12 -5.411621e-02
## X13 4.217803e-02
## X14 -1.987143e-02

pred <- predict(best_ridge, s = best_lambda, newx = data.matrix(test[, c("X1", "X2", "X3","X4", "X5", "X
actual_preds <- data.frame(cbind(actuals = test$Y, predicted = pred))
correlation_accuracy <- cor(actual_preds)
correlation_accuracy

## actuals X1
## actuals 1.000000 0.998892
## X1 0.998892 1.000000

min_max_accuracy <- mean(apply(actual_preds, 1, min) / apply(actual_preds, 1, max))


print(min_max_accuracy)

## [1] 0.9757787

The accuracy of Ridge Regression is 97% when run on the test data set.

C. TIME SERIES

library("tseries")

63
## Registered S3 method overwritten by ’quantmod’:
## method from
## as.zoo.data.frame zoo

library("forecast")
library("TSA")

## Registered S3 methods overwritten by ’TSA’:


## method from
## fitted.Arima forecast
## plot.Arima forecast

##
## Attaching package: ’TSA’

## The following objects are masked from ’package:stats’:


##
## acf, arima

## The following object is masked from ’package:utils’:


##
## tar

library(ggplot2)

ts1 <- read.table("TSD1.txt")


ts1 <- as.matrix(ts1)
ts.plot(ts1)

64
4
2
ts1

0
−2
−4

0 20 40 60 80 100

Time

adf.test(ts1)

##
## Augmented Dickey-Fuller Test
##
## data: ts1
## Dickey-Fuller = -5.0454, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

acf(ts1)

65
Series ts1
0.5
0.0
ACF

−0.5

5 10 15 20

Lag

pacf(ts1)

66
Series ts1
0.2
0.0
Partial ACF

−0.4
−0.8

5 10 15 20

Lag

eacf(ts1)

## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x o o o o
## 1 o o o o o o o o o o o o o o
## 2 x o o o o o o o o o o o o o
## 3 x x o o o o o o o o o o o o
## 4 o x o o o o o o o o o o o o
## 5 o x o o o o o o o o o o o o
## 6 o x o o o o o o o o o o o o
## 7 o x x x o o o o o o o o o o

The ACF plot tails off and the PACF plot cuts off after lag(1).. In the EACF plot, the first zero at the top
left determines the value of p and q in ARMA(p,q) model Here, p=1 and q=0 which states that it is an
AR(1) model It can also be verifies using auto.ARIMA

ts2 <- read.table("TSD2.txt")


ts2 <- as.matrix(ts2)
ts.plot(ts2)

67
4
2
ts2

0
−2
−4

0 20 40 60 80 100

Time

#Test for stationarity


adf.test(ts2)

##
## Augmented Dickey-Fuller Test
##
## data: ts2
## Dickey-Fuller = -3.992, Lag order = 4, p-value = 0.01242
## alternative hypothesis: stationary

acf(ts2)

68
Series ts2
0.8
0.6
0.4
ACF

0.2
−0.2 0.0

5 10 15 20

Lag

pacf(ts2)

69
Series ts2
0.8
0.6
0.4
Partial ACF

0.2
0.0
−0.4

5 10 15 20

Lag

eacf(ts2)

## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x o o o o o o o o o o o
## 1 x x x o x o o o o o o o o o
## 2 x o o o o o o o o o o o o o
## 3 x x o o o o o o o o o o o o
## 4 x x o o o o o o o o o o o o
## 5 x x o x o o o o o o o o o o
## 6 o o o o o o o o o o o o o o
## 7 x o o o o o o o o o o o o o

model <- auto.arima(ts2)


model

## Series: ts2
## ARIMA(2,0,0) with zero mean
##
## Coefficients:
## ar1 ar2
## 1.1051 -0.3747
## s.e. 0.0932 0.0939
##
## sigma^2 estimated as 1.619: log likelihood=-165.64
## AIC=337.28 AICc=337.53 BIC=345.09

70
Since, the p value in Dickey-Fuller Test is <0.05, thus we can say it follows ARMA(p,q) model.
The ACF plot is tailing off,thus it is an AR(p) process, In the PACF plot, cutting off takes place after a lap
of 2 which is p.
The EACF plot shows the top left 0 at (2,1), thus the order is mantained and (2,0) can also be the choice
of p and q since it surrounds (2,1) in the EACF table. Thus it is ARMA(2,0) model.

ts3 <- read.table("TSD3.txt")


ts3 <- as.matrix(ts3)
ts.plot(ts3)
3
2
1
0
ts3

−4 −3 −2 −1

0 20 40 60 80 100

Time

acf(ts3)

71
Series ts3
0.1
−0.1
ACF

−0.3
−0.5

5 10 15 20

Lag

pacf(ts3)

72
Series ts3
0.1
Partial ACF

−0.1
−0.3
−0.5

5 10 15 20

Lag

eacf(ts3)

## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o x o o o o o o o o o
## 1 x o o o x o o o o o o o o o
## 2 x o o o x o o o o o o o o o
## 3 x o o o x o o o o o o o o o
## 4 o o o x o o o o o o o o o o
## 5 x o o x o x o o o o o o o o
## 6 o o o o o x o o o o o o o o
## 7 x x o o o x x o o o o o o o

Based on Dickey-Fuller test, the p-value is <0.05 which states that the data is stationary and thus the model
is ARMA(p,q).
The PACF plot is tailing off with a slight standard deviation at lag 4, thus the order of p can be taken as
0 based on the EACF plot where the top-left 0 is located at (0,1) which depicts the order of q as 1 which is
also prudent in ACF plot which iss cutting off after a lag of 1. Therefore, it is an ARMA(0,1) model

ts4 <- read.table("TSD4.txt")


ts4 <- as.matrix(ts4)
ts.plot(ts4)

73
8
6
4
2
ts4

0
−2
−4

0 20 40 60 80 100

Time

adf.test(ts4)

##
## Augmented Dickey-Fuller Test
##
## data: ts4
## Dickey-Fuller = -4.8092, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

acf(ts4)

74
Series ts4
0.2
0.1
0.0
ACF

−0.1
−0.2

5 10 15 20

Lag

pacf(ts4)

75
Series ts4
0.2
0.1
Partial ACF

0.0
−0.1
−0.2

5 10 15 20

Lag

eacf(ts4)

## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o x o o o o o o o o o o o o
## 1 x x o o x o o o o o o o o o
## 2 x x o x o o o o o o o o o o
## 3 x x x x o o o o o o o o o o
## 4 x o x x o o o o o o o o o o
## 5 x o o o x o o o o o o o o o
## 6 x o o o x o o o o o o o o o
## 7 x x o o o o x o o o o o o o

Based on Dickey-Fuller test, the p-value is <0.05 which states that the data is stationary and thus the model
is ARMA(p,q). The EACF plot the 0 nearest to the top-left corner shows an ARMA(1,2) series which is
around (0,2). (Verified by auto.ARIMA) From ACF plot, the significant values drops after lag 2 which states
the value of q is 2.

ts5 <- read.table("TSD5.txt")


ts5 <- as.matrix(ts5)
ts.plot(ts5)

76
0
−10
−20
ts5

−30
−40

0 20 40 60 80 100

Time

adf.test(ts5)

##
## Augmented Dickey-Fuller Test
##
## data: ts5
## Dickey-Fuller = -1.893, Lag order = 4, p-value = 0.621
## alternative hypothesis: stationary

ts5a <- diff(ts5)


adf.test(ts5a)

##
## Augmented Dickey-Fuller Test
##
## data: ts5a
## Dickey-Fuller = -5.1214, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

acf(ts5a)

77
Series ts5a
0.6
0.4
0.2
ACF

0.0
−0.4

5 10 15 20

Lag

pacf(ts5a)

78
Series ts5a
0.6
0.4
Partial ACF

0.2
0.0
−0.4

5 10 15 20

Lag

eacf(ts5a)

## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x o o x x x x x o o o x x
## 1 x x o o x o o o o o o o o o
## 2 o o x o o o o o o o o o o o
## 3 o x x o o o o o o o o o o o
## 4 o x x o o o o o o o o o o o
## 5 o x x x o o o o o o o o o o
## 6 x x x o o o o o o o o o o o
## 7 x o x o o o o o o o o o o o

From the Dickey-Fuller test, we can see that the p-value>0.05, which states that the data is non-
stationary.After 1 differencing, the p-value<0.05 which states that the value of d in ARIMA(p,d,q) is
1.
From the PACF plot, we can clearly visualize the value of q to be 2 and from ACF plot we can argue the q
value to be 2, Thus it is ARIMA(2,1,2) process

ts6 <- read.table("TSD6.txt")


ts6 <- as.matrix(ts6)
ts.plot(ts6)

79
15
10
ts6

5
0
−5

0 20 40 60 80 100

Time

adf.test(ts6)

##
## Augmented Dickey-Fuller Test
##
## data: ts6
## Dickey-Fuller = -1.8286, Lag order = 4, p-value = 0.6477
## alternative hypothesis: stationary

ts6a <- diff(ts6)


adf.test(ts6a)

##
## Augmented Dickey-Fuller Test
##
## data: ts6a
## Dickey-Fuller = -2.9085, Lag order = 4, p-value = 0.2003
## alternative hypothesis: stationary

ts6b <- diff(ts6a)


adf.test(ts6b)

##
## Augmented Dickey-Fuller Test

80
##
## data: ts6b
## Dickey-Fuller = -5.9482, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

acf(ts6b)

Series ts6b
0.2
0.1
0.0
ACF

−0.1
−0.2
−0.3

5 10 15

Lag

pacf(ts6b)

81
Series ts6b
0.2
0.1
Partial ACF

0.0
−0.1
−0.2
−0.3

5 10 15

Lag

eacf(ts6b)

## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o o o o o o o o o o o
## 1 x o o o o o o o o o o o o o
## 2 x x x o o o o o o o o o o o
## 3 x x x x o o o o o o o o o o
## 4 o x x x o o o o o o o o o o
## 5 x o o o o o o o o o o o o o
## 6 x o o o o o o o o o o o o o
## 7 x o o o o o o o o o o o o o

model <- auto.arima(ts6b)


model

## Series: ts6b
## ARIMA(0,0,2) with zero mean
##
## Coefficients:
## ma1 ma2
## -0.4995 -0.2224
## s.e. 0.1002 0.1063
##
## sigma^2 estimated as 2.25: log likelihood=-179.92
## AIC=365.84 AICc=366.1 BIC=373.63

82
D. CLASSIFICATION

library("tseries")
library("forecast")
library("TSA")
library(ggplot2)
library(ISLR)
library(Hmisc)

## Loading required package: survival

##
## Attaching package: ’survival’

## The following object is masked from ’package:caret’:


##
## cluster

## The following objects are masked from ’package:faraway’:


##
## rats, solder

## The following object is masked from ’package:robustbase’:


##
## heart

## The following object is masked from ’package:quantreg’:


##
## untangle.specials

## Loading required package: Formula

##
## Attaching package: ’Hmisc’

## The following object is masked from ’package:quantreg’:


##
## latex

## The following objects are masked from ’package:base’:


##
## format.pval, units

library(CatEncoders)

##
## Attaching package: ’CatEncoders’

## The following object is masked from ’package:base’:


##
## transform

83
library(caret)
library(MASS)
library(ggthemes)
library(gbm)

## Loaded gbm 2.1.8

library(e1071)

##
## Attaching package: ’e1071’

## The following object is masked from ’package:Hmisc’:


##
## impute

## The following objects are masked from ’package:TSA’:


##
## kurtosis, skewness

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

##
## Attaching package: ’randomForest’

## The following object is masked from ’package:ggplot2’:


##
## margin

## The following object is masked from ’package:outliers’:


##
## outlier

library(dplyr)

##
## Attaching package: ’dplyr’

## The following object is masked from ’package:randomForest’:


##
## combine

## The following objects are masked from ’package:Hmisc’:


##
## src, summarize

84
## The following object is masked from ’package:car’:
##
## recode

## The following object is masked from ’package:MASS’:


##
## select

## The following objects are masked from ’package:stats’:


##
## filter, lag

## The following objects are masked from ’package:base’:


##
## intersect, setdiff, setequal, union

library(lmtest)
library(quantreg)
library(robustbase)
library(faraway)
library(astsa)

##
## Attaching package: ’astsa’

## The following object is masked from ’package:forecast’:


##
## gas

## The following object is masked from ’package:faraway’:


##
## star

Question 1. Consider the Weekly data set, which is part of ISLR package. It contains the weekly stock
market returns for 21 years. (a) Produce some numerical and graphical summaries of the Weekly data. Do
there appear to be any pattern?

library(ISLR)
df <- Weekly
head(df)

## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction


## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down

85
summary(df)

## Year Lag1 Lag2 Lag3


## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##

plot(df$Year,df$Lag1)
10
5
df$Lag1

0
−5
−15

1990 1995 2000 2005 2010

df$Year

86
plot(df$Year,df$Lag2)

10
5
df$Lag2

0
−5
−15

1990 1995 2000 2005 2010

df$Year

plot(df$Year,df$Volume)

87
8
df$Volume

6
4
2
0

1990 1995 2000 2005 2010

df$Year

Year Vs Volume graph shows that there is a steady increase in the volume almost every year.

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag
variables plus Volume as predictors. Use the summary function to print the results. Do any of the
predictors appears to be statistically significant? If so, which ones?

dfb <- df[,c(2,3,4,5,6,7,9)]


train <- sample(1:1089,800)
dfb_train <- dfb[train,]
dfb_test <- dfb[-train,]
logReg <- glm(Direction ~.,family=binomial(),data=dfb_train)
summary(logReg)

##
## Call:
## glm(formula = Direction ~ ., family = binomial(), data = dfb_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.481 -1.260 1.036 1.093 1.349
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.236643 0.099794 2.371 0.0177 *
## Lag1 -0.010568 0.030986 -0.341 0.7330
## Lag2 0.043209 0.030952 1.396 0.1627

88
## Lag3 0.009892 0.032363 0.306 0.7599
## Lag4 -0.007061 0.032243 -0.219 0.8266
## Lag5 -0.010622 0.030171 -0.352 0.7248
## Volume -0.025549 0.043535 -0.587 0.5573
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1101.0 on 799 degrees of freedom
## Residual deviance: 1098.3 on 793 degrees of freedom
## AIC: 1112.3
##
## Number of Fisher Scoring iterations: 4

As the p-values of all the variables are greater than 0.05 we can argue that all the variables have a significant
impact on the response variable, Direction.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion
matrix is telling you about the types of mistakes made by logistic regression.

pred <- predict(logReg,newdata=dfb_test,type=’response’)


pred <- ifelse(pred > 0.5,2,1)
dfb_test$Direction<-as.numeric(dfb_test$Direction)
table(factor(pred, levels=min(dfb_test$Direction):max(dfb_test$Direction)),
factor(dfb_test$Direction, levels=min(dfb_test$Direction):max(dfb_test$Direction)))

##
## 1 2
## 1 8 8
## 2 116 157

error <- mean(pred != dfb_test$Direction)


print(1-error)

## [1] 0.5709343

Confusion Matrix Accuracy

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the
only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the
held out data (that is, the data from 2009 and 2010.)

dfd <- df[,c(3,9)]


train <- sample(1:1089,800)
dfd_train <- dfd[train,]
dfd_test <- dfd[-train,]
logReg2 <- glm(Direction ~.,family=binomial(),data=dfd_train)
summary(logReg2)

89
##
## Call:
## glm(formula = Direction ~ ., family = binomial(), data = dfd_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.417 -1.248 1.013 1.103 1.417
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.17286 0.07121 2.428 0.0152 *
## Lag2 0.06521 0.02973 2.193 0.0283 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1102.9 on 799 degrees of freedom
## Residual deviance: 1098.0 on 798 degrees of freedom
## AIC: 1102
##
## Number of Fisher Scoring iterations: 4

pred <- predict(logReg2, newdata=dfd_test, type=’response’)


pred <- ifelse(pred > 0.5,2,1)
fit1=LabelEncoder.fit(dfd_test$Direction)
dfd_test$Direction<-transform(fit1,dfd_test$Direction)
table(factor(pred, levels=min(dfd_test$Direction):max(dfd_test$Direction)),
factor(dfd_test$Direction, levels=min(dfd_test$Direction):max(dfd_test$Direction)))

##
## 1 2
## 1 7 8
## 2 112 162

error <- mean(pred != dfd_test$Direction)


print(1-error)

## [1] 0.5847751

(e) Repeat (d) using linear discriminant analysis (LDA).

LDAmodel <- lda(Direction~., data = dfd_train)


summary(LDAmodel)

## Length Class Mode


## prior 2 -none- numeric
## counts 2 -none- numeric
## means 2 -none- numeric
## scaling 1 -none- numeric
## lev 2 -none- character

90
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## terms 3 terms call
## xlevels 0 -none- list

## [1] 2

pred <-predict(LDAmodel,newdata=dfd_test,type=’response’)
fit=LabelEncoder.fit(pred$class)
pred$class<-transform(fit,pred$class)
table(factor(pred$class, levels=min(dfd_test$Direction):max(dfd_test$Direction)),
factor(dfd_test$Direction, levels=min(dfd_test$Direction):max(dfd_test$Direction)))

##
## 1 2
## 1 6 8
## 2 113 162

error <- mean(pred$class != dfd_test$Direction)


print(1-error)

## [1] 0.5813149

(f) Repeat (d) using quadratic discriminant analysis (QDA).

QDAmodel <- qda(Direction~., data=dfd_train)


summary(QDAmodel)

## Length Class Mode


## prior 2 -none- numeric
## counts 2 -none- numeric
## means 2 -none- numeric
## scaling 2 -none- numeric
## ldet 2 -none- numeric
## lev 2 -none- character
## N 1 -none- numeric
## call 3 -none- call
## terms 3 terms call
## xlevels 0 -none- list

(g) Repeat (d) using K-nearest neighbors (KNN) with K = 1.

fit2=LabelEncoder.fit(dfd_train$Direction)
dfd_train$Direction<-transform(fit2,dfd_train$Direction)
library(class)
classifier_knn <- knn(train = dfd_train,
test = dfd_test,
cl = dfd_train$Direction,
k = 1)
summary(classifier_knn)

91
## 1 2
## 118 171

error <- mean(classifier_knn != dfd_test$Direction)


print(error)

## [1] 0.003460208

(h) Which of these methods appears to provide the best results on this data? Knn method appears to
provide the best results on this data with an accuracy of more than 99%

2. This problem involves predicting Salary from the Hitters data set which is part of the ISLR package.

(a) Remove the observations for whom the salary information is unknown, and log-transform the salaries.

df=Hitters
df = na.omit(df)
df$Salary = log(df$Salary)
head(df)

## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
## -Al Newman 185 37 1 23 8 21 2 214 42 1
## CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Alan Ashby 321 414 375 N W 632 43 10
## -Alvin Davis 224 266 263 A W 880 82 14
## -Andre Dawson 828 838 354 N E 200 11 3
## -Andres Galarraga 48 46 33 N E 805 40 4
## -Alfredo Griffin 501 336 194 A W 282 421 25
## -Al Newman 30 9 24 N E 76 127 7
## Salary NewLeague
## -Alan Ashby 6.163315 N
## -Alvin Davis 6.173786 A
## -Andre Dawson 6.214608 N
## -Andres Galarraga 4.516339 N
## -Alfredo Griffin 6.620073 A
## -Al Newman 4.248495 A

(b) Create a training set consisting of the first 200 observations, and a test set consisting of the remaining
observations.

df_train <- df[1:200,]


df_test <- df[-(1:200),]

(c) Perform boosting on the training set with 1000 trees for a range of values of the shrinkage parameter
lambda. Produce a plot with different shrinkage values on the x-axis and the corresponding training
set MSE on the y-axis.

92
set.seed(1)
val = seq(-10, -0.2, by = 0.1)
lambda = 10^val
df_train_error = double(length = length(lambda))
for (i in 1:length(lambda))
{
df_boost = gbm(Salary ~ ., data = df_train, distribution = "gaussian",
n.trees = 1000, shrinkage = lambda[i])
df_train_pred = predict(df_boost, df_train, n.trees = 1000)
df_train_error[i] = mean((df_train_pred - df_train$Salary)^2)
}
plot(lambda, df_train_error, type = "b", xlab = "Shrinkage", ylab = "Training MeanSquareError")
0.8
Training MeanSquareError

0.6
0.4
0.2
0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Shrinkage

(d) Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the
y-axis.

df_test_error <- double(length = length(lambda))


for (i in 1:length(lambda))
{
df_boost = gbm(Salary ~ ., data = df_train, distribution = "gaussian",
n.trees = 1000, shrinkage = lambda[i])
yhat = predict(df_boost, df_test, n.trees = 1000)
df_test_error[i] = mean((yhat - df_test$Salary)^2)
}
plot(lambda, df_test_error, type = "b", xlab = "Shrinkage", ylab = "Testing Set MeanSquareError")

93
Testing Set MeanSquareError

0.6
0.5
0.4
0.3

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Shrinkage

(e) Which variable appear to be the most important predictors in the boosted model?

df_boost <- gbm(Salary ~ ., data = df_train, distribution = "gaussian",


n.trees = 1000, shrinkage = lambda[which.min(df_test_error)])
summary(df_boost)

94
RBI
Years HmRun
League

0 5 10 15 20 25

Relative influence

## var rel.inf
## CAtBat CAtBat 25.2453320
## PutOuts PutOuts 8.9039446
## Walks Walks 7.6702068
## RBI RBI 7.5020740
## Assists Assists 7.5016586
## AtBat AtBat 5.6224284
## CHmRun CHmRun 5.2839753
## Runs Runs 4.5930488
## HmRun HmRun 4.4011313
## CRBI CRBI 3.8898374
## CRuns CRuns 3.7701424
## Errors Errors 3.5450901
## Hits Hits 3.4071681
## Years Years 3.4031648
## CWalks CWalks 2.2178861
## CHits CHits 1.5736620
## Division Division 0.9736067
## NewLeague NewLeague 0.3533409
## League League 0.1423017

‘CAtBat’ is the most important predictor with ‘PutOuts’ as the second most important.

(f) Apply bagging to the training set. What is the test set MSE for this approach.

95
set.seed(1)
df_bag <- randomForest(Salary ~ ., data = df_train, mtry = 19, ntree = 500)
pred_bag <- predict(df_bag, newdata = df_test)
mean((pred_bag - df_test$Salary)^2)

## [1] 0.2299324

The Mean Sqaure Error is as follows: 0.23

(g) Apply random forests to the training set. What is the test set MSE for this approach.

set.seed(1)
df_rf <- randomForest(Salary ~ ., data = df_train, mtry = sqrt(13), ntree = 500)
pred_rf <- predict(df_rf, newdata = df_train)
mean((pred_rf - df_test$Salary)^2)

## [1] 1.393555

Question 3. Generate a simulated two-class data set with 100 observations and two features in which there is
a visible but non-linear seperation between the classes. Show that in this setting, a support vector machine
with a polynomial kernal (with degree greater than 1) or a radial kernal will outperform a support vector
classifier on the training data. Which technique performs best on the test data? Make plots and report
training and test error rates in order to back up your assertions.

set.seed(1)
p <- 3
X <- matrix(rnorm(100 * 2), ncol = 2)
X[1:30, ] <- X[1:30, ] + p
X[31:60, ] <- X[31:60, ] - p
Y <- c(rep(0, 60), rep(1, 40))
df <- data.frame(x = X, y = as.factor(Y))
plot(X, col = Y+10)

96
4
2
X[,2]

0
−2
−4

−4 −2 0 2 4

X[,1]

train <- sample(1:100,80)


df.train <- df[train,]
df.test <- df[-train,]
summary(df)

## x.1 x.2 y
## Min. :-4.3771 Min. :-4.91436 0:60
## 1st Qu.:-2.1470 1st Qu.:-2.48883 1:40
## Median : 0.1138 Median : 0.03554
## Mean : 0.1089 Mean :-0.03781
## 3rd Qu.: 2.3748 3rd Qu.: 2.36811
## Max. : 4.5953 Max. : 4.76729

SVM using a linear kernel

linSVM <- svm(y ~ ., data = df.train, kernel = ’linear’, scale = FALSE)


plot(linSVM, data = df.train, col=c(11,12))

97
SVM classification plot

x x
4 x
xx
x x x
x xx x
xx
xx x

1
2 x x
xx
x xx x
x x xx xx x x x
x
x.1

0 xx xx
x
xx x x x x x
x
o x
x x x x
−2 o o o
x

0
o xx x xx x
o o xx
o o x x
x x
x x x
−4 x x
−4 −2 0 2 4

x.2

tab <- table(predict=linSVM$fitted, truth=df.train$y)


error_rate <- (tab[1,2]+tab[2,1])/80
print(error_rate)

## [1] 0.3875

SVM using a polynomial kernel since the former is inefficient

polySVM <- svm(y ~ ., data = df.train, kernel = ’polynomial’, scale = FALSE)


plot(polySVM,data = df.train, col=c(11,12))

98
SVM classification plot

x o
4 o
xx
x x x
x xx x
xx
xx x

1
2 x x
xo
x xx x
x x xx xx x x o
x
x.1

0 xx xo
x
xx x x x x x
x
o x
x x x x
−2 o o o
x

0
o ox o xx x
o o xx
o o x x
x x
x x x
−4 x x
−4 −2 0 2 4

x.2

tab <- table(predict=polySVM$fitted, truth=df.train$y)


error_rate <- (tab[1,2]+tab[2,1])/80
print(error_rate)

## [1] 0.2

SVM using a radial kernel since even though the former is better but not the best one

radSVM <- svm(y ~ ., data = df.train, kernel = ’radial’, scale = FALSE)


plot(radSVM,data = df.train, col=c(11,12))

99
SVM classification plot

x x
4 o
oo o
o x
o oo o
oo
ox x

1
2 x x
xo
o oo x
o o ox oo o o x
x
x.1

0 oo oo
o
xo o o o o o
o
x o
x x o x
−2 x o x
oo oo ox

0
o
x
o o oo
x o x
o oo
oo o
−4 x x
−4 −2 0 2 4

x.2

tab <- table(predict=radSVM$fitted, truth=df.train$y)


error_rate <- (tab[1,2]+tab[2,1])/80
print(error_rate)

## [1] 0

Now lets compare the test errors of all the 3 kernels SVM (linear,polynomial and radial)

predLin <- predict(linSVM, df.test)


tab <- table(predLin, actual = df.test$y)
error_rate_lin <- (tab[1,2]+tab[2,1])/80
print(error_rate_lin)

## [1] 0.1125

polyPred <- predict(polySVM, df.test)


tab <- table(polyPred, actual = df.test$y)
error_rate_poly <- (tab[1,2]+tab[2,1])/80
print(error_rate_poly)

## [1] 0.125

100
radPred <- predict(radSVM, df.test)
tab <- table(radPred, actual = df.test$y)
error_rate_rad <- (tab[1,2]+tab[2,1])/80
print(error_rate_rad)

## [1] 0.025

101

You might also like