Regression and Classification Analysis
Regression and Classification Analysis
11/25/2020
Contents
A. REGRESSION 1 1
B. REGRESSION 2 11
C. TIME SERIES 63
D. CLASSIFICATION 83
A. REGRESSION 1
library(MASS)
str(Boston)
1
library(lmtest)
##
## Attaching package: ’zoo’
library(auditor)
library(outliers)
library(hnp)
#data-split
target <- Boston$medv
x <- as.matrix(Boston[-ncol(Boston)])
n <- nrow(Boston)
p <- ncol(xfinal)
2
t <- beta/seBeta
pValue <- 2*(1-pt(abs(t),df=n-p-1))
pValue <- round(pValue,4)
#all the beta(i) which have p-values<significant-level, are not considered significant since we reject n
3
20
10
residual
0
−10
0 10 20 30 40
yhat
qqplot(yhat,residual,df=n-p-1)
4
20
10
residual
0
−10
0 10 20 30 40
yhat
#if pvalue is greater than 0.05, then the parameter is normally distributed
shapiro.test(residual)
##
## Shapiro-Wilk normality test
##
## data: residual
## W = 0.90138, p-value < 2.2e-16
#Darkin-Watson test
dwtest(target ~ xfinal)
##
## Durbin-Watson test
##
## data: target ~ xfinal
## DW = 1.0784, p-value = 0.7609
## alternative hypothesis: true autocorrelation is greater than 0
sumnumRes <- 0
sumRes <- sum(residual^2)
5
#used to test autocorrelation , if value is between 0 and 2 , positive autocorrelation
for(i in 2:length(residual)){
sumnumRes <- sumnumRes + (residual[i]-residual[i-1])^2
}
dw <- sumnumRes/sumRes
#leverage verification
hotmatrixDiagonal <- diag(xfinal %*% solve(t(xfinal) %*% xfinal) %*% t(xfinal))
hotmatrixSumDiagonal <- sum(diag(xfinal %*% solve(t(xfinal) %*% xfinal) %*% t(xfinal)))
## Half-normal plot with simulated envelope generated assuming the residuals are
## normally distributed under the null hypothesis.
3
Residuals
2
1
0
Theoretical quantiles
6
6
4
standardizedResidual
2
0
−2
0 10 20 30 40
yhat
## [1] TRUE
## [1] 1
## Half-normal plot with simulated envelope generated assuming the residuals are
## normally distributed under the null hypothesis.
7
4
3
Residuals
2
1
0
Theoretical quantiles
8
20
10
residual
0
−10
10 20 30
xfinal[1:numberRow, i]
#partial-residual plot
plot(yhat,residual,main=’Partial-Residual’)
9
Partial−Residual
20
10
residual
0
−10
0 10 20 30 40
yhat
10
100
80
60
y2
40
20
0
2 4 6 8 10
x2
B. REGRESSION 2
Question 1:
library(lmtest)
data <- read.table(’RegD1.txt’)
ind <- seq(10,100,by=10)
train <- data[-ind,]
test <- data[ind,]
#X1 and X2 are the features, thus we will find the relation between the two and it is independent and th
plot(train$X1,train$X2, xlab = "X1", ylab = "X2")
11
1.0
0.8
0.6
X2
0.4
0.2
0.0
−10 −5 0 5 10
X1
##
## Call:
## lm(formula = Y1 ~ X1 + X2, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1226 -1.3189 -0.0519 1.1825 3.3815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.77301 0.33680 8.233 1.66e-12 ***
## X1 0.19480 0.03465 5.623 2.24e-07 ***
## X2 -0.15753 0.61320 -0.257 0.798
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.711 on 87 degrees of freedom
## Multiple R-squared: 0.2666, Adjusted R-squared: 0.2498
## F-statistic: 15.81 on 2 and 87 DF, p-value: 1.387e-06
Diagnostics: 1. Checking Constant Variance by plotting the residuals against the fitted values We can observe
that the data is randomly distributed which satisfies the test. 2. Checking Normality through: a.Q-Q plot
12
of Residuals b.Shapiro Test - Since the p-value>0.05 , it is normally distributed 3.Checking Uncorrelated
Error usind Durbin-Watson Test : Errors are uncorrelated
#Diagnostics 1
plot(model,pch=15,which=1)
Residuals vs Fitted
4
42
57
2
Residuals
0
−2
95
1 2 3 4 5
Fitted values
lm(Y1 ~ X1 + X2)
#Diagnostics 2
plot(model, pch=15, which=2)
13
Normal Q−Q
42
2
57
Standardized residuals
1
0
−1
95
−2
−2 −1 0 1 2
Theoretical Quantiles
lm(Y1 ~ X1 + X2)
shapiro.test(model$residuals)
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.97386, p-value = 0.06631
#Diagnostics 3
dwtest(model)
##
## Durbin-Watson test
##
## data: model
## DW = 2.1766, p-value = 0.7891
## alternative hypothesis: true autocorrelation is greater than 0
b) RegD2 Dataset
14
#X1 and X2 are the features, thus we will find the relation between the features and it is independent a
plot(train$X1,train$X2,xlab = "X1", ylab = "X2")
1.0
0.8
0.6
X2
0.4
0.2
0.0
−10 −5 0 5 10 15
X1
15
8
6
X3
4
2
0
−10 −5 0 5 10 15
X1
16
8
6
X3
4
2
0
X1
##
## Call:
## lm(formula = Y1 ~ X1 + X2 + X3, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.21901 -0.06800 -0.01703 0.07815 0.28281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.037081 0.027754 -1.336 0.185
## X1 0.201507 0.002523 79.857 <2e-16 ***
## X2 -0.454286 0.038373 -11.839 <2e-16 ***
## X3 0.705114 0.003951 178.466 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.1068 on 86 degrees of freedom
## Multiple R-squared: 0.9977, Adjusted R-squared: 0.9976
## F-statistic: 1.219e+04 on 3 and 86 DF, p-value: < 2.2e-16
Diagnostics: 1. Checking Constant Variance by plotting the residuals against the fitted values We can observe
17
that the data is randomly distributed which satisfies the test. 2. Checking Normality through: a.Q-Q plot
of Residuals b.Shapiro Test - Since the p-value>0.05 , it is normally distributed 3.Checking Uncorrelated
Error usind Durbin-Watson Test : Errors are uncorrelated
#Diagnostics 1
plot(model,pch=15,which=1)
Residuals vs Fitted
0.3
79
21 46
0.2
0.1
Residuals
0.0
−0.2
0 2 4 6 8
Fitted values
lm(Y1 ~ X1 + X2 + X3)
#Diagnostics 2
plot(model, pch=15, which=2)
18
Normal Q−Q
3
79
21
Standardized residuals
2
1
0
−1
−2
26
−2 −1 0 1 2
Theoretical Quantiles
lm(Y1 ~ X1 + X2 + X3)
shapiro.test(model$residuals)
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.98497, p-value = 0.3886
#Diagnostics 3
dwtest(model)
##
## Durbin-Watson test
##
## data: model
## DW = 2.2712, p-value = 0.9044
## alternative hypothesis: true autocorrelation is greater than 0
b) RegD3 Dataset
19
#X1 and X2 are the features, thus we will find the relation between the features and it is independent a
plot(train$X1,train$X2,xlab = "X1", ylab = "X2")
1.0
0.8
0.6
X2
0.4
0.2
0.0
−5 0 5 10 15
X1
20
5
4
3
X4
2
1
0
−5 0 5 10 15
X1
21
5
4
3
X4
2
1
0
X1
##
## Call:
## lm(formula = Y1 ~ X1 + X2 + X4, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.210075 -0.064858 -0.001555 0.069632 0.256274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.066625 0.027747 2.401 0.0185 *
## X1 0.199642 0.002292 87.102 <2e-16 ***
## X2 -1.051784 0.037636 -27.946 <2e-16 ***
## X4 1.160520 0.005730 202.526 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.09636 on 86 degrees of freedom
## Multiple R-squared: 0.9982, Adjusted R-squared: 0.9982
## F-statistic: 1.618e+04 on 3 and 86 DF, p-value: < 2.2e-16
Diagnostics: 1. Checking Constant Variance by plotting the residuals against the fitted values We can
22
observe that the data is Non-linear. 2. Checking Normality through: a.Q-Q plot of Residuals b.Shapiro Test
- Since the p-value>0.05 , it is normally distributed 3.Checking Uncorrelated Error usind Durbin-Watson
Test : Errors are correlated
#Diagnostics 1
plot(model,pch=15,which=1)
Residuals vs Fitted
0.3
18
0.2
8
0.1
Residuals
0.0
−0.2
81
−2 0 2 4 6 8
Fitted values
lm(Y1 ~ X1 + X2 + X4)
#Diagnostics 2
plot(model, pch=15, which=2)
23
Normal Q−Q
3
18
8
Standardized residuals
2
1
0
−1
−2
81
−2 −1 0 1 2
Theoretical Quantiles
lm(Y1 ~ X1 + X2 + X4)
shapiro.test(model$residuals)
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.9904, p-value = 0.7596
#Diagnostics 3
dwtest(model)
##
## Durbin-Watson test
##
## data: model
## DW = 1.5493, p-value = 0.01472
## alternative hypothesis: true autocorrelation is greater than 0
Question 2: Diagnostics: 1. Checking Constant Variance by plotting the residuals against the fitted values
We can observe that the data is randomly distributed which satisfies the test. After choosing only X5 and
X6 as the imporant features from summary(model) 2.Checking Normality through: a.Q-Q plot of Residuals
b.Shapiro Test - Since the p-value<0.05 , it is not normally distributed
24
data <- read.table(’RegD14.txt’,header = TRUE)
##
## Call:
## lm(formula = Y ~ X5 + X6, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.862e-13 -1.882e-13 -7.120e-14 -2.840e-14 4.200e-12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.145e-13 1.173e-12 -4.390e-01 0.663
## X5 1.000e+00 1.057e-14 9.462e+13 <2e-16 ***
## X6 1.000e+00 9.247e-15 1.081e+14 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 6.3e-13 on 47 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.456e+29 on 2 and 47 DF, p-value: < 2.2e-16
plot(modelA,pch=15,which=1)
25
Residuals vs Fitted
4e−12
1
Residuals
2e−12
0e+00
24 44
Fitted values
lm(Y ~ X5 + X6)
shapiro.test(modelA$residuals)
##
## Shapiro-Wilk normality test
##
## data: modelA$residuals
## W = 0.26744, p-value = 2.2e-14
26
Normal Q−Q
1
Standardized residuals
6
4
2
11
0
44
−2 −1 0 1 2
Theoretical Quantiles
lm(Y ~ X5 + X6)
library(car)
plot(modelA,which=5)
27
Residuals vs Leverage
1
6
Standardized residuals
0.5
2
11
0
Cook's distance 44
Leverage
lm(Y ~ X5 + X6)
28
Cook's distance
1
0.8
Cook's distance
0.6
0.4
0.2
11 44
0.0
0 10 20 30 40 50
Obs. number
lm(Y ~ X5 + X6)
29
QQ Plot
300
1
Studentized Residuals(modelA)
200
100
50
44
0
−2 −1 0 1 2
t Quantiles
## [1] 1 44
leveragePlots(modelA)
30
Leverage Plots
30
11
10
44
1
5
20
0
Y | others
Y | others
10
−10
0
−20
5
144
−10
11
X5 | others X6 | others
2 leverages(11,44) in residuals vs fitted plot since they are away from the red line at 0 1 outlier (extreme
value of y hat) 1 influential point(1) since it lies in the cooks distance in residuals vs fitted plot.
-Check and comment on the structure of the relationship between the predictors and the response.
summary(modelA)
##
## Call:
## lm(formula = Y ~ X5 + X6, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.862e-13 -1.882e-13 -7.120e-14 -2.840e-14 4.200e-12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.145e-13 1.173e-12 -4.390e-01 0.663
## X5 1.000e+00 1.057e-14 9.462e+13 <2e-16 ***
## X6 1.000e+00 9.247e-15 1.081e+14 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 6.3e-13 on 47 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.456e+29 on 2 and 47 DF, p-value: < 2.2e-16
31
The p-value is <0.05 along with a high F-statistic value, thus rejecting the null hypothesis leads us to conlcude
a potential relationship between the predictors and the response. The low Residual Standard Error (estimate
of s.d of irreducible error) means a low deviation of our model from the true regression line. R-squared data
at 1 implies 100% variance of the model is being explained which accounts for a good model.
-Compute and comment on the condition numbers.
x=model.matrix(modelA)
e=eigen (t(x) %*% x)
print(e$val)
sqrt(e$val[1]/e$val)
There is a wide range in the eigenvalues and several condition numbers are large. This means that problems
are being caused by more than just one linear combination
round(cor(data[,c(-7)]),3)
## X1 X2 X3 X4 X5 X6
## X1 1.000 -0.371 0.870 0.593 -0.410 -0.349
## X2 -0.371 1.000 -0.001 -0.213 0.064 0.095
## X3 0.870 -0.001 1.000 0.617 -0.477 -0.401
## X4 0.593 -0.213 0.617 1.000 -0.893 -0.869
## X5 -0.410 0.064 -0.477 -0.893 1.000 0.970
## X6 -0.349 0.095 -0.401 -0.869 0.970 1.000
vif(modelA)
## X5 X6
## 17.06391 17.06391
VIF Test for removal of multicollinearity. VIFi>10 indicates serious multicollinearity for the predictor.
Question 3. Use the data “RegD4.txt” to fit a model using the following methods. (a) Least squares.
32
##
## Call:
## lm(formula = Y1 ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2377 -1.7117 -0.4551 2.3614 5.6978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.9197 11.8960 -3.356 0.00375 **
## X1 0.7156 0.1349 5.307 5.8e-05 ***
## X2 1.2953 0.3680 3.520 0.00263 **
## X3 -0.1521 0.1563 -0.973 0.34405
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983
## F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
library(quantreg)
##
## Attaching package: ’SparseM’
##
## Call: rq(formula = Y1 ~ ., data = dat)
##
## tau: [1] 0.5
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) -39.68986 -41.61973 -29.67754
## X1 0.83188 0.51278 1.14117
## X2 0.57391 0.32182 1.41090
## X3 -0.06087 -0.21348 -0.02891
33
library(MASS)
modelc <- rlm(Y1~., data = dat)
summary(modelc)
##
## Call: rlm(formula = Y1 ~ ., data = dat)
## Residuals:
## Min 1Q Median 3Q Max
## -8.91753 -1.73127 0.06187 1.54306 6.50163
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -41.0265 9.8073 -4.1832
## X1 0.8294 0.1112 7.4597
## X2 0.9261 0.3034 3.0524
## X3 -0.1278 0.1289 -0.9922
##
## Residual standard error: 2.441 on 17 degrees of freedom
library(rrcov)
##
## Call:
## ltsReg.formula(formula = Y1 ~ ., data = dat)
##
## Residuals (from reweighted LS):
## Min 1Q Median 3Q Max
## -2.5065 -0.4236 0.0000 0.5764 1.9342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Intercept -37.65246 4.73205 -7.957 2.37e-06 ***
## X1 0.79769 0.06744 11.828 2.48e-08 ***
## X2 0.57734 0.16597 3.479 0.00408 **
## X3 -0.06706 0.06160 -1.089 0.29611
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.253 on 13 degrees of freedom
## Multiple R-Squared: 0.975, Adjusted R-squared: 0.9692
## F-statistic: 169 on 3 and 13 DF, p-value: 1.159e-10
34
Question 4. Use the data to fit a model with Y as the response and only X3, X4, and X5 as predictors. Use
the Box-Cox method to determine the best transformation on the response.
library(faraway)
##
## Attaching package: ’faraway’
library(MASS)
data <- read.table(’RegD7.txt’)
model <- lm(Y~X3+X4+X5,data=data)
plot(model)
35
Residuals vs Fitted
15
53 220
232
10
Residuals
5
0
−5
−10
0 5 10 15 20 25
Fitted values
lm(Y ~ X3 + X4 + X5)
36
Normal Q−Q
220
3
53
232
Standardized residuals
2
1
0
−1
−2
−3 −2 −1 0 1 2 3
Theoretical Quantiles
lm(Y ~ X3 + X4 + X5)
37
Scale−Location
53 220
232
1.5
Standardized residuals
1.0
0.5
0.0
0 5 10 15 20 25
Fitted values
lm(Y ~ X3 + X4 + X5)
38
Residuals vs Leverage
220
3
53
Standardized residuals
2
1
0
−1
−2
167
Cook's distance
−3
Leverage
lm(Y ~ X3 + X4 + X5)
b <- boxcox(model)
39
95%
−700
−900
log−Likelihood
−1100
−1300
−2 −1 0 1 2
lam = b$x[which(b$y==max(b$y))]
model.inv <- lm((Y)^-1 ~ X3+X4+X5,data=data)
plot(model.inv)
40
Residuals vs Fitted
0.8
286 330
0.6
0.4
Residuals
327
0.2
0.0
−0.2
Fitted values
lm((Y)^−1 ~ X3 + X4 + X5)
41
Normal Q−Q
286
8
330
Standardized residuals
6
4
327
2
0
−2
−3 −2 −1 0 1 2 3
Theoretical Quantiles
lm((Y)^−1 ~ X3 + X4 + X5)
42
Scale−Location
286 330
2.5
Standardized residuals
2.0
327
1.5
1.0
0.5
0.0
Fitted values
lm((Y)^−1 ~ X3 + X4 + X5)
43
Residuals vs Leverage
286
8
330
0.5
Standardized residuals
6
4
327
2
0
−2
Cook's distance
Leverage
lm((Y)^−1 ~ X3 + X4 + X5)
The maximum likelihood occurs at 0.2626263 (lambda value) The model shows non linear trend. The
model.inv tranforms the response into a linear model with a certain increase in the number of outliers on
tranformation.
Question 5. Use transformations to find a good model with Y as the response and X1 and X2 as predictors
using the data .
##
## Call:
## lm(formula = Y ~ X1 + X2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## X1 4.7082 0.2643 17.816 < 2e-16 ***
## X2 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
44
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
#Log-transformation
lm_log <- lm(log1p(Y) ~ log1p(X1) + log1p(X2), data = data)
summary(lm_log)
##
## Call:
## lm(formula = log1p(Y) ~ log1p(X1) + log1p(X2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.16219 -0.04721 0.00006 0.06261 0.12252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.7737 0.7798 -8.686 1.96e-09 ***
## log1p(X1) 2.0562 0.0781 26.329 < 2e-16 ***
## log1p(X2) 1.0766 0.2001 5.379 9.85e-06 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.07868 on 28 degrees of freedom
## Multiple R-squared: 0.9774, Adjusted R-squared: 0.9758
## F-statistic: 605.5 on 2 and 28 DF, p-value: < 2.2e-16
plot(lm_log)
45
Residuals vs Fitted
0.15
0.05
Residuals
−0.05
−0.15
16
15 18
Fitted values
lm(log1p(Y) ~ log1p(X1) + log1p(X2))
46
Normal Q−Q
2
Standardized residuals
1
0
−1
16
−2
18 15
−2 −1 0 1 2
Theoretical Quantiles
lm(log1p(Y) ~ log1p(X1) + log1p(X2))
47
Scale−Location
1.5
15 18
16
Standardized residuals
1.0
0.5
0.0
Fitted values
lm(log1p(Y) ~ log1p(X1) + log1p(X2))
48
Residuals vs Leverage
2
11 17
Standardized residuals
1
0
−1
−2
18 0.5
Cook's distance
Leverage
lm(log1p(Y) ~ log1p(X1) + log1p(X2))
The maximum likelihood occurs at 0.303 (lambda value) Applying the log tranformation , we have
transformed the data into linear trend since it had sknewness and thus log tranformation was the choice.
Question 6. Use the data to fit a linear model. Implement the following variable selection methods to
determine the “best” model. (a) Backward Elimination. (b) AIC, AICC, BIC. (c) R2 (d) Mallows Cp.
library(olsrr)
##
## Attaching package: ’olsrr’
49
## Backward Elimination Method
## ---------------------------
##
## Candidate Terms:
##
## 1 . X1
## 2 . X2
## 3 . X3
## 4 . X4
## 5 . X5
## 6 . X6
## 7 . X7
## 8 . X8
##
## We are eliminating variables based on p value...
##
## - X7
##
## Backward Elimination: Step 1
##
## Variable X7 Removed
##
## Model Summary
## --------------------------------------------------------------
## R 0.809 RMSE 0.705
## R-Squared 0.654 Coef. Var 28.436
## Adj. R-Squared 0.627 MSE 0.497
## Pred R-Squared 0.584 MAE 0.521
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------
## Regression 83.713 7 11.959 24.078 0.0000
## Residual 44.204 89 0.497
## Total 127.918 96
## -------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------
## (Intercept) 0.954 0.829 1.150 0.253 -0.694 2.602
## X1 0.592 0.086 0.604 6.879 0.000 0.421 0.762
## X2 0.448 0.168 0.193 2.672 0.009 0.115 0.782
## X3 -0.019 0.011 -0.125 -1.747 0.084 -0.041 0.003
## X4 0.108 0.058 0.135 1.853 0.067 -0.008 0.223
## X5 0.758 0.241 0.272 3.140 0.002 0.278 1.237
## X6 -0.104 0.090 -0.127 -1.155 0.251 -0.284 0.075
50
## X8 0.005 0.003 0.130 1.549 0.125 -0.002 0.012
## ---------------------------------------------------------------------------------------
##
##
##
## No more variables satisfy the condition of p value = 0.3
##
##
## Variables Removed:
##
## - X7
##
##
## Final Model Output
## ------------------
##
## Model Summary
## --------------------------------------------------------------
## R 0.809 RMSE 0.705
## R-Squared 0.654 Coef. Var 28.436
## Adj. R-Squared 0.627 MSE 0.497
## Pred R-Squared 0.584 MAE 0.521
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------
## Regression 83.713 7 11.959 24.078 0.0000
## Residual 44.204 89 0.497
## Total 127.918 96
## -------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------
## (Intercept) 0.954 0.829 1.150 0.253 -0.694 2.602
## X1 0.592 0.086 0.604 6.879 0.000 0.421 0.762
## X2 0.448 0.168 0.193 2.672 0.009 0.115 0.782
## X3 -0.019 0.011 -0.125 -1.747 0.084 -0.041 0.003
## X4 0.108 0.058 0.135 1.853 0.067 -0.008 0.223
## X5 0.758 0.241 0.272 3.140 0.002 0.278 1.237
## X6 -0.104 0.090 -0.127 -1.155 0.251 -0.284 0.075
## X8 0.005 0.003 0.130 1.549 0.125 -0.002 0.012
## ---------------------------------------------------------------------------------------
plot(back)
## geom_path: Each group consists of only one observation. Do you need to adjust
51
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
page 1 of 2
R−Square C(p)
0.700 7.125
0.675 7.100
0.650 7.075
0.625 7.050
0.950 0.975 1.000 1.025 1.050 0.950 0.975 1.000 1.025 1.050
52
page 2 of 2
SBIC
−56.60
−56.62
−56.64
−56.66
−56.68
0.950 0.975 1.000 1.025 1.050
SBC
240.26
240.24
240.22
240.20
240.18
Build regression model from a set of candidate predictor variables by removing predictors based on p values,
in a stepwise manner until there is no variable left to remove any more.
53
page 1 of 2
R−Square C(p)
0.66 25
0.63 20
0.60 15
0.57 10
0.54 5
2 4 6 8 2 4 6 8
2 4 6 8 2 4 6 8
54
page 2 of 2
SBIC
−45
−50
−55
2 4 6 8
SBC
245
240
235
230
2 4 6 8
Subset of predictors that do the best at meeting some well-defined objective criterion, such as having the
largest R2 value or the smallest MSE, Mallow’s Cp or AIC.
Question 7. Consider the data . Remove every tenth observation from the data for use as a test sample.
Use the remaining data as a training sample building the following models. (a) Linear regression with all
predictors.
##
## Call:
## lm(formula = Y ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4456 -0.3758 -0.1407 0.2172 15.0780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.487e+02 1.108e+01 40.500 <2e-16 ***
## X1 -4.105e+02 8.517e+00 -48.196 <2e-16 ***
## X2 1.347e-02 1.003e-02 1.344 0.180
55
## X3 9.256e-03 1.653e-02 0.560 0.576
## X4 -5.402e-03 2.925e-02 -0.185 0.854
## X5 -3.045e-02 7.215e-02 -0.422 0.673
## X6 2.687e-02 3.027e-02 0.888 0.376
## X7 1.893e-02 3.309e-02 0.572 0.568
## X8 2.415e-02 4.493e-02 0.537 0.591
## X9 -1.658e-02 4.466e-02 -0.371 0.711
## X10 -5.181e-03 7.474e-02 -0.069 0.945
## X11 -8.786e-02 6.738e-02 -1.304 0.194
## X12 -5.766e-02 5.314e-02 -1.085 0.279
## X13 3.737e-02 6.132e-02 0.609 0.543
## X14 6.042e-03 1.674e-01 0.036 0.971
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.299 on 227 degrees of freedom
## Multiple R-squared: 0.9773, Adjusted R-squared: 0.9759
## F-statistic: 698.8 on 14 and 227 DF, p-value: < 2.2e-16
step(model)
## Start: AIC=141.14
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 +
## X12 + X13 + X14
##
## Df Sum of Sq RSS AIC
## - X14 1 0.0 383.1 139.14
## - X10 1 0.0 383.1 139.14
## - X4 1 0.1 383.1 139.17
## - X9 1 0.2 383.3 139.28
## - X5 1 0.3 383.4 139.33
## - X8 1 0.5 383.5 139.44
## - X3 1 0.5 383.6 139.47
## - X7 1 0.6 383.6 139.49
## - X13 1 0.6 383.7 139.53
## - X6 1 1.3 384.4 139.98
## - X12 1 2.0 385.0 140.39
## - X11 1 2.9 385.9 140.94
## - X2 1 3.0 386.1 141.05
## <none> 383.1 141.14
## - X1 1 3919.7 4302.8 724.49
##
## Step: AIC=139.14
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 +
## X12 + X13
##
## Df Sum of Sq RSS AIC
## - X10 1 0.0 383.1 137.14
## - X4 1 0.1 383.1 137.17
## - X9 1 0.2 383.3 137.29
## - X5 1 0.3 383.4 137.33
56
## - X8 1 0.5 383.5 137.45
## - X3 1 0.5 383.6 137.48
## - X7 1 0.6 383.6 137.49
## - X13 1 0.7 383.7 137.56
## - X6 1 1.3 384.4 137.98
## - X12 1 2.0 385.0 138.39
## - X11 1 3.0 386.1 139.03
## <none> 383.1 139.14
## - X2 1 3.5 386.6 139.35
## - X1 1 4080.0 4463.1 731.35
##
## Step: AIC=137.14
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X11 + X12 +
## X13
##
## Df Sum of Sq RSS AIC
## - X4 1 0.1 383.1 135.18
## - X9 1 0.3 383.4 135.32
## - X5 1 0.3 383.4 135.33
## - X8 1 0.5 383.5 135.45
## - X3 1 0.6 383.6 135.49
## - X7 1 0.6 383.6 135.50
## - X13 1 0.7 383.7 135.56
## - X6 1 1.4 384.4 136.00
## - X12 1 2.0 385.0 136.39
## - X11 1 3.2 386.2 137.14
## <none> 383.1 137.14
## - X2 1 3.8 386.8 137.52
## - X1 1 4080.9 4464.0 729.39
##
## Step: AIC=135.18
## Y ~ X1 + X2 + X3 + X5 + X6 + X7 + X8 + X9 + X11 + X12 + X13
##
## Df Sum of Sq RSS AIC
## - X9 1 0.2 383.4 133.34
## - X5 1 0.3 383.4 133.37
## - X3 1 0.5 383.6 133.51
## - X7 1 0.6 383.7 133.55
## - X8 1 0.6 383.7 133.57
## - X13 1 0.7 383.8 133.60
## - X6 1 1.6 384.7 134.17
## - X12 1 2.0 385.1 134.41
## - X11 1 3.2 386.3 135.17
## <none> 383.1 135.18
## - X2 1 3.9 387.0 135.62
## - X1 1 4097.2 4480.3 728.28
##
## Step: AIC=133.34
## Y ~ X1 + X2 + X3 + X5 + X6 + X7 + X8 + X11 + X12 + X13
##
## Df Sum of Sq RSS AIC
## - X5 1 0.4 383.7 131.56
## - X8 1 0.4 383.8 131.59
## - X3 1 0.5 383.8 131.63
57
## - X7 1 0.6 383.9 131.69
## - X13 1 0.7 384.0 131.75
## - X6 1 1.8 385.2 132.47
## - X12 1 2.5 385.9 132.90
## <none> 383.4 133.34
## - X11 1 3.2 386.6 133.35
## - X2 1 5.3 388.7 134.66
## - X1 1 4172.0 4555.4 730.30
##
## Step: AIC=131.56
## Y ~ X1 + X2 + X3 + X6 + X7 + X8 + X11 + X12 + X13
##
## Df Sum of Sq RSS AIC
## - X3 1 0.3 384.0 129.73
## - X7 1 0.5 384.2 129.85
## - X13 1 0.5 384.2 129.88
## - X8 1 0.6 384.3 129.92
## - X6 1 1.8 385.5 130.67
## - X12 1 2.8 386.5 131.32
## - X11 1 3.1 386.8 131.51
## <none> 383.7 131.56
## - X2 1 5.0 388.7 132.68
## - X1 1 4328.3 4712.1 736.48
##
## Step: AIC=129.73
## Y ~ X1 + X2 + X6 + X7 + X8 + X11 + X12 + X13
##
## Df Sum of Sq RSS AIC
## - X13 1 0.7 384.7 128.14
## - X7 1 0.8 384.8 128.21
## - X8 1 1.7 385.7 128.78
## - X12 1 2.5 386.5 129.32
## - X6 1 2.6 386.6 129.38
## - X11 1 2.8 386.8 129.51
## <none> 384.0 129.73
## - X2 1 4.7 388.7 130.70
## - X1 1 4751.9 5135.9 755.33
##
## Step: AIC=128.14
## Y ~ X1 + X2 + X6 + X7 + X8 + X11 + X12
##
## Df Sum of Sq RSS AIC
## - X7 1 0.7 385.3 126.57
## - X8 1 1.6 386.3 127.16
## - X12 1 1.9 386.6 127.34
## - X11 1 2.6 387.3 127.78
## <none> 384.7 128.14
## - X6 1 3.3 387.9 128.19
## - X2 1 4.4 389.1 128.90
## - X1 1 4756.7 5141.4 753.59
##
## Step: AIC=126.57
## Y ~ X1 + X2 + X6 + X8 + X11 + X12
##
58
## Df Sum of Sq RSS AIC
## - X12 1 2.1 387.4 125.89
## - X11 1 2.8 388.1 126.33
## <none> 385.3 126.57
## - X8 1 5.3 390.6 127.87
## - X2 1 6.4 391.7 128.56
## - X6 1 8.3 393.6 129.74
## - X1 1 7275.0 7660.3 848.08
##
## Step: AIC=125.89
## Y ~ X1 + X2 + X6 + X8 + X11
##
## Df Sum of Sq RSS AIC
## <none> 387.4 125.89
## - X11 1 3.4 390.9 126.02
## - X8 1 4.0 391.4 126.37
## - X6 1 6.5 394.0 127.94
## - X2 1 7.3 394.8 128.42
## - X1 1 7280.1 7667.5 846.31
##
## Call:
## lm(formula = Y ~ X1 + X2 + X6 + X8 + X11, data = train)
##
## Coefficients:
## (Intercept) X1 X2 X6 X8 X11
## 448.50931 -412.80035 0.01566 0.03971 0.03653 -0.08502
##
## Call:
## lm(formula = Y ~ X1 + X2 + X6 + X8 + X11, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3588 -0.3912 -0.1131 0.2695 15.3199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.485e+02 7.713e+00 58.146 <2e-16 ***
## X1 -4.128e+02 6.199e+00 -66.592 <2e-16 ***
## X2 1.566e-02 7.418e-03 2.111 0.0358 *
## X6 3.971e-02 1.990e-02 1.996 0.0471 *
## X8 3.653e-02 2.341e-02 1.561 0.1200
## X11 -8.502e-02 5.889e-02 -1.444 0.1501
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.281 on 236 degrees of freedom
## Multiple R-squared: 0.9771, Adjusted R-squared: 0.9766
## F-statistic: 2011 on 5 and 236 DF, p-value: < 2.2e-16
59
Y~ X1+X2+X6+X8+X11 gives us the least AIC value.
Out of all the best subset plots shown here, we look at the AIC plot and select those features which have
the lowest value
(c)Principal Component Analysis
library(pls)
##
## Attaching package: ’pls’
plot(pcr_model)
60
Y, 14 comps, validation
40
30
predicted
20
10
0
0 10 20 30 40
measured
## [1] 1
library(caret)
##
## Attaching package: ’lattice’
##
## Attaching package: ’caret’
61
## The following object is masked from ’package:pls’:
##
## R2
library(pls)
modelpls <- train(
Y~., data = train, method = "pls",
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
predy <- predict(modelpls,test)
actuals_and_preds <- data.frame(cbind(actuals=test$Score, predicteds = predy))
min_max_accuracy <- mean(apply(actuals_and_preds, 1, min) / apply(actuals_and_preds, 1, max))
print(min_max_accuracy)
## [1] 1
(e)Ridge Regression
library(glmnet)
x_var <- data.matrix(train[, c("X1", "X2", "X3","X4", "X5", "X6","X7", "X8", "X9","X10", "X11", "X12","X
y_var <- train[,’Y’]
# Setting the range of lambda values
lambda_seq <- 10^seq(2, -2, by = -.1)
# Using glmnet function to build the ridge regression in r
fit <- glmnet(x_var, y_var, alpha = 0, lambda = lambda_seq)
# Checking the model
summary(fit)
62
# Using cross validation glmnet
ridge_cv <- cv.glmnet(x_var, y_var, alpha = 0, lambda = lambda_seq)
# Best lambda value
best_lambda <- ridge_cv$lambda.min
best_fit <- ridge_cv$glmnet.fit
pred <- predict(best_ridge, s = best_lambda, newx = data.matrix(test[, c("X1", "X2", "X3","X4", "X5", "X
actual_preds <- data.frame(cbind(actuals = test$Y, predicted = pred))
correlation_accuracy <- cor(actual_preds)
correlation_accuracy
## actuals X1
## actuals 1.000000 0.998892
## X1 0.998892 1.000000
## [1] 0.9757787
The accuracy of Ridge Regression is 97% when run on the test data set.
C. TIME SERIES
library("tseries")
63
## Registered S3 method overwritten by ’quantmod’:
## method from
## as.zoo.data.frame zoo
library("forecast")
library("TSA")
##
## Attaching package: ’TSA’
library(ggplot2)
64
4
2
ts1
0
−2
−4
0 20 40 60 80 100
Time
adf.test(ts1)
##
## Augmented Dickey-Fuller Test
##
## data: ts1
## Dickey-Fuller = -5.0454, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
acf(ts1)
65
Series ts1
0.5
0.0
ACF
−0.5
5 10 15 20
Lag
pacf(ts1)
66
Series ts1
0.2
0.0
Partial ACF
−0.4
−0.8
5 10 15 20
Lag
eacf(ts1)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x x x x x x x x o o o o
## 1 o o o o o o o o o o o o o o
## 2 x o o o o o o o o o o o o o
## 3 x x o o o o o o o o o o o o
## 4 o x o o o o o o o o o o o o
## 5 o x o o o o o o o o o o o o
## 6 o x o o o o o o o o o o o o
## 7 o x x x o o o o o o o o o o
The ACF plot tails off and the PACF plot cuts off after lag(1).. In the EACF plot, the first zero at the top
left determines the value of p and q in ARMA(p,q) model Here, p=1 and q=0 which states that it is an
AR(1) model It can also be verifies using auto.ARIMA
67
4
2
ts2
0
−2
−4
0 20 40 60 80 100
Time
##
## Augmented Dickey-Fuller Test
##
## data: ts2
## Dickey-Fuller = -3.992, Lag order = 4, p-value = 0.01242
## alternative hypothesis: stationary
acf(ts2)
68
Series ts2
0.8
0.6
0.4
ACF
0.2
−0.2 0.0
5 10 15 20
Lag
pacf(ts2)
69
Series ts2
0.8
0.6
0.4
Partial ACF
0.2
0.0
−0.4
5 10 15 20
Lag
eacf(ts2)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x x o o o o o o o o o o o
## 1 x x x o x o o o o o o o o o
## 2 x o o o o o o o o o o o o o
## 3 x x o o o o o o o o o o o o
## 4 x x o o o o o o o o o o o o
## 5 x x o x o o o o o o o o o o
## 6 o o o o o o o o o o o o o o
## 7 x o o o o o o o o o o o o o
## Series: ts2
## ARIMA(2,0,0) with zero mean
##
## Coefficients:
## ar1 ar2
## 1.1051 -0.3747
## s.e. 0.0932 0.0939
##
## sigma^2 estimated as 1.619: log likelihood=-165.64
## AIC=337.28 AICc=337.53 BIC=345.09
70
Since, the p value in Dickey-Fuller Test is <0.05, thus we can say it follows ARMA(p,q) model.
The ACF plot is tailing off,thus it is an AR(p) process, In the PACF plot, cutting off takes place after a lap
of 2 which is p.
The EACF plot shows the top left 0 at (2,1), thus the order is mantained and (2,0) can also be the choice
of p and q since it surrounds (2,1) in the EACF table. Thus it is ARMA(2,0) model.
−4 −3 −2 −1
0 20 40 60 80 100
Time
acf(ts3)
71
Series ts3
0.1
−0.1
ACF
−0.3
−0.5
5 10 15 20
Lag
pacf(ts3)
72
Series ts3
0.1
Partial ACF
−0.1
−0.3
−0.5
5 10 15 20
Lag
eacf(ts3)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o x o o o o o o o o o
## 1 x o o o x o o o o o o o o o
## 2 x o o o x o o o o o o o o o
## 3 x o o o x o o o o o o o o o
## 4 o o o x o o o o o o o o o o
## 5 x o o x o x o o o o o o o o
## 6 o o o o o x o o o o o o o o
## 7 x x o o o x x o o o o o o o
Based on Dickey-Fuller test, the p-value is <0.05 which states that the data is stationary and thus the model
is ARMA(p,q).
The PACF plot is tailing off with a slight standard deviation at lag 4, thus the order of p can be taken as
0 based on the EACF plot where the top-left 0 is located at (0,1) which depicts the order of q as 1 which is
also prudent in ACF plot which iss cutting off after a lag of 1. Therefore, it is an ARMA(0,1) model
73
8
6
4
2
ts4
0
−2
−4
0 20 40 60 80 100
Time
adf.test(ts4)
##
## Augmented Dickey-Fuller Test
##
## data: ts4
## Dickey-Fuller = -4.8092, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
acf(ts4)
74
Series ts4
0.2
0.1
0.0
ACF
−0.1
−0.2
5 10 15 20
Lag
pacf(ts4)
75
Series ts4
0.2
0.1
Partial ACF
0.0
−0.1
−0.2
5 10 15 20
Lag
eacf(ts4)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o x o o o o o o o o o o o o
## 1 x x o o x o o o o o o o o o
## 2 x x o x o o o o o o o o o o
## 3 x x x x o o o o o o o o o o
## 4 x o x x o o o o o o o o o o
## 5 x o o o x o o o o o o o o o
## 6 x o o o x o o o o o o o o o
## 7 x x o o o o x o o o o o o o
Based on Dickey-Fuller test, the p-value is <0.05 which states that the data is stationary and thus the model
is ARMA(p,q). The EACF plot the 0 nearest to the top-left corner shows an ARMA(1,2) series which is
around (0,2). (Verified by auto.ARIMA) From ACF plot, the significant values drops after lag 2 which states
the value of q is 2.
76
0
−10
−20
ts5
−30
−40
0 20 40 60 80 100
Time
adf.test(ts5)
##
## Augmented Dickey-Fuller Test
##
## data: ts5
## Dickey-Fuller = -1.893, Lag order = 4, p-value = 0.621
## alternative hypothesis: stationary
##
## Augmented Dickey-Fuller Test
##
## data: ts5a
## Dickey-Fuller = -5.1214, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
acf(ts5a)
77
Series ts5a
0.6
0.4
0.2
ACF
0.0
−0.4
5 10 15 20
Lag
pacf(ts5a)
78
Series ts5a
0.6
0.4
Partial ACF
0.2
0.0
−0.4
5 10 15 20
Lag
eacf(ts5a)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x o o x x x x x o o o x x
## 1 x x o o x o o o o o o o o o
## 2 o o x o o o o o o o o o o o
## 3 o x x o o o o o o o o o o o
## 4 o x x o o o o o o o o o o o
## 5 o x x x o o o o o o o o o o
## 6 x x x o o o o o o o o o o o
## 7 x o x o o o o o o o o o o o
From the Dickey-Fuller test, we can see that the p-value>0.05, which states that the data is non-
stationary.After 1 differencing, the p-value<0.05 which states that the value of d in ARIMA(p,d,q) is
1.
From the PACF plot, we can clearly visualize the value of q to be 2 and from ACF plot we can argue the q
value to be 2, Thus it is ARIMA(2,1,2) process
79
15
10
ts6
5
0
−5
0 20 40 60 80 100
Time
adf.test(ts6)
##
## Augmented Dickey-Fuller Test
##
## data: ts6
## Dickey-Fuller = -1.8286, Lag order = 4, p-value = 0.6477
## alternative hypothesis: stationary
##
## Augmented Dickey-Fuller Test
##
## data: ts6a
## Dickey-Fuller = -2.9085, Lag order = 4, p-value = 0.2003
## alternative hypothesis: stationary
##
## Augmented Dickey-Fuller Test
80
##
## data: ts6b
## Dickey-Fuller = -5.9482, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
acf(ts6b)
Series ts6b
0.2
0.1
0.0
ACF
−0.1
−0.2
−0.3
5 10 15
Lag
pacf(ts6b)
81
Series ts6b
0.2
0.1
Partial ACF
0.0
−0.1
−0.2
−0.3
5 10 15
Lag
eacf(ts6b)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o o o o o o o o o o o o
## 1 x o o o o o o o o o o o o o
## 2 x x x o o o o o o o o o o o
## 3 x x x x o o o o o o o o o o
## 4 o x x x o o o o o o o o o o
## 5 x o o o o o o o o o o o o o
## 6 x o o o o o o o o o o o o o
## 7 x o o o o o o o o o o o o o
## Series: ts6b
## ARIMA(0,0,2) with zero mean
##
## Coefficients:
## ma1 ma2
## -0.4995 -0.2224
## s.e. 0.1002 0.1063
##
## sigma^2 estimated as 2.25: log likelihood=-179.92
## AIC=365.84 AICc=366.1 BIC=373.63
82
D. CLASSIFICATION
library("tseries")
library("forecast")
library("TSA")
library(ggplot2)
library(ISLR)
library(Hmisc)
##
## Attaching package: ’survival’
##
## Attaching package: ’Hmisc’
library(CatEncoders)
##
## Attaching package: ’CatEncoders’
83
library(caret)
library(MASS)
library(ggthemes)
library(gbm)
library(e1071)
##
## Attaching package: ’e1071’
library(randomForest)
## randomForest 4.6-14
##
## Attaching package: ’randomForest’
library(dplyr)
##
## Attaching package: ’dplyr’
84
## The following object is masked from ’package:car’:
##
## recode
library(lmtest)
library(quantreg)
library(robustbase)
library(faraway)
library(astsa)
##
## Attaching package: ’astsa’
Question 1. Consider the Weekly data set, which is part of ISLR package. It contains the weekly stock
market returns for 21 years. (a) Produce some numerical and graphical summaries of the Weekly data. Do
there appear to be any pattern?
library(ISLR)
df <- Weekly
head(df)
85
summary(df)
plot(df$Year,df$Lag1)
10
5
df$Lag1
0
−5
−15
df$Year
86
plot(df$Year,df$Lag2)
10
5
df$Lag2
0
−5
−15
df$Year
plot(df$Year,df$Volume)
87
8
df$Volume
6
4
2
0
df$Year
Year Vs Volume graph shows that there is a steady increase in the volume almost every year.
(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag
variables plus Volume as predictors. Use the summary function to print the results. Do any of the
predictors appears to be statistically significant? If so, which ones?
##
## Call:
## glm(formula = Direction ~ ., family = binomial(), data = dfb_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.481 -1.260 1.036 1.093 1.349
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.236643 0.099794 2.371 0.0177 *
## Lag1 -0.010568 0.030986 -0.341 0.7330
## Lag2 0.043209 0.030952 1.396 0.1627
88
## Lag3 0.009892 0.032363 0.306 0.7599
## Lag4 -0.007061 0.032243 -0.219 0.8266
## Lag5 -0.010622 0.030171 -0.352 0.7248
## Volume -0.025549 0.043535 -0.587 0.5573
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1101.0 on 799 degrees of freedom
## Residual deviance: 1098.3 on 793 degrees of freedom
## AIC: 1112.3
##
## Number of Fisher Scoring iterations: 4
As the p-values of all the variables are greater than 0.05 we can argue that all the variables have a significant
impact on the response variable, Direction.
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion
matrix is telling you about the types of mistakes made by logistic regression.
##
## 1 2
## 1 8 8
## 2 116 157
## [1] 0.5709343
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the
only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the
held out data (that is, the data from 2009 and 2010.)
89
##
## Call:
## glm(formula = Direction ~ ., family = binomial(), data = dfd_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.417 -1.248 1.013 1.103 1.417
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.17286 0.07121 2.428 0.0152 *
## Lag2 0.06521 0.02973 2.193 0.0283 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1102.9 on 799 degrees of freedom
## Residual deviance: 1098.0 on 798 degrees of freedom
## AIC: 1102
##
## Number of Fisher Scoring iterations: 4
##
## 1 2
## 1 7 8
## 2 112 162
## [1] 0.5847751
90
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## terms 3 terms call
## xlevels 0 -none- list
## [1] 2
pred <-predict(LDAmodel,newdata=dfd_test,type=’response’)
fit=LabelEncoder.fit(pred$class)
pred$class<-transform(fit,pred$class)
table(factor(pred$class, levels=min(dfd_test$Direction):max(dfd_test$Direction)),
factor(dfd_test$Direction, levels=min(dfd_test$Direction):max(dfd_test$Direction)))
##
## 1 2
## 1 6 8
## 2 113 162
## [1] 0.5813149
fit2=LabelEncoder.fit(dfd_train$Direction)
dfd_train$Direction<-transform(fit2,dfd_train$Direction)
library(class)
classifier_knn <- knn(train = dfd_train,
test = dfd_test,
cl = dfd_train$Direction,
k = 1)
summary(classifier_knn)
91
## 1 2
## 118 171
## [1] 0.003460208
(h) Which of these methods appears to provide the best results on this data? Knn method appears to
provide the best results on this data with an accuracy of more than 99%
2. This problem involves predicting Salary from the Hitters data set which is part of the ISLR package.
(a) Remove the observations for whom the salary information is unknown, and log-transform the salaries.
df=Hitters
df = na.omit(df)
df$Salary = log(df$Salary)
head(df)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
## -Al Newman 185 37 1 23 8 21 2 214 42 1
## CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Alan Ashby 321 414 375 N W 632 43 10
## -Alvin Davis 224 266 263 A W 880 82 14
## -Andre Dawson 828 838 354 N E 200 11 3
## -Andres Galarraga 48 46 33 N E 805 40 4
## -Alfredo Griffin 501 336 194 A W 282 421 25
## -Al Newman 30 9 24 N E 76 127 7
## Salary NewLeague
## -Alan Ashby 6.163315 N
## -Alvin Davis 6.173786 A
## -Andre Dawson 6.214608 N
## -Andres Galarraga 4.516339 N
## -Alfredo Griffin 6.620073 A
## -Al Newman 4.248495 A
(b) Create a training set consisting of the first 200 observations, and a test set consisting of the remaining
observations.
(c) Perform boosting on the training set with 1000 trees for a range of values of the shrinkage parameter
lambda. Produce a plot with different shrinkage values on the x-axis and the corresponding training
set MSE on the y-axis.
92
set.seed(1)
val = seq(-10, -0.2, by = 0.1)
lambda = 10^val
df_train_error = double(length = length(lambda))
for (i in 1:length(lambda))
{
df_boost = gbm(Salary ~ ., data = df_train, distribution = "gaussian",
n.trees = 1000, shrinkage = lambda[i])
df_train_pred = predict(df_boost, df_train, n.trees = 1000)
df_train_error[i] = mean((df_train_pred - df_train$Salary)^2)
}
plot(lambda, df_train_error, type = "b", xlab = "Shrinkage", ylab = "Training MeanSquareError")
0.8
Training MeanSquareError
0.6
0.4
0.2
0.0
Shrinkage
(d) Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the
y-axis.
93
Testing Set MeanSquareError
0.6
0.5
0.4
0.3
Shrinkage
(e) Which variable appear to be the most important predictors in the boosted model?
94
RBI
Years HmRun
League
0 5 10 15 20 25
Relative influence
## var rel.inf
## CAtBat CAtBat 25.2453320
## PutOuts PutOuts 8.9039446
## Walks Walks 7.6702068
## RBI RBI 7.5020740
## Assists Assists 7.5016586
## AtBat AtBat 5.6224284
## CHmRun CHmRun 5.2839753
## Runs Runs 4.5930488
## HmRun HmRun 4.4011313
## CRBI CRBI 3.8898374
## CRuns CRuns 3.7701424
## Errors Errors 3.5450901
## Hits Hits 3.4071681
## Years Years 3.4031648
## CWalks CWalks 2.2178861
## CHits CHits 1.5736620
## Division Division 0.9736067
## NewLeague NewLeague 0.3533409
## League League 0.1423017
‘CAtBat’ is the most important predictor with ‘PutOuts’ as the second most important.
(f) Apply bagging to the training set. What is the test set MSE for this approach.
95
set.seed(1)
df_bag <- randomForest(Salary ~ ., data = df_train, mtry = 19, ntree = 500)
pred_bag <- predict(df_bag, newdata = df_test)
mean((pred_bag - df_test$Salary)^2)
## [1] 0.2299324
(g) Apply random forests to the training set. What is the test set MSE for this approach.
set.seed(1)
df_rf <- randomForest(Salary ~ ., data = df_train, mtry = sqrt(13), ntree = 500)
pred_rf <- predict(df_rf, newdata = df_train)
mean((pred_rf - df_test$Salary)^2)
## [1] 1.393555
Question 3. Generate a simulated two-class data set with 100 observations and two features in which there is
a visible but non-linear seperation between the classes. Show that in this setting, a support vector machine
with a polynomial kernal (with degree greater than 1) or a radial kernal will outperform a support vector
classifier on the training data. Which technique performs best on the test data? Make plots and report
training and test error rates in order to back up your assertions.
set.seed(1)
p <- 3
X <- matrix(rnorm(100 * 2), ncol = 2)
X[1:30, ] <- X[1:30, ] + p
X[31:60, ] <- X[31:60, ] - p
Y <- c(rep(0, 60), rep(1, 40))
df <- data.frame(x = X, y = as.factor(Y))
plot(X, col = Y+10)
96
4
2
X[,2]
0
−2
−4
−4 −2 0 2 4
X[,1]
## x.1 x.2 y
## Min. :-4.3771 Min. :-4.91436 0:60
## 1st Qu.:-2.1470 1st Qu.:-2.48883 1:40
## Median : 0.1138 Median : 0.03554
## Mean : 0.1089 Mean :-0.03781
## 3rd Qu.: 2.3748 3rd Qu.: 2.36811
## Max. : 4.5953 Max. : 4.76729
97
SVM classification plot
x x
4 x
xx
x x x
x xx x
xx
xx x
1
2 x x
xx
x xx x
x x xx xx x x x
x
x.1
0 xx xx
x
xx x x x x x
x
o x
x x x x
−2 o o o
x
0
o xx x xx x
o o xx
o o x x
x x
x x x
−4 x x
−4 −2 0 2 4
x.2
## [1] 0.3875
98
SVM classification plot
x o
4 o
xx
x x x
x xx x
xx
xx x
1
2 x x
xo
x xx x
x x xx xx x x o
x
x.1
0 xx xo
x
xx x x x x x
x
o x
x x x x
−2 o o o
x
0
o ox o xx x
o o xx
o o x x
x x
x x x
−4 x x
−4 −2 0 2 4
x.2
## [1] 0.2
SVM using a radial kernel since even though the former is better but not the best one
99
SVM classification plot
x x
4 o
oo o
o x
o oo o
oo
ox x
1
2 x x
xo
o oo x
o o ox oo o o x
x
x.1
0 oo oo
o
xo o o o o o
o
x o
x x o x
−2 x o x
oo oo ox
0
o
x
o o oo
x o x
o oo
oo o
−4 x x
−4 −2 0 2 4
x.2
## [1] 0
Now lets compare the test errors of all the 3 kernels SVM (linear,polynomial and radial)
## [1] 0.1125
## [1] 0.125
100
radPred <- predict(radSVM, df.test)
tab <- table(radPred, actual = df.test$y)
error_rate_rad <- (tab[1,2]+tab[2,1])/80
print(error_rate_rad)
## [1] 0.025
101