1
The Analytics Edge                                                                  FALL 2020
      Test your knowledge of Linear Regression and PCA in R
                                                                             Exercise: Week 2
1. This question involves the use of simple linear regression on the Auto dataset. This dataset
  was taken from the StatLib library which is maintained at Carnegie Mellon University. The
  dataset has the following fields:
     • mpg: miles per gallon
     • cylinders: number of cylinders
     • displacement: engine displacement (cu. inches)
     • horsepower: engine horsepower
     • acceleration: time to accelerate from 0 to 60 mph (sec.)
     • year: model year (modulo 100)
     • origin: origin of car (1. American, 2. European, 3. Japanese)
     • name: vehicle name
   (a) Perform a simple linear regression with mpg as the response and horsepower as the pre-
       dictor. Comment on why you need to change the horsepower variable before performing
       the regression.
   (b) Comment on the output by answering the following questions:
         • Is there a strong relationship between the predictor and the response?
         • Is the relationship between the predictor and the response positive or negative?
   (c) What is the predicted mpg associated with a horsepower of 98? What is the associated
       99% confidence interval?
       Hint: You can check the predict.lm function on how the confidence interval can be
       computed for predictions with R.
   (d) Compute the correlation between the response and the predictor variable. How does this
       compare with the R2 value?
   (e) Plot the response and the predictor. Also plot the least squares regression line.
    (f) First install the package ggfortify which aids plotting linear models with ggplot2. Use
       the following two commands in R to produce diagnostic plots of the linear regression fit:
       > library(ggfortify)
       > autoplot(your model name)
       Comment on the Residuals versus Fitted plot and the Normal Q-Q plot and on any
       problems you might see with the fit.
                                                                                                 2
2. This question involves the use of multiple linear regression on the Auto dataset building on
  question 1.
   (a) Produce a scatterplot matrix which includes all the variables in the dataset.
   (b) Compute a matrix of correlations between the variables using the function cor(). You
       need to exclude the name variable which is qualitative.
   (c) Perform a multiple linear regression with mpg as the response and all other variables except
       name as the predictors. Comment on the output by answering the following questions:
         • Is there a strong relationship between the predictors and the response?
         • Which predictors appear to have a statistically significant relationship to the re-
            sponse?
         • What does the coefficient for the year variable suggest?
3. This problem focusses on the multicollinearity problem with simulated data.
   (a) Perform the following commands in R:
       > set.seed(1)
       > x1 <− runif(100)
       > x2 <− 0.5*x1 + rnorm(100)/10
       > y <− 2 + 2*x1 + 0.3*x2 + rnorm(100)
       The last line corresponds to creating a linear model in which y is a function of x1 and
       x2. Write out the form of the linear model. What are the regression coefficients?
   (b) What is the correlation between x1 and x2? Create a scatterplot displaying the relation-
       ship between the variables.
   (c) Using the data, fit a least square regression to predict y using x1 and x2.
         • What are the estimated parameters of β̂0 , β̂1 and β̂2 ? How do these relate to the
            true β0 , β1 and β2 ?
         • Can you reject the null hypothesis H0 : β1 = 0?
         • How about the null hypothesis H0 : β2 = 0?
   (d) Now fit a least squares regression to predict y using only x1.
         • How does the estimated β̂1 relate to the true β1 ?
         • Can you reject the null hypothesis H0 : β1 = 0?
   (e) Now fit a least squares regression to predict y using only x2.
         • How does the estimated β̂2 relate to the true β2 ?
         • Can you reject the null hypothesis H0 : β2 = 0?
    (f) Provide an explanation on the results in parts (c)-(e).
4. This problem involves the Boston dataset. This data was part of an important paper in 1978
  by Harrison and Rubinfeld titled “Hedonic housing prices and the demand for clean
  air” published in the Journal of Environmental Economics and Management 5(1): 81-102.
  The dataset has the following fields:
                                                                                             3
  • crim: per capita crime rate by town
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft
  • indus: proportion of non-retail business acres per town
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • nox: nitrogen oxides concentration (parts per 10 million)
  • rm: average number of rooms per dwelling
  • age: proportion of owner-occupied units built prior to 1940
  • dis: weighted mean of distances to five Boston employment centres
  • rad: index of accessibility to radial highways
  • tax: full-value property-tax rate per $10,000
  • ptratio: pupil-teacher ratio by town
  • black: 1000(Bk − 0.63)2 where Bk is the proportion of black residents by town
  • lstat: lower status of the population (percent)
  • medv: median value of owner-occupied homes in $1000s
We will try to predict the median house value using thirteen predictors.
(a) For each predictor, fit a simple linear regression model using a single variable to predict
     the response. In which of these models is there a statistically significant relationship
     between the predictor and the response? Plot the figure of relationship between medv and
     lstat as an example to validate your finding.
(b) Fit a multiple linear regression models to predict your response using all the predictors.
     Compare the adjusted R2 from this model with the simple regression model. For which
     predictors, can we reject the null hypothesis H0 : βj = 0?
 (c) Create a plot displaying the univariate regression coefficients from (a) on the X-axis and
     the multiple regression coefficients from (b) on the Y-axis. That is each predictor is
     displayed as a single point in the plot. Comment on this plot.
(d) In this question, we will check if there is evidence of non-linear association between the
     lstat predictor variable and the response? To answer the question, fit a model of the
     form
                              medv = β0 + β1 lstat + β2 lstat2 + .
     You can make use of the poly() function in R. Does this help improve the fit¿ Add higher
     degree polynomial fits. What is the degree of the polynomial fit beyond which the terms
     no longer remain significant?
                                                                                              4
5. Orley Ashenfelter in his paper “Predicting the Quality and Price of Bordeaux Wines”
  published in The Economic Journal showed that the variability in the prices of Bordeaux wines
  is predicted well by the weather that created the grapes. In this question, you will validate
  how these results translate to a dataset for wines produced in Australia. The data is provided
  in the file winedata.csv. The dataset contains the following variables:
     • vintage: year the wine was made
     • price91: 1991 auction prices for the wine in dollars
     • price92: 1992 auction prices for the wine in dollars
     • temp: average temperature during the growing season in degree Celsius
     • hrain: total harvest rain in mm
     • wrain: total winter rain in mm
     • tempdiff: sum of the difference between the maximum and minimum temperatures dur-
       ing the growing season in degree Celsius
  (a) Define two new variables age91 and age92 that captures the age of the wine (in years) at
      the time of the auctions. For example, a 1961 wine would have an age of 30 at the auction
      in 1991. What is the average price of wines that were 15 years or older at the time of the
      1991 auction?
  (b) What is the average price of the wines in the 1991 auction that were produced in years
      when both the harvest rain was below average and the temperature difference was below
      average?
  (c) In this question, you will develop a simple linear regression model to fit the log of the
      price at which the wine was auctioned in 1991 with the age of the wine. To fit the model,
      use a training set with data for the wines up to (and including) the year 1981. What is
      the R-squared for this model?
  (d) Find the 99% confidence interval for the estimated coefficients from the regression.
  (e) Use the model to predict the log of prices for wines made from 1982 onwards and auctioned
      in 1991. What is the test R-squared?
  (f) Which among the following options describes best the quality of fit of the model for
      this dataset in comparison with the Bordeaux wine dataset that was analyzed by Orley
      Ashenfelter?
        • The result indicates that the variation of the prices of the wines in this dataset is
          explained much less by the age of the wine in comparison to Bordeaux wines.
        • The result indicates that the variation of the prices of the wines in this dataset is
          explained much more by the age of the wine in comparison to Bordeaux wines.
        • The age of the wine has no predictive power on the wine prices in both the datasets.
                                                                                                 5
(g) Construct a multiple regression model to fit the log of the price at which the wine was auc-
     tioned in 1991 with all the possible predictors (age91, temp, hrain, wrain, tempdiff)
     in the training dataset. To fit your model, use the data for wines made up to (and includ-
     ing) the year 1981. What is the R-squared for the model?
(h) Is this model preferred to the model with only the age variable as a predictor (use the
     adjusted R-squared for the model to decide on this)?
 (i) Which among the following best describes the output from the fitted model?
       • The result indicates that less the temperature, the better is the price and quality of
         the wine
       • The result indicates that greater the temperature difference, the better is the price
         and quality of wine.
       • The result indicates that lesser the harvest rain, the better is the price and quality of
         the wine.
       • The result indicates that winter rain is a very important variable in the fit of the data.
 (j) Of the five variables (age91, temp, hrain, wrain, tempdiff), drop the two variables
     that are the least significant from the results in (g). Rerun the linear regression and write
     down your fitted model.
(k) Is this model preferred to the model with all variables as predictors (use the adjusted
     R-squared in the training set to decide on this)?
 (l) Using the variables identified in (j), construct a multiple regression model to fit the log
     of the price at which the wine was auctioned in 1992 (remember to use age92 instead of
     age91). To fit your model, use the data for wines made up to (and including) the year
     1981. What is the R-squared for the model?
(m) Suppose in this application, we assume that a variable is statistically significant at the 0.2
     level. Would you reject the hypothesis that the coefficient for the variable hrain is zero?
(n) By separately estimating the equations for the wine prices for each auction, we can better
     establish the credibility of the explanatory variables because:
       • We have more data to fit our models with.
       • The effect of the weather variables and age of the wine (sign of the estimated coeffi-
         cients) can be checked for consistency across years.
       • 1991 and 1992 are the markets when the Australian wines were traded heavily.
     Select the best option.
(o) The current fit of the linear regression using the weather variables drops all observations
     where any of the entries are missing. Provide a short explanation on when this might not
     be a reasonable approach to use.
                                                                                                6
6. This question involves the use of principal component analysis on the well-known iris dataset.
  The dataset is available in R.
   (a) How many observations are there in the dataset? What are the different fields/attributes
        in the data set?
   (b) Create a new dataset iris.data by removing the Species column and store its content
        as iris.sp.
    (c) Compare the various pair of features using a pairwise scatterplot and find correlation
        coefficients between the features. Which features seem to be highly correlated?
   (d) Conduct a principal component analysis on iris.data without standardizing the data.
        You may use prcomp(..., scale=F).
         (i) How many principal components are required to explain at least 90 % of the vari-
             ability in the data? Plot the cumulative percentage of variance explained by the
             principal components to answer this question.
         (ii) Plot the data along the first two principal components and color the different
             species separately. Does the first principal component create enough separation
             among the different species? To plot, you may use the function fviz pca ind or
             fviz pca biplot in library(factoextra). Alternatively, you may use biplot or
             construct a plot using ggplot2 as well.
    (e) Do the same exercise as in (d) above, now after standardizing the dataset. Comment on
        any differences you observe.
7. This problem involves the dataset wine italy.csv which was obtained from the University
  of Irvine Machine Learning Repository. These data are the results of a chemical analysis of
  wines grown in the same region in Italy but derived from three different cultivars. The analysis
  determined the quantities of 13 constituents found in each of the three types of wines. The
  first column identifies the cultivars and the next thirteen are the attributes given by:
     • alcohol: Alcohol
     • malic: Malic acid
     • ash: Ash
     • alkalin: Alkalinity of ash
     • mag: Magnesium
     • phenols: Total phenols
     • flavanoids: Flavanoids
     • nonflavanoids: Nonflavanoid phenols
     • proanth: Proanthocyanins
     • color: Color Intensity
     • hue: Hue
                                                                                            7
 • od280: OD280/ OD315 of diluted wines
 • proline: Proline
(a) Check the relationship between the variables by creating a pair-wise scatterplot of the
    thirteen attributes.
(b) Conduct a principal component analysis on the standardized data. What proportion of
    the total variance is explained by the first two components?
(c) Plot the data along the first two principal components and color the different cultivars
    separately. Also plot the loadings of the different components to show the importance of
    the different attributes on the first two principal components?
      (i) Which two key attributes differentiate Cultivar 2 from the other two cultivars?
     (ii) Which two key attributes differentiate Cultivar 3 from the other two cultivars?
(d) Use an appropriate plot to find the number of attributes required to explain at least 80%
    of the total variation in the data.