KEMBAR78
Lecture3 4 | PDF | Regression Analysis | Errors And Residuals
0% found this document useful (0 votes)
12 views48 pages

Lecture3 4

The document outlines a course on Statistics for Data Science, focusing on correlation and regression analysis. It covers key concepts such as correlation coefficients, types of regression, and important terminologies related to regression analysis, including outliers, multicollinearity, and heteroscedasticity. Additionally, it discusses model accuracy metrics and provides examples of implementing regression models using R software.

Uploaded by

Mohamed Romance
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views48 pages

Lecture3 4

The document outlines a course on Statistics for Data Science, focusing on correlation and regression analysis. It covers key concepts such as correlation coefficients, types of regression, and important terminologies related to regression analysis, including outliers, multicollinearity, and heteroscedasticity. Additionally, it discusses model accuracy metrics and provides examples of implementing regression models using R software.

Uploaded by

Mohamed Romance
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Statistics for Data Science -

CIMDS 51103
 Instructor
◼ Lemma Ebssa, Ph.D.
 Class email: Statu.Jimma22@gmail.com
 Personal email: lemma.ebssa@gmail.com
 Claims:
◼ Several pictures in this lecture are adopted from
the World Wide Web.
◼ Data are obtained from R software, Kaggle,
datacamp, Reference book (Bruce & Bruce)
◼ R software is used to analyze data
Module
Correlation and Regression
Correlation – a bivariate analysis
 a process for establishing the relationships between two
variables. You learned a way to get a general idea about
 Use of Scatter plots is the best practice to know if two variables
are related.
 A single numerical value (called correlation coefficient, r) also
shows if the relationship is positive, negative, weak, or strong.
 r ranges between -1 and +1
◼ r quite close to 0, but either positive or negative, implies little or no
relationship between the variables
◼ r close to +1 → strong positive relationship → increases in one of the variables
is associated with increases in the other variable.

◼ r close to -1 → strong negative relationship → increases in one of the variables


is associated with decreases in the other variable.

◼ Strength of the relationship can be statistically tested (Pearson’s , Spearman’s)


Type of Correlation
x1 x2
5 4
7 3
9 5

nxy = 3(20+21+45) =258


x y = (5+7+9)(4+3+5) = 252
nx^2 =3(25+49+81) = 465
ny^2 = 3(16+9+25) = 150
Pearson Correlation coefficient (x)^2 = (5+7+9)^2 = 441
(y)^2 = (4+3+5)^2 = 144
r = (258- 252)/sqr(465-441)(150-144)
r= 0.5
Change x2 to 3, 4, 6 and recalculate.
Change x2 to 5 3 4 and recalculate.
Use the other formula and recalculate.
In Correlation:
 Relationship of only two variables at a
time
 Association does not mean causation
 In multiple variables, we generate
Correlation matrix and commonly plot to
visualize
 Visualizing the data with Correlation
matrix is mainly a type of Exploratory
data analysis: “look at the data first”
Variables with High Correlation (r)

Predictive Power Score (PPS) (0 = no PP, 1= highest PP).


- How useful a variable would be in predicting the values of another variable
Regression – bivariate, multivariate
 Regression: statistical procedure to model the
relationship between a dependent variable (response,
outcome, target) and one or more independent variables
(predictors, features) so that given one or more of the
independent variables, the dependent variable can be
predicted.
◼ It is one of the most used techniques in statistics.
◼ The outcome to be predicted could be even detecting outliers
(anomalies, unseal observations), e.g., in crime detection.
 Correlation measures the strength of bivariate while
regression quantifies the nature of the relationship of
bivariate or multivariate (more than one predictors).
Type of Regression
 Simple linear: y = b0 + b1x1;
 Generalized linear (transformed/linearized
models): e.g., logistic regression (categorical),
Poisson regression (count data)
 Multiple linear: y = b0 + b1x1 + b2x2 + …
 Non-linear (cannot be linearized):
y = (b0 + b1x1 + b2x1^2 +b3x1^2) / (1 + b4x1 +b4x1^2 + b5x1^3)

log model (Poisson distribution)


Terminologies Related to
Regression Analysis
 1. Outliers: an unusual observation
◼ observations having a very high or very low
(extreme) value as compared to the other
observations in the data.
◼ may hamper/change the results we get;
thus, problems of outliers should be
addressed before a final model selection.
Outliers: Pictorial diagnosis
Terminologies Related to
Regression Analysis
 2. Multicollinearity: When independent
variables are highly correlated to each other.
◼ Regression techniques assumes multicollinearity
should not be present in the dataset. It causes
problems in ranking variables based on its
importance.
◼ It makes job difficult in selecting the most
important predictors.
Multicollinearity: Correlation
coefficient and variance inflation

Rule of thumb:
VIF > 10 shows Multicollinearity.
Solution: exclude the predictor
with VIF > 10.
Terminologies related to
regression analysis
 3. Heteroscedasticity: When dependent variable's
variability is not equal across values of an
independent variable.
◼ Example -As one's income increases, the variability of
food consumption will increase. A poorer person will
spend a rather constant amount by always eating
inexpensive food; a wealthier person may occasionally
buy inexpensive food and at other times eat expensive
meals. Those with higher incomes display a greater
variability of food consumption.
Heteroscedasticity
Solution: Chose none-
parametric models that are not
affected by this assumption.
Transform the variables
(response, predictors, or both)
Terminologies related to
regression analysis
 4. Underfitting and Overfitting: When unnecessary
explanatory variables are used.
◼ Overfitting means that our algorithm works well on the
training set but is unable to perform better on the test
sets. It is also known as problem of high variance.
◼ When algorithm works so poorly that it is unable to fit
even training set well then it is said to underfit the data.
It is also known as problem of high bias.
Underfitting and Overfitting
Terminologies related to
regression analysis
 5. Leverage: extreme values with a ability to change the
slope of the regression line. It is a measure of how much
each data point influences the regression. It is tested
through deleting each observation at a time and re-fitting
the remaining data to the model and assess of the
estimates significantly changed. The test produces Cook’s
distance which reflects how much the fitted values would
change if a point was deleted.
 6. Influence: The combined impact of strong leverage and
outlier status.
Simple Linear Regression

 Simple because we have only one predictor.


 If we collect x and y data, in simple regression, we try to
find the best line of fit with an intercept of b0 and a
slope of b1 called parameter estimates.
 b0 is the value of Y when we do not have any value of
the predictor.
 b1 is the value change in Y for every unit change in x.
 In the simple regression, usually both Y and X are
continuous variables. If Y is binary data (yes/no), the
Binomial or Logistic regression is used. If Y is count data
(e.g., number of eggs per hen), a Poisson Regression is
used. If X is categorical, we usually use ANOVA instead
of Regression.
Simple Linear Regression:
residuals & predicts
 The main reason of regression is not to find the best line
of fit rather using the equation of the best line of fit and
then predict the value of y that we did not observe given
that we have collected x. However, the best prediction is
possible if we have good estimates. The whole effort in
Regression analysis is how to determine the best
estimates of given data.
◼ Out of 50 students who took both mid-term and final exam in a Stat 101, can we predict
how much a student would have gotten in the mid-term exam that they missed given we
know their final exam score.

 Of course, we wouldn’t expect the actual value of the


estimated Y (Y hat = Ŷ) to be exactly what we would have
gotten had we measured it, but close enough with some
errors. The error between what we estimate and what we
could have observed is called Residuals (or errors).
Least Square method
The best line of fit is a model with the least
difference between observed and predicted values
of response.
Intuitively: (1) fit the data to different models,
(2) use your models to predict response, (3)
subtract the prediction from the original response,
(4) select the model with the smallest difference.

Given the data (x and y) the


mathematically find two
numbers (b0 and b1) that
result in the smallest RSS
(RSE, ) in the equation.
This process is called
Ordinary Least Square
(OLS).
Multiple Linear Regression

 Multiple because more than 1 predictors. It is an


extension of the simple linear regression
 Every predictor has its own slope. The slope of a
predictor (x1) is interpreted as a change in y for
every unit change in x1 given other predictors
constant.
 A final regression equation keeps only significant
predictors.
◼ In the mid-term example, using gender and score in Math 101 as
additional predictors in the model may improve prediction of the mid-
term exam score.
 There is variable selection (or model selection)
procedure in the multiple linear regression analysis.
Multiple Linear Regression
Variable Selection (called model selection)
 Forward, backward, or stepwise variable selection are
used to determine which predictors to fit first during
model build up. Which predictor to fit first is important
as the effect of one predictor is affected in the presence
of other factors (varying p-value of the slope of a given
predictor depending on which other predictors are included
in the model).
 For variable selection, neither R2 nor RMSE are useful
since they directly correlate with the number of
predictors.
 AIC, BIC, CP are used model selection criteria: a model
with the smallest AIC, BIC, and or CP is the best as that
indicates smaller amounts of unexplained error.
How Good are the Estimates?
 Given data, it is always guaranteed that b0 and b1 are
obtainable.
 If these estimates are true representative (not biased),
◼ future unknown response can be predicted using the equation.
◼ With little knowledge of everything else, it is possible to know
the future response once we know this good predictor x1.
 Thus, make sure model assumptions are fulfilled
 Determine goodness and the significance of the estimates
(statistically).
◼ H0: b1 = 0; HA: b1 ≠ 0 using t-test and F-test for overall
model significance
◼ Test of p-value < 0.05 to declare importance of each predictor
◼ Estimate Confidence interval of the estimates
Model Assumptions for OLS Regression
 OLS is the most common estimation method for linear models. Estimates are the best only if
the model satisfies the OLS assumptions for linear regression. However, if the OLS
assumptions are not met, the results are in doubt.
1. The regression model is linear in the coefficients and the error term. Test: plot
predictors or residuals versus fitted value
2. The error term (unpredicted random errors) has a population mean of zero. Otherwise,
the estimates are biased. Test: scatter plot of residues. Solution: transform, use
different models
3. All independent variables are uncorrelated with the error term. If violated, estimates
are biased. Test: Scatter plot of residuals against each predictor. Solution: create a new
variable that predicts the error using the predictors and then use this new variable as
an additional predictor for the response variable.
4. Observations of the error term are uncorrelated with each other. This are called
autocorrelation and most common in time series models. Test: scatter plot residuals
with order of data collection (e.g., time). Solution: use appropriate model that accounts
for autocorrelation.
5. The error term has a constant variance (Homoscedasticity of residuals, no
heteroscedasticity). Test: plot residuals versus fitted value (is it a cone-shape
distribution).
6. No independent variable is a perfect linear function of other explanatory variables.
Test: correlation matrix for the predictors. Solution: drop a predictor with VIF >10.
7. The error term is normally distributed. Note: this assumption is not for the model
estimate itself (i.e., violation does not produce unbiased estimates). This assumption
helps allows to perform statistical hypothesis testing and generate reliable CI of the
predicted responses. Test: Normal probability (or QQ) plot of the residuals.
If Predictors Are Not Linear with
Response, Transform Them
Non-linear pattern
of residuals,
requiring some
transformation

Normality test

Variance
homogeneity

Influence
- measures the relationship
between a variable’s current
value and its past values.
➔The correlation (Y-axis) from the
immediate next line onwards drops
to a near zero value below the
dashed blue line (significance
level) and all the remaining lags
are between the significance lines.
So, there is no significant
autocorrelation
Model Accuracy - goodness-of-fit
 Residual standard error (RSE)
◼ the average difference between values predicted by a model & the actual values.
◼ The smaller RSE is the better
◼ error rate of the model: RSE /Mean of Response
where Mean of Response (average y) = b0 + meanX1*b1 + meanX2*b2
◼ Rule of thumb: error rate < 0.4 relatively predicts the data accurately (if > 0.4, consider using a
model other than linear regression)
➔ Same as square root of
Residual Mean Square
in the ANOVA table

where df = sample size minus the number of parameters (degree freedom of error).
➔ Interpretation of RSE = the model predicts y with about RSE error on average.

 Intuitively, it’s like Root Mean Square Error (RMSE).


◼ how far the data points are from the regression line
◼ = average distance from this line
Model Accuracy - goodness-of-fit
 R-squared and Adjusted R-squared (for # of predictors):
◼ The larger is the better, but affected by the increase in number of predictors
◼ Adj. R-squared is adjusted for the number of parameters.
◼ Rule of thumb: R^2 > 0.70
Model Accuracy - goodness-of-fit
 F-Statistic:
◼ The larger F-statistic the more significant the model is.
◼ Rule of thumb: P-value < 0.05
Regression Example (data = marketing)

 install.packages("devtools")
 marketing <-Datarium::marketing
 Loading Required R packages: >library(tidyverse)> library(caret)
 Simple linear regression:
◼ Model_simple<-lm(sales ~ youtube, data = marketing)
◼ summary(Model_simple)$coef OR
◼ summary(Model_simple)
◼ confint(Model_simple) /* to obtain CI of the estimates */
◼ anova(model_simple)#Significance of each predictor in the model
Prediction
◼> newdata<-data.frame(youtube=c(0, 1000))
◼> predict(Model_simple, newdata)
◼> predictions <-predict(Model_simple, marketing)
◼> RMSE(predictions, marketing$sales)
◼> R2(predictions, marketing$sales)
◼> cbind(marketing, predictions) or
◼> mrkt_prdct <- marketing %>% mutate(sale_hat = predict(Model_simple,marketing))
Regression Example (data = marketing)

Multiple linear regression:


◼model_multiple<-lm(sales~youtube+facebook+newspaper, data=marketing)OR
◼model_multiple<-lm(sales~., data=marketing)
◼summary(model_multiple)$coef
◼ anova(model_multiple)#Significance of each predictor in the model
◼# New budgetsnewdata2<-data.frame(youtube=2000, facebook=1000,
newspaper=1000)
◼# Predict y: predict(model_multiple, newdata2)
◼predict(model_multiple, newdata2, interval = "confidence“)
Model assumption test (Diagnostics)
◼> pairs(marketing) # linearity test
◼> acf(model_multiple$residuals) # autocorrelation test
◼> cor.test(marketing$facebook, model_multiple$residuals)
#uncorrelatedpredictors with error
◼> Plot(model) or > res <-resid(model)> plot(density(res)) # Test of unpredicted
random errors/normality/influence
◼> Car::vif(model) or > cor(DATA) # Test of multicollinearity
◼> influence.measures(model)Gives a list influential observations (*)
Test Multicollinearity Assumption
> cor(marketing)
youtube facebook newspaper sales . Interpret:
youtube 1.00000000 0.05480866 0.05664787 0.7822244 - Correlation among predictors
facebook 0.05480866 1.00000000 0.35410375 0.5762226 - VIF
newspaper 0.05664787 0.35410375 1.00000000 0.2282990
sales 0.78222442 0.57622257 0.22829903 1.0000000
> car::vif (model)
youtube facebook newspaper
1.001788 1.149706 1.149192
Outputs of Regression Analysis
> summary(model) > model2 <- lm(sales ~ youtube + facebook , data = train.data)
Call: > summary(model2)
lm(formula = sales ~ youtube + facebook + newspaper, data = train.data)
Call:
Residuals: lm(formula = sales ~ youtube + facebook, data = train.data)

Min 1Q Median 3Q Max


Residuals:
-10.4122 -1.1101 0.3475 1.4218 3.4987
Min 1Q Median 3Q Max
-10.4807 -1.1044 0.3485 1.4232 3.4862
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Coefficients:
(Intercept) 3.391883 0.440622 7.698 1.41e-12 *** Estimate Std. Error t value Pr(>|t|)
youtube 0.045574 0.001592 28.630 < 2e-16 *** (Intercept) 3.434458 0.408770 8.402 2.32e-14 ***
facebook 0.186941 0.009888 18.905 < 2e-16 *** youtube 0.045582 0.001587 28.725 < 2e-16 ***
newspaper 0.001786 0.006773 0.264 0.792 facebook 0.187877 0.009202 20.418 < 2e-16 ***
--- ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.119 on 158 degrees of freedom Residual standard error: 2.112 on 159 degrees of freedom
Multiple R-squared: 0.8902, Adjusted R-squared: 0.8881 Multiple R-squared: 0.8901, Adjusted R-squared: 0.8887
F-statistic: 426.9 on 3 and 158 DF, p-value: < 2.2e-16 F-statistic: 644 on 2 and 159 DF, p-value: < 2.2e-16

Write down Final regression equation and assess Model


accuracy.
OLS Diagnostics: Assumption test
using > plot(model2)

The red line across the


center of the plot is roughly
horizontal then we can
assume that the residuals
follow a linear pattern and
mean of error term is 0
(except few outliers: IDs
131, 6.
OLS Diagnostics: Assumption test

The points in this plot


fall roughly along a
straight diagonal line,
then we can assume
the residuals are
normally distributed.
Observation 6 is an
outlier.
OLS Diagnostics: Assumption test
The red line is roughly
horizontal across the plot;
thus, the assumption of
equal variance is likely
met.
OLS Diagnostics: Assumption test
Observation #131
lies closest to the
border of Cook’s
distance, but it
doesn’t fall outside of
the dashed line. This
means there aren’t
any overly
influential points in
our dataset.
Multiple Linear Regression:
Practice Example
 Import house_sales data into R. [Read B&B page 248, use 7-zip to
extract .gz file). Find parameter estimates that predict AdjSalePrice
(y). You may exclude ID, DocumentDate, PropertyID.
 1. Find correlation matrix. [The matrix is only for numerical vars.]
 2. Choose those predictors highly correlated with y
(r > |0.49| (x1, x2, …) and run regression as:
◼ Model_r <- lm(y ~ x1+x2+…PropertyType +NewConstruction)
 3. Use the stepwise procedure to select the best predictors (Page #
254-255)
 4. Compare models (step 2 & 3), the meaning of b1 of categorical
variables, use of p-values to assess significance of predictors vs.
correlation coefficients, provide the final equation.
 5. If interested, do some diagnostics and redo the model after trying
to solve some of the data issues….
 6. Estimate y if the final model is AdjSalePrice ~ SqFtTotLiving + Bathrooms
+ Bedrooms + BldgGrade + PropertyType + YrBuilt when:
◼ SqFtTotLiving =1910, Bathrooms=2.5, Bedrooms =3, BldgGrade =7, PropertyType= "Multiplex ", YrBuilt
=1977
Logistic Regression:
Regression of Binary Response
 Analogous to multiple linear regression, except
that the outcome is binary (yes/no, 0/1).
 Logistic regression produces a sigmoidal graph.
 It models the probability to get one of the
responses.
Sigmoidal model
Logistics Regression & odds –
Example More info on OR, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/

 In a sack of ‘sergegna tef’, there are more number white ‘tef’ seeds
than red tef seed. In 100 seeds randomly selected from a sack of
‘Sergegna tef from Bishoftu there were 75 whites. Whereas in the
sack from Jimma, there were 60 white seeds. What is the odds of
picking a white tef seed from Bishoftu tef compared to tef from
Jimma?
◼ Odds of white for Bishoftu sack= p of white/(1- p of white)=0.75/0.25=3.
◼ Odds of white for Jimma sack = p of white/(1- p of white)=0.6/0.4=1.5.
◼ The odds ratio (OR) (ratios of the two odds) =
odds of Bisshoftu/odds of Jimma = 3/1.5 = 2
◼ Interpretation OR: The odds of picking a white seed from Bishpoftu sack
compared to picking a white seed from Jimma sack is two-folds. In other
word, opening the sack from Bishoftu increases odds of finding a white
seed by 2-fold compared to searching it in the sack from Jimma.
◼ A logistic regression of picking a white seed using location as a predictor:
odds of white = e^(b0 + b1*location), where OR= e^b1
log(e^b1) = log(OR) → b1 = log(2) = 0.69
Log(odds of white) = b0+0.69*location.
Logistics Regression for DS
 In R, use the loan data and run
logistic regression analysis (B&B, p.
332).
 Logistic_model <- glm(outcome ~ payment_inc_ratio +
purpose_ + home_ + emp_len_ + borrower_score, family =
"binomial", data = loan_data)
 Use summary(Logistic_model ) to find estimates and
significance of the parameters.
 Write the final Logistic equation and interpreter the slops.
Logistic Regression:
Regression of Categorical Responses
Poisson Regression:
Regression of Count Responses
 Assumption of Normal distribution of Residuals is
highly violated:
◼ response is count (0, 1, …) and not normally distributed
◼ Expected mean of the response is related to expected
variance {E(x) = E(var)}
 Often, occurrence is relatively rare, e.g., number of
earthquake
 Transformation of the response variable does not
improve normality assumption violation
 Follows the Poisson distribution
 Data fit to the Generalized model with the family =
Poisson
Poisson Regression: Example
 Get data: for_poisson.csv
 R codes for Poisson regression
◼ model_pois <- glm(num_awards ~ prog + math,
family="poisson", data=pois)
Real world problems examples:
correlation or regression
 Time Spent Running vs. Body Fat
 Time Spent Watching TV vs. Exam Scores
 Height vs. Weight
 Temperature vs. Ice Cream Sales
 Coffee Consumption vs. Intelligence
 Shoe Size vs. Movies Watched
 the relationship between advertising spending and revenue.
 relationship between drug dosage and blood pressure of patients.
 the effect of fertilizer (different amounts) and water (amount of irrigation) on crop yields.

You might also like