0% found this document useful (0 votes)

12 views48 pages

Lecture3 4

The document outlines a course on Statistics for Data Science, focusing on correlation and regression analysis. It covers key concepts such as correlation coefficients, types of regression, and important terminologies related to regression analysis, including outliers, multicollinearity, and heteroscedasticity. Additionally, it discusses model accuracy metrics and provides examples of implementing regression models using R software.

Uploaded by

Mohamed Romance

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views48 pages

Lecture3 4

Uploaded by

Mohamed Romance

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Statistics for Data Science -

CIMDS 51103
 Instructor
◼ Lemma Ebssa, Ph.D.
 Class email: Statu.Jimma22@gmail.com
 Personal email: lemma.ebssa@gmail.com
 Claims:
◼ Several pictures in this lecture are adopted from
the World Wide Web.
◼ Data are obtained from R software, Kaggle,
datacamp, Reference book (Bruce & Bruce)
◼ R software is used to analyze data
Module
Correlation and Regression
Correlation – a bivariate analysis
 a process for establishing the relationships between two
variables. You learned a way to get a general idea about
 Use of Scatter plots is the best practice to know if two variables
are related.
 A single numerical value (called correlation coefficient, r) also
shows if the relationship is positive, negative, weak, or strong.
 r ranges between -1 and +1
◼ r quite close to 0, but either positive or negative, implies little or no
relationship between the variables
◼ r close to +1 → strong positive relationship → increases in one of the variables
is associated with increases in the other variable.

◼ r close to -1 → strong negative relationship → increases in one of the variables

is associated with decreases in the other variable.

◼ Strength of the relationship can be statistically tested (Pearson’s , Spearman’s)

Type of Correlation
x1 x2
5 4
7 3
9 5

nxy = 3(20+21+45) =258

x y = (5+7+9)(4+3+5) = 252
nx^2 =3(25+49+81) = 465
ny^2 = 3(16+9+25) = 150
Pearson Correlation coefficient (x)^2 = (5+7+9)^2 = 441
(y)^2 = (4+3+5)^2 = 144
r = (258- 252)/sqr(465-441)(150-144)
r= 0.5
Change x2 to 3, 4, 6 and recalculate.
Change x2 to 5 3 4 and recalculate.
Use the other formula and recalculate.
In Correlation:
 Relationship of only two variables at a
time
 Association does not mean causation
 In multiple variables, we generate
Correlation matrix and commonly plot to
visualize
 Visualizing the data with Correlation
matrix is mainly a type of Exploratory
data analysis: “look at the data first”
Variables with High Correlation (r)

Predictive Power Score (PPS) (0 = no PP, 1= highest PP).

- How useful a variable would be in predicting the values of another variable
Regression – bivariate, multivariate
 Regression: statistical procedure to model the
relationship between a dependent variable (response,
outcome, target) and one or more independent variables
(predictors, features) so that given one or more of the
independent variables, the dependent variable can be
predicted.
◼ It is one of the most used techniques in statistics.
◼ The outcome to be predicted could be even detecting outliers
(anomalies, unseal observations), e.g., in crime detection.
 Correlation measures the strength of bivariate while
regression quantifies the nature of the relationship of
bivariate or multivariate (more than one predictors).
Type of Regression
 Simple linear: y = b0 + b1x1;
 Generalized linear (transformed/linearized
models): e.g., logistic regression (categorical),
Poisson regression (count data)
 Multiple linear: y = b0 + b1x1 + b2x2 + …
 Non-linear (cannot be linearized):
y = (b0 + b1x1 + b2x1^2 +b3x1^2) / (1 + b4x1 +b4x1^2 + b5x1^3)

log model (Poisson distribution)

Terminologies Related to
Regression Analysis
 1. Outliers: an unusual observation
◼ observations having a very high or very low
(extreme) value as compared to the other
observations in the data.
◼ may hamper/change the results we get;
thus, problems of outliers should be
addressed before a final model selection.
Outliers: Pictorial diagnosis
Terminologies Related to
Regression Analysis
 2. Multicollinearity: When independent
variables are highly correlated to each other.
◼ Regression techniques assumes multicollinearity
should not be present in the dataset. It causes
problems in ranking variables based on its
importance.
◼ It makes job difficult in selecting the most
important predictors.
Multicollinearity: Correlation
coefficient and variance inflation

Rule of thumb:
VIF > 10 shows Multicollinearity.
Solution: exclude the predictor
with VIF > 10.
Terminologies related to
regression analysis
 3. Heteroscedasticity: When dependent variable's
variability is not equal across values of an
independent variable.
◼ Example -As one's income increases, the variability of
food consumption will increase. A poorer person will
spend a rather constant amount by always eating
inexpensive food; a wealthier person may occasionally
buy inexpensive food and at other times eat expensive
meals. Those with higher incomes display a greater
variability of food consumption.
Heteroscedasticity
Solution: Chose none-
parametric models that are not
affected by this assumption.
Transform the variables
(response, predictors, or both)
Terminologies related to
regression analysis
 4. Underfitting and Overfitting: When unnecessary
explanatory variables are used.
◼ Overfitting means that our algorithm works well on the
training set but is unable to perform better on the test
sets. It is also known as problem of high variance.
◼ When algorithm works so poorly that it is unable to fit
even training set well then it is said to underfit the data.
It is also known as problem of high bias.
Underfitting and Overfitting
Terminologies related to
regression analysis
 5. Leverage: extreme values with a ability to change the
slope of the regression line. It is a measure of how much
each data point influences the regression. It is tested
through deleting each observation at a time and re-fitting
the remaining data to the model and assess of the
estimates significantly changed. The test produces Cook’s
distance which reflects how much the fitted values would
change if a point was deleted.
 6. Influence: The combined impact of strong leverage and
outlier status.
Simple Linear Regression

 Simple because we have only one predictor.

 If we collect x and y data, in simple regression, we try to
find the best line of fit with an intercept of b0 and a
slope of b1 called parameter estimates.
 b0 is the value of Y when we do not have any value of
the predictor.
 b1 is the value change in Y for every unit change in x.
 In the simple regression, usually both Y and X are
continuous variables. If Y is binary data (yes/no), the
Binomial or Logistic regression is used. If Y is count data
(e.g., number of eggs per hen), a Poisson Regression is
used. If X is categorical, we usually use ANOVA instead
of Regression.
Simple Linear Regression:
residuals & predicts
 The main reason of regression is not to find the best line
of fit rather using the equation of the best line of fit and
then predict the value of y that we did not observe given
that we have collected x. However, the best prediction is
possible if we have good estimates. The whole effort in
Regression analysis is how to determine the best
estimates of given data.
◼ Out of 50 students who took both mid-term and final exam in a Stat 101, can we predict
how much a student would have gotten in the mid-term exam that they missed given we
know their final exam score.

 Of course, we wouldn’t expect the actual value of the

estimated Y (Y hat = Ŷ) to be exactly what we would have
gotten had we measured it, but close enough with some
errors. The error between what we estimate and what we
could have observed is called Residuals (or errors).
Least Square method
The best line of fit is a model with the least
difference between observed and predicted values
of response.
Intuitively: (1) fit the data to different models,
(2) use your models to predict response, (3)
subtract the prediction from the original response,
(4) select the model with the smallest difference.

Given the data (x and y) the

mathematically find two
numbers (b0 and b1) that
result in the smallest RSS
(RSE, ) in the equation.
This process is called
Ordinary Least Square
(OLS).
Multiple Linear Regression

 Multiple because more than 1 predictors. It is an

extension of the simple linear regression
 Every predictor has its own slope. The slope of a
predictor (x1) is interpreted as a change in y for
every unit change in x1 given other predictors
constant.
 A final regression equation keeps only significant
predictors.
◼ In the mid-term example, using gender and score in Math 101 as
additional predictors in the model may improve prediction of the mid-
term exam score.
 There is variable selection (or model selection)
procedure in the multiple linear regression analysis.
Multiple Linear Regression
Variable Selection (called model selection)
 Forward, backward, or stepwise variable selection are
used to determine which predictors to fit first during
model build up. Which predictor to fit first is important
as the effect of one predictor is affected in the presence
of other factors (varying p-value of the slope of a given
predictor depending on which other predictors are included
in the model).
 For variable selection, neither R2 nor RMSE are useful
since they directly correlate with the number of
predictors.
 AIC, BIC, CP are used model selection criteria: a model
with the smallest AIC, BIC, and or CP is the best as that
indicates smaller amounts of unexplained error.
How Good are the Estimates?
 Given data, it is always guaranteed that b0 and b1 are
obtainable.
 If these estimates are true representative (not biased),
◼ future unknown response can be predicted using the equation.
◼ With little knowledge of everything else, it is possible to know
the future response once we know this good predictor x1.
 Thus, make sure model assumptions are fulfilled
 Determine goodness and the significance of the estimates
(statistically).
◼ H0: b1 = 0; HA: b1 ≠ 0 using t-test and F-test for overall
model significance
◼ Test of p-value < 0.05 to declare importance of each predictor
◼ Estimate Confidence interval of the estimates
Model Assumptions for OLS Regression
 OLS is the most common estimation method for linear models. Estimates are the best only if
the model satisfies the OLS assumptions for linear regression. However, if the OLS
assumptions are not met, the results are in doubt.
1. The regression model is linear in the coefficients and the error term. Test: plot
predictors or residuals versus fitted value
2. The error term (unpredicted random errors) has a population mean of zero. Otherwise,
the estimates are biased. Test: scatter plot of residues. Solution: transform, use
different models
3. All independent variables are uncorrelated with the error term. If violated, estimates
are biased. Test: Scatter plot of residuals against each predictor. Solution: create a new
variable that predicts the error using the predictors and then use this new variable as
an additional predictor for the response variable.
4. Observations of the error term are uncorrelated with each other. This are called
autocorrelation and most common in time series models. Test: scatter plot residuals
with order of data collection (e.g., time). Solution: use appropriate model that accounts
for autocorrelation.
5. The error term has a constant variance (Homoscedasticity of residuals, no
heteroscedasticity). Test: plot residuals versus fitted value (is it a cone-shape
distribution).
6. No independent variable is a perfect linear function of other explanatory variables.
Test: correlation matrix for the predictors. Solution: drop a predictor with VIF >10.
7. The error term is normally distributed. Note: this assumption is not for the model
estimate itself (i.e., violation does not produce unbiased estimates). This assumption
helps allows to perform statistical hypothesis testing and generate reliable CI of the
predicted responses. Test: Normal probability (or QQ) plot of the residuals.
If Predictors Are Not Linear with
Response, Transform Them
Non-linear pattern
of residuals,
requiring some
transformation

Normality test

Variance
homogeneity

Influence
- measures the relationship
between a variable’s current
value and its past values.
➔The correlation (Y-axis) from the
immediate next line onwards drops
to a near zero value below the
dashed blue line (significance
level) and all the remaining lags
are between the significance lines.
So, there is no significant
autocorrelation
Model Accuracy - goodness-of-fit
 Residual standard error (RSE)
◼ the average difference between values predicted by a model & the actual values.
◼ The smaller RSE is the better
◼ error rate of the model: RSE /Mean of Response
where Mean of Response (average y) = b0 + meanX1*b1 + meanX2*b2
◼ Rule of thumb: error rate < 0.4 relatively predicts the data accurately (if > 0.4, consider using a
model other than linear regression)
➔ Same as square root of
Residual Mean Square
in the ANOVA table

where df = sample size minus the number of parameters (degree freedom of error).
➔ Interpretation of RSE = the model predicts y with about RSE error on average.

 Intuitively, it’s like Root Mean Square Error (RMSE).

◼ how far the data points are from the regression line
◼ = average distance from this line
Model Accuracy - goodness-of-fit
 R-squared and Adjusted R-squared (for # of predictors):
◼ The larger is the better, but affected by the increase in number of predictors
◼ Adj. R-squared is adjusted for the number of parameters.
◼ Rule of thumb: R^2 > 0.70
Model Accuracy - goodness-of-fit
 F-Statistic:
◼ The larger F-statistic the more significant the model is.
◼ Rule of thumb: P-value < 0.05
Regression Example (data = marketing)

 install.packages("devtools")
 marketing <-Datarium::marketing
 Loading Required R packages: >library(tidyverse)> library(caret)
 Simple linear regression:
◼ Model_simple<-lm(sales ~ youtube, data = marketing)
◼ summary(Model_simple)$coef OR
◼ summary(Model_simple)
◼ confint(Model_simple) /* to obtain CI of the estimates */
◼ anova(model_simple)#Significance of each predictor in the model
Prediction
◼> newdata<-data.frame(youtube=c(0, 1000))
◼> predict(Model_simple, newdata)
◼> predictions <-predict(Model_simple, marketing)
◼> RMSE(predictions, marketing$sales)
◼> R2(predictions, marketing$sales)
◼> cbind(marketing, predictions) or
◼> mrkt_prdct <- marketing %>% mutate(sale_hat = predict(Model_simple,marketing))
Regression Example (data = marketing)

Multiple linear regression:

◼model_multiple<-lm(sales~youtube+facebook+newspaper, data=marketing)OR
◼model_multiple<-lm(sales~., data=marketing)
◼summary(model_multiple)$coef
◼ anova(model_multiple)#Significance of each predictor in the model
◼# New budgetsnewdata2<-data.frame(youtube=2000, facebook=1000,
newspaper=1000)
◼# Predict y: predict(model_multiple, newdata2)
◼predict(model_multiple, newdata2, interval = "confidence“)
Model assumption test (Diagnostics)
◼> pairs(marketing) # linearity test
◼> acf(model_multiple$residuals) # autocorrelation test
◼> cor.test(marketing$facebook, model_multiple$residuals)
#uncorrelatedpredictors with error
◼> Plot(model) or > res <-resid(model)> plot(density(res)) # Test of unpredicted
random errors/normality/influence
◼> Car::vif(model) or > cor(DATA) # Test of multicollinearity
◼> influence.measures(model)Gives a list influential observations (*)
Test Multicollinearity Assumption
> cor(marketing)
youtube facebook newspaper sales . Interpret:
youtube 1.00000000 0.05480866 0.05664787 0.7822244 - Correlation among predictors
facebook 0.05480866 1.00000000 0.35410375 0.5762226 - VIF
newspaper 0.05664787 0.35410375 1.00000000 0.2282990
sales 0.78222442 0.57622257 0.22829903 1.0000000
> car::vif (model)
youtube facebook newspaper
1.001788 1.149706 1.149192
Outputs of Regression Analysis
> summary(model) > model2 <- lm(sales ~ youtube + facebook , data = train.data)
Call: > summary(model2)
lm(formula = sales ~ youtube + facebook + newspaper, data = train.data)
Call:
Residuals: lm(formula = sales ~ youtube + facebook, data = train.data)

Min 1Q Median 3Q Max

Residuals:
-10.4122 -1.1101 0.3475 1.4218 3.4987
Min 1Q Median 3Q Max
-10.4807 -1.1044 0.3485 1.4232 3.4862
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Coefficients:
(Intercept) 3.391883 0.440622 7.698 1.41e-12 *** Estimate Std. Error t value Pr(>|t|)
youtube 0.045574 0.001592 28.630 < 2e-16 *** (Intercept) 3.434458 0.408770 8.402 2.32e-14 ***
facebook 0.186941 0.009888 18.905 < 2e-16 *** youtube 0.045582 0.001587 28.725 < 2e-16 ***
newspaper 0.001786 0.006773 0.264 0.792 facebook 0.187877 0.009202 20.418 < 2e-16 ***
--- ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.119 on 158 degrees of freedom Residual standard error: 2.112 on 159 degrees of freedom
Multiple R-squared: 0.8902, Adjusted R-squared: 0.8881 Multiple R-squared: 0.8901, Adjusted R-squared: 0.8887
F-statistic: 426.9 on 3 and 158 DF, p-value: < 2.2e-16 F-statistic: 644 on 2 and 159 DF, p-value: < 2.2e-16

Write down Final regression equation and assess Model

accuracy.
OLS Diagnostics: Assumption test
using > plot(model2)

The red line across the

center of the plot is roughly
horizontal then we can
assume that the residuals
follow a linear pattern and
mean of error term is 0
(except few outliers: IDs
131, 6.
OLS Diagnostics: Assumption test

The points in this plot

fall roughly along a
straight diagonal line,
then we can assume
the residuals are
normally distributed.
Observation 6 is an
outlier.
OLS Diagnostics: Assumption test
The red line is roughly
horizontal across the plot;
thus, the assumption of
equal variance is likely
met.
OLS Diagnostics: Assumption test
Observation #131
lies closest to the
border of Cook’s
distance, but it
doesn’t fall outside of
the dashed line. This
means there aren’t
any overly
influential points in
our dataset.
Multiple Linear Regression:
Practice Example
 Import house_sales data into R. [Read B&B page 248, use 7-zip to
extract .gz file). Find parameter estimates that predict AdjSalePrice
(y). You may exclude ID, DocumentDate, PropertyID.
 1. Find correlation matrix. [The matrix is only for numerical vars.]
 2. Choose those predictors highly correlated with y
(r > |0.49| (x1, x2, …) and run regression as:
◼ Model_r <- lm(y ~ x1+x2+…PropertyType +NewConstruction)
 3. Use the stepwise procedure to select the best predictors (Page #
254-255)
 4. Compare models (step 2 & 3), the meaning of b1 of categorical
variables, use of p-values to assess significance of predictors vs.
correlation coefficients, provide the final equation.
 5. If interested, do some diagnostics and redo the model after trying
to solve some of the data issues….
 6. Estimate y if the final model is AdjSalePrice ~ SqFtTotLiving + Bathrooms
+ Bedrooms + BldgGrade + PropertyType + YrBuilt when:
◼ SqFtTotLiving =1910, Bathrooms=2.5, Bedrooms =3, BldgGrade =7, PropertyType= "Multiplex ", YrBuilt
=1977
Logistic Regression:
Regression of Binary Response
 Analogous to multiple linear regression, except
that the outcome is binary (yes/no, 0/1).
 Logistic regression produces a sigmoidal graph.
 It models the probability to get one of the
responses.
Sigmoidal model
Logistics Regression & odds –
Example More info on OR, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/

 In a sack of ‘sergegna tef’, there are more number white ‘tef’ seeds
than red tef seed. In 100 seeds randomly selected from a sack of
‘Sergegna tef from Bishoftu there were 75 whites. Whereas in the
sack from Jimma, there were 60 white seeds. What is the odds of
picking a white tef seed from Bishoftu tef compared to tef from
Jimma?
◼ Odds of white for Bishoftu sack= p of white/(1- p of white)=0.75/0.25=3.
◼ Odds of white for Jimma sack = p of white/(1- p of white)=0.6/0.4=1.5.
◼ The odds ratio (OR) (ratios of the two odds) =
odds of Bisshoftu/odds of Jimma = 3/1.5 = 2
◼ Interpretation OR: The odds of picking a white seed from Bishpoftu sack
compared to picking a white seed from Jimma sack is two-folds. In other
word, opening the sack from Bishoftu increases odds of finding a white
seed by 2-fold compared to searching it in the sack from Jimma.
◼ A logistic regression of picking a white seed using location as a predictor:
odds of white = e^(b0 + b1*location), where OR= e^b1
log(e^b1) = log(OR) → b1 = log(2) = 0.69
Log(odds of white) = b0+0.69*location.
Logistics Regression for DS
 In R, use the loan data and run
logistic regression analysis (B&B, p.
332).
 Logistic_model <- glm(outcome ~ payment_inc_ratio +
purpose_ + home_ + emp_len_ + borrower_score, family =
"binomial", data = loan_data)
 Use summary(Logistic_model ) to find estimates and
significance of the parameters.
 Write the final Logistic equation and interpreter the slops.
Logistic Regression:
Regression of Categorical Responses
Poisson Regression:
Regression of Count Responses
 Assumption of Normal distribution of Residuals is
highly violated:
◼ response is count (0, 1, …) and not normally distributed
◼ Expected mean of the response is related to expected
variance {E(x) = E(var)}
 Often, occurrence is relatively rare, e.g., number of
earthquake
 Transformation of the response variable does not
improve normality assumption violation
 Follows the Poisson distribution
 Data fit to the Generalized model with the family =
Poisson
Poisson Regression: Example
 Get data: for_poisson.csv
 R codes for Poisson regression
◼ model_pois <- glm(num_awards ~ prog + math,
family="poisson", data=pois)
Real world problems examples:
correlation or regression
 Time Spent Running vs. Body Fat
 Time Spent Watching TV vs. Exam Scores
 Height vs. Weight
 Temperature vs. Ice Cream Sales
 Coffee Consumption vs. Intelligence
 Shoe Size vs. Movies Watched
 the relationship between advertising spending and revenue.
 relationship between drug dosage and blood pressure of patients.
 the effect of fertilizer (different amounts) and water (amount of irrigation) on crop yields.

Da Unit 3 R22
No ratings yet
Da Unit 3 R22
15 pages
Intermediate Analytics-Regression-Week 1
No ratings yet
Intermediate Analytics-Regression-Week 1
52 pages
Module 3
No ratings yet
Module 3
34 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
DA unit-III
No ratings yet
DA unit-III
30 pages
Regression Analysis Basics
No ratings yet
Regression Analysis Basics
12 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
13 Predictive Analysis - Tests of Association - Regression
No ratings yet
13 Predictive Analysis - Tests of Association - Regression
70 pages
Unit 2 Regression
No ratings yet
Unit 2 Regression
31 pages
CH 5
No ratings yet
CH 5
36 pages
d90840b8 1721727178674
No ratings yet
d90840b8 1721727178674
43 pages
Uttam Linear Regression 17march24
No ratings yet
Uttam Linear Regression 17march24
82 pages
Da Module 3
No ratings yet
Da Module 3
54 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Day 3
No ratings yet
Day 3
85 pages
Model Development
No ratings yet
Model Development
80 pages
Regression Analysis
No ratings yet
Regression Analysis
49 pages
Econometrics For MGT ppt-2
No ratings yet
Econometrics For MGT ppt-2
58 pages
Intro To Regresion: Codergirl Data Analysis
No ratings yet
Intro To Regresion: Codergirl Data Analysis
32 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Unit III
No ratings yet
Unit III
13 pages
Session 19&20
No ratings yet
Session 19&20
54 pages
Linear Regression Models Guide
No ratings yet
Linear Regression Models Guide
42 pages
Applied Statistics in Construction
No ratings yet
Applied Statistics in Construction
8 pages
Regression and Introduction To Bayesian Network
No ratings yet
Regression and Introduction To Bayesian Network
12 pages
Coding 2
No ratings yet
Coding 2
3 pages
Chapter No 11 (Simple Linear Regression)
No ratings yet
Chapter No 11 (Simple Linear Regression)
3 pages
Topic - Chapter 12 - Regression Models
No ratings yet
Topic - Chapter 12 - Regression Models
1 page
Correlation & Regression Guide
No ratings yet
Correlation & Regression Guide
25 pages
Module 8 Regression Analysis
No ratings yet
Module 8 Regression Analysis
15 pages
Business Analytics: Advance: Simple & Multiple Linear Regression
No ratings yet
Business Analytics: Advance: Simple & Multiple Linear Regression
38 pages
DS 3 2
No ratings yet
DS 3 2
17 pages
3 Da
No ratings yet
3 Da
16 pages
Ra Web
No ratings yet
Ra Web
70 pages
Linear Regression
No ratings yet
Linear Regression
38 pages
Regression: Unit Iii
No ratings yet
Regression: Unit Iii
54 pages
Topic - 9 PDF
No ratings yet
Topic - 9 PDF
12 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
34 pages
CH 4 - Correlation and Regression YARA&LAMA
No ratings yet
CH 4 - Correlation and Regression YARA&LAMA
27 pages
Midterm 2 Nem Veg Leges
No ratings yet
Midterm 2 Nem Veg Leges
9 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
QT - Unit 2 - Part B - Regression
No ratings yet
QT - Unit 2 - Part B - Regression
40 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
BCSE352E EDA CAT 2 Mod 1,2,5
No ratings yet
BCSE352E EDA CAT 2 Mod 1,2,5
146 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Introduction of Regression
No ratings yet
Introduction of Regression
57 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Regression Coeffient
No ratings yet
Regression Coeffient
52 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
10 pages
STAT630Slide Adv Data Analysis
0% (1)
STAT630Slide Adv Data Analysis
238 pages
Linear Regression
100% (2)
Linear Regression
28 pages
ML Unit-2
No ratings yet
ML Unit-2
34 pages
Regression Analysis 1 2020
No ratings yet
Regression Analysis 1 2020
40 pages
Regression Analysis for Students
No ratings yet
Regression Analysis for Students
10 pages
Assignment 2.2
No ratings yet
Assignment 2.2
6 pages
Assigment Document Mohamed and Omar
No ratings yet
Assigment Document Mohamed and Omar
41 pages
Introduction To DS
No ratings yet
Introduction To DS
97 pages
Image Presensatio Ass Three
No ratings yet
Image Presensatio Ass Three
13 pages
Concept Sentiment Analysis
No ratings yet
Concept Sentiment Analysis
13 pages
Lecture1 2
No ratings yet
Lecture1 2
63 pages
Module 1
No ratings yet
Module 1
38 pages
Chi Jin Du 1632016 BJ Emt 30809
No ratings yet
Chi Jin Du 1632016 BJ Emt 30809
10 pages
Final Proposal
No ratings yet
Final Proposal
38 pages
Bruhm AC
No ratings yet
Bruhm AC
32 pages
Comba ODI-065R12M15JJJ02-GQ V1
No ratings yet
Comba ODI-065R12M15JJJ02-GQ V1
3 pages
Digital Investigation: Philipp Amann, Joshua I. James
No ratings yet
Digital Investigation: Philipp Amann, Joshua I. James
10 pages
The Lean Lego Game: Francisco Trindade Danilo Sato
100% (1)
The Lean Lego Game: Francisco Trindade Danilo Sato
53 pages
MH 3 HS
No ratings yet
MH 3 HS
4 pages
PJBL 4
No ratings yet
PJBL 4
44 pages
5 Proven Steps To Success
No ratings yet
5 Proven Steps To Success
14 pages
By Law
No ratings yet
By Law
11 pages
Three Minute Thesis Slides
100% (1)
Three Minute Thesis Slides
8 pages
Week 4: Compose A Research Report On A Relevant Social Issue
100% (1)
Week 4: Compose A Research Report On A Relevant Social Issue
4 pages
Relieving of Officers
No ratings yet
Relieving of Officers
1 page
2021 - 2 - 11 - 145 - CL1-Adavanced Audit and Assurance-Feb 2021 - English
No ratings yet
2021 - 2 - 11 - 145 - CL1-Adavanced Audit and Assurance-Feb 2021 - English
13 pages
BMIS300 Revision Sheet - Final - Fall 2023-2024
No ratings yet
BMIS300 Revision Sheet - Final - Fall 2023-2024
11 pages
ST8 QP 0918 PDF
No ratings yet
ST8 QP 0918 PDF
6 pages
Polyspace Code Verification: Call Hierarchy Report For Project: Polyspace
No ratings yet
Polyspace Code Verification: Call Hierarchy Report For Project: Polyspace
7 pages
How To Install Niresh 10
No ratings yet
How To Install Niresh 10
2 pages
Accounting Standards AS-10, AS-26
No ratings yet
Accounting Standards AS-10, AS-26
25 pages
Tcs Dailycompound
No ratings yet
Tcs Dailycompound
3 pages
Carlo Maderno, Dome of San Andrea
No ratings yet
Carlo Maderno, Dome of San Andrea
8 pages
TXL 270 Assignment 1 Fatin Zulaikha PDF
No ratings yet
TXL 270 Assignment 1 Fatin Zulaikha PDF
3 pages
Chapter 7: Probability II Probability of An Event No
100% (1)
Chapter 7: Probability II Probability of An Event No
2 pages
Cement and Concrete Composites: Pengfei Zhao, Alexander Ozersky, Alexander Khomyakov, Karl Peterson
No ratings yet
Cement and Concrete Composites: Pengfei Zhao, Alexander Ozersky, Alexander Khomyakov, Karl Peterson
11 pages
Service: Golf 2004 Golf Plus 2005 Passat 2006 Touran 2003
100% (1)
Service: Golf 2004 Golf Plus 2005 Passat 2006 Touran 2003
299 pages
Auditing Fair Value Measurements and Disclosures
No ratings yet
Auditing Fair Value Measurements and Disclosures
93 pages
Ai Agent For Puc
No ratings yet
Ai Agent For Puc
4 pages
Thesis Statement For Wearing Seat Belts
100% (3)
Thesis Statement For Wearing Seat Belts
5 pages
Faqs Fellow Programme in Management (FPM)
No ratings yet
Faqs Fellow Programme in Management (FPM)
7 pages
Models J05C-TD, J08C-TP and J08C-TR
No ratings yet
Models J05C-TD, J08C-TP and J08C-TR
20 pages
Quality Policy Document
No ratings yet
Quality Policy Document
14 pages
Schneider MCCB - Compact NSX - LV431630
No ratings yet
Schneider MCCB - Compact NSX - LV431630
3 pages

Lecture3 4

Uploaded by

Lecture3 4

Uploaded by

Statistics for Data Science -

◼ r close to -1 → strong negative relationship → increases in one of the variables

◼ Strength of the relationship can be statistically tested (Pearson’s , Spearman’s)

nxy = 3(20+21+45) =258

Predictive Power Score (PPS) (0 = no PP, 1= highest PP).

log model (Poisson distribution)

 Simple because we have only one predictor.

 Of course, we wouldn’t expect the actual value of the

Given the data (x and y) the

 Multiple because more than 1 predictors. It is an

 Intuitively, it’s like Root Mean Square Error (RMSE).

Multiple linear regression:

Min 1Q Median 3Q Max

Write down Final regression equation and assess Model

The red line across the

The points in this plot

You might also like