0% found this document useful (0 votes)

93 views24 pages

Regression Analysis Script

The document provides problems and steps to analyze the Anscombe data set using R. Specifically, it involves: 1) Loading and encoding the Anscombe data set from Excel into R. 2) Computing summary statistics like the mean, variance, and standard deviation of the variables. 3) Building simple linear regression models to analyze the relationships between (X1, Y1), (X2, Y2), (X3, Y3), and (X4, Y4). 4) Validating the assumptions of each linear regression model, including normality, independence, homoscedasticity, and linearity. The assumptions are satisfied for the (X1, Y1)

Uploaded by

John Riel Canete

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views24 pages

Regression Analysis Script

Uploaded by

John Riel Canete

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 24

# PANGANTIHON, Norman O.

STT061-M14

# Activity on Regression Analysis

# Your task is consider the Anscombe data set again

# These are your new problems

# Problem 1: Encode the Anscombe data set using Excel (Save the file)

# Problem 2: Export the xls file into a csv file (Find Export command

# in the File command menu of Excel)

# csv means comma separated values

# Problem 3: Use R command to load the csv file in R

homedir <- "~/NORMAN/STT061/"

setwd(homedir)

Anscombe <- read.csv("Anscombe.csv")

# Problem 4: Use R command to compute the mean of X1,X2,X3,X4

mean(Anscombe$X1)

mean(Anscombe$X2)

mean(Anscombe$X3)

mean(Anscombe$X4)
# Problem 5: Use R command to compute the mean of Y1,Y2,Y3,Y4

mean(Anscombe$Y1)

mean(Anscombe$Y2)

mean(Anscombe$Y3)

mean(Anscombe$Y4)

# Problem 6: Use R command to compute the variance Y1,Y2,Y3,Y4

var(Anscombe$Y1)

var(Anscombe$Y2)

var(Anscombe$Y3)

var(Anscombe$Y4)

# Problem 7: Use R command to compute the variance X1,X2,X3,X4

var(Anscombe$X1)

var(Anscombe$X2)

var(Anscombe$X3)

var(Anscombe$X4)

# Problem 8: Use R command to compute the sd Y1,Y2,Y3,Y4

sd(Anscombe$Y1)

sd(Anscombe$Y2)

sd(Anscombe$Y3)

sd(Anscombe$Y4)
# Problem 9: Use R command to compute the sd X1,X2,X3,X4

sd(Anscombe$X1)

sd(Anscombe$X2)

sd(Anscombe$X3)

sd(Anscombe$X4)

# Problem 10: Use R command to compute the correlation (X1,Y1),

# and also for (X2,Y2), (X3,Y3), (X4,Y4).

cor(Anscombe$X1,Anscombe$Y1)

cor(Anscombe$X2,Anscombe$Y2)

cor(Anscombe$X3,Anscombe$Y3)

cor(Anscombe$X4,Anscombe$Y4)

# Establish causal relationship by validating the four assumptions

# Problem 11: Build the Simple Linear Regression Model for (X1, Y1).

# Verify the assumptions of this model.

# Step1: Load data set

Anscombe <- read.csv("Anscombe.csv")

# Step2: Build the Simple Linear Regression Model

model <- lm(Anscombe$Y1 ~ Anscombe$X1, data=Anscombe$Y1andAnscombe$X1) # create the linear

regression model
plot(Anscombe$X1,Anscombe$Y1) # scatter plot

abline(model,col = "red",lwd = 3) # plot the regression line

model # display regression coefficients

cor(Anscombe$X1,Anscombe$Y1) # get correlation coefficient

# correlation = 0.8164205 - this value is high

# I can say that there is a strong linear relationship between X1 and Y1

# Findings for Assumption1:

# The linearity of data is satisfied

# because of the high value of correlation

# and the scatter plot shows that

# as X1 increases, Y1 also increases.

# Step 3: Verify Assumption2: Independence of Error terms

plot(model,1)

plot(model$fitted.values, model$residuals)

plot(model,1)

# Conclusion: Independence of error terms is satisfied.

# In the residuals versus fits plot,

# the points seem randomly scattered,

# and it does not appear that there is a pattern.

# Also, there is a red line which is

# approximately horizontal at 0.
# Step4: Verify Assumption3: Normality of Error terms.

# For simple linear regression

# we can check the normality of the response variable

# Can use the hist() to check whether the dependent variable

# follows a normal distribution. Use also Test for normality and the

# QQ plot and the Normality Rule

hist(Anscombe$Y1,probability=T, main="Histogram",xlab="Raw data")

lines(density(Anscombe$Y1),col=2,lwd = 3)

# Findings: it appears that the distribution

# is approximately normal

# We can confirm this by computing the skewness and kurtosis coefficients

library(datawizard)

skewness(Anscombe$Y1)

kurtosis(Anscombe$Y1)

# skew = -0.065 > -0.5 indicating nearly symmetrical data

# Kurtosis = -0.535 < 0 indicating that the curve is flatter than normal

# Findings: The distribution of the response variable is nearly symmetric

# Conclusion: The distribution of the response variable is normal

# Perform Shapiro-Wilk test for Normality

shapiro.test(Anscombe$Y1)

# W = 0.97693, p-value = 0.9467

# The p-value of the test turns out to be 0.9467 .

# Since this value is greater than .05, we have sufficient evidence

# to say that the sample data come from a population

# that is normally distributed.

# Perform the Empirical Normality rule: (This requires that the curve is symmetric)

# Percentage of area covered under normality rule: 68/95/99.7

# Area covered within 1 SD is 68%

# Area covered within 2 SD is 95%

# Area covered within 3 SD is 99.7%

# Compute the actual percentage

data1 <- Anscombe$Y1

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)

percentage <- paste0(within1sd,'/',within2sd,'/',within3sd)

(normsfreq <- paste0("Actual Percentage: ", percentage))

# "Actual Percentage: 81.82/100/100 which satisfies the empirical rule

# and there is a validation that the curve is symmetric

# using the histogram skewness and kurtosis.

# Therefore the result about the actual empirical coverage

# should be considered

# because it has already been established that the curve is nearly symmetric.

# Findings: the Empirical Normality rule is satisfied

# Decision: The result of Empirical rule is valid

# since the conditions are satisfied and

# the shape of the distribution is nearly symmetric.

# Here is another way to establish normality

# create Quantile-Quantile plot for residuals ( the plot

# should fall along a straight line)

qqnorm(model$residuals)

qqline(model$residuals)

# Findings: residuals seem to fall along the straight line

# This means the error terms is normally distributed.

# Final Conclusion about normality of error terms:

# Assumption is satisfied.

# Step5: Verify Assumption4: Constant Variance of Y for each x

# The assumption can be checked by examining the scale-location

# plot which is also known as spread location-plot

plot(model,3)

# Findings: There is constant variance of Y for each X value.

# Another view on how to verify constancy of variance

# Use augment function to automatically insert fitted values

# and residuals

library(broom)

augment_model <- augment(model)

library(ggplot2)

ggplot(augment_model, aes(Anscombe$X1, Anscombe$Y1)) +

geom_point() +

stat_smooth(method = lm, se = TRUE) +

geom_segment(aes(xend = Anscombe$X1, yend = .fitted), color = "red", size = 0.3)

# The variance is represented by the shaded band.

# As you can see the shaded band which represents the standard

# deviation across all values of x. Its values are closer

# from the beginning up to the end values of x.

# This suggests that the variance of Y is constant along the values

# of x.

# Conclusion: Assumption 4 is satisfied

# ---------------------------------------------------------

# Final Conclusion: All 4 assumptions are satisfied.

# Therefore the simple linear regression model is

# appropriate for the data. This means to say that we do not have

# to look for other methods to improve the model.

# Validation of the assumptions showed that the Simple linear regression

# model is valid to represent the causal relationship

# between the X1 variable and the Y1 variable.

# ---------------------------------------------------------

# If instead of evaluating the response variable,

# we will investigate the residuals (actual value - fitted value)

# If we will take a look at the residuals we will have

hist(model$residuals,probability=T, main="Histogram",xlab="Raw data")

lines(density(model$residuals),col=2,lwd = 3)

# ---------------------------------------------------------

# Problem 12: Build the Simple Linear Regression Model for (X2, Y2).

# Verify the assumptions of this model.

# Step1: Load data set

Anscombe <- read.csv("Anscombe.csv")

# Step2: Build the Simple Linear Regression Model

model <- lm(Anscombe$Y2 ~ Anscombe$X2, data=Anscombe$Y2andAnscombe$X2) # create the linear

regression model
plot(Anscombe$X2,Anscombe$Y2) # scatter plot

abline(model,col = "red",lwd = 3) # plot the regression line

model # display regression coefficients

cor(Anscombe$X2,Anscombe$Y2) # get correlation coefficient

# correlation = 0.8162365 -

# Findings for Assumption1:

# Step 3: Verify Assumption2: Independence of Error terms

plot(model,1)

plot(model$fitted.values, model$residuals)

plot(model,1)

# Conclusion:

# Step4: Verify Assumption3: Normality of Error terms.

# For simple linear regression

# we can check the normality of the response variable

# Can use the hist() to check whether the dependent variable

# follows a normal distribution. Use also Test for normality and the
# QQ plot and the Normality Rule

hist(Anscombe$Y2,probability=T, main="Histogram",xlab="Raw data")

lines(density(Anscombe$Y2),col=2,lwd = 3)

# Findings:

# We can confirm this by computing the skewness and kurtosis coefficients

library(datawizard)

skewness(Anscombe$Y2) # skew = -1.316 < -0.5 indicating strong skewness

# on the left side

kurtosis(Anscombe$Y2) # kurtosis = 0.846 > 0 indicating that the curve

# is higher than normal

# Findings:

# Conclusion:

# Perform Shapiro-Wilk test for Normality

shapiro.test(Anscombe$Y2)

# W = 0.82837, p-value = 0.02222

# The p-value of the test turns out to be 0.02222 .

# Perform the Empirical Normality rule: (This requires that the curve is symmetric)

# Percentage of area covered under normality rule: 68/95/99.7

# Area covered within 1 SD is 68%

# Area covered within 2 SD is 95%

# Area covered within 3 SD is 99.7%

# Compute the actual percentage

data1 <- Anscombe$Y2

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)

percentage <- paste0(within1sd,'/',within2sd,'/',within3sd)

(normsfreq <- paste0("Actual Percentage: ", percentage))

# "Actual Percentage: 100/100/100" which satisfies the empirical rule

# Findings:

# Decision:

# Here is another way to establish normality

# create Quantile-Quantile plot for residuals ( the plot

# should fall along a straight line)

qqnorm(model$residuals)

qqline(model$residuals)

# Findings:
# Decision:

# Final Conclusion about normality of error terms:

# Step5: Verify Assumption4: Constant Variance of Y for each x

# The assumption can be checked by examining the scale-location

# plot which is also known as spread location-plot

plot(model,3)

# Findings: No constant variance of Y for each

# x value.

# Another view on how to verify constancy of variance

# Use augment function to automatically insert fitted values

# and residuals

library(broom)

augment_model <- augment(model)

library(ggplot2)

ggplot(augment_model, aes(Anscombe$X2, Anscombe$Y2)) +

geom_point() +

stat_smooth(method = lm, se = TRUE) +

geom_segment(aes(xend = Anscombe$X2, yend = .fitted), color = "red", size = 0.3)

# The variance is represented by the shaded band.

# Conclusion:

# ---------------------------------------------------------

# Final Conclusion:

# ---------------------------------------------------------

# If instead of evaluating the response variable,

# we will investigate the residuals (actual value - fitted value)

# If we will take a look at the residuals we will have

hist(model$residuals,probability=T, main="Histogram",xlab="Raw data")

lines(density(model$residuals),col=2,lwd = 3)

# ---------------------------------------------------------

# Problem 13: Build the Simple Linear Regression Model for (X3, Y3).

# Verify the assumptions of this model.

# Step1: Load data set

Anscombe <- read.csv("Anscombe.csv")

# Step2: Build the Simple Linear Regression Model

model <- lm(Anscombe$Y3 ~ Anscombe$X3, data=Anscombe$Y3andAnscombe$X3)

# create the linear regression model# create the linear regression model

plot(Anscombe$X3,Anscombe$Y3) # scatter plot

abline(model,col = "red",lwd = 3) # plot the regression line

model # display regression coefficients

cor(Anscombe$X3,Anscombe$Y3) # get correlation coefficient

# correlation = 0.8162867 -

# Findings for Assumption1:

# Step 3: Verify Assumption2: Independence of Error terms

plot(model,1)

plot(model$fitted.values, model$residuals)

plot(model,1)

# Conclusion:

# Step4: Verify Assumption3: Normality of Error terms.

# For simple linear regression

# we can check the normality of the response variable

# Can use the hist() to check whether the dependent variable

# follows a normal distribution. Use also Test for normality and the

# QQ plot and the Normality Rule

hist(Anscombe$Y3,probability=T, main="Histogram",xlab="Raw data")

lines(density(Anscombe$Y3),col=2,lwd = 3)

# Findings:

# We can confirm this by computing the skewness and kurtosis coefficients

library(datawizard)

skewness(Anscombe$Y3) # skew = 1.855 > 1 indicating strong skewness

# on the right side

kurtosis(Anscombe$Y3) # kurtosis = 4.384 > 0 indicating that the curve

# is higher than normal

# Findings:

# Conclusion:

# Perform Shapiro-Wilk test for Normality

shapiro.test(Anscombe$Y3)

# W = 0.83361, p-value = 0.02604

# The p-value of the test turns out to be 0.02604 .

#
# Perform the Empirical Normality rule: (This requires that the curve is symmetric)

# Percentage of area covered under normality rule: 68/95/99.7

# Area covered within 1 SD is 68%

# Area covered within 2 SD is 95%

# Area covered within 3 SD is 99.7%

# Compute the actual percentage

data1 <- Anscombe$Y3

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)

percentage <- paste0(within1sd,'/',within2sd,'/',within3sd)

(normsfreq <- paste0("Actual Percentage: ", percentage))

# "Actual Percentage: 90.91/90.91/100" which satisfies the empirical rule

# Findings:

# Decision:

# Here is another way to establish normality

# create Quantile-Quantile plot for residuals ( the plot

# should fall along a straight line)

qqnorm(model$residuals)

qqline(model$residuals)

# Findings:

# Decision:

# Final Conclusion about normality of error terms:

# Step5: Verify Assumption4: Constant Variance of Y for each x

# The assumption can be checked by examining the scale-location

# plot which is also known as spread location-plot

plot(model,3)

# Findings: No constant variance of Y for each

# x value.

# Another view on how to verify constancy of variance

# Use augment function to automatically insert fitted values

# and residuals

library(broom)

augment_model <- augment(model)

library(ggplot2)

ggplot(augment_model, aes(Anscombe$X3, Anscombe$Y3)) +

geom_point() +
stat_smooth(method = lm, se = TRUE) +

geom_segment(aes(xend = Anscombe$X3, yend = .fitted), color = "red", size = 0.3)

# The variance is represented by the shaded band.

# Conclusion:

# ---------------------------------------------------------

# Final Conclusion:

# ---------------------------------------------------------

# If instead of evaluating the response variable,

# we will investigate the residuals (actual value - fitted value)

# If we will take a look at the residuals we will have

hist(model$residuals,probability=T, main="Histogram",xlab="Raw data")

lines(density(model$residuals),col=2,lwd = 3)

# ---------------------------------------------------------

# Problem 14: Build the Simple Linear Regression Model for (X4, Y4).

# Verify the assumptions of this model.

# Step1: Load data set

Anscombe <- read.csv("Anscombe.csv")

# Step2: Build the Simple Linear Regression Model

model <- lm(Anscombe$Y4 ~ Anscombe$X4, data=Anscombe$Y4andAnscombe$X4) # create the linear

regression model

plot(Anscombe$X4,Anscombe$Y4) # scatter plot

abline(model,col = "red",lwd = 3) # plot the regression line

model # display regression coefficients

cor(Anscombe$X4,Anscombe$Y4) # get correlation coefficient

# correlation = 0.8165214 -

# Findings for Assumption1:

# Step 3: Verify Assumption2: Independence of Error terms

plot(model,1)

plot(model$fitted.values, model$residuals)

plot(model,1)

# Conclusion:

# Step4: Verify Assumption3: Normality of Error terms.

# For simple linear regression

# we can check the normality of the response variable

# Can use the hist() to check whether the dependent variable

# follows a normal distribution. Use also Test for normality and the

# QQ plot and the Normality Rule

hist(Anscombe$Y4,probability=T, main="Histogram",xlab="Raw data")

lines(density(Anscombe$Y4),col=2,lwd = 3)

# Findings:

# We can confirm this by computing the skewness and kurtosis coefficients

library(datawizard)

skewness(Anscombe$Y4) # skew = 1.507 > -.5 indicating strong skewness

# on the right side

kurtosis(Anscombe$Y4) # kurtosis = 3.151 > 0 indicating that the curve

# is higher than normal

# Findings:

# Conclusion:

# Perform Shapiro-Wilk test for Normality

shapiro.test(Anscombe$Y4)
# W = 0.87536, p-value = 0.09081

# The p-value of the test turns out to be 0.09081 .

# Perform the Empirical Normality rule: (This requires that the curve is symmetric)

# Percentage of area covered under normality rule: 68/95/99.7

# Area covered within 1 SD is 68%

# Area covered within 2 SD is 95%

# Area covered within 3 SD is 99.7%

# Compute the actual percentage

data1 <- Anscombe$Y4

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)

percentage <- paste0(within1sd,'/',within2sd,'/',within3sd)

(normsfreq <- paste0("Actual Percentage: ", percentage))

# "Actual Percentage: 90.91/90.91/100" which satisfies the empirical rule

# Findings:

# Decision:

# Here is another way to establish normality

# create Quantile-Quantile plot for residuals ( the plot

# should fall along a straight line)

qqnorm(model$residuals)

qqline(model$residuals)

# Findings:

# Decision:

# Final Conclusion about normality of error terms:

# Step5: Verify Assumption4: Constant Variance of Y for each x

# The assumption can be checked by examining the scale-location

# plot which is also known as spread location-plot

plot(model,3)

# Findings:

# Another view on how to verify constancy of variance

# Use augment function to automatically insert fitted values

# and residuals

library(broom)

augment_model <- augment(model)

library(ggplot2)
ggplot(augment_model, aes(Anscombe$X4, Anscombe$Y4)) +

geom_point() +

stat_smooth(method = lm, se = TRUE) +

geom_segment(aes(xend = Anscombe$X4, yend = .fitted), color = "red", size = 0.3)

# The variance is represented by the shaded band.

# Conclusion: Assumption 4 is not satisfied

# ---------------------------------------------------------

# Final Conclusion:

# ---------------------------------------------------------

# If instead of evaluating the response variable,

# we will investigate the residuals (actual value - fitted value)

# If we will take a look at the residuals we will have

hist(model$residuals,probability=T, main="Histogram",xlab="Raw data")

lines(density(model$residuals),col=2,lwd = 3)

# ---------------------------------------------------------

# Finally which pair is suitable for Simple Linear Regression Analysis?

# Short Answer:

# ---------------------------------------------------------

Lab 5 LR
No ratings yet
Lab 5 LR
9 pages
H-311 Linear Regression Analysis With R
100% (1)
H-311 Linear Regression Analysis With R
71 pages
R Practice
No ratings yet
R Practice
38 pages
Lab-5-1-Regression and Multiple Regression
100% (2)
Lab-5-1-Regression and Multiple Regression
8 pages
Chapter 2
No ratings yet
Chapter 2
47 pages
CH 2
No ratings yet
CH 2
31 pages
Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
No ratings yet
Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
8 pages
Lab 5
No ratings yet
Lab 5
6 pages
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
7 OLS Assumptions
No ratings yet
7 OLS Assumptions
37 pages
RegrCorr PDF
No ratings yet
RegrCorr PDF
20 pages
R Codes
No ratings yet
R Codes
5 pages
Exam 1 Notes
No ratings yet
Exam 1 Notes
4 pages
3010 Lab Model Diagnostic-1
No ratings yet
3010 Lab Model Diagnostic-1
4 pages
R Data Analysis Techniques
No ratings yet
R Data Analysis Techniques
6 pages
R Lab 4
No ratings yet
R Lab 4
7 pages
Regression Analysis Assignment1111
No ratings yet
Regression Analysis Assignment1111
13 pages
Maths Lab
No ratings yet
Maths Lab
17 pages
WEEK
No ratings yet
WEEK
17 pages
Practical Linear Regression Guide
No ratings yet
Practical Linear Regression Guide
162 pages
Chapter 06-Regression Analysis
No ratings yet
Chapter 06-Regression Analysis
41 pages
Which Test When: 1 Exploratory Tests
No ratings yet
Which Test When: 1 Exploratory Tests
5 pages
Linear Regression Assumptions Guide
No ratings yet
Linear Regression Assumptions Guide
27 pages
R For Marketing Research and Analytics
No ratings yet
R For Marketing Research and Analytics
47 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
Lab Box Cox and Multiple Linear Reg-1
No ratings yet
Lab Box Cox and Multiple Linear Reg-1
4 pages
Linear Regression Lecture Notes
100% (2)
Linear Regression Lecture Notes
228 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
Regression P 5
No ratings yet
Regression P 5
15 pages
Lab File AD PDF
No ratings yet
Lab File AD PDF
25 pages
Linear Regression Experiment
No ratings yet
Linear Regression Experiment
6 pages
Remedial Measures Purdue - Edu
No ratings yet
Remedial Measures Purdue - Edu
28 pages
ASSIGNMENT NO - 2, FDAS - SUMANYAKUMARI - Bfia
No ratings yet
ASSIGNMENT NO - 2, FDAS - SUMANYAKUMARI - Bfia
6 pages
A028 GLM-SC3
No ratings yet
A028 GLM-SC3
137 pages
Lecture 12 - Adv. Correlation and Multiple Regression
No ratings yet
Lecture 12 - Adv. Correlation and Multiple Regression
32 pages
Make Up Cat
No ratings yet
Make Up Cat
6 pages
Multiple Linear Regression
100% (1)
Multiple Linear Regression
14 pages
Homework 1
No ratings yet
Homework 1
8 pages
Exercice V
No ratings yet
Exercice V
5 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
MultivariableRegression Summary
No ratings yet
MultivariableRegression Summary
15 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
04 BasicAnalyses
No ratings yet
04 BasicAnalyses
44 pages
Homework 2
100% (1)
Homework 2
14 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Regression Modelli NG Assignment
No ratings yet
Regression Modelli NG Assignment
3 pages
Regression and Classification Analysis
No ratings yet
Regression and Classification Analysis
101 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Exp - 5 - Zarin
No ratings yet
Exp - 5 - Zarin
19 pages
Series 1
No ratings yet
Series 1
2 pages
6th Lecture Note 108335647 230518 203102
No ratings yet
6th Lecture Note 108335647 230518 203102
35 pages
Linear Regression Lab Report Share
No ratings yet
Linear Regression Lab Report Share
9 pages
Mod 3
No ratings yet
Mod 3
50 pages
Section 2
No ratings yet
Section 2
22 pages
Multiple Regression Insights
100% (1)
Multiple Regression Insights
21 pages
ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
No ratings yet
ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
35 pages
Lab Wk1soln PDF
No ratings yet
Lab Wk1soln PDF
14 pages
Business Analytics C-2
No ratings yet
Business Analytics C-2
7 pages
Sas/Ets Procedures: Rajender Parsad and Manoj Kumar Khandelwal I.A.S.R.I., Library Avenue, New Delhi - 110 012
No ratings yet
Sas/Ets Procedures: Rajender Parsad and Manoj Kumar Khandelwal I.A.S.R.I., Library Avenue, New Delhi - 110 012
10 pages
30 Lecture 26, 27 and 28 Slides
No ratings yet
30 Lecture 26, 27 and 28 Slides
23 pages
Car Mileage Prediction Model
No ratings yet
Car Mileage Prediction Model
5 pages
Stats3 2
No ratings yet
Stats3 2
2 pages
Beyond ANOVA and MANOVA For Repeated Measures
No ratings yet
Beyond ANOVA and MANOVA For Repeated Measures
10 pages
Dokumen - Pub Linear Model Theory With Examples and Exercises 1st Ed 9783030520625 9783030520632
No ratings yet
Dokumen - Pub Linear Model Theory With Examples and Exercises 1st Ed 9783030520625 9783030520632
513 pages
Nurwana 2023 (Tjiptono)
No ratings yet
Nurwana 2023 (Tjiptono)
10 pages
Stratified Sampling 2012
No ratings yet
Stratified Sampling 2012
17 pages
in Binomial Distribution, The Formula of Calculating Standard Deviation Is A) Square Root of P B) Square Root of PQ D) Square Root of NP 2
No ratings yet
in Binomial Distribution, The Formula of Calculating Standard Deviation Is A) Square Root of P B) Square Root of PQ D) Square Root of NP 2
14 pages
Linear Regression Analysis Guide
No ratings yet
Linear Regression Analysis Guide
3 pages
Essoham Ali
No ratings yet
Essoham Ali
27 pages
HW6 MBA Fall 24
No ratings yet
HW6 MBA Fall 24
3 pages
Practicefinalsolutions
No ratings yet
Practicefinalsolutions
7 pages
Business Analytics: Linear Regression
No ratings yet
Business Analytics: Linear Regression
60 pages
Basic Summation Notation
No ratings yet
Basic Summation Notation
16 pages
Statistics 0
No ratings yet
Statistics 0
47 pages
Effects of Trade Openness On Regional Economic Growth
No ratings yet
Effects of Trade Openness On Regional Economic Growth
5 pages
Xsmle
No ratings yet
Xsmle
37 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
6 pages
Unit5 Updated
No ratings yet
Unit5 Updated
69 pages
Selvanathan 6e - 19 - PPT
No ratings yet
Selvanathan 6e - 19 - PPT
72 pages
Business Analytics Assignment
No ratings yet
Business Analytics Assignment
4 pages
Motivasi Kerja Karyawan Mitsubishi
No ratings yet
Motivasi Kerja Karyawan Mitsubishi
18 pages
Functional Regression Insights
No ratings yet
Functional Regression Insights
7 pages
Fashion Consciousness & Brand Impact on Purchases
No ratings yet
Fashion Consciousness & Brand Impact on Purchases
6 pages
Linear Regression Model Adequacy
No ratings yet
Linear Regression Model Adequacy
22 pages
Advance Stats Assignment
No ratings yet
Advance Stats Assignment
18 pages
Introduction To Machine Learning - Unit 4 - Week 2
No ratings yet
Introduction To Machine Learning - Unit 4 - Week 2
4 pages
6.20 Otro
No ratings yet
6.20 Otro
23 pages
Autocorrelation - A Violation of Classical Linear Regression Model Assumptions
No ratings yet
Autocorrelation - A Violation of Classical Linear Regression Model Assumptions
21 pages

Regression Analysis Script

Uploaded by

Regression Analysis Script

Uploaded by

# PANGANTIHON, Norman O.

# Activity on Regression Analysis

# Your task is consider the Anscombe data set again

# These are your new problems

# in the File command menu of Excel)

# csv means comma separated values

# Problem 3: Use R command to load the csv file in R

homedir <- "~/NORMAN/STT061/"

Anscombe <- read.csv("Anscombe.csv")

# Problem 4: Use R command to compute the mean of X1,X2,X3,X4

# Problem 6: Use R command to compute the variance Y1,Y2,Y3,Y4

# Problem 7: Use R command to compute the variance X1,X2,X3,X4

# Problem 8: Use R command to compute the sd Y1,Y2,Y3,Y4

# Problem 10: Use R command to compute the correlation (X1,Y1),

# and also for (X2,Y2), (X3,Y3), (X4,Y4).

# Establish causal relationship by validating the four assumptions

# Verify the assumptions of this model.

# Step1: Load data set

Anscombe <- read.csv("Anscombe.csv")

# Step2: Build the Simple Linear Regression Model

model <- lm(Anscombe$Y1 ~ Anscombe$X1, data=Anscombe$Y1andAnscombe$X1) # create the linear

abline(model,col = "red",lwd = 3) # plot the regression line

model # display regression coefficients

cor(Anscombe$X1,Anscombe$Y1) # get correlation coefficient

# correlation = 0.8164205 - this value is high

# I can say that there is a strong linear relationship between X1 and Y1

# Findings for Assumption1:

# The linearity of data is satisfied

# because of the high value of correlation

# and the scatter plot shows that

# as X1 increases, Y1 also increases.

# Step 3: Verify Assumption2: Independence of Error terms

# Conclusion: Independence of error terms is satisfied.

# In the residuals versus fits plot,

# the points seem randomly scattered,

# and it does not appear that there is a pattern.

# Also, there is a red line which is

# For simple linear regression

# we can check the normality of the response variable

# Can use the hist() to check whether the dependent variable

# QQ plot and the Normality Rule

hist(Anscombe$Y1,probability=T, main="Histogram",xlab="Raw data")

# Findings: it appears that the distribution

# We can confirm this by computing the skewness and kurtosis coefficients

# skew = -0.065 > -0.5 indicating nearly symmetrical data

# Findings: The distribution of the response variable is nearly symmetric

# Conclusion: The distribution of the response variable is normal

# Perform Shapiro-Wilk test for Normality

# W = 0.97693, p-value = 0.9467

# The p-value of the test turns out to be 0.9467 .

# Since this value is greater than .05, we have sufficient evidence

# to say that the sample data come from a population

# that is normally distributed.

# Percentage of area covered under normality rule: 68/95/99.7

# Area covered within 1 SD is 68%

# Area covered within 2 SD is 95%

# Area covered within 3 SD is 99.7%

# Compute the actual percentage

data1 <- Anscombe$Y1

within1sd <- round(sum(abs(data1) < mean(data1)+1*sd(data1))*(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2*sd(data1))*(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3*sd(data1))*(100/length(data1)),2)

percentage <- paste0(within1sd,'/',within2sd,'/',within3sd)

(normsfreq <- paste0("Actual Percentage: ", percentage))

# "Actual Percentage: 81.82/100/100 which satisfies the empirical rule

# and there is a validation that the curve is symmetric

# using the histogram skewness and kurtosis.

# Therefore the result about the actual empirical coverage

# Findings: the Empirical Normality rule is satisfied

# Decision: The result of Empirical rule is valid

# since the conditions are satisfied and

# the shape of the distribution is nearly symmetric.

# Here is another way to establish normality

# create Quantile-Quantile plot for residuals ( the plot

# should fall along a straight line)

# Findings: residuals seem to fall along the straight line

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)