Question 1: Produce descriptive statistics to summarize the data.
You are expected to generate as
many relevant descriptive statistics as possible using ALL the relevant tools introduced in the
labs of this course. Remember to provide appropriate interpretations for the descriptive statistics.
Try not to include unnecessary or irrelevant descriptive statistics.
Firstly, import the dataset23.csv data frame into R and assign it to case4.
getwd()
setwd("")
case4<- read.table("dataset23.csv", header = TRUE , sep = ",", quote ="/", stringsAsFactors =
FALSE )
1. Some first rows of the data
head(case4)
2. Display the structure of case4 data frame
str(case4)
3. Convert character variables into factors
case4$X.province<-factor(case4$X.province, levels = c("Hanoi","Haiphong","TP HCM"))
case4$own<-factor(case4$own, levels = c("One-owner","Multi-owner"))
str(case4)
4. Summary data
Summary(case4)
table(case4$X.province,case4$own)
5. Summary data by groups
by(case4$X.quantityproduct,list(case4$X.province,case4$own),summary)
by(case4$X.quantitysold,list(case4$X.province,case4$own),summary)
by(case4$totalass,list(case4$X.province,case4$own),summary)
# Descriptive data in statistic
install.packages("psych")
library("psych")
describeBy(case4["X.quantityproduct"],list(case4$X.province,case4$own))
describeBy(case4["X.quantitysold"],list(case4$X.province,case4$own))
describeBy(case4["totalass"],list(case4$X.province,case4$own))
# Descriptive data in graphs
boxplot(X.quantityproduct ~ X.province + own, data = case4, xlab = "Specify address of firm
and Ownership status", ylab = "Quantity produced for the most important product", col =
c("red", "blue", "yellow","pink","grey","green"))
boxplot(X.quantitysold ~ X.province + own, data = case4, xlab = "Specify address of firm and
Ownership status", ylab = "Quantity sold base one quantity produced for the most important
product", col = c("red", "blue", "yellow","pink","grey","green"))
boxplot(totalass ~ X.province + own, data = case4, xlab = "Specify address of firm and
Ownership status", ylab = "Total assets in 2014", col = c("red", "blue",
"yellow","pink","grey","green"))
install.packages("gplots")
library("gplots")
plotmeans(X.quantityproduct~interaction(X.province, own), data=case4, xlab = "Specify address
of firm and Ownership status", ylab = "Quantity produced for the most important product",
main="Mean Plot with 95% CI")
plotmeans(X.quantitysold~interaction(X.province, own), data=case4, xlab = "Specify address of
firm and Ownership status", ylab = "Quantity sold base one quantity produced for the most
important product", main="Mean Plot with 95% CI")
plotmeans(totalass~interaction(X.province, own), data=case4, xlab = "Specify address of firm
and Ownership status", ylab = "Total assets in 2014", main="Mean Plot with 95% CI")
Question 2: Use analysis of variance to test for any significant differences due to province. Use
a .05 level of significance, and for now, ignore the effect of types of ownership, quantity
produced and quantity sold. Check all the assumptions of the inference technique you use. Are
the assumptions satisfied? Explain.
Check assumption:
1. All populations are normally distributed (qqplot)
install.packages("car")
library(car)
qqPlot(lm(case4$totalass ~ case4$X.province,data = case4), simulate=T, main="Q-Q
Plot", labels=F)
2. Samples were selected by using simple random sampling. Samples are independent and
simple random sample and sample sizes are equal
table(case4$X.province)
3. All population variances are equal (Slargest <2Ssmallest )
by(case4$totalass,case4$X.province,sd)
Slargest 110745.9
= = 9.921601 > 2
Ssmallest 11162.1
The ratio of largest SD over smallest SD is around 9.92 (which is greater than 2) in this case it is
not so clear to pool variances, then it’s good to check again using Levene’s test:
(limitation: the ratio is too big
Hypothesis:
Ho : All populations variances are equal
Ha : At least 2 populations variances are different.
R code:
library(car)
leveneTest(case4$totalass, case4$X.province, center=median)
p-value = 0.0077
Decision rule : Reject Ho if p-value < ∝
We have : p-value = 0.077 > 0.05
Do not reject Ho
We have enough evidence to conclude that all populations variances are equal
Assumption 3 correct
Use one-way ANOVA to test for any significant differences due to province
# One-way ANOVA
aovcase4.1<- aov(case4$totalass~ case4$X.province, data=case4)
summary(aovcase4.1)
Question 3: At the .05 level of significance test for any significant differences due to
X.province, types of ownership, and interaction (ignore the effect of quantity produced and
quantity sold. Check all the assumptions of the inference technique you use. Are the assumptions
satisfied? Explain. Draw an interaction plot and interpret the plot. Is the plot consistent with the
conclusions?
I. Assumptions:
1) All populations are normally distributed
2) Samples were selected by using simple random sampling
3) Samples are independent
4) All population standard deviations are equal (Slargest <2Ssmallest )
Assumption 1: All populations are normally distributed
In order to check the normal distribution of the populations, we use QQ plot with R command:
install.packages("car")
library(car)
qqPlot(lm(case4$totalass ~ X.province + own + own*X.province, data = case4), simulate
= T, main = “Q-Q Plot”, labels=F)
few outliers vẫn cho là normally distributed và cho phần outliers vào limitations
Assumption 2 & 3: Samples were selected by using simple random sampling, independent
table(case4$own, case4$X.X.province)
Output:
Assumption 4: All population standard deviations are equal
by(case4$totalass, list(case4$X.X.province,case4$own),sd)
Slargest 148,425.6
= = 19.62588
Ssmallest 7562.748
standard deviation of each
sample was not
equal.
standard deviation of each
sample was not
equal.
SD are not all equal continue to use ANOVA limitation
Use Levene test although >2.5 many times
Rstudio:
install.packages("car")
library(car)
leveneTest(case4$totalass, case4$own, center = median)
Output:
1. Hypothesis
H0: The population variances are equal
Ha: The population variances are not all equal
2. P-value = 0.2179
3. Rejection rule: Reject H0 if p-value < α
We have: 0.2179 > 0.05
Do not reject H0
4. Conclusion
Assumption 3 is satisfied
II. Hypothesis
H0: µ1 = µ2
Ha: Two populations are different
aov2 <- aov(totalass ~ own,data= case4)
summary(aov2)
Output:
III. Rejection Rules: Reject H0 if p-value < α
We have: 0.152 > 0.05
Do not reject H0
Conclusion
R INPUT
Q1
# import the .csv file “dataset23.csv”
getwd()
setwd("")
case4<- read.table("dataset23.csv", header = TRUE , sep = ",", quote ="/", stringsAsFactors =
FALSE )
# Some first rows of the data
head(case4)
# Display the structure of case4 data frame
str(case4)
# Convert character variables into factors
case4$X.province<-factor(case4$X.province, levels = c("Hanoi","Haiphong","TP HCM"))
case4$own<-factor(case4$own, levels = c("One-owner","Multi-owner"))
str(case4)
# Summary data
summary(case4)
table(case4$X.province,case4$own)
# Summary data by groups
by(case4$X.quantityproduct,list(case4$X.province,case4$own),summary)
by(case4$X.quantitysold,list(case4$X.province,case4$own),summary)
by(case4$totalass,list(case4$X.province,case4$own),summary)
# Descriptive data in statistic
install.packages("psych")
library("psych")
describeBy(case4["X.quantityproduct"],list(case4$X.province,case4$own))
describeBy(case4["X.quantitysold"],list(case4$X.province,case4$own))
describeBy(case4["totalass"],list(case4$X.province,case4$own))
# Descriptive data in graphs
boxplot(X.quantityproduct ~ X.province + own, data = case4, xlab = "Specify address of firm
and Ownership status", ylab = "Quantity produced for the most important product", col =
c("red", "blue", "yellow","pink","grey","green"))
boxplot(X.quantitysold ~ X.province + own, data = case4, xlab = "Specify address of firm and
Ownership status", ylab = "Quantity sold base one quantity produced for the most important
product", col = c("red", "blue", "yellow","pink","grey","green"))
boxplot(totalass ~ X.province + own, data = case4, xlab = "Specify address of firm and
Ownership status", ylab = "Total assets in 2014", col = c("red", "blue",
"yellow","pink","grey","green"))
install.packages("gplots")
library("gplots")
plotmeans(X.quantityproduct~interaction(X.province, own), data=case4, xlab = "Specify address
of firm and Ownership status", ylab = "Quantity produced for the most important product",
main="Mean Plot with 95% CI")
plotmeans(X.quantitysold~interaction(X.province, own), data=case4, xlab = "Specify address of
firm and Ownership status", ylab = "Quantity sold base one quantity produced for the most
important product", main="Mean Plot with 95% CI")
plotmeans(totalass~interaction(X.province, own), data=case4, xlab = "Specify address of firm
and Ownership status", ylab = "Total assets in 2014", main="Mean Plot with 95% CI")
Q2
#Check assumptions
#Check independence and simple random sample and sample sizes are equal
table(case4$X.province)
#Check population are normally distributed
install.packages("car")
library(car)
qqPlot(lm(case4$totalass ~ case4$X.province,data = case4), simulate=T, main="Q-Q Plot",
labels=F)
#Check all population variances are equal
by(case4$totalass,case4$X.province,sd)
43451.78/110745.9
#levene test
library(car)
leveneTest(case4$totalass, case4$X.province, center=median)
# One-way ANOVA
aovcase4.1<- aov(case4$totalass~ case4$X.province, data=case4)
summary(aovcase4.1)
Q3
#Check assumptions
#Check independent and simple random sample and sample sizes are equal
table(case4$own,case4$X.province)
str(case4)
#Check population are normally distributed
library(car)
qqPlot(lm(case4$totalass ~ X.province + own + own*X.province, data = case4), simulate = T,
labels=F)
#Check all population variances are equal
by(case4$totalass, list(case4$X.province,case4$own),sd)
148425.6/7562.748
#levene test
library(car)
leveneTest(case4$totalass, case4$own, center = median)
# Two-way ANOVA
aovcase4.2 <- aov(case4$totalass ~ own,data= case4)
summary(aovcase4.2)
Box plot:
QQ plot: