Exam Statistics
Contents
Poor ways to sample .................................................................................................................................... 2
Confidence interval ...................................................................................................................................... 2
Choosing sample size ................................................................................................................................... 5
Significance tests .......................................................................................................................................... 6
Errors......................................................................................................................................................... 7
Group comparisons ...................................................................................................................................... 8
Comparing dependent samples: .................................................................................................................. 9
Independence and association: ................................................................................................................... 9
Formatting excel sheet:
Shade the labels of the variables in the first row
Borders for tables
Center heading and numbers in columns
Froze upper row(view tab- freeze)
Poor ways to sample
Convenience Sample: a type of survey sample that is easy to obtain
o Unlikely to be representative of the population
o Often severe biases result from such a sample
o Results apply ONLY to the observed subjects
Volunteer Sample: most common form of convenience sample
o Subjects volunteer for the sample
o Volunteers do not tend to be representative of the entire population
Bias: Tendency to systematically favor certain parts of the population over others:
o Sampling bias: bias resulting from the sampling method such as using
nonrandom samples or having undercoverage
o Nonresponse bias: occurs when some sampled subjects cannot be reached or
refuse to participate or fail to answer some questions
o Response bias: occurs when the subject gives an incorrect response or the
question is misleading
Notice: a large sample does not guarantee an unbiased sample!
Confidence interval
1. CI for population proportion:
For the 95% confidence interval for a proportion p to be valid, you should have at
least 15 successes and 15 failures:
2. CI for mean proportion:
95% CI
3. The margin of error measures how accurate the point estimate is likely to be in estimating
a parameter. It is a multiple of the standard error of the sampling distribution of the
estimate, such as 1.96 (standard error) when the sampling distribution is a normal
distribution.
Interpretation of 95% confidence interval
A 95% confidence interval for the population mean of is (L,U ) hours. We can be 95%
confident that the mean amount of s is between L and U of
Large sample (explanation narrow CI)
Because the sample size was relatively large, the estimation is precise and the confidence
interval is quite narrow. The larger the sample size, the smaller the standard error and the
subsequent margin of error.
Central Limit Theory
For random sampling with a large sample size n, the sampling distribution of the sample
mean x is approximately a normal distribution.
Interpreting a Confidence Interval for a Difference of Proportions
Check whether 0 falls in the confidence interval. If so, it is plausible (but not necessary)
that the population proportions are equal.
If all values in the confidence interval for ( p 1 - p2) are positive, you can infer that ( p 1
- p2) 7 0, or p1 > p2. The interval shows just how much larger p1 might be. If all values
in the confidence interval are negative, you can infer that ( p 1 - p2) 6 0, or p1 6 p2.
The magnitude of values in the confidence interval tells you how large any true
difference is. If all values in the confidence interval are near 0, the true difference may
be relatively small in practical terms.
In addition, if the confidence interval contains 0, then it is plausible that (p1 - p2) = 0,
that is, p1 = p2. The population proportions might be equal. In such a case insufficient
evidence exists to infer which of p1 or p2 is larger.
Interpret a Confidence Interval for a Difference of Means
Check whether or not 0 falls in the interval. When it does, 0 is a plausible value for (1 -
2), meaning that possibly 1 = 2.
A confidence interval for (1-2) that contains only positive numbers suggests that (1-
2) is positive. We then infer that 1 is larger than 2
A confidence interval for (1-2) that contains only negative numbers suggests that (1-
2) is negative. We then infer that 1 is smaller than 2.
Confidence interval for a mean in excel:
Descriptive statistics: input, output, labels, summary statistics and confidence level
Confidence.t
Choosing sample size
1. Choosing/ estimating sample size for proportions:
2. Choosing/ estimating sample size for means:
Significance tests
A significance test is conducted and the probability value(p-value) reflects the strength of
the evidence against the null hypothesis. If the probability is below 0.01, the data provide
strong evidence that the null hypothesis is false. If the probability value is below 0.05 but
larger than 0.01, then the null hypothesis is typically rejected, but not with as much
confidence as it would be if the probability value were below 0.01. Probability values
between 0.05 and 0.10 provide weak evidence against the null hypothesis and, by
convention, are not considered low enough to justify rejecting it. Higher probabilities
provide less evidence that the null hypothesis is false.
When a probability value is below the level, the effect is statistically significant and the
null hypothesis is rejected. If the null hypothesis is rejected, then the alternative to the
null hypothesis (called the alternative hypothesis) is accepted
When a significance test results in a high probability value, it means that the data provide
little or no evidence that the null hypothesis is false. However, the high probability value
is not evidence that the null hypothesis is true.
The test statistic measures how far the sample proportion falls from the null hypothesis
value, p0, relative to what wed expect if H0 were true
1. Significance test proportions:
For proportion, the p-value calculated with norm.dist of the z-score
2. Significance test means:
For mean, the p-value is calculated with t.dist.rt(for right tail probability), 1-t.dist.rt(left
tail probability) and t.dist.2t
Errors
Type I and type II errors
o When H0 is true, a type I error occurs when H0 is rejected. We can control the
probability of a Type I error by our choice of the significance level. The more serious the
consequences of a Type I error, the smaller a should be
o When H0 is false, a type II error occurs when H0 is not rejected. We denote P(Type II
error) = We do not choose directly as we choose .
The power of a test is given by Power = 1
Group comparisons
1. Confidence Interval for comparing two groups of proportions:
2. Confidence Interval for comparing two groups of means:
Requirements:
o Independent random samples from two groups, either from random sampling or a
randomized experiment
o An approximately normal population distribution for each group. This is mainly
important for small samples sizes, and even then the method is robust to
violations of this assumption
Comparing means for two groups in excel:
o Data analysis: t-Test: Two-Sample Assuming Equal Variances (individual
observances)
o T.test
o For paired groups: Data analysis: t-Test: Paired Two Sample for Means
Comparing dependent samples:
o To compare proportions with dependent samples, construct confidence intervals and
significance tests using the single sample of the difference scores: pi = xi1 xi2
o The 95% confidence interval pd 1.96 sepd and the test statistics z = (pd 0)/sepd
are the same as for a single sample. The assumptions are also the same: A random sample
or a randomized experiment and at least 15successes and 15 failures in the sample of
difference scores
o To compare means with dependent samples, construct confidence intervals and
significance tests using the single sample of the difference scores: di = xi1 xi2
o The 95% confidence interval and the test are the
same as for a single sample. The assumptions are also the same:
A random sample or a randomized experiment and a normal population distribution of
the difference scores
Independence and association:
Two categorical variables are independent if the population conditional distribution for
one them is identical at each category of the other; the variables are dependent(or
associated) if the conditional distributions are not identical; even if variables are
independent, we would not expect the sample conditional distributions to be identical
Because of sampling variability, each sample percentage typically differs somewhat from
the true population percentage
Construct a contingency table( which displays two categorical variables, the rows list the
categories of the variable and the columns list the categories of the other variable), entries
in the table are frequencies; the percentages in a particular row of a table are called
conditional percentages and they form the conditional distribution
To test for independence conduct a CHI SQUARED TEST
The CHI-SQUARED TEST:
1. Assumptions:
Two categorical variables
Data randomization(such as from random experiment or randomized sample)
Expected cell count>= 5 in each cell; otherwise use Fischers exact test
2. Hypotheses:
H0: The two variables are independent
Ha: The two variables are dependent
What to expect under H0: The count in any particular cell is a random variable;
the mean of its distribution is called and expected cell count
3. The Chi squared statistic: summarizes how far the observed cell counts in a contingency
table fall from the expected cell counts for a null hypothesis. The formula is
For each cell, square the difference between the observed count and the expected
count and then divide that square by the expected count. After calculating this
term for every cell, sum the terms to find X2
4. P-value:
Convert the chi-squared test statistic to a p-value, use the sampling distribution of
the chi-squared statistic(for large sample sizes, this sampling distribution is well
approximated by the chi-squared distribution)
Main properties of chi-squared distribution:
o It falls on the positive part of the real number line
o The precise shape of the distribution depends on the degrees of freedom
o The mean of the distribution equals the df value
o It is skewed to the right
o The larger the X2 value, the greater the evidence against H0
The p-value is the right-tail probability for the observed X2 value for the chi-
squared distribution with df=(r-1)(c-1), r=number of rows, c=number of columns
in the contingency table
5. Conclusion: report and interpret p-value in context, reject null hypothesis if<=
significance level; if the null hypothesis is rejected, there is proof that the two variables
are associated
6. In excel:
CHIDIST gives directly P-value
CHITEST
Measure of strength:
A large cho-squared value provides strong evidence that the variables are associated, but
it doesnt indicate how strong the association is, it merely indicates through its p-value,
how certain we can be that the variables are associated
The strength of an association can be measured by difference in proportions p1-p2 or by
the ratio p1/p2
Pattern of association:
A standardized residual for each cell, is like a z-score; values below -3 or above 3 are
unlikely and hence indicate dependence
Fishers exact test:
Used when the expected frequencies are small, any of them being<5(since chi-squared
test of indepence is a large sample test), small-sample tests are more appropriate
Fishers exact test is a small-sample test of independence
Complex calculations
The smaller the p-value, the stronger the evidence that the variables are associated
For two such variables X and Y, with m and n observed states. Form an m x n matrix in
which the entries aij represent the number of observations in which x=I and y=j. Calculate
the row and column sums Ri and Cj and the total sum
Significance
Correlation analysis excel:
CORREL function (two variables only)
Data analysis-> Correlation(several variables, placed in adjoining columns, input range
contains the entire range of columns, doesnt return significance levels- in order to obtain
these->linear regression analysis)
Regression analysis with 2 variables:
1. Using trendlines:comparing linear and non-linear functions
Trendline=regression function(line) which is inserted into a scatter plot
Lacks significance levels of coefficients and residuals
When constructing a scatter plot, the explanatory variable must be placed on the
x-axis, is placed in a column on the left hand side of the other variables
Add trendline: right-click an accidental marker in the scatter plot add trendline
(linear type), display equation on chart, display r-squared value on chart
Exponential, logarithmic and power trendlines cant be created with negative
values (or 0)
2. Using data analysis regression on a linear relationship:
1. Data analysis Regression: input Y- response variable, input X- explanatory
variable, labels, confidence level, output range, all kinds of residuals and line fit
plots
2. Two graphs for the output: residual plot and line fit plot(is a scatter plot, so no
need to make one before
Conducting a significance test:
If the p-value is < than 0.10, we can reject the null hypothesis at a 90% confidence level
If the p-value is < than 0.05, we can reject the null hypothesis at a 95% confidence level
If the p-value is < than 0.01, we can reject the null hypothesis at a 99% confidence level
Confidence interval and significance test for dependent sample means:
Construct confidence intervals and significance tests using the single sample of the
difference scores: add a new column with the difference between the first 2 columns(A-
B) and calculate the x_bar, n, st.dev, se and me and other for that column, same
assumptions as for a confidence interval for one mean
Calculate the difference between column A and B
(A-B), calculate mean, st.dev, se and the test
statistic for the
difference. Same
assumptions
as a for a significance
test for a mean,
hypothesis: H0: 1=2 and Ha: 1>//<2. Calculate test statistic and P-value+
conclusion.
Testing categorical variables for Independence (X squared test)
Create a contingency table of the actual frequencies
Create a contingency table of the expected frequencies (for a cell, the expected value is
the total of the column* total of the row/grand total), use $ to lock row, column, cell
Conduct the X squared-test by function CHISQ.TEST( actual ranges without totals,
expected range)gives directly the P-value, which needs to be interpreted
Modelling the relationship between 2 quantitative variables (association between
quantitative variables is done using regression analysis):
Aim: to use sample data to make inference about population relationships
First step in a regression analysis: identify the response variable (outcome on which
comparisons are made, noted with y) and the explanatory variable (defines the groups tp
be compared, noted with x)
Construct a scatterplot and check for linear trend, straight line can be fitted through the
data points to describe that trend( regression line)
The regression line predicts the value of the response variable y as a straight line function
of the value of x of the explanatory variable
Y hat is the predicted value of y
Equation for the regression line: y = a + bx, a is the y-intercept and b is the slope
Check for outliers by plotting the data
The regression line can be pulled toward an outlier and away from the general trend of
points. An observation can be influential in affecting the regression line when two thing
happen:
o Its x value is low or high compared to the rest of the data
o It does not fall in the straight-line pattern that the rest of the data have
Residuals measure the size of the prediction errors, the vertical distance between the
point and the regression line
o Each observation has a residual
o Calculation for each residual: y y
o A large residual indicates an unusual observation
How to find the regression line? Lest square method: This method gives the line that has
the smallest value for the residual sum of squares in using y = a + bx to predict y.
Characteristics of the least squares method line: Has some positive residuals and some
negative residuals, but the sum of the residuals equal 0. The line passes through (x,y)
Regression analysis in Excel: Analysis ToolPak: [Data] - [Data Analysis] - [Regression]
+Graphically we can also add a linear trend line to a scatterplot as we have seen
The population regression equation describes the relationship in the population between x
and the mean of y. The equation is y = + x
In practice we estimate the population regression equation using the prediction equation
for the sample data. The population regression equation merely approximates the actual
relationship between x and the population mean of y
It is a model. A model is a simple approximation for how variables relate in the
population
Regression analysis find out if there is any association between the two quantitative variables
Correlation find out how strong is the connection between the variables
The correlation coefficient:
The correlation summarizes the direction of the association between 2 quantitative
variables and the strenth of its straight-line trend. Denoted by r.
Interpret it: Takes values between -1 and 1. A positive value indicates a positive
assosiacion, negative value indicates negative association. The closer the data to+-1, the
close the data points fall to a straight line the stronger the linear association is. The
closer is to 0, the weaker the linear association is.
In excel, the correlation coefficient is calculated with CORREL function or Analysis
ToolPack: Data- Data Analysis- Corrrelation
The correlation describes the strength of the linear assosciation between 2 variables,
doesnt change when units of measurements change and doesnt depend upon which
variable is the response and which is the explanatory
The slope is a numerical value and it depends on the units used to measure the variables,
it doesnt tell us whether the association is strong or weak, the two variables must be
identified as response and explanatory variables
Correlation squared (R2):
Another way o describe the strength of association, refers to how close predictions fo y
tend to be to the observed y values
The variables are strongly associated if you can predict y much better by substituting x
values into the prediction equation than by merely using the sample mean y and ignoring
x
The prediction error= difference between the observed and the predicted values of y, each
error is (y-y hat)
When a strong linear association exists, the regression equation predictions tend to be
much better than the prediction using
Measure the proportional reduction in error, call it R2
Properties of R2:
o It summarizes the reduction in sum of squared errors in predicting y using the
regression line instead of using the mean of y.
o Falls between 0 and 1
o R2=1 when RSS=0 : it happens only when all the data points fall exactly on the
regression line
o R2=0 when :this happens when the
slope=0, in which case y hat= y bar
o The closer R2 to 1, the stronger the linear association: the more effective the
regression equation is compared to y bar in predicting y
R2 vs rxy:
rxy falls between -1 and 1; it represents the slope of the regression line when x and y have
been standardized
R2 falls between 0 and 1; it summarizes the reduction in sum of squared errors in
predicting y using the regression line instead of using y bar
Are descriptive parts of a regression analysis
The inferential parts of regression use the tools of confidence intervals and significance tests to
provide inference about the regression equation, the correlation and R2 in the population of
interest.
Assumptions for regression line for description:
Population means of y at different values of x have a straight line relationship with x; this
assumption states that a straight line regression model is valid and can be verified with a
scatterplot
Assumptions for using regression to make statistical inference:
Data were gathered using randomization
The population values of y at each value of x follow a normal distribution, with the same
standard deviation at each x value
Testing independence between quantitative variables:
Suppose the slope beta of the regression line=0, then the mean of y is identical at each x
value; the two variables, x and y, are statistically independent: the outcome for y does not
depend on the value of x, it doesnt help us to know the value of x if we want to predict
the value of y
Conducting a significance test about a population slope beta:
1. Assumptions:
a) Population satisfies regression line: y=+x
b) Data gathered using randomization
c) Population y values at each x value have normal distribution with same
standard deviation at each x value
2. Hypothesis:
a) H0: =0
b) Ha: 0
3. Test statistic:
4. P-value: Two tail probability of t test statistic value more extreme than
observed, using t distribution with df =n-2
5. Conclusion: Interpret p-value in context. Reject H0 if p-value<= significance
level
Confidence interval for beta:
Small p value in significance test of beta=0 suggests that the population regression line
has a nonzero slope
To learn how far the slope beta falls from 0, we construct a confidence interval
Variation around the regression line:
A residual is a prediction error; magnitude of these residuals depends on units of
measurement for y
A standardized version doesnt depend on the units
Se formula complex, software will calculate it
A standardized residual indicates how many standard errors a residual falls from 0
If the relationship is truly linear and the standardized residuals have approximately a bell shaped
distribution, observations with standardized residuals> 3 in absolute value often represent
outliers a histogram of residuals/standardized residuals is a good way of detecting unusual
observations, also good for checking the assumption that the conditional distribution of y at
each x value is normal(look for a bell shaped histogram)
If the histogram is not bell shaped, the distribution of the residuals is not normal, however. Two
sided inferences about the slope parameter still work quite well, the t-inferences are robust
For statistical inference, the regression model assumes that the conditional distribution of y at a
fixed value of x is normal, with the same standard deviation at each x
This standard deviation, denoted by , refers to the variability of y values for all subjects with
the same x value
The estimate of , obtained from the data, is:
Confidence interval for y: y hat+- t0.025sey hat , df= n-2
Prediction interval for y: the estimate y hat= a+b for the mean of y at a fixed value of x is also a
prediction for an individual outcome y at the fixed value of x
Confidence intervals for y vs prediction interval for y:
The confidence interval for y is an inference about where a population mean falls; use a
confidence interval if you want to estimate the mean of y for all individuals having a particular x
value, approximately equal to
The prediction interval for y is an inference about where individual observations fall; use it if you
want to predict where a single observation on y will fall for a particular x
value, approximately equal to: ,s is the residual standard
deviation
Exponential regression models:
If the scatterplot indicates substantial curvature in a relationship, then equations that provide
curvature are needed
Occasionally, a scatterplot has a parabolic appearance: as x increases, y increases then it goes
back down
More often, y tend to continually increase or continually decrease, bu the trend shows
curvature
An exponential regression model has the formula:
In the exponential regression equation, the explanatory variable x appears as the exponent of a
parameter
The mean miu y and the parameter beta can take only positive values
As x increases, the mean miu y increases when beta>1; it continually decreases when 0<beta<1
For exponential regression, the logarithm of the mean is a linear function of x
When the exponential regression model holds, a plot of the log of the y values versus x should
show an approximate straight line relation with x
Taking logs:
The regression equation is often called a prediction equation , because substituting
a particular value of x into the equation provides a prediction for y at that value of x.
The output tells us that yn = predicted maximum bench press (BP)
relates to x = number of 60@pound bench presses
The slope of 1.49 tells us that the predicted maximum bench press yn
increases by an average of 1 1/2 pounds for every additional 60-pound
bench press an athlete can do. The impact on yn of a 33-unit change in
x , from the sample minimum of x = 2 to the maximum of x = 35, is
33(1.49) = 49.2 pounds. An athlete who can do thirty-five 60-pound
bench presses has a predicted maximum bench press nearly 50 pounds
higher than an athlete who can do only two 60-pound bench presses.
Those predicted values are yn = 63.5 + 1.49(2) = 66.5 pounds at
x = 2 and yn = 63.5 + 1.49(35) = 115.7 pounds at x = 35.
The slope of 1.49 is positive: As x increases, the predicted value yn increases.
The association is positive. When the association is negative , the predicted
value yn decreases as x increases. When the slope = 0, the regression line is
horizontal.
Performing a polynomial regression analysis.
1. Identify the response and the explanatory(x) variables, x left column
2. Create a scatterplot+ polynomial trendline equation
3. Create a column with the squared explanatory variable. Function POWER(cell,2)
4. Data Analysis:RegressionInput Y(Y variable), input X(X variable+X2), labels+ select
the residuals needed.
5. The polynomial equation= intercept +coefficient*x +coefficient*x2
Is there interaction? Categorical predictors
I For two explanatory variables, interaction exists between them in their effects on the response
variable when the slope of the relationship between y and one of them changes as the value
of the other changes
To allow for interaction with two explanatory variables, one quantitative and one categorical,
you can fit a separate regression line with a different slope between the two quantitative
variables for each category of the categorical variable.
I Or you can generate interactions multiplying the two variables and include those in the
regression