KEMBAR78
Statistics for Data Analysts | PDF | Standard Error | Variance
0% found this document useful (0 votes)
19 views29 pages

Statistics for Data Analysts

Summarized Cheat Sheet - Hypothesis Testing

Uploaded by

sayantini123bak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views29 pages

Statistics for Data Analysts

Summarized Cheat Sheet - Hypothesis Testing

Uploaded by

sayantini123bak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

HYPOTHESIS TESTING

n
1
• x= ∑ x i
n i=1
n
1
• s21= ∑
n−1 i=1
( x i−x )
2

n
1
s = ∑ ( x i−x )
2 2

n i=1
• If E(statistic) = parameter

• then the statistic is said to be an Unbiased Estimate of the parameter.

• Sample mean is an unbiased estimate of the population mean.

• This means that the average of all sample means equals the population mean.

• E ( x )=μ

Also, E ( s1 ) =σ E ( s2 ) ≠ σ 2
2 2
and

• Unknown parameters are estimated using sample observations.

• Parameter values are fixed.

• Values of statistics vary from sample to sample.

• Each sample has some probability of being chosen.

• Each value of a statistic is associated with probability.

• Thus, Statistic is a random variable.

• Distribution of a statistic is called a sampling distribution.

• Distribution of a statistic may not be the same as the distribution of the population.

• We saw in previous example, E ( x )=μ∧Var ( x )=σ 2 /n.

• This is always true and can be proved as below:

( )
n n n
1 1 1
• E ( x )=E ∑ x i = ∑ E ( x i )= ∑ μ=μ
n i=1 n i=1 n i =1

( )
n n n
1 1 1 1
• Var ( x )=Var ∑ x i = 2 ∑ Var ( x i )= 2 ∑ σ 2= 2 n σ 2=σ 2 /n
n i =1 n i=1 n i=1 n
• Square root of variance is generally called as standard deviation.

• Here we shall call it Standard Error.

• Different samples of the same size from the same population yield different sample means.

• Standard Error of x is a measure of the variability in different values of sample mean.

Central Limit Theorem


• When population distribution is N(μ, σ),

σ
• Then x N (μ , )
√n
• When the population distribution is not normal,

σ
• Then also, x N (μ , ), provided n → ∞ .
√n
• Practically, this result is true for n ≥ 30.

• The result may also be written as

( x−μ)
• N (0 ,1)
σ /√n
• Clearly, this result is valid when

• Sample comes out of a normal population, or

• Sample size is large (n ≥ 30).

• Suppose a population has mean μ = 8 and standard deviation σ = 3.


• Suppose a random sample of size n = 36 is selected.
• What is the probability that the sample mean is between 7.75 and 8.25?
• P ( 7.75< x <8.25 ) ?
σ

x N (μ , ) x N (8 , 0.5)
√ n , or
• Using Excel,
• P ( 7.75< x <8.25 ) = NORM.DIST(8.25,8,0.5,1)-
NORM.DIST(7.75,8,0.5,1)

POPULATION & SAMPLE PROPORTIONS

• X and π are population parameters.


• x and p are sample statistics.
• p provides an estimate of π .
• Note that, x B(n , π )
• E ( x )=n π ,
• Var (x )=n π (1 – π ),
• This implies that
• E( p)=E (x /n)=π ,
2
• Var ( p)=Var (x /n)=n π (1 – π )/n =π (1 – π )/n .
• Standard Error ( p)=√ [Var ( p)]=√[ π (1 – π )/n]

• When the sample size n is large enough, binomial distribution approaches normal distribution.
• So, for large n ,
p−π



√ π (1−π )
n
~N (0,1),
This is a particular case of central limit theorem.
• Practically, this result is true for n ≥ 30.
Or, when nπ ≥ 5as well as n(1 – π )≥5 .

• We have seen the following 2 results:


x−μ
• N ( 0 , 1)
σ /√ n
• This result is valid:
• When sample size is 30 or more, or
• When parent population has normal distribution
p−π
N ( 0 ,1 )
• √ π ( 1−π ) /n
• This result is valid
• when sample size is 30 or more, or
• When nπ ≥ 5 , as well as n ( 1−π ) ≥ 5

• Two types of error:


• Type I Error: Reject H0, when it is true
• Size of Type I Error = P(Type I Error)
• =P(Reject H0, when it is true)
• =α (Also called Producer’s risk)
• Type II Error: Accept H0, when it is wrong
• Size of Type II Error = P(Type II Error)
• =P(Accept H0, when it is wrong)
• =β (Also called Consumer’s risk)
• Size of Type I Error (α) is called the Level of Significance.
• α is set by the researcher in advance.

• Critical value divides the whole area under the probability curve into two regions:
• Critical (Rejection) region
• When the statistical outcome falls into this region, H is rejected.
0
• Size of this region is α.
• Acceptance Region
• When the statistical outcome falls into this region, H is
0
accepted. Size of this region is (1-α).
Testing of Statistical Hypothesis
(One Samples Test)

Testing of Hypothesis for µ (z-test)


• Conditions/ Assumptions:
• Population is normal or n ≥ 30
• σ is known or n ≥ 30
x−μ
Z c=
σ
Test Statistic:
√n
1. Obtain the Critical Values using Excel or the Statistical Table
• Excel Formula
• For TTT: NORM.S.INV(α/2) and
NORM.S.INV(1 - α/2)
• For RTT: NORM.S.INV(1- α)
• For LTT: NORM.S.INV(α)
• p – value Approach
• Let Z be the computed value of the test statistic and Z ~ N (0,1)
c
• Then p – value is given by the following probability
• For two-tailed tests: 2P(Z> |Zc|)
• Excel Formula: 2*(1-
NORM.S.DIST(ABS(Zc),1))
• For right-tailed tests: P(Z> Zc)
• Excel Formula: 1-NORM.S.DIST(Zc,1))
• For left-tailed tests: P(Z< Zc)
• Excel Formula: NORM.S.DIST(Zc,1))

Testing of Hypothesis for µ (z-test)


• Conditions/ Assumptions:
• n<30; Population is normal; σ is unknown
x−μ
Z c=
σ
• Test Statistic:
√n
1.Obtain the Critical Values using t
distribution with (n-1) degree of
freedom (t ). (n−1)

• Excel Formula
• For TTT: T.INV(α/2,n-1) and
T.INV(1 - α/2 ,n-1)
• For RTT: T.INV(1- α ,n-1)
• For LTT: T.INV(α ,n-1)
2. p – value Approach in t-test
1. Let Tc be the computed value of the test statistic and T ~t(n-1)
2. Then p – value is given by the following probability
• For two-tailed tests: 2P(T> |Tc|)
1. Excel Formula: 2*(1-T.DIST(ABS(Tc),n-1,1))
• For right-tailed tests: P(T> Tc)
1. Excel Formula: 1-T.DIST(Tc, n-1,1))
• For left-tailed tests: P(T< Tc)
1. Excel Formula: T.DIST(Tc, n-1,1))

Testing of Statistical Hypothesis


(Two Samples Tests)
x 1−x 2
Z c= N (0 ,1)


2 2
• Z test for two independent samples σ1 σ 2
+
n 1 n2
x1 −x2
Z c= N (0 , 1)


2 2
• Z test for two independent samples s s 1 2
+
n1 n2
• t test for two independent samples assuming equal variances
x 1−x 2
T c= t ( n +n −2 ) 1
[ ( n1−1 ) s 1+ ( n2−1 ) s2 ]

2 2 2
• 1 1 1 2
, where S =
S + n 1 +n 2 −2
n1 n2
• Use t (n +n −2) distribution for critical value/ p-value.
1 2

• t test for two independent samples assuming unequal variances


2
x 1−x 2 ( s21 /n 1+ s 22 /n 2)
T c= t (f ) f=

√ [ ]
2 2
• 2
s1 s2
+
2
, Where ( s 1 /n1 ) ( s 2 /n2 )
2 2

n1 n2 +
n1 −1 n2−1
d
• paired t test T c= t (n−1)
sd / √ n
• Testing the Hypothesis for Difference of Proportions
p1− p 2
Z c= N (0 , 1) n1 p 1+ n2 p2

√^π (1− ^π ) 1
+
1
n1 n2( )
, where ^π =
n1 + n2

( p1 − p2 )
N ( 0 ,1 )
• Thus,
√ 1 1
π (1−π )( + )
n1 n2
.
• In the given example, we have three populations.
• We wish to test
• H0: π1 = π2 = π3 (All the proportions are the same)
• H1: Not all π1, π2, π3 are equal
• The table of data shown in the example is called the Contingency Table.
• Contingency Tables are used to classify sample observations according to two or more
characteristics.
• Contingency Table is useful in situations involving multiple population proportions.
• Let a contingency table has r rows and c columns.
• Then, it will have r x c cells
Chi square tests are always right tailed.

• We always have

• If we approximate some expected frequency, we must make sure that above condition is
satisfied.
• In these problems, data is of discrete type
• Chi – Square distribution is a continuous distribution.
• It loses its validity if any expected frequency is less than FIVE.
• In such case, the expected frequency is pooled with the preceding or succeeding
frequency.
• D.f. is reduced by one for one such pooling.
• We do not make any assumption about the distribution of parent population.
• The difference between two means can be examined using t – test or Z – test.
• If we have more than 2 samples.
• We wish to test the hypothesis that
• all the samples are drawn from the population having the same means.
• Or all population means are the same.
• We use ANOVA.

• ANOVA is essentially a
procedure for testing the
difference among various
groups of data for
homogeneity.
• At its simplest, ANOVA tests
the following hypotheses:
 H0: The means of all the
groups are equal.
 H1: Not all the means are
equal
• doesn’t say how
or which ones
differ.
• Can follow up
with “multiple
comparisons”

ANOVA IS ALWAYS RIGHT TAILED
TOO
• If the observations are large, you can
shift their origin and scale.
• This will not change the result.
• Shifting origin means adding or
subtracting some constant.
• Shifting of scale means multiplying or
dividing by some constant.
• Two-way analysis of variance is
an extension of one-way
analysis of variance.
• The variation is controlled by
two factors.
• The values of random variable
X are affected by different
levels of two factors.
• Assumptions
 The populations are normally
distributed.
 The samples are
independent.
 The variances of the
populations are equal.

• HA0: All levels of Factor A have the


same effect
• HA1: All levels of Factor A don’t have
the same effect
• HB0: All levels of Factor B have the
same effect
• HB1: All levels of Factor B don’t have
the same effect
• HAB0: There is no interaction effect
• HAB1: Interaction effect is there

You might also like