Statistical
Inference
Dr. Basheer Ahmad Samim
1 8:16 PM
Course Outline
1. Review of Descriptive Statistics and SPSS
2. Random Variable and Mathematical Expectation
3. Discrete Probability Distributions (Binomial, Poisson)
4. Continuous Probability Distribution (Normal)
5. Sampling Theory
6. Confidance Intervals
7. Hypotheses Testing
8. Goodness of Fit
9. Regression and Correlation with ANOVA
10. Multiple Regression
11. All the topics will be SPSS oriented
2 8:16 PM
Recommended Readings (Books)
Introduction to Statistics,
Walpole, R. E., 3rd Edition
(2000)
Statistical Methods for Practice
and Research by Ajai S. Gaur
and Sanjaya S. Gaur
3 8:16 PM
Attendance Policy
16-Weeks Teaching
16-Lectures (32-Attendance)
Twice Roll Call, Once before the break
and once after the break
At Least 80% (24) Attendance is
compulsory to be elligible for the Final
Examination
No Roll Call after First Ten(5) minutes
4 8:16 PM
Mode of Teaching
Lecture
SPSS Workshop
Discussion Session
5 8:16 PM
Mode of Assessment
Quizes (15%)
Assignments (15%)
Class Performance (5%)
Mid Term Test (25%)
Final Examination (40%)
6 8:16 PM
Questionnaire
7 8:16 PM
Variable
A characteristic or
property that varies
from individual to
individual.
8 8:16 PM
Constant
A characteristic or
property that does not
change from individual
to individual.
9 8:16 PM
Types of Variables
Types of
Variables
Qualitative Quantitative
Discrete Continuous
10 8:16 PM
Nominal Scale
Variable categories are mutually
exclusive and exhaustive.
Variable categories have no
logical order.
Eye Color, Hair Color, Gender.
11 8:16 PM
Ordinal Scale
Data categories are mutually
exclusive and exhaustive.
Data classifications are ranked or
ordered according to the
particular trait they possess.
Level of Knowledge about SPSS
12 8:16 PM
Interval Scale
Data categories are mutually exclusive
and exhaustive.
Data classifications are ranked or ordered
according to the particular trait they
possess.
Equal differences in the characteristic are
not represented by equal differences in
the measurements.
Temperature, Shoe Size and IQ scores
13 8:16 PM
14
Ratio Scale
Data categories are mutually exclusive and
exhaustive.
Data classifications are ranked or ordered
according to the particular trait they possess.
Equal differences in the characteristic are
represented by equal differences in the
measurements.
The zero point is the essence of the
characteristic.
Height, Weight, Distance.
8:16 PM
15
Scale
Nominal
Data may only
be classified
Eye color,
Hair Color
Gender.
Ordinal
Data are
ranked
Level of
Knowledge
about
SPSS
Interval
True Zero Point
does not
Exist.
Temperature,
Shoe Size,
IQ Scores
Ratio
Meaningful Zero
point and Ratio
Between values
Height, Weight,
Distance.
Measurement Scales
8:16 PM
16
Data
The information collected
for any kind of investigation.
Usually Numerical but can
be Qualitative.
8:16 PM
17
Primary Data
The initial material collected
during the research process.
The information collected
directly from the respondent.
Personal Invetigation, Through Investigator, Through Questionnaire,
Through Local Sources, Through Telephone,
8:16 PM
18
Secondary Data
The information
collected and processed
by the people other than
the researcher
Government Organizations, Semi-Government
Organizations,
8:16 PM
Data Collection
Any of the following methods may be
adopted:
(a) Personal interview
(b) Direct observation
(c) Mail interview (internet interview)
(d) Telephone interview
What are the cons and pros of each?
19 8:16 PM
Data management
Office Editing,
Post Coding,
Data entry and Verification.
20 8:16 PM
Data organization and Analysis
Preparing data for analysis,
Extracting descriptive measures
from the data,
Using advanced statistical
techniques to analyze the data
and draw inference there from.
21 8:16 PM
22
Measures of Central Tendency
Arithmetic Mean
Quantiles
(Median, Quartiles, Deciles, Percentiles)
Mode
8:16 PM
23
Arithmetic Mean
A value obtained by dividing the sum of all the observations by
their number.
n n
X X X
X
n
1 i
i
n 2 1
X
=
=
+ + +
=
If X
1
, X
2
, , X
n
are n observations of a variable X then
ns observatio the of Number
ns observatio the all of Sum
Mean Arithmetic =
8:16 PM
24
Arithmetic Mean
The marks obtained by 8 students are:
Marks 5 . 68
8
548
8
63 72 67
X = =
+ + +
=
67 72 68 70 65 68 75 63
8:16 PM
25
Quantiles
For individual observations/discrete frequency
distribution, the ith quartile, jth decile and kth
percentile are located in the array/discrete frequency
distribution by the following relations
3 2, 1, i on, distributi in the n observatio th
4
1) i(n
Q
i
=
+
=
,9 2, 1, j on, distributi in the n observatio th
10
1) j(n
D
j
=
+
=
,99 2, 1, k on, distributi in the n observatio th
100
1) k(n
P
k
=
+
=
8:16 PM
26
The weekly TV Watching times (Hours):
25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21
Quartiles
The array of the above data is given below:
5 15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66
8:16 PM
27
Quartiles
Hours 22.0 21} - 0.25{25 21
obs.} 5th - obs. 0.25{6th obs. th 5
on distributi in the n observatio th 25 . 5
on distributi in the n observatio th
4
1) 1(20
Q
1
= + =
+ =
=
+
=
8:16 PM
28
Hours 30.5 30} - 0.50{31 30
obs.} 10th - obs. 0.50{11th obs. th 10
on distributi in the n observatio th 50 . 10
on distributi in the n observatio th
4
1) 2(20
Q
2
= + =
+ =
=
+
=
Quartiles
8:16 PM
29
Quantiles
8:16 PM
30
Mode
The mode is a value which occurs
most frequently in a set of data. Or
mode is a value that occurs
maximum number of times in a
sequence of observations.
8:16 PM
31
The total automobile sales (in millions) in
the United States for the last 14 years.
9.0 8.2 8.0 9.1 10.3 11.0 11.5
10.3 10.5 9.8 9.3 8.2 8.2 8.5
Mode
Mode = 8.2 million
8:16 PM
32
Measures of variation measure the
variation present among the values
of a data set, so measures of
variation are measures of spread of
values in the data.
8:16 PM
33
Absolute Measures of
Dispersion
Range
Quartile Deviation
Mean (Average) Deviation
Variance and Standard Deviation
8:16 PM
34
Relative Measures of
Dispersion
Coefficient of Range
Coefficient of Quartile Deviation
Coefficient of Mean Deviation
Coefficient of Variation (CV)
8:16 PM
35
Range
Difference between the largest
and the smallest observations
Largest Smallest
Range X X =
8:16 PM
36
Ignores the way in which data are distributed
Sensitive to outliers
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
Disadvantages of the Range
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119
8:16 PM
Inter-quartile Range (IQR)
Inter-quartile range = 3
rd
quartile 1
st
Quartile
Q3 - Q1
IQR is independent of outliers
37 8:16 PM
Inter-quartile Range
38
Median
(Q2)
X
maximum
X
minimum
Q1 Q3
25% 25% 25% 25%
12 30 45 57 70
Inter-quartile Range (IQR)
= 57 30 = 27
8:16 PM
39
The Mean (absolute) Deviation
X
8 3
5 0
2 -3
0
Mean Deviation is the average of absolute
deviations taken form the mean value.
( )
6
2
3
x x
n
= =
3
0
3
6
( ) X X
X X
8:16 PM
40
Variance
Variance is the average
of the squared
deviations taken from
the mean value.
X cm (X-Mean)^2 X
2
4 36 16
6 16 36
9 1 81
12 4 144
13 9 169
16 36 256
60 102 702
2
2 2
2
2 2
2 2
( )
102
( ) 17
6
702 102
( ) 17
6 6
x x
i S cm
n
X X
ii S cm
n n
= = =
| |
| |
= = =
|
|
|
\ .
\ .
8:16 PM
41
Comparing Standard Deviations
Mean = 15.5
S = 3.338
11 12 13 14 15 16 17 18 19 20 21
Data A
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
S = 4.567
Data C
The smaller the standard deviation, the more tightly
clustered the scores around mean
The larger the standard deviation, the more spread out
the scores from mean
8:16 PM
11 12 13 14 15 16 17 18 19 20 21
Data B
Mean = 15.5
S = 0.926
42
Relative Measures of Variation
Largest Smallest
Largest Smallest
Coefficient of Range
X X
X X
=
+
3 1
3 1
Coefficient of Quartile Deviation
Q Q
Q Q
=
+
Coefficient of Mean Deviation
MD
Mean
=
8:16 PM
Coefficient of Variation (CV)
Can be used to compare two or more
sets of data measured in different
units or same units but different
average size.
8:16 PM 43
100%
X
S
CV
|
|
.
|
\
|
=
44
Use of Coefficient of Variation
Stock A:
Average price last year = $50
Standard deviation = $5
Stock B:
Average price last year = $100
Standard deviation = $5
but stock B is
less variable
relative to its
price
10% 100%
$50
$5
100%
X
S
CV
A
= =
|
|
.
|
\
|
=
5% 100%
$100
$5
100%
X
S
CV
B
= =
|
|
.
|
\
|
=
Both stocks
have the
same
standard
deviation
8:16 PM
45
Appropriate Choice of Measure
of Variability
If data are symmetric, with no serious
outliers, use range and standard
deviation.
If data are skewed, and/or have serious
outliers, use IQR.
If comparing variation across two data
sets, use coefficient of variation (C.V)
8:16 PM
46
Five Number Summary
The five number summary of a data set consists of the
minimum value, the first quartile, the second quartile, the
third quartile and the maximum value written in that order:
Min, Q
1
, Q
2
, Q
3
, Max.
From the three quartiles we can obtain a measure of central
tendency (the median, Q
2
) and measures of variation of the
two middle quarters of the distribution, Q
2
-Q
1
for the
second quarter and Q
3
-Q
2
for the third quarter.
8:16 PM
47
The weekly TV viewing times (in hours).
25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21
The array of the above data is given below:
5 15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66
Five Number Summary
8:16 PM
48
Hrs 22.0 21} - 0.25{25 21 obs.} 5th - obs. 0.25{6th obs. 5th ; Q1 of VALUE
obs. 5.25th data in the obs. th
4
1) 1(20
; Q1 of LOCATION
= + = +
=
+
Five Number Summary
Hrs 30.5 30} - 0.50{31 0 3 obs.} 10th - obs. 0.50{11th obs. th 10 ; Q2 of VALUE
obs. th 50 . 10 data in the obs. th
4
1) 2(20
;
2
Q of LOCATION
= + = +
=
+
Minimum value=5.0 Maximum value=66.0
Hrs 36.5 35} - 0.75{37 35 obs} 15th - obs {16th 75 . 0 obs 15th ;
3
Q of VALUE
obs. 15.75th data in the obs. th
4
1) 3(20
;
3
Q of LOCATION
= + = +
=
+
8:16 PM
49
Box and Whisker Diagram
A box and whisker diagram or box-plot is a
graphical mean for displaying the five number
summary of a set of data. In a box-plot the first
quartile is placed at the lower hinge and the
third quartile is placed at the upper hinge. The
median is placed in between these two hinges.
The two lines emanating from the box are
called whiskers. The box and whisker diagram
was introduced by Professor Jhon W. Tukey.
8:16 PM
50
Construction of Box-Plot
1. Start the box from Q1 and end at
Q3
2. Within the box draw a line to
represent Q2
3. Draw lower whisker to Min.
Value up to Q1
4. Draw upper Whisker from Q3 up
to Max. Value
Q1
Q3
Q2
8:16 PM
Max
Value
Min
Value
51
Construction of Box-Plot
1. Q1=22.0 Q3=36.5
2. Q2=30.5
3. Minimum Value=5.0
4. Maximum Value=66.0
70
60
50
40
30
20
10
0
8:16 PM
52
Interpretation of Box-Plot
70
60
50
40
30
20
10
0
Box-Whisker Plot is useful to identify
Maximum and Minimum Values in the data
Median of the data
IQR=Q3-Q1,
Lengthy box indicates more variability in the data
Shape of the data From Position of line within box
Line At the center of the box----Symmetrical
Line above center of the box----Negatively skewed
Line below center of the box----Positively Skewed
Detection of Outliers in the data
8:16 PM
53
Outliers
An outlier is the values that falls well outside the overall
pattern of the data. It might be
the result of a measurement or recording error,
a member from a different population,
simply an unusual extreme value.
An extreme value needs not to be an outliers; it might,
instead, be an indication of skewness.
8:16 PM
54
Inner and Outer Fences
If Q1=22.0 Q2=30.5 Q3=36.5
( )
( )
= + =
= =
25 . 58 IQR 1.5 Q Fence Inner Upper
25 . 0 IQR 1.5 Q Fence Inner Lower
: Fences Inner
3
1
( )
( )
= + =
= =
0 . 80 IQR 3 Q Fence Outer Upper
5 . 21 IQR 3 Q Fence Outer Lower
: Fences Outer
3
1
8:16 PM
55
Identification of the Outliers
1. The values that lie within inner
fences are normal values
2. The values that lie outside inner
fences but inside outer fences
are possible/suspected/mild
outliers
3. The values that lie outside outer
fences are sure outliers
80
70
60
50
40
30
20
10
0
Plot each suspected outliers with an asterisk
and each sure outliers with an hollow dot.
*
Only
66 is a
mild
outlier
8:16 PM
56
Box plots are
especially suitable for
comparing two or more
data sets. In such a
situation the box plots
are constructed on the
same scale.
Uses of Box and Whisker Diagram
Male
Female
8:16 PM
Standardized Variable
A variable that has mean 0 and Variance 1 is
called standardized variable
Values of standardized variable are called
standard scores
Values of standard variable i.e standard scores are
unit-less
Construction
Variable of Deviation Standard
Variable of Mean Variable
Z
=
8:16 PM 57
X Z
3 25 -1.3624 1.8561
6 4 -0.5450 0.2970
11 9 0.81741 0.6682
12 16 1.0899 1.1879
32 54 0 4.009
5 . 13
4
54
8
4
32
2
= =
= = =
x
S
n
X
X
2
) ( X X
67 . 3
8
=
=
X
Sx
X X
Z
1
4
009 . 4
0
2
~ =
= =
z
S
n
Z
Z
2
) ( Z Z
Variable Z has mean 0 and
variance 1 so Z is a standard variable.
Standard Score at X=11 is
8174 . 0
67 . 3
8 11
=
=
Sx
X X
Z
8:16 PM
Standardized Variable
59
The industry in which sales rep Mr. Atif works has mean
annual sales=$2,500
standard deviation=$500.
The industry in which sales rep Mr. Asad works has mean
annual sales=$4,800
standard deviation=$600.
Last year Mr. Atifs sales were $4,000 and
Mr. Asads sales were $6,000.
Performance evaluation by z-scores
Which of the representatives would you hire
if you have one sales position to fill?
8:16 PM
60
Performance evaluation by z-scores
3
500
500 , 2 000 , 4
=
=
B
B
B B
B
Z
S
X X
Z
Sales rep. Atif
X
B
= $2,500
S
B
= $500
X
B
= $4,000
Sales rep. Asad
X
P
=$4,800
S
P
= $600
X
P
= $6,000
2
600
800 , 4 000 , 6
=
=
P
P
P P
P
Z
S
X X
Z
Mr. Atif is the best choice
8:16 PM
61
values of 68% about contains 1S X
The Empirical Rule
X
68%
1S X
values of 99.7% about contains 3S X
values of 95% about contains 2S X
95%
X 2S
X 3S
99.7%
8:16 PM
62
A distribution in which the values equidistant from
the centre have equal frequencies is defined to be
symmetrical and any departure from symmetry is
called skewness.
1. Length of Right Tail = Length of Left
Tail
2. Mean = Median = Mode
3. Sk=0
a) Sk=(Mean-Mode)/SD
b) Sk=(Q3-2Q2+Q1)/(Q3-Q1)
8:16 PM
Measures of Skewness
63
A distribution is positively skewed, if the observations
tend to concentrate more at the lower end of the possible
values of the variable than the upper end. A positively
skewed frequency curve has a longer tail on the right
hand side
1. Length of Right Tail > Length of Left
Tail
2. Mean > Median > Mode
3. SK>0
Measures of Skewness
8:16 PM
64
A distribution is negatively skewed, if the
observations tend to concentrate more at the upper
end of the possible values of the variable than the
lower end. A negatively skewed frequency curve has a
longer tail on the left side.
1. Length of Right Tail < Length of Left
Tail
2. Mean < Median < Mode
3. SK< 0
8:16 PM
Measures of Skewness
8:16 PM 65
The Kurtosis is the degree of peakedness or flatness of a
unimodal (single humped) distribution,
When the values of a variable are highly concentrated around
the mode, the peak of the curve becomes relatively high; the
curve is Leptokurtic.
When the values of a variable have low concentration around
the mode, the peak of the curve becomes relatively flat;curve
is Platykurtic.
A curve, which is neither very peaked nor very flat-toped, it
is taken as a basis for comparison, is called
Mesokurtic/Normal.
Measures of Kurtosis
66 8:16 PM
Measures of Kurtosis
67
Measures of Kurtosis
1. If Coefficient of Kurtosis > 3 ----------------- Leptokurtic.
2. If Coefficient of Kurtosis = 3 ----------------- Mesokurtic.
3. If Coefficient of Kurtosis < 3 ----------------- is Platykurtic.
( )
( )
4
2
2
n X-X
Coefficient of Kurtosis=
X-X
(
8:16 PM