Exploratory Data Analysis
“Statistical Techniques/Methods”
Formulate Get some Visualize the
problem data data
Do some Interpret
statistical results
calculations
DATA AND SUMMARIZATION
Primary Uses of Statistics
• Descriptive statistics – the collection, organization,
presentation and summary of data.
• Inferential statistics – generalizing from a sample to a
population, estimating unknown parameters, drawing
conclusions, making decisions.
Basic Vocabulary of Statistics
POPULATION
A population consists of all the items or individuals about which
you want to draw a conclusion.
SAMPLE
A sample is the portion of a population selected for analysis.
PARAMETER
A parameter is a numerical measure that describes a
characteristic of a population.
STATISTIC
A statistic is a numerical measure that describes a characteristic
of a sample.
Qualitative(Categ
Quantitative
orical)
Discrete (no.
of customers, Ordinal (customer
no of claims) satisfaction,
efficiency of workers,
bond rating)
Continuous
(salary, price)
Nominal (sex,
nationality,
eye color)
Cross-Sectional Data
• Cross-sectional data: Data collected at the same or approximately the
same point in time
• Time series data: data collected over different time periods
Data Visualization
Preliminary
Analysis: Page
Views
Treatment Average
Control 420
Treatment 1 501
Treatment 2 483
Preliminary
Analysis: Calls
Treatment Average
Control 34
Treatment 1 37
Treatment 2 42
Preliminary
Analysis:
Reservations
Treatment Average
Control 34
Treatment 1 34
Treatment 2 42
Preliminary
Analysis: Page
Views
Treatment Res Type Average
Control Chain 600
Independent 300
Treatment 1 Chain 690
Independent 375
Treatment 2 Chain 691
Independent 345
Preliminary
Analysis: Calls
Treatment Res Type Average
Control Chain 40
Independent 30
Treatment 1 Chain 44
Independent 33
Treatment 2 Chain 48
Independent 37
Preliminary
Analysis:
Reservations
Treatment Res Type Average
Control Chain 40
Independent 30
Treatment 1 Chain 40
Independent 30
Treatment 2 Chain 48
Independent 37
Frequency distribution table of Reservations
data
Class interval Freqency Density
15-20 90 90/30000=0.0006
20-25 1387
25-30 5776
30-35 7551
35-40 6809
40-45 4620
45-50 2082
50-55 922
55-60 500
60-65
65-70
70-75
75-80
Total 30000 1
Skewness
Skewed to left
Skewness
Symmetric
Skewness
Skewed to right
What is the point? Why collect this data?
Data and randomness
• Three questions that good business managers ask themselves when
they look at “the numbers”:-
• What is a typical or central value?
• How much variability is present in the data set?
• Are there unusual shocks/events/cases (shape of the curve)?
Dispersion
• Describes how similar a set of observations are to each other
or
the degree of deviation (spread) of a set of data from their central
value
• In general, the more spread out a distribution is, the larger the measure of
dispersion will be
Measures of Dispersion
125
• Which of the 100
75
distributions of 50
demand has the larger 25
dispersion? 0
1 2 3 4 5 6 7 8 9 10
The upper distribution 125
has more dispersion 100
75
because the scores 50
are more spread out 25
0
1 2 3 4 5 6 7 8 9 10
Measures of Dispersion
• There are four main measures of dispersion:
• Range
• Variance
• Standard Deviation
• Inter-quartile range (IQR)
Interpretation
• The larger the SD/variance is, the more the observations deviate, on
average, away from the mean
• The smaller the SD/variance is, the less the observations deviate, on
average, from the mean
Coefficient of Variation (CV)
• Relative measure (unit free) used for the purpose
of comparison of variability.
• Relative Measure=absolute measure/avg. *100
s
CV 100
x
Percentiles, Quartiles and IQR
• Percentiles are data that have been divided into 100
groups (99 percentiles).
• For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the test-
takers scored below you.
• Deciles are data that have been divided into 10 groups
(9 deciles).
• Quartiles are data that have been divided into 4 groups
(3 quartiles).
Percentiles, quartiles, and the IQR
The 10th percentile (denoted by P10) is the number
such that 10% of the values are less than it and 90%
are bigger.
The median is the 50th percentile.
The 1st quartile (denoted by Q1) is the data such
that 25% of the values are less than it and 75% are
bigger.
Inter quartile range (IQR) = Q3-Q1
Box Plot
Describes the overall distribution of a set of
numbers but is simpler than a histogram.
Useful when comparing several samples because
too many histograms on one graph would be
both crowded and confusing.
Also produces useful display with small data
sets.
Useful to detect outliers / extreme values
EXAMPLE: Page views data
Measures pageviews calls reservations
Min. : 145.0 17.00 15.00
1st Qu.: 328.0 32.00 31.00
Median : 391.0 37.00 36.00
Mean : 468.1 37.71 36.55
3rd Qu.: 636.0 42.00 41.00
Max. : 929.0 77.00 79.00
SD : 168.16 7.97 7.99
MAD : 149.74 7.41 7.41
restaurant_type
chain :12000
independent:18000
Box Plot
S=smallest, L=Largest, M=median
Q1=lower quartile, Q3=upper quartile
Detection of Outliers (Box Plot)
• Calculate Q1-1.5*IQR and Q3+1.5*IQR
• Any data lying outside this region is an outlier
BoxPlot
83 84 85 86 87 88 89 90 91
IBM
BoxPlot
18.5 19 19.5 20 20.5 21 21.5 22 22.5
EDS
A large number of fast-food restaurants with drive-through
windows offering drivers and their passengers the
advantages of quick service. To measure how good the
service is, an organization called QSR planned a study
wherein the amount of time taken by a sample of drive-
through customers at each of five restaurants was
recorded. Compare the five sets of data using a box plot
and interpret the results.
Standardising Data
• Purpose: To compare each data point to the natural
range and variation of the dataset.
• Method: For each data value – subtract off sample
mean and divided by sample std dev.
Resulting numbers called z-values or z-scores
• measure how many standard deviations above or
below the mean a data point is.
• are “unit free”
• have mean zero and SD 1
Standardising Data
How: To compare each data point to the
natural range and variation of the dataset.
xx
z
s
z score can be both positive or negative
Capturing variation
Chebyshev’s Theorem
Applies to any distribution, regardless of shape
Empirical Rule
Applies only to roughly mound-shaped and symmetric
distributions
Chebyshev’s Theorem
1
1
At least 2 of
the elements of any
k
distribution lie within k standard deviations of the
mean
1 1 3
1 1 75%
2
2
4 4 2
Standard
At 1 1 8 Lie
1 2 1 89% 3 deviations
least 3 9 9 within of the mean
1 1 15 4
1 2 1 94%
4 16 16
Empirical Rule
For roughly mound-shaped and symmetric
distributions, approximately:
68% 1 standard deviation
of the mean
95% Lie 2 standard deviations
within of the mean
All 3 standard deviations
of the mean
Empirical Rule
99.72%
95.44%
68.26%
m
x
m – 3s m – 1s m + 1s m + 3s
m – 2s m + 2s
Scatter Plots and Correlation
• A scatter plot (or scatter diagram) is used to show
the relationship between two variables
• Correlation analysis is used to measure strength of
the linear association between two variables
• Only concerned with strength of the relationship
• No causal effect is implied
Scatter Plot Examples
Linear relationships Curvilinear relationships
y y
x x
y y
x x
Strong relationships Weak relationships
y y
x x
y y
x x
No relationship
x
Correlation Coefficient
• The correlation coefficient (r) is used to measure
the strength of the linear relationship in the sample
observations
Calculating sample Correlation Coefficient
cov( x, y )
rxy
sx s y
1
cov( x, y ) ( xi x )( yi y )
n
1 1
sx
n
( xi x ) 2
s y
n
( y i y ) 2
Features of correlation coefficient
• Unit free
• Range between -1.00 and 1.00
• -1≤r<0 implies that as X ↑ (↓), Y ↓ (↑ )
• 0< r≤1 implies that as X ↑ (↓), Y ↑ (↓)
• The closer to -1.00, the stronger the negative linear relationship
• The closer to 1.00, the stronger the positive linear relationship
• The closer to 0.00, the weaker the linear relationship
• r=0 implies that X and Y are not linearly associated
Examples of Approximate r Values
y y y
x x x
r = -1.00 r = -.60 r = 0.00
y y
x x
r = 0.20 r = 1.00