0% found this document useful (0 votes)

86 views54 pages

Data Analytics Theory

Exploratory data analysis involves formulating a problem, collecting relevant data, visualizing the data, performing statistical calculations to interpret the results. The document discusses exploratory data analysis techniques including descriptive statistics, inferential statistics, data visualization using tables, graphs and charts. It covers concepts such as data types, measures of central tendency, dispersion, outliers, and the use of standardization to compare data.

Uploaded by

Chandra Mohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views54 pages

Data Analytics Theory

Uploaded by

Chandra Mohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Exploratory Data Analysis

“Statistical Techniques/Methods”

Formulate Get some Visualize the

problem data data

Do some Interpret
statistical results
calculations
DATA AND SUMMARIZATION
Primary Uses of Statistics

• Descriptive statistics – the collection, organization,

presentation and summary of data.

• Inferential statistics – generalizing from a sample to a

population, estimating unknown parameters, drawing
conclusions, making decisions.
Basic Vocabulary of Statistics

POPULATION
A population consists of all the items or individuals about which
you want to draw a conclusion.

SAMPLE
A sample is the portion of a population selected for analysis.

PARAMETER
A parameter is a numerical measure that describes a
characteristic of a population.

STATISTIC
A statistic is a numerical measure that describes a characteristic
of a sample.
Qualitative(Categ
Quantitative
orical)

Discrete (no.
of customers, Ordinal (customer
no of claims) satisfaction,
efficiency of workers,
bond rating)
Continuous
(salary, price)
Nominal (sex,
nationality,
eye color)
Cross-Sectional Data

• Cross-sectional data: Data collected at the same or approximately the

same point in time
• Time series data: data collected over different time periods
Data Visualization
Preliminary
Analysis: Page
Views
Treatment Average
Control 420
Treatment 1 501
Treatment 2 483
Preliminary
Analysis: Calls
Treatment Average
Control 34
Treatment 1 37
Treatment 2 42
Preliminary
Analysis:
Reservations
Treatment Average
Control 34
Treatment 1 34
Treatment 2 42
Preliminary
Analysis: Page
Views
Treatment Res Type Average
Control Chain 600
Independent 300
Treatment 1 Chain 690
Independent 375
Treatment 2 Chain 691
Independent 345
Preliminary
Analysis: Calls
Treatment Res Type Average
Control Chain 40
Independent 30
Treatment 1 Chain 44
Independent 33
Treatment 2 Chain 48
Independent 37
Preliminary
Analysis:
Reservations
Treatment Res Type Average
Control Chain 40
Independent 30
Treatment 1 Chain 40
Independent 30
Treatment 2 Chain 48
Independent 37
Frequency distribution table of Reservations
data
Class interval Freqency Density
15-20 90 90/30000=0.0006
20-25 1387
25-30 5776
30-35 7551
35-40 6809
40-45 4620
45-50 2082
50-55 922
55-60 500
60-65
65-70
70-75
75-80
Total 30000 1
Skewness
Skewed to left
Skewness
Symmetric
Skewness
Skewed to right
What is the point? Why collect this data?
Data and randomness

• Three questions that good business managers ask themselves when

they look at “the numbers”:-

• What is a typical or central value?

• How much variability is present in the data set?

• Are there unusual shocks/events/cases (shape of the curve)?

Dispersion
• Describes how similar a set of observations are to each other
or
the degree of deviation (spread) of a set of data from their central
value

• In general, the more spread out a distribution is, the larger the measure of
dispersion will be
Measures of Dispersion

125

• Which of the 100

75
distributions of 50
demand has the larger 25
dispersion? 0
1 2 3 4 5 6 7 8 9 10

The upper distribution 125

has more dispersion 100

75
because the scores 50

are more spread out 25

0
1 2 3 4 5 6 7 8 9 10
Measures of Dispersion

• There are four main measures of dispersion:

• Range
• Variance
• Standard Deviation
• Inter-quartile range (IQR)
Interpretation

• The larger the SD/variance is, the more the observations deviate, on
average, away from the mean
• The smaller the SD/variance is, the less the observations deviate, on
average, from the mean
Coefficient of Variation (CV)

• Relative measure (unit free) used for the purpose

of comparison of variability.

• Relative Measure=absolute measure/avg. *100

s
CV  100 
x
Percentiles, Quartiles and IQR

• Percentiles are data that have been divided into 100

groups (99 percentiles).
• For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the test-
takers scored below you.
• Deciles are data that have been divided into 10 groups
(9 deciles).
• Quartiles are data that have been divided into 4 groups
(3 quartiles).
Percentiles, quartiles, and the IQR

The 10th percentile (denoted by P10) is the number

such that 10% of the values are less than it and 90%
are bigger.

The median is the 50th percentile.

The 1st quartile (denoted by Q1) is the data such

that 25% of the values are less than it and 75% are
bigger.

Inter quartile range (IQR) = Q3-Q1

Box Plot

Describes the overall distribution of a set of

numbers but is simpler than a histogram.
Useful when comparing several samples because
too many histograms on one graph would be
both crowded and confusing.
Also produces useful display with small data
sets.

Useful to detect outliers / extreme values

EXAMPLE: Page views data

Measures pageviews calls reservations

Min. : 145.0 17.00 15.00

1st Qu.: 328.0 32.00 31.00
Median : 391.0 37.00 36.00
Mean : 468.1 37.71 36.55
3rd Qu.: 636.0 42.00 41.00
Max. : 929.0 77.00 79.00
SD : 168.16 7.97 7.99
MAD : 149.74 7.41 7.41

restaurant_type
chain :12000
independent:18000
Box Plot

S=smallest, L=Largest, M=median

Q1=lower quartile, Q3=upper quartile
Detection of Outliers (Box Plot)

• Calculate Q1-1.5IQR and Q3+1.5IQR

• Any data lying outside this region is an outlier
BoxPlot

83 84 85 86 87 88 89 90 91
IBM

BoxPlot

18.5 19 19.5 20 20.5 21 21.5 22 22.5

EDS
A large number of fast-food restaurants with drive-through
windows offering drivers and their passengers the
advantages of quick service. To measure how good the
service is, an organization called QSR planned a study
wherein the amount of time taken by a sample of drive-
through customers at each of five restaurants was
recorded. Compare the five sets of data using a box plot
and interpret the results.
Standardising Data
• Purpose: To compare each data point to the natural
range and variation of the dataset.
• Method: For each data value – subtract off sample
mean and divided by sample std dev.
Resulting numbers called z-values or z-scores
• measure how many standard deviations above or
below the mean a data point is.
• are “unit free”
• have mean zero and SD 1
Standardising Data

How: To compare each data point to the

natural range and variation of the dataset.
xx
z
s

z score can be both positive or negative

Capturing variation

 Chebyshev’s Theorem
Applies to any distribution, regardless of shape

 Empirical Rule
Applies only to roughly mound-shaped and symmetric
distributions
Chebyshev’s Theorem
 1 
1  
 At least  2 of
 the elements of any
 k 
distribution lie within k standard deviations of the
mean
1 1 3
1  1    75%
2
2
4 4 2
Standard
At 1 1 8 Lie
1  2  1    89% 3 deviations
least 3 9 9 within of the mean
1 1 15 4
1 2  1   94%
4 16 16
Empirical Rule
 For roughly mound-shaped and symmetric
distributions, approximately:

68% 1 standard deviation

of the mean

95% Lie 2 standard deviations

within of the mean

All 3 standard deviations

of the mean
Empirical Rule
99.72%
95.44%
68.26%

m
x
m – 3s m – 1s m + 1s m + 3s
m – 2s m + 2s
Scatter Plots and Correlation

• A scatter plot (or scatter diagram) is used to show

the relationship between two variables
• Correlation analysis is used to measure strength of
the linear association between two variables
• Only concerned with strength of the relationship
• No causal effect is implied
Scatter Plot Examples
Linear relationships Curvilinear relationships

y y

x x

y y

x x
Strong relationships Weak relationships

y y

x x

y y

x x
No relationship

x
Correlation Coefficient

• The correlation coefficient (r) is used to measure

the strength of the linear relationship in the sample
observations
Calculating sample Correlation Coefficient

cov( x, y )
rxy 
sx s y
1
cov( x, y )   ( xi  x )( yi  y )
n
1 1
sx 
n
 ( xi  x ) 2
s y 
n
 ( y i  y ) 2
Features of correlation coefficient
• Unit free
• Range between -1.00 and 1.00
• -1≤r<0 implies that as X ↑ (↓), Y ↓ (↑ )
• 0< r≤1 implies that as X ↑ (↓), Y ↑ (↓)
• The closer to -1.00, the stronger the negative linear relationship
• The closer to 1.00, the stronger the positive linear relationship
• The closer to 0.00, the weaker the linear relationship
• r=0 implies that X and Y are not linearly associated
Examples of Approximate r Values

y y y

x x x
r = -1.00 r = -.60 r = 0.00
y y

x x
r = 0.20 r = 1.00

Reading - Exploratory Data Analysis
No ratings yet
Reading - Exploratory Data Analysis
33 pages
Descriptive & Inferential Stats Guide
No ratings yet
Descriptive & Inferential Stats Guide
13 pages
Basic Business Statistics: Concepts & Applications: Activity 4+ 5 + 6 Descriptive Statistics and Graphical Analysis
No ratings yet
Basic Business Statistics: Concepts & Applications: Activity 4+ 5 + 6 Descriptive Statistics and Graphical Analysis
33 pages
Business Statistics: Dr. Basheer Ahmad Samim
No ratings yet
Business Statistics: Dr. Basheer Ahmad Samim
70 pages
Lecture 2b - Describing Data-Numerical
No ratings yet
Lecture 2b - Describing Data-Numerical
47 pages
Arm & Sa Spring 13
No ratings yet
Arm & Sa Spring 13
64 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Chapter 1
No ratings yet
Chapter 1
44 pages
Part 2-Chapter 3 - Describing Data - Edit
No ratings yet
Part 2-Chapter 3 - Describing Data - Edit
46 pages
Ch3-Numerical Measures
No ratings yet
Ch3-Numerical Measures
33 pages
Quantitative Analysis: Dr. Basheer Ahmad Samim
No ratings yet
Quantitative Analysis: Dr. Basheer Ahmad Samim
71 pages
Chapter 3 Review
100% (1)
Chapter 3 Review
12 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
Week 6+7+8
No ratings yet
Week 6+7+8
37 pages
Numerical Descriptive Measures
No ratings yet
Numerical Descriptive Measures
52 pages
Module 1 Statistical Inference
No ratings yet
Module 1 Statistical Inference
67 pages
2 Descriptives
No ratings yet
2 Descriptives
43 pages
Data Analysis and Data Visualization Basics 2
No ratings yet
Data Analysis and Data Visualization Basics 2
50 pages
EECM3724 Unit 1 Ch3 Slides 2022
No ratings yet
EECM3724 Unit 1 Ch3 Slides 2022
48 pages
Module 1 Statistical Inference
No ratings yet
Module 1 Statistical Inference
67 pages
Spring Semester, 2020-2021
No ratings yet
Spring Semester, 2020-2021
40 pages
Descriptive Statistics 1
No ratings yet
Descriptive Statistics 1
63 pages
Session 2 Descriptive Statistics
No ratings yet
Session 2 Descriptive Statistics
33 pages
Dr. K. M. Salah Uddin Associate Professor Dept. of MIS, DU
No ratings yet
Dr. K. M. Salah Uddin Associate Professor Dept. of MIS, DU
41 pages
Chapter 5
No ratings yet
Chapter 5
6 pages
Lec5&6 02sep2016
No ratings yet
Lec5&6 02sep2016
32 pages
02 Descriptive Statistics
No ratings yet
02 Descriptive Statistics
49 pages
CH 3 - 250408 - 170537
No ratings yet
CH 3 - 250408 - 170537
33 pages
Probability Theory & Statistics: Describing Data: Numerical
No ratings yet
Probability Theory & Statistics: Describing Data: Numerical
36 pages
Intro to Descriptive Statistics
No ratings yet
Intro to Descriptive Statistics
68 pages
Descriptive Statistics
100% (1)
Descriptive Statistics
7 pages
Lecture 2-3 Data Analysis Location & Dispression
No ratings yet
Lecture 2-3 Data Analysis Location & Dispression
43 pages
Lec006 - Measures of Dispersion
No ratings yet
Lec006 - Measures of Dispersion
42 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
Measures of Central Tendency and Spread: Chapter 1, Section 2
No ratings yet
Measures of Central Tendency and Spread: Chapter 1, Section 2
36 pages
RM EBBA Class 8 CH0 11 Quatitative Analysis
No ratings yet
RM EBBA Class 8 CH0 11 Quatitative Analysis
37 pages
Stats - The Theory 2
No ratings yet
Stats - The Theory 2
25 pages
Module 1 Overview - of - Statistics
No ratings yet
Module 1 Overview - of - Statistics
11 pages
Math264 Numerical Measures Apaydın
No ratings yet
Math264 Numerical Measures Apaydın
64 pages
Topic3 Descriptive Statistics
No ratings yet
Topic3 Descriptive Statistics
50 pages
Statistical Inference Course Guide
No ratings yet
Statistical Inference Course Guide
69 pages
Basic Statistics
No ratings yet
Basic Statistics
7 pages
Lecture 04
No ratings yet
Lecture 04
88 pages
Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
STAT241 - Business Statistics (Day 3)
No ratings yet
STAT241 - Business Statistics (Day 3)
32 pages
Statistics for Data Analysis
No ratings yet
Statistics for Data Analysis
59 pages
Descriptive Statistics - Numerical Measures
No ratings yet
Descriptive Statistics - Numerical Measures
102 pages
Summary and Revision Quiz 1
No ratings yet
Summary and Revision Quiz 1
5 pages
Statistics For Managers Using Microsoft Excel: 5 Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 5 Edition
54 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Ken Black QA ch03
0% (1)
Ken Black QA ch03
61 pages
2 Basic Statistics Unit-II Class
No ratings yet
2 Basic Statistics Unit-II Class
28 pages
Lecture Notes 2 - Descriptive Statistics-1720598791715
No ratings yet
Lecture Notes 2 - Descriptive Statistics-1720598791715
21 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
26 pages
Program-1
No ratings yet
Program-1
15 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Stats
No ratings yet
Stats
109 pages
Wireless Communications: Principles and Practice 2 Edition T.S. Rappaport
No ratings yet
Wireless Communications: Principles and Practice 2 Edition T.S. Rappaport
19 pages
Biology Levels for Students
No ratings yet
Biology Levels for Students
3 pages
Ohms Law 14to16 Lesson-Plan
No ratings yet
Ohms Law 14to16 Lesson-Plan
3 pages
Influence of NLP On Sales
0% (1)
Influence of NLP On Sales
25 pages
Patchwork Text Winter
No ratings yet
Patchwork Text Winter
22 pages
Latihan Soal
100% (1)
Latihan Soal
3 pages
IC Engines
No ratings yet
IC Engines
37 pages
Kartilya & 1898 Philippine Independence
No ratings yet
Kartilya & 1898 Philippine Independence
7 pages
Packing Machine Operation Instruction
No ratings yet
Packing Machine Operation Instruction
18 pages
25570929192444
No ratings yet
25570929192444
30 pages
4shapes in Tide Pools
No ratings yet
4shapes in Tide Pools
7 pages
Product Guide: Hyundai Construction Equipment
100% (1)
Product Guide: Hyundai Construction Equipment
26 pages
Chapter 82024
No ratings yet
Chapter 82024
23 pages
Linked Lists: Concepts and Operations
No ratings yet
Linked Lists: Concepts and Operations
13 pages
Overview of Timeline Panel
No ratings yet
Overview of Timeline Panel
15 pages
Java Applet and Factorial Guide
No ratings yet
Java Applet and Factorial Guide
6 pages
LG Oem Lgit Plde-P017a SCH
No ratings yet
LG Oem Lgit Plde-P017a SCH
2 pages
English 7 Curriuculum Map Quarter 1-3
100% (4)
English 7 Curriuculum Map Quarter 1-3
15 pages
FSK Filters
No ratings yet
FSK Filters
4 pages
Oil & Gas Construction Services
No ratings yet
Oil & Gas Construction Services
22 pages
Product Catalogue 11 Stauff Hire
No ratings yet
Product Catalogue 11 Stauff Hire
20 pages
Klüber Lubricants for Glass Industry
No ratings yet
Klüber Lubricants for Glass Industry
12 pages
Namma Kalvi 12th Commerce Book Inside One Mark Study Material EM 220550
No ratings yet
Namma Kalvi 12th Commerce Book Inside One Mark Study Material EM 220550
145 pages
Gas Production Rate (MSCF/D) : IPR TPR
No ratings yet
Gas Production Rate (MSCF/D) : IPR TPR
5 pages
Grassland Forage Management Insights
No ratings yet
Grassland Forage Management Insights
22 pages
Class 12 Geography: Planning & Sustainable Development
No ratings yet
Class 12 Geography: Planning & Sustainable Development
40 pages
VLSI Design MCQs & Answers
0% (1)
VLSI Design MCQs & Answers
20 pages
Rem Koolhaas
100% (1)
Rem Koolhaas
7 pages

Data Analytics Theory

Uploaded by

Data Analytics Theory

Uploaded by

Exploratory Data Analysis

Formulate Get some Visualize the

• Descriptive statistics – the collection, organization,

• Inferential statistics – generalizing from a sample to a

• Cross-sectional data: Data collected at the same or approximately the

• Three questions that good business managers ask themselves when

• What is a typical or central value?

• How much variability is present in the data set?

• Are there unusual shocks/events/cases (shape of the curve)?

• Which of the 100

The upper distribution 125

has more dispersion 100

are more spread out 25

• There are four main measures of dispersion:

• Relative measure (unit free) used for the purpose

• Relative Measure=absolute measure/avg. *100

• Percentiles are data that have been divided into 100

The 10th percentile (denoted by P10) is the number

The median is the 50th percentile.

The 1st quartile (denoted by Q1) is the data such

Inter quartile range (IQR) = Q3-Q1

Describes the overall distribution of a set of

Useful to detect outliers / extreme values

Measures pageviews calls reservations

Min. : 145.0 17.00 15.00

S=smallest, L=Largest, M=median

• Calculate Q1-1.5*IQR and Q3+1.5*IQR

18.5 19 19.5 20 20.5 21 21.5 22 22.5

How: To compare each data point to the

z score can be both positive or negative

68% 1 standard deviation

95% Lie 2 standard deviations

All 3 standard deviations

• A scatter plot (or scatter diagram) is used to show

• The correlation coefficient (r) is used to measure

You might also like

• Calculate Q1-1.5IQR and Q3+1.5IQR