0% found this document useful (0 votes)

641 views36 pages

Descriptive Statistics and Exploratory Data Analysis

This document discusses descriptive statistics and exploratory data analysis. It covers basic numerical and graphical summaries of data, including measures of central tendency, variation, histograms, box plots, and how to visualize univariate, bivariate, and multivariate data. It also provides an overview of how to calculate descriptive statistics and make graphs in R.

Uploaded by

Emmanuel Adjei Odame

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

641 views36 pages

Descriptive Statistics and Exploratory Data Analysis

Uploaded by

Emmanuel Adjei Odame

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Further Thoughts on Experimental Design

16 Individuals (8 each from two populations) with replicates

Pop 1

Pop 2 Randomly sample 4 individuals from each pop Tissue culture and RNA extraction Labeling and array hybridization Slide scanning and data acquisition Repeat 2 times processing 16 samples in total Repeat entire process producing 2 technical replicates for all 16 samples

Other Business
Course web-site:
http://www.gs.washington.edu/academics/courses/akey/56008/index.htm

Homework due on Thursday not Tuesday

Make sure you look at HW1 soon and see either Shameek or myself with questions

Today
What is descriptive statistics and exploratory data analysis? Basic numerical summaries of data Basic graphical summaries of data How to use R for calculating descriptive statistics and making graphs

Central Dogma of Statistics

Probability

Population
Descriptive Statistics

Sample
Inferential Statistics

EDA
Before making inferences from data it is essential to examine all your variables. Why? To listen to the data: - to catch mistakes - to see patterns in the data - to nd violations of statistical assumptions - to generate hypotheses and because if you dont, you will have trouble later

Types of Data
Categorical Quantitative

binary
2 categories

nominal

ordinal

discrete

continuous

more categories order matters numerical uninterrupted

Dimensionality of Data Sets

Univariate: Measurement made on one variable per subject Bivariate: Measurement made on two variables per subject

Multivariate: Measurement made on many variables per subject

Numerical Summaries of Data

Central Tendency measures. They are computed to give a center around which the measurements in the data are distributed. Variation or Variability measures. They describe data spread or how far away the measurements are from the center. Relative Standing measures. They describe the relative position of specic measurements in the data.

Location: Mean
1. The Mean To calculate the average x of a set of observations, add their value and divide by the number of observations:

x1 + x 2 + x 3 + ... + x n 1 x= = " xi n n i=1

Other Types of Means

Weighted means:
n

Trimmed:
x ="

"w x
i

i= 1 n

"w
i= 1

Geometric:
# & x = %" x i ( $ i=1 '
n 1 n

Harmonic:
x= n 1 "x i= 1 i
n

Location: Median
Median the exact middle value Calculation:
- If there are an odd number of observations, nd the middle value - If there are an even number of observations, nd the middle two values and average them

Example
Some data: Age of participants: 17 19 21 22 23 23 23 38 Median = (22+23)/2 = 22.5

Which Location Measure Is Best?

Mean is best for symmetric distributions without outliers Median is useful for skewed distributions or data with outliers
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Median = 3

Mean = 4 Median = 3

Scale: Variance
Average of squared deviations of values from the mean
n

2 " =

$( x
i

# x)

n #1

Why Squared Deviations?

Adding deviations will yield a sum of ? Absolute values do not have nice mathematical properties Squares eliminate the negatives Result:
Increasing contribution to the variance as you go farther from the mean.

Scale: Standard Deviation

Variance is somewhat arbitrary What does it mean to have a variance of 10.8? Or 2.2? Or 1459.092? Or 0.000001? Nothing. But if you could standardize that value, you could talk about any variance (i.e. deviation) in equivalent terms Standard deviations are simply the square root of the variance

Scale: Standard Deviation

n 2 ( x # x ) $ i i

= "

n #1

1. Score (in the units that are meaningful) 2. Mean ! 3. Each scores deviation from the mean 4. Square that deviation 5. Sum all the squared deviations (Sum of Squares) 6. Divide by n-1 7. Square root now the value is in the units we started with!!!

Interesting Theoretical Result

Regardless of how the data are distributed, a certain percentage of values must fall within k standard deviations from the mean:
Note use of (sigma) to represent standard deviation.

Note use of (mu) to represent mean.

At least

within

(1 - 1/12) = 0% ... k=1 ( 1) (1 - 1/22) = 75% ........ k=2 ( 2) (1 - 1/32) = 89% ....k=3 ( 3)

Often We Can Do Better

For many lists of observations especially if their histogram is bell-shaped 1. Roughly 68% of the observations in the list lie within 1 standard deviation of the average 2. 95% of the observations lie within 2 standard deviations of the average
Average Ave+s.d. Ave+2s.d.

Ave-2s.d.

Ave-s.d.

68% 95%

Scale: Quartiles and IQR

IQR
25% 25% 25% 25%

The rst quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger Q2 is the same as the median (50% are smaller, 50% are larger) Only 25% of the observations are greater than the third quartile

Percentiles (aka Quantiles)

In general the nth percentile is a value such that n% of the observations fall at or below or it
n%

Q1 = 25th percentile Median = 50th percentile Q2 = 75th percentile

Graphical Summaries of Data

A (Good) Picture Is Worth A 1,000 Words

Univariate Data: Histograms and Bar Plots

Whats the difference between a histogram and bar plot? Bar plot
Used for categorical variables to show frequency or proportion in each category. Translate the data from frequency tables into a pictorial representation

Histogram
Used to visualize distribution (shape, center, range, variation) of continuous variables Bin size important

Effect of Bin Size on Histogram

Simulated 1000 N(0,1) and 500 N(1,1)

Frequency

More on Histograms
Whats the difference between a frequency histogram and a density histogram?

More on Histograms
Whats the difference between a frequency histogram and a density histogram?
Frequency Histogram Density Histogram

Box Plots
100.0
maximum

66.7

Q3 IQR median Q1

Years

33.3

minimum

0.0

AGE

Variables

Bivariate Data
Variable 1
Categorical

Variable 2
Categorical

Display
Crosstabs Stacked Box Plot

Categorical Continuous

Continuous Continuous

Boxplot Scatterplot Stacked Box Plot

Multivariate Data
Clustering Organize units into clusters Descriptive, not inferential Many approaches Clusters always produced Data Reduction Approaches (PCA) Reduce n-dimensional dataset into much smaller number Finds a new (smaller) set of variables that retains most of the information in the total sample Effective way to visualize multivariate data

How to Make a Bad Graph

The aim of good data graphics:
Display data accurately and clearly

Some rules for displaying data badly:

Display as little information as possible Obscure what you do show (with chart junk) Use pseudo-3d and color gratuitously Make a pie chart (preferably in color and 3d) Use a poorly chosen scale
From Karl Broman: http://www.biostat.wisc.edu/~kbroman/

Example 1

Example 2

Example 3

Example 4

Example 5

R Tutorial
Calculating descriptive statistics in R Useful R commands for working with multivariate data (apply and its derivatives) Creating graphs for different types of data (histograms, boxplots, scatterplots) Basic clustering and PCA analysis

Approaches To The Analysis of Survey Data PDF
No ratings yet
Approaches To The Analysis of Survey Data PDF
28 pages
PSSC Maths Statistics Project Handbook Eff08 PDF
No ratings yet
PSSC Maths Statistics Project Handbook Eff08 PDF
19 pages
POL BigDataStatisticsJune2014
No ratings yet
POL BigDataStatisticsJune2014
27 pages
Statistical Analysis for Engineers
No ratings yet
Statistical Analysis for Engineers
4 pages
Machine Learning Mastery Notes
No ratings yet
Machine Learning Mastery Notes
4 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
33 pages
Stat Inference for Car Mileage
100% (1)
Stat Inference for Car Mileage
75 pages
Introductory Statistical Concepts
100% (1)
Introductory Statistical Concepts
118 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
Infant Mortality in Brazil A Survival Analysis Using Machine Learning Models7
No ratings yet
Infant Mortality in Brazil A Survival Analysis Using Machine Learning Models7
47 pages
Introduction To Data and Statistics With R
No ratings yet
Introduction To Data and Statistics With R
45 pages
Hypothesis Testing - 2 Populations
100% (1)
Hypothesis Testing - 2 Populations
26 pages
SPSS Chi-Square Tests Guide
No ratings yet
SPSS Chi-Square Tests Guide
14 pages
Distribution and Statistical Interference
No ratings yet
Distribution and Statistical Interference
43 pages
Stats - Lecture 1 - F2020 PDF
100% (1)
Stats - Lecture 1 - F2020 PDF
45 pages
Gaussian Noise Detection & Estimation
No ratings yet
Gaussian Noise Detection & Estimation
55 pages
Water Quality Prediction Using Machine Learning Technique
No ratings yet
Water Quality Prediction Using Machine Learning Technique
9 pages
Data Mining - Outlier Analysis
100% (3)
Data Mining - Outlier Analysis
11 pages
Sajjadiani Et Al - 2019 - Using Machine Learning To Translate Applicant Work History Into Predictors of
No ratings yet
Sajjadiani Et Al - 2019 - Using Machine Learning To Translate Applicant Work History Into Predictors of
61 pages
Probability Theory - Towards Data Science
No ratings yet
Probability Theory - Towards Data Science
19 pages
Rectified Linear Units (ReLU) in Deep Learning - Kaggle
No ratings yet
Rectified Linear Units (ReLU) in Deep Learning - Kaggle
3 pages
Inferential Statistics Guide
100% (2)
Inferential Statistics Guide
57 pages
Stat Quiz
0% (1)
Stat Quiz
5 pages
Logistic Regression
100% (1)
Logistic Regression
21 pages
Life Expectancy Using Data Analytics
100% (1)
Life Expectancy Using Data Analytics
9 pages
Machine Learning For Health Services Researchers
No ratings yet
Machine Learning For Health Services Researchers
8 pages
Data Visualization Techniques 1
No ratings yet
Data Visualization Techniques 1
27 pages
Data Scientist - KD PDF
No ratings yet
Data Scientist - KD PDF
1 page
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
35 pages
Demographics Segmentation Using Machine Learning
No ratings yet
Demographics Segmentation Using Machine Learning
8 pages
Presentation On Data Mining
100% (1)
Presentation On Data Mining
51 pages
Corporate Data Governance Guide
No ratings yet
Corporate Data Governance Guide
4 pages
Lme4: Mixed-Effects Modeling With R
No ratings yet
Lme4: Mixed-Effects Modeling With R
145 pages
Statistics Correlation Analysis
No ratings yet
Statistics Correlation Analysis
10 pages
Introductory Concepts of Probabability & Statistics
No ratings yet
Introductory Concepts of Probabability & Statistics
6 pages
10 Key Challenges in Data Mining
No ratings yet
10 Key Challenges in Data Mining
8 pages
Statistics for Data Science Practice
100% (1)
Statistics for Data Science Practice
71 pages
Applied Statistics Lab Manual No. 3 Minitab
No ratings yet
Applied Statistics Lab Manual No. 3 Minitab
11 pages
Levenberg Marquardt in Excel Excel VBA
0% (3)
Levenberg Marquardt in Excel Excel VBA
3 pages
Lecture Notes Interpolation and Data Fitting
No ratings yet
Lecture Notes Interpolation and Data Fitting
16 pages
A Brief Tutorial On Interval Type-2 Fuzzy Sets and Systems
No ratings yet
A Brief Tutorial On Interval Type-2 Fuzzy Sets and Systems
10 pages
Topic:use Statistical Data Analysis To Drive Fact - Based Decisions
0% (1)
Topic:use Statistical Data Analysis To Drive Fact - Based Decisions
11 pages
Discrete Structures
No ratings yet
Discrete Structures
404 pages
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
100% (1)
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
36 pages
Differential Evolution in Search of Solutions by Vitaliy Feoktistov PDF
No ratings yet
Differential Evolution in Search of Solutions by Vitaliy Feoktistov PDF
200 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
Data Science: Stats & Regression
100% (1)
Data Science: Stats & Regression
21 pages
Statistics: Statistics, Data, & Statistical Thinking
No ratings yet
Statistics: Statistics, Data, & Statistical Thinking
24 pages
Basic Statistical Tools for Research
No ratings yet
Basic Statistical Tools for Research
43 pages
Numerical Solution of Ordinary Differential Equations Part 1 - Intro & Approximation
100% (1)
Numerical Solution of Ordinary Differential Equations Part 1 - Intro & Approximation
15 pages
BigQuery ML: Custom Model Building
No ratings yet
BigQuery ML: Custom Model Building
32 pages
151 Practice Final 1
100% (1)
151 Practice Final 1
11 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
Business Network Analysis Course
No ratings yet
Business Network Analysis Course
2 pages
Data Analysis
No ratings yet
Data Analysis
30 pages
05 Probability Distributions
No ratings yet
05 Probability Distributions
52 pages
Lecture Notes
No ratings yet
Lecture Notes
37 pages
Day 01-Basic Statistics
No ratings yet
Day 01-Basic Statistics
36 pages
Statistics I Chapter 2: Univariate Data Analysis
No ratings yet
Statistics I Chapter 2: Univariate Data Analysis
27 pages
MÔ TẢ BIẾN SỐ
No ratings yet
MÔ TẢ BIẾN SỐ
48 pages
PSDM 2, The Sequel: Special Section Introduction
No ratings yet
PSDM 2, The Sequel: Special Section Introduction
1 page
Report For Experiment #7 Work and Energy On An Air Track: Meghan Lumnah
No ratings yet
Report For Experiment #7 Work and Energy On An Air Track: Meghan Lumnah
13 pages
TRIGONOMETRY
No ratings yet
TRIGONOMETRY
24 pages
Thomas Timmermann - An Invitation To Quantum Groups and Duality (Ems Textbooks in Mathematics) - European Mathematical Society (2008) PDF
100% (1)
Thomas Timmermann - An Invitation To Quantum Groups and Duality (Ems Textbooks in Mathematics) - European Mathematical Society (2008) PDF
427 pages
Quality Control and Inspection
No ratings yet
Quality Control and Inspection
67 pages
Kuchar - Aperture Coupled Micro Strip Patch Antenna Array - 1996
No ratings yet
Kuchar - Aperture Coupled Micro Strip Patch Antenna Array - 1996
91 pages
Two Way Slab Punching Shear Check
No ratings yet
Two Way Slab Punching Shear Check
1 page
CY6151 Engineering Chemistry - I - 2 Marks With Answers
0% (1)
CY6151 Engineering Chemistry - I - 2 Marks With Answers
4 pages
Molecular Topology Mircea V Diudea Ivan Gutman Jantschi Lorentz PDF Download
No ratings yet
Molecular Topology Mircea V Diudea Ivan Gutman Jantschi Lorentz PDF Download
77 pages
Calculus-Based Physics Problems
No ratings yet
Calculus-Based Physics Problems
5 pages
Bitsat Syllabus 2020 PDF: Subjects Type of Exam No of Questions
No ratings yet
Bitsat Syllabus 2020 PDF: Subjects Type of Exam No of Questions
3 pages
Lecture 13 Gauss Law and Electric Potential
No ratings yet
Lecture 13 Gauss Law and Electric Potential
53 pages
Math's Role in Modern Understanding
No ratings yet
Math's Role in Modern Understanding
4 pages
Determination of Coefficient of Linear Expansion of A Metal Rod
50% (2)
Determination of Coefficient of Linear Expansion of A Metal Rod
5 pages
Escape Room Lesson
60% (5)
Escape Room Lesson
14 pages
Olympiad Inequalities Guide
No ratings yet
Olympiad Inequalities Guide
1 page
ACTIVITY 5 Techniques of Integration Part 2 PDF
No ratings yet
ACTIVITY 5 Techniques of Integration Part 2 PDF
2 pages
LMTD Correction Factor Guide
No ratings yet
LMTD Correction Factor Guide
3 pages
135 142 8 Johnson June 2022 94
No ratings yet
135 142 8 Johnson June 2022 94
8 pages
Digital Content Processing (DCP)
No ratings yet
Digital Content Processing (DCP)
47 pages
文化资本及其对教育成果的影响
No ratings yet
文化资本及其对教育成果的影响
14 pages
Spur Gear Design
No ratings yet
Spur Gear Design
15 pages
A Study of Demand-Controlled Ventilation
No ratings yet
A Study of Demand-Controlled Ventilation
8 pages
Results 2009 DVC Accountant Advt
100% (2)
Results 2009 DVC Accountant Advt
5 pages
Activity Sheetnormaldis
No ratings yet
Activity Sheetnormaldis
2 pages
A Computational Simulation of Electromembrane Extraction Based On Poisson - Nernst - Planck Equations
No ratings yet
A Computational Simulation of Electromembrane Extraction Based On Poisson - Nernst - Planck Equations
11 pages
Detailed Modeling of CIGRÉ HVDC Benchmark System Using PSCADEMTDC and PSBSIMULINK
No ratings yet
Detailed Modeling of CIGRÉ HVDC Benchmark System Using PSCADEMTDC and PSBSIMULINK
10 pages
Observer-Based Reduced Order Controller Design For The Stabilization of Large Scale Linear Discrete-Time Control Systems
No ratings yet
Observer-Based Reduced Order Controller Design For The Stabilization of Large Scale Linear Discrete-Time Control Systems
11 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Maxwell's Reciprocal Theory PDF
No ratings yet
Maxwell's Reciprocal Theory PDF
7 pages

Descriptive Statistics and Exploratory Data Analysis

Uploaded by

Descriptive Statistics and Exploratory Data Analysis

Uploaded by

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Further Thoughts on Experimental Design

Homework due on Thursday not Tuesday

Central Dogma of Statistics

more categories order matters numerical uninterrupted

Dimensionality of Data Sets

Multivariate: Measurement made on many variables per subject

Numerical Summaries of Data

x1 + x 2 + x 3 + ... + x n 1 x= = " xi n n i=1

Other Types of Means

Which Location Measure Is Best?

Why Squared Deviations?

Scale: Standard Deviation

Scale: Standard Deviation

Interesting Theoretical Result

Note use of (mu) to represent mean.

Often We Can Do Better

Scale: Quartiles and IQR

Percentiles (aka Quantiles)

Q1 = 25th percentile Median = 50th percentile Q2 = 75th percentile

Graphical Summaries of Data

A (Good) Picture Is Worth A 1,000 Words

Univariate Data: Histograms and Bar Plots

Effect of Bin Size on Histogram

Boxplot Scatterplot Stacked Box Plot

How to Make a Bad Graph

Some rules for displaying data badly:

You might also like