INTRODUCTION TO
DATA & STATISTICS
WITH
2
HELLO!
I am Elijah Appiah from
Ghana.
I am an Economist by
profession.
I love everything about R!
You can reach me:
secret behind the smile! eappiah.uew@gmail.com
3
Lecture Series
Introduction to Data and Statistics
Foundations of Probability
Inferential Statistics
Modeling and Regression Analysis
4
Lesson Goal
Introduce statistics as a science of
understanding and analyzing data
and making data-based decisions.
5
Statistics
Practice and study of:
collecting data
analyzing data
6
Statistics
Two main branches of statistics:
Descriptive – describe and
summarise data
Inferential – uses sample data to
make inferences about a larger
population
7
Statistics - Data
Two main types of data
Numeric (Quantitative)
Categorical (Qualitative)
8
Statistics - Data
Data
Numeric Categorical
(Quantitative) (Qualitative)
Continuous Discrete Nominal Ordinal
9
Data
Numeric Categorical
Discrete – counts Nominal – names, labels,
e.g. number of categories with no natural
cylinders of a vehicle order
e.g. gender, countries
Continuous – measured Ordinal – categories with
even within an interval an order
e.g. height, weight e.g. Likert Scales
10
EXPLORATION & SUMMARIES
11
EXPLORATION & SUMMARIES
12
Visualizing Numeric Data
TWO VARIABLES
Correlation does not imply causation
13
Visualizing Numeric Data
TWO VARIABLES
Both Continuous
Scatter plot – geom_point()
Data Source: gapminder.com
Country, Income per person (in US$), Life Expectancy (in
years) [2012] ----- {country, income, lifeExp}
14
Now, let’s practice
15
Visualizing Numeric Data
ONE VARIABLE
Discrete Continuous
Bar Plot – geom_bar() Histogram – geom_histogram()
Density Plot – geom_density()
Dot Plot – geom_dotplot()
Box Plot – geom_boxplot()
Data Source: gapminder.com
Country, Income per person (in US$), Life Expectancy (in
years) [2012] ----- {country, income, lifeExp}
16
Visualizing Numeric Data
Left Skewed Symmetric Right Skewed
17
Now, let’s practice
18
Visualizing Numeric Data
Large bin width Moderate bin width Narrow bin width
19
Visualizing Numeric Data
Dot Plot
20
Now, let’s practice
21
Visualizing Numeric Data
Box Plot Left Skewed
Normal
Right Skewed
22
Now, let’s practice
23
Population vs Sample
Population – entire group you to want
to draw conclusions about
e.g. income of all countries in the world
Sample – specific group from the
population used for inference
e.g. income of countries in Africa
24
Measures of Central Tendency
Sometimes, not good to have all
observations for data
Estimates may not be perfect
Good sample (representative of
population) makes estimates good
guesses.
25
Measures of Central Tendency
Key characteristics of a distribution
Mean
Mode
Median
Population parameters vs Sample
Statistics
26
Measures of Central Tendency
Mean
mean()
Median
median()
Mode
table() {base}
count() {dplyr}
27
Now, let’s practice
28
Measures of Spread
Data Variability Mean = 0
SD = 1
Mean = 0
SD = 2
29
Measures of Spread
Range: (maximum – minimum)
Variance: (average squared deviation from the mean)
Standard Deviation: (average deviation around the mean)
Interquartile Range: (range of the middle 50% of the data;
difference between first and third quartiles)
30
Measures of Spread
Range: max(x) – min(x); range()***
Variance: var(x)
Standard Deviation: sd(x); sqrt(var(x))
Interquartile Range: quantile(); boxplot
31
Robust Statistics
A measure least affected by extreme
values
32
Robust Statistics
Robust measures of Center & Spread
Example:
Data Mean Median
1,2,3,4,5,6 3.5 3.5
1,2,3,4,5,1000 169.12 3.5
Note: While the mean depends on all observations, the median
depends only on the midpoint of the distribution and the values of the
end points are irrelevant to its calculation.
33
Robust Statistics
Median is a more robust statistic of
center than the mean.
So too the IQR (which is based on
median) is more robust than standard
deviation (which is calculated using
the mean).
34
Robust Statistics
Robust statistics like the median and
IQR are most useful for describing
skewed distributions.
Non-robust statistics like the mean
and standard deviation are useful for
describing symmetric data.
35
Data Transformation
Rescaling data
Logarithmic Transformation
Square Root Transformation
36
Now, let’s practice
37
EXPLORATION & SUMMARIES
38
Exploring Categories
Data
titanic {ggmosaic}
Passengers and crew on board the Titanic
Description
A dataset containing some demographics and survival of people
on board the Titanic
Variables: Class (1st, 2nd, 3rd, crew); Sex (Male, Female);
Age (Child, Adult); Survived (Yes, No)
39
Exploring Categories
One Categorical Variable
Frequency Table
Bar Plots
40
Now, let’s practice
41
Exploring Categories
Two Categorical Variables
Contingency Table
Stacked Bar Plots
Clustered Bar Plots
Mosaic Plots
42
Now, let’s practice
43
Exploring Categories
One Numerical and One Categorical
Box Plot
44
Now, let’s practice
45
Any questions?
Reach me anytime!
Email
eappiah.uew@gmail.com
LinkedIn
https://www.linkedin.com/in/appiah-elijah-383231123/