QUANTITATIVE DATA ANALYSIS
Fundamental of Data Analysis
Dr. Olunga Okomo, PhD
Broad Objective
• By the end of the session learners shall have
comprehended the concept of statistical
significance, correlation; regression and be
able to skilfully apply them in their research
work
Specific Objectives
• To Define Basic terms in statistics
• To discuss correlation and regression analysis
• To describe , linear correlation coefficient
• To describe confidence intervals
• To discuss binomial distribution
• To discuss normal distribution
Definition of terms
• Statistical Data Analysis: This is the process of
collecting and analyzing large volumes of data in
order to identify trends and develop valuable
insights.
• In the professional world, statistical analysts take
raw data and find correlations between variables
to reveal patterns & trends to stakeholders
• Inferential Statistics are methods used to make
inferences about the relationship between the
dependent and independent variables in a
population, based on a sample of observations
Variables and Observation units
• Observation unit: The variable in research to be
measured.
Example:
– Individual (students, health workers, farmers ….)
– Groups (e.g. family, household, couples)
– Institution, organization or community (e.g. school,
enterprise, municipality)
– Time, age, sex, scores in an exam etc
Note: a mean of a sample is statistic, while a mean of
the whole population is a parameter
Definition cont…
• Correlation Analysis : A process used to establish
relationship - patterns within datasets of variables. A
positive correlation result means that both variables
increase in relation to each other, while a negative
correlation means that as one variable decreases, the
other increases.
• Linear correlation is defined when the ratio of
proportion of two given variables are same/constant,
for example- every time when the income increases by
20% there is a rise in expenditure of 5%.
• Non-linear correlation: A situation when the ratio of
variations between two given variables changes (not
constant).
Stages of statistical analysis
Note: With statistical data analysis programs you
can easily do several steps in one operation.
1. Clean your data
• Make very sure that your data are correct
(e.g. check data transcription)
• Make sure that missing values (e.g. not
answered questions in a survey) are clearly
identified as missing data
Stages of statistical analysis
2. Gain knowledge about your data
• Make lists of data (for small data sets only!)
• Produce descriptive statistics, e.g. means,
standard-deviations, min, max for each variable
• Produce graphics, e.g. histograms or box plot
that show the distribution
3. Produce composite scales
• E.g. create a single variable from a set of
questions
Stages of statistical analysis
4. Make graphics or tables that show
relationships
• E.g. Scatter plots for interval data (as in our
previous examples) or cross-tabulations
5. Calculate coefficients that measure the
strength and the structure of a relation
• Strength examples: Cramer’s V for cross-
tabulations, or Pearson’s R for interval data
• Structure examples: regression coefficient,
tables of means in analysis of variance
Stages of statistical analysis
6. Calculate coefficients that describe the
percentage of variance explained
• E.g. R2 in a regression analysis
7. Compute significance level, i.e. find out
if you can interpret the relation
• E.g. Chi2 for cross-tabs, Fisher’s F in
regression analysis
Steps in conducting Chi square in SPSS
• Go to data entry
• Click analyze Menu then select:
Descriptive statistics then
Cross-tabulation
• Click Rows provision (insert dependent variables)
• Click Columns provision (insert indep. variables of your
choice)
• Click on Exact: - highlight Monte- configure Confidence
interval (C.I) of your choice e.g. 95%
• Indicate total number of samples in the study
• Click on statistics:- highlight Chi-square)
• Click on cells: - highlight column and row then Click ok.
Difference Chi square and Fisher’s exact
• The chi-squared test applies an approximation
assuming the sample is large, while the
Fisher's exact test runs an exact procedure
especially for small-sized samples (i.e. < 5)
Significance Level
For a study finding to be significant:
1. P. values < 0.05 (at 95% C.I)
2. Chi square Value > 3
3. Odd ratios > 2
4. Relative risk > 2
5. C.I : (Both lower and upper levels must be
either positive whole figures; negatives whole
figures; or decimals, of which the gap between
the lower and the upper levels is quite narrow
based on subject under study)
Measures of Association
(Relations)
Data preparation and composite scale
making
Data preparation
• Enter the data Assign a number to each
response item (planned when you design
the questionnaire)
• Enter a clear code for missing values (no
response), e.g. -1
• Make sure that your data set is complete
and free of errors
Data preparation and composite scale
making
Data preparation
• Some simple descriptive statistics (minimum,
maximum, missing values, etc.) can help
• Learn how to document the data in your statistics
program
• Enter labels for variables, labels for responses
items, display instructions (e.g. decimal points to
show)
• Define data-types (interval, ordinal or nominal)
Composite scales (indicators)
Basics:
• Most scales are made by simply adding
the values from different items
(sometimes called "Lickert scales")
• Eliminate items that have a high
number of non responses
Composite scales (indicators)
Basics cont.:
• Make sure to take into account missing values
(non responses) when you add up the
responses from the different items. A real
statistics program (SPSS) does that for you
• Make sure when you create your
questionnaire that all items use the same
range of response item, else you will need to
standardize !!
Composite scales (indicators)
Quality of a scale:
Again: use a published set of items to measure a
variable (if available) if you do, you can avoid
making long justifications !
Sensitivity: questionnaire scores discriminate
e.g. if exploratory research has shown higher degree
of presence in one kind of learning environment
than in an other one, results of presence
questionnaire should demonstrate this.
. Composite scales (indicators)
Quality of a scale:
• Reliability: internal consistency is high
Inter-correlation between items (alpha) is
high
• Validity: results obtained with the
questionnaire can be tied to other measures
• e.g. were similar to results obtained by
other tools (e.g. in depth interviews),
• e.g. results are correlated with similar
variables.
Stage One: Define the Research Problem
In this stage, the following issues are
addressed:
Relationship to be analyzed
Specifying the dependent and
independent variables
Method for including independent
variables
Define Relationship to be analyzed
The goal of this analysis is to examine the
relationship between …………….
Specifying the dependent and independent variables
The dependent variables are/is …………………...
The independent variables:
• AGE 'Age of respondent’
• EDUC ‘Highest year of school completed’
• DEGREE ‘Respondent’s Highest Degree’
• SEX ‘Respondent’s Sex’
Note: Identify the method for including variables
Stage 2: Develop the Analysis Plan: Sample Size Issues
In this stage, the following issues are addressed:
• Missing data analysis
• Minimum sample size requirement: cases per
independent variable
Missing data analysis
Check the magnitude of missing cases. If the number of cases
with missing data is so small, it cannot produce a missing data
process that is disruptive to the analysis.
Stage 2: Develop the Analysis Plan: Sample Size
Issues
Minimum sample size requirement: cases per
independent variable
Check the sample size and verify that you have
enough cases to give a stable result
Depends on the statistical test, but usually 15-20
cases per independent variable is sufficient.
Stage 2: Develop the Analysis Plan:
Measurement Issues:
• Examine the data structure:
• How to incorporate non-metric data with
dummy variables
• How to deal with Curvi-linear Effects with
Polynomials
• How to identify and describe Interaction or
Moderator Effects
Stage 3: Evaluate Underlying Assumptions
Evaluate the assumptions for the intended
statistics in terms of:
• Non-metric dependent variable with two or
more groups
• Metric or non-metric independent
variables
Stage 4: Ran your statistical estimates and
Assess Overall Fit: Model Estimation
• Compute all your statistical estimates and
assess the model fit.
• Interpret the results
• Report your findings.
Overview of statistical methods
Descriptive statistics
• Descriptive statistics are not very interesting
in most cases (unless they are used to
compare different cases in comparative
systems designs)
• Therefore, do not fill up pages of your report
with tons of Excel diagrams !!
. Which data analysis for which data types?
Popular multi-variate analysis
Independent Dependant variable Y
(explaining)
Quantitative Qualitative
variable X (interval) (nominal or ordinal)
Quantitative Factor Analysis, Transform Y into a
multiple regression, quantitative
SEM, variable
Cluster Analysis,
Qualitative ANOVA Multidimensional
scaling etc.
5. Which data analysis for which data
types?
Popular bi-variate analysis
Dependant variable Y
Independent Quantitative Qualitative
(explaining) (interval) (nominal or
variable X ordinal)
Quantitative Correlation and Transform Y into a
Regression quantitative
variable
Qualitative Analysis of Cross-tabulations
variance
Types of statistical coefficients
• First of all the coefficient must be more or less
appropriate for the data
The big four:
1. Strength of a relation
• Coefficients usually range from -1 (total
negative relationship) to +1 (total positive
relationship). 0 means no relationship.
2. Structure (tendency) of a relation
3. Percentage of variance explained
Types of statistical coefficients:
. Signification level of a model
• The probability of observing a result as
extreme or more extreme than the one
actually observed from chance alone
• Typically in the social sciences a sig. level
of 5% (0.05) is acceptable
• These four are mathematically connected:
E.g. Significant level is not just dependent on
the size of the sample, but also on the
strength of a relation.
Cross-tabulation
• Cross-tabulation is a popular technique to study
relationships between normal (categorical) or
ordinal variables.
Computing the percentages (probabilities)
1. For each value of the independent variable
compute the percentages
• Usually the X variable is put on top (i.e. its values show
in columns). If you don’t have to compute percentages
across lines, remember this:
• You want to know the probability (percentage) that a
value of X leads to a value of Y
• Compare (interpret) percentages across the dependant
variables .
Crosstabulation
Ever been Social Economic Status as risk
emotionally or factor
physically abused? has health has no health
insurance insurance
No 1,011 954
Yes 104 294
Steps in Statistical Testing
• Null hypothesis
Ho: there is no difference between the groups
• Alternative hypothesis
H1: there is a difference between the groups
• Collect data
• Perform test statistic e.g T test, Chi square or ANOVA
• Interpret P value and confidence intervals
P value 0.05 Reject Ho
P value > 0.05 Accept Ho
• Draw conclusions
CORRELATION
• The correlation is a powerful statistical tool
that is used in examining relationship
between two or more variable.
• We can use correlation coefficient to
determine the presence, direction and
magnitude of the relation between variables.
• We can determine if the difference among
subjects for one variable can be accounted
for, or explained by another variables.
Methods of studying Correlation
• Methods include:
– Scatter diagram
– Pearson’s correlation coefficient
– Spearman’s Rank correlation coefficient
– Method of least squares
• The first method is based on the knowledge
of graphs whereas the others are
mathematical methods
Types of Correlation
• Correlation may be classified as:
1. Positive and negative
2. Linear and non-linear
3. Simple and multiple
• Give examples of correlation
How To Calculate correlation
• Step 1: Find the mean of x, and the mean of y.
• Step 2: Subtract the mean of x from every x
value (call them "a"), and subtract the mean
of y from every y value (call them "b")
• Step 3: Calculate: ab, a2 and b2 for every value.
• Step 4: Sum up ab, sum up a2 and sum up b.
Key points
• Pearson correlation coefficient, also known as
Pearson R, is a statistical test that estimates
the strength between the different variables
and their relationships. Hence, whenever any
statistical test is performed between the two
variables, it is always a good idea for the
person to estimate the correlation coefficient
value to know the strong relationship between
them.
Key points
• The correlation coefficient of -1 means a
robust negative relationship. Therefore, it
imposes a perfect negative relationship
between the variables. If the correlation
coefficient is 0, it displays no relationship.
Moreover, if the correlation coefficient is 1, it
means a strong positive relationship.
Therefore, it implies a perfect positive
relationship between the variables.
Key points
• The Pearson correlation coefficient shows the
relationship between the two variables
calculated on the same interval or ratio scale.
In addition, It estimates the relationship
strength between the two continuous
variables.
Example of Correlation calculation
• Pearson’s correlation coefficients:
Is there a relationship between the age of
husbands and the wives at the time of marriage?
Husbands: 23 27 28 28 30 30 33 35 38
Wives: 18 20 22 27 29 27 29 28 29
Correlation
x =30 y=25
X dx dx2 Y dy dy2 dx.dy
23 18
27 20
28 22
28 27
28 21
30 29
30 27
33 29
35 28
38 29
X=300 dx = dx2= Y=250 dy = dy2= dx.dy=
Correlation
x =30 y=25
X dx dx2 Y dy dy2 dx.dy
23 -7 49 18 -7 49 49
27 -3 9 20 -5 25 15
28 -2 4 22 -3 9 6
28 -2 4 27 2 4 -4
28 -2 4 21 -4 16 8
30 0 0 29 4 16 0
30 0 0 27 2 4 0
33 3 9 29 4 16 12
35 5 25 28 3 9 15
38 8 64 29 4 16 32
X=300 dx = 0 dx2=168 Y=250 dy = 0 dx2=164 dx.dy=133
Correlation
r = Sum of the r = dx.dy
product of the √dx2.dy2
deviation divided by
the square root of = 133 ÷ √168 x 164
the product of the = 133 ÷ √27552
sum of squares = 133 ÷ 165.99
= 0.8013
Properties of the Coefficient of Correlation
• The coefficient of correlation lies between -1 and +1 (-
1< r < +1)
• The nearer the coefficient is to ± 1 the stronger is the
relationship
• The coefficient of correlation is symmetrical with
respect to x and y
• The coefficient of correlation is independent of change
or origin and scale value of X and Y and by change of
scale we mean dividing or multiplying every value of X
and Y by some constant
• If X and Y are independent variables then coefficient of
correlation is zero, however the converse is not true
Regression
• Equation of a straight line.
• Given two points (x1, y1) and x2, y2), one can
draw a line
• Equation of a line is given by y = ax + b.
– Where b = y intercept (where the line crosses x bar
i.e. x = 0)
– a = gradient/slope.
Regression
Y
X
Regression
Gradient/slope (a) = ∆y = y2 – y 1
∆x x2 – x 1
Find the gradient and equation of a line
passing through two points (1,1) and (2,3)
Gradient ∆y = 3 – 1= 2
∆x 2–1
Regression
• Equation of a straight line.
y – 3 = 2(x – 2)
Y-3+3 = 2x – 4 + 3
Y= 2x-4+3
y = 2x -1
Regression
• The equation of a straight line is given by
y = b + ax
• In regression analysis, we call the x variables
the independent variables because we want to
use this variable to predict or estimate scores
on the dependent variables, the y variables.
• We can estimates the intercept (b) and the
slope (a), since all points cannot lie in one
straight line. This straight line, also known as
the line of best fit, is known as the regression
of y and x.
Regression
It can be shown that, by using least square
methods of estimation.
a= Σ(xi – Ҳ ) (y1 – ŷ)
Σ(xi – Ҳ )2
y = b + ax
and
b = y - ax
Regression
• A regression line is a straight line that best
represents the linear relationship between
two variables x and y.
• The regression line is considered to be the
‘line of best fit’ when the sum of the squared
distances of each of the actual scores, y and
predicted scores y is minimal i.e. (Σ(y – ŷ)2 is
minimal.
Regression
No x1 y1 x1 – Ҳ y1 – y
(Y1- Mean of Y)
(x – Ҳ )(y – ŷ)
(X1- Mean of X) (Y1- Mean of Y)
(x – Ҳ)2
(X1- Mean of X)
1 2 1
2 3 3
3 5 7
4 7 11
5 9 15
6 10 17
Sum 36 54
Mean 6 9
Regression
I x1 y1 x1 - Ҳ y1 - y (x – Ҳ )(y – ŷ) (x – Ҳ)2
1 2 1 -4 -8 32 16
2 3 3 -3 -6 18 9
3 5 7 -1 -2 2 1
4 7 11 1 2 2 1
5 9 15 3 6 18 9
6 10 17 4 8 32 16
Σ 36 54 0 0 104 52
Mean 6 9
Regression
Thus x = 6 and y = 9
b = 104 = 2
52
a = 9 - (2 x 6) = -3
Thus the regression line is
y = 2x - 3
Linearity
• Example: Most popular statistical methods
for interval data assume linear
relationships:
• In the following example the relationship is
non-linear: students that show weak daily
computer use
• have bad grades, but so do they ones that
show very strong use.
• • Popular measures like the Pearson’s r will
"not work", i.e. you will have a very weak
correlation and
• therefore miss this non-linear relationship
Linearity
Normal Distribution
• Most methods for interval data also require "normal
distribution“
• If you have data with "extreme cases" and/or data that is
skewed, some individuals will have much more "weight"
than the others.
• Hypothetical example: The "red" student who uses the
computer for very long hours will determine a positive
correlation and positive regression rate, whereas the
"black" ones suggest an in-existent correlation.
• Mean use of computers does not represent "typical" usage.
• The "green" student however, will not have a major impact
on the result, since the other data are well distributed along
the 2 axis. In this second case the "mean" represents a
"typical" student.
Normal distribution
Standard Normal Distribution
Mean +/- 1 SD encompasses 68% of observations
Mean +/- 2 SD encompasses 95% of observations
Mean +/- 3SD encompasses 99.7% of observations
Principle of statistical analysis
The goal of statistical analysis is quite simple: find
structure in the data
DATA = STRUCTURE + NON-STRUCTURE
DATA = EXPLAINED VARIANCE + NOT EXPLAINED VARIANCE
Example: Simple regression analysis
• DATA = predicted regression line + residuals
• in other words: regression analysis tries to find a
line that will maximize prediction and minimize
residuals
2. The principle of statistical analysis
Confidence Intervals
• Confidence intervals express the range in which the
true value of a population parameter (as estimated
by the population statistic) falls, with a high degree
of confidence (usually 95% or 99%).
• Example: Take mean = 205.15;
95% CI =(204.70-205.60);
99% CI = 204.56-205.75.
END