CHAPTER EIGHT
DATA ANALYSIS
Data for Analysis ?
Data Analysis is the process of systematically applying statistical and/or
logical techniques to describe and illustrate, condense and recap, and
evaluate data.
Before analyzing the data for your research, it is important to know the type of data you have
at hand as the technique you use is determined by the data.
The following figure provides you clear information of the type of data to be used for
research.
1 01-09-2024
2 01-09-2024
8.1.1. Quantitative data can be divided into
two distinct groups:
A. Categorical and
B. Numerical
A. Categorical data
These are data that can‘t be measured numerically as
quantities.
Categorical data can be further sub-divided into
3 01-09-2024
1. Nominal- whose values can‘t be measured numerically
or can‘t be ranked. Rather these data simply count the
number of occurrences in each category of a variable.
Examples of nominal variables:
Where a person lives (AA, Adama, B/Dar, etc.)
Gender (male, female)
Nationality (American, Ethiopian, Chinese)
Ethnicity (Oromo, Amhara, Tgire, Gurage…)
4 01-09-2024
2. Ranked/Ordinal data - whose values can be ranked in orders
Examples of ordinal data
Education (Elementary school, High school, College Diploma, College
degree, Masters)
Agreement (strongly disagree, disagree, neutral, agree, strongly agree)
Rating (poor, fair, good, excellent)
Frequency (never, often, sometimes; always,, )
Any other scale (―On a scale of 1 to 5...‖)
5 01-09-2024
Descriptive data with only two categories are known as
dichotomous data.
E.g. gender can be divide into female and male.
Or questions with a ‗yes‘ or ‗No‘ response
6 01-09-2024
Cont…
B. Numerical Data
Which are sometimes termed ‗quantifiable‘, are those
whose values are measured or counted numerically as
quantities.
Numerical data can be analysed using a far wider
range of statistics than categorical data.
7 01-09-2024
Coding the Data
Coding – Process of translating information gathered from
questionnaires or other sources into something that can be
analyzed
Involves assigning a value to the information given—often value is
given a label.
Coding can make data more consistent
Example: Question = Sex
Answers = Male, Female, M, or F
Coding will avoid such inconsistencies
11 01-09-2024
Coding Systems
Common coding systems (code and label) for dichotomous variables:
0=No 1=Yes
(1 = value assigned,Yes= label of value)
OR: 1=No 2=Yes
When you assign a value, you must also make it clear what that value
means
In first example above, 1=Yes but in second example 1=No
As long as it is clear how the data are coded, either is fine
12 01-09-2024
Coding- Ordinal Variables
Coding process is similar with other categorical variables
Example: variable EDUCATION, possible coding:
0 = Did not graduate from high school
1 = High school graduate
2 = Some college or post-high school education
3 = College graduate
Could be coded in reverse order (0=college graduate, 3=did
not graduate high school)
13 01-09-2024
Coding: Nominal Variables
For coding nominal variables, order makes no difference
Example: variable RESIDENCE
1 = Northeast
2 = South
3 = Northwest
4 = Midwest
5 = Southwest
Order does not matter, no ordered value associated with each
response
14 01-09-2024
Coding: Continuous Variables
Creating categories from a continuous variable (ex. age) is
common
May break down a continuous variable into chosen categories by
creating an ordinal categorical variable
Example: variable = AGE
1 = 0–9 years old
2 = 10–19 years old
3 = 20–39 years old
4 = 40–59 years old
5 = 60 years or older
15 01-09-2024
8.2. Types of Data Analysis
Is the process of inspecting, cleaning, transforming, and modelling data
with the goal of discovering useful information suggesting conclusions, and
supporting decision making.
Data analysis can be made using:
(i) Descriptive Statistics
(ii) Inferential Statistics
Descriptive statistics are used to describe, summarize, or
explain a given set of data.
inferential statistics is used to infer certain characteristics of
samples to population.
22 01-09-2024
8.2.1. Univariate Analysis
Is the analysis carried out with the description of single
variable in terms of the applicable unit of analysis.
Measure of central tendencies and measure of dispersion are
the typical categories of univariate analysis.
24 01-09-2024
A. Measures of Central Tendency
The three most frequently used measures of central
tendency are
• Mode
• Median and
• Mean
25 01-09-2024
1. Mode
Mode can be defined as the most frequently occurring value in a
group of observations.
If the scores for a given sample distributions are:
32, 32, 35, 36, 37, 38, 38, 39, 39, 39, 40, 40, 42, 45
Then the mode would be 39 because a score of 39 occurs three
times, more than any other score.
Mode is very good measure for ascertaining the location of
distribution in the case of nominal data.
26 01-09-2024
2. Median
Median is defined as the middle value in an ordered arrangement
of observations.
The median is often used to summarize the location of a distribution.
Further, the median can be used with ordinal, interval, or ratio
measurements.
If the scores for a given sample distributions are:
32, 32, 35, 36, 37, 38, 38, 39, 39, 39, 40, 40, 42, 45
The median will be 38 + 39 = 38.5
2
27 01-09-2024
3. Mean
The arithmetic mean is the most commonly used and accepted
measure of central tendency.
This should be used in the case of interval or ratio data.
If the scores for a given sample distributions are:
32, 32, 35, 36, 37, 38, 38, 39, 39, 39, 40, 40, 42, 45
The mean of the distribution will be:
32+32+35+36+37+38+38+39+39+39+40+40+42+45/14= 38
Mid-mean, geometric mean, mid-range are other types of means. (P.139 of
QRM)
28 01-09-2024
Bivariate Analysis/Relationships between Variables
Help researchers to know the nature, direction, and significance
of the relationships between two variables in the study.
Often in practical situations, researchers are interested in
describing associations between variables.
They try to ascertain how two variables are related with each
other, that is, whether a change in one affects the other.
The measures of association depend on the nature of the data
and could be positive, negative or neutral.
30 01-09-2024
8.2.1.1. Relation between two nominal variables -X2 Test
This analysis technique is used to know if there is relationship between
two nominal variables.
E.g. Is viewing television advertisement of a product (yes/No)
related to buying that particular product ( buy/Not buy).
An international business researcher wants to establish if the
performance ( categorized as loss, breakeven and profit) of a
firm is dependent on which country ( categorized as low, middle
and high income) it is located.
32 01-09-2024
There are three different types of chi-square analysis
1. Chi-square test for goodness of fit
2. Chi-square test for homogeneity
3. Chi-square test of independence
The first one used to see if the sample has been drawn from
the population and the second if the population are
homogenous with respect to a given characteristics.
The two are not common and we will focus on the third
type of test
33 01-09-2024
8.2.1. 2. Correlations Analysis
Correlation is a measure of relationship between two variable. It has wide
application in business and statistics.
The correlation coefficient describes the direction of the correlation, that is,
whether it is
• Positive or
• Negative,
And the strength of the correlation, that is, whether an existing correlation is:
• Strong or
• Weak.
35 01-09-2024
8.2.1.3. Bi-variate regression analysis
Regression is one of the most frequently used techniques in business and
social researches.
Regression analysis is used to predict the value of one variable (the
dependent variable) on the basis of other variables (the independent
variable).
The most common form of regression, however, is linear regression,
where the dependent variable is related to the independent variable in a
linear way.
39 01-09-2024
The linear regression equation takes the
following form
Variables:
X = Independent Variable (we provide this)
Y = Dependent Variable (we observe this)
Parameters:
β0 = Y-Intercept
β1 = Slope
ε = error term
Note: β1 = Indicates the change in the dependent variable for
every unit change in the independent variable
40 01-09-2024
Regression coefficient
Is the measure of how strongly the predictor (IDV)
predicts the DV
There are two types of regression coefficients
1. Unstandardized coefficients
2. Standardized coefficients (Beta Values)
42 01-09-2024
The unstandardized coefficient can be used in the equation as
coefficients of different independent variables along with the
constant term to predict the value of the dependent variable.
o Difference in “Y” per Unit change in “X”
The standardized coefficient (Beta) is measured in
standard deviation, i.e. the difference in “Y” in standard
deviation per standard deviation difference in “X”
43 01-09-2024
R values
R represents the correlation between the observed values and the
predicted values (based on the regression equation obtained) of the
dependent variable.
Is used to measure the fitness of the model used for the
research.
45 01-09-2024
R square is the square of R and gives the proportion of variance in the
dependent variable accounted for by the set of independent variables
chosen for the model.
R-square value tend to be influenced when the number of independent
variables is more or when the number of cases if large.
Therefore the adjusted R square that takes in to account these things and
provides more accurate information about the fitness of the model.
While it is not uncommon to get R square value of as high as 0.99 in
natural science, a much lower value (0.10 – 0.20 ) of R2 /R-square
is acceptable in social science research.
46 01-09-2024
2. Multicollinearity
Is a situation when two or more IVs are highly
correlated to each other.
If variables are so highly correlated with each other, it is
difficult to come up with reliable estimates of their
individual regression coefficients.
In other words, when two variables are highly correlated,
they both convey essentially the same information.
49 01-09-2024
How to know the presence of Multicollinearity?
1. If the Variance Inflation Factor ( VIF) > 5 or it mean the Tolerance is < 0.2 as
tolerance is the inverse of VIF
2. If any two IDV have Variance proportion in excess of 0.9 (Column value)
corresponding to any raw in which the condition index is in excess of 30.
If there is serious multicollinearity problem, try other solutions such as:
Removing highly correlated predictors
Linearly combining predictors, such as adding them together
Running entirely different analyses, such as principal components analysis ( to
know similarities and differences)
50 01-09-2024
8.2. 2. Multivariate Analysis
In many real life situations, it becomes necessary to analyse
relationship among three or more variables led to the
popularity of multivariate statistics.
Multivariate statistics techniques look at the pattern of
relationships between several variables simultaneously.
The following section deals with categories of multivariate
analysis techniques.
51 01-09-2024
8.2. 2. Multivariate Analysis …
8.2.2.1. Multiple linear Regression
In simple regression, there is one dependent variable and one
independent variable, whereas in
multiple regression, there is one dependent variable and many
independent variables.
It examines the relationship between a single metric dependent
variable and two or more metric independent variables
52 01-09-2024
.
Assumptions of normality and linearity should be checked before using multiple
regression.
Where: y is a dependent variable and x1, x2, … xk are independent variables and a is
the Y intercept , b1, b2 … bk are the regression coefficient.
Note: All the conditions and tests above are common in case of
multivariate analysis too.
.
53 01-09-2024
End
Thanks
Questions
57 01-09-2024