DATA ANALYSIS,
INTERPRETATION AND
RESEARCH REPORT WRITING
DR. CHARLES OMANE-ADJEKUM
DEPARTMENT OF ACCOUNTING
PREPARATION AND
ORGANIZATION OF DATA
Research
Data analysis is a process used to inspect, clean, transform and remodel data
with a view to reaching to a certain conclusion for a given situation. Data
analysis is typically of two kinds: qualitative or quantitative. The type of data
dictates the method of analysis.
In qualitative research, any non-numerical data like text or individual words
are analysed. Quantitative analysis, on the other hand, focuses on
measurement of the data and can use statistics to help reveal results and
Benefits Of Data Analysis
Among the many benefits of data analysis, the more important ones are:
Data analysis helps in structuring the findings from different sources of data.
Data analysis is very helpful in breaking a macro problem into micro parts.
Data analysis acts like a filter when it comes to acquiring meaningful
insights out of a huge data set.
Data analysis helps in keeping human bias away from the research
conclusion with the help of proper statistical treatment.
Data
Editing
Scoring
Coding
Data Cleaning
STATISTICAL DATA ANALYSIS –
DESCRIPTIVE STATISTICS
Descriptive and Inferential Statistics
Statistics is a set of procedures for gathering, measuring, classifying,
computing, describing, synthesizing, analyzing, and interpreting
systematically acquired quantitative data.
Statistics has two major components: Descriptive Statistics and
Inferential Statistics.
Descriptive Statistics give numerical and graphic procedures to
summarize a collection of data in a clear and understandable way
whereas Inferential Statistics provide procedures to draw inferences
about a population from a sample.
Variable
There are three major characteristics of a single variable that we tend to
look at:
Distribution
Central Tendency
Dispersion
Distribution
The distribution is a summary of the frequency of individual values
or ranges of values for a variable. The simplest distribution would list
every value of a variable and the number of times each value occurs.
One of the most common ways to describe a single variable is with a
frequency distribution.
Frequency distributions can be depicted in two ways, as a table or as
a graph. Distributions may also be displayed using percentages.
Frequency distribution organizes raw data or observations that have
been collected into ungrouped data and grouped data
Shape of the Distribution
Simple descriptive statistics can provide some information relevant to
this issue. For example, if the skewness (which measures the deviation
of the distribution from symmetry) is clearly different from 0, then
that distribution is asymmetrical, while normal distributions are
perfectly symmetrical.
If the kurtosis (which measures the peakedness of the distribution) is
clearly different from 0, then the distribution is either flatter or more
peaked than normal; the kurtosis of the normal distribution is 0.
Central Tendency
These measures are also called Averages. They provide single
values which are used to summarise a set of
observations/data. The central tendency of a distribution is an
estimate of the "centre" of a distribution of values. They are
measures or statistics that describe the location of the centre
of a distribution.
A distribution as we have described earlier, consists of scores
and other numerical values such as number of years of
teaching, age, income, score in a test and the frequency of
their occurrence.
The Mode
The Mode is the most frequently occurring value in a set of
scores.
Thus, the mode is the most frequent score in a distribution –
that is the score obtained by more students than any other
score
Features of the mode
The main advantage is that it is the only measure that is useful for nominal scale data.
It is used when there is the need for a rough estimate of the measure of location.
It is used when there is the need to know the most frequently occurring value e.g., dress
styles.
It is not useful for further statistical work because the distribution can be bi-modal or
trimodal or have no mode at all.
The Median
It is a score such that approximately one-half (50%) of the
scores are above it and one-half (50%) are below it when
the scores are arranged sequentially – in short, the
midpoint. The Median is the score found at the exact
middle of the set of values
Features of the median
It is not influenced by extreme scores. For example, the median for the following numbers,
2, 3, 4, 5, 6 is 4. If 6 changes to 23 as an extreme score, the median remains 4.
It does not use all the scores in a distribution but uses only one value.
It has limited use for further statistical work.
It can be used when there is incomplete data at the beginning or end of the distribution.
It is mostly appropriate for data from interval and ratio scales.
Where there are very few observations, the median is not representative of the data.
Where the data set is large, it is tedious to arrange the data in an array for ungrouped data
Uses of the median
1. It is used as the most appropriate measure of location when there is reason to believe that the
distribution is skewed.
2. It is used as the most appropriate measure of location when there are extreme scores to affect
the mean. E.g., Typical income in a company of senior and junior staff.
3. It is useful when the exact midpoint of the distribution is wanted.
4. It provides a standard of performance for comparison with individual scores when the score
distribution is skewed. For example, if the median score is 60 and an individual student
obtains 55, performance can be said to be below average/median. Also performance can be
described as just above average or far below average or just below average.
5. It can be compared with the mean to determine the direction of student performance.
The Mean
It is the sum of a set of observations divided by the total
number of observations.
The Mean or average is probably the most commonly used
method of describing central tendency.
To compute the mean, all the values are added up and divided
by the number of values. If the distribution is truly normal
(i.e., bell-shaped), the mean, median and mode are all equal to
each other
Uses of the mean
It is useful when the actual magnitude of the scores is needed to get
an average. E.g., total sales for a new product, selecting a student to
represent a whole class in a competition.
It is useful for further statistical work (e.g., standard deviation)
It is useful when the scores are symmetrically distributed (i.e.,
normal).
Uses of the mean (CON’T)
4. It provides a direction of performance, compared with other measures of location,
especially the median. Where Mean > Median, the distribution is skewed to the
right (positive skewness) showing that performance tends to be low and where
Mean < Median, the distribution is skewed to the left (negative skewness) showing
that performance tends to be high.
5. It serves as a standard of performance with which individual scores are
compared. For example, for normally distributed scores, where the mean is 56, an
individual score of 80 can be said to be far above average. Also, performance can
be described as just above average or far below average or just below average.
Dispersion
These are also called measures of variation, dispersion or scatter. While the measures of central
tendency are useful statistics for summarizing the scores in a distribution, they are not sufficient.
Averages are representatives of a frequency distribution but they fail to give a complete picture of
the distribution. They do not tell anything about the scatterness of observations within the
distribution.
The main measures that are used mainly are:
1. The Range
2. The Variance
3. The Standard Deviation
4. The Quartile Deviation (Semi-interquartile range)
Variance & Standard Deviation
The variance is always considered together with the standard deviation. It is the square of
the standard deviation. Both variance and standard deviation are computed for both
ungrouped and grouped data. Microsoft Excel is also useful in obtaining the variance and
standard deviations.
The Standard Deviation (SD) is a more accurate and detailed estimate of dispersion
because an outlier can greatly exaggerate the range. The Standard Deviation shows the
relation that set of scores has to the mean of the sample. The standard deviation is the
square root of the sum of the squared deviations from the mean divided by the number of
scores minus one.
Uses of the Standard Deviation
1. It is used as the most appropriate measure of variation/dispersion when there is
reason to believe that the distribution is normal.
2. It helps to find out the variation in achievement among a group of students ( i.e.,
it determines if a group is homogeneous or heterogeneous).
Measure of Relative Dispersion
Suppose that the two distributions to be compared are expressed in the same units
and their means are equal or nearly equal, then their variability can be compared
directly by using their standard deviations.
However, if their means are widely different or if they are expressed in different
units of measurement, we cannot use the standard deviations as such, for
comparing their variability. We have to use the relative measures of dispersion in
such situations. There are relative dispersions in relation to range, the quartile
deviation, the mean deviation, and the standard deviation. Of these, the coefficient
of variation (∂)which is related to the standard deviation is important.
Measurement Scales
Nominal variables
Ordinal variables
Interval variables
Ratio variables
Quartile Deviation(QDS)
It is also called the semi-inter quartile range and it depends on quartiles.
Quartiles divide distributions into four equal parts. Practically, there are three
quartiles.
The QD is half the distance between the first quartile (Q1) and the third quartile
(Q3).
Features of the Quartile Deviation
For skewed distributions, where the median is used as a measure of location the
quartile deviation is a better measure of variability.
The quartile deviation is a measure of individual differences. It helps to find out
the variation in achievement among a group of students. (i.e., it determines if a
group is homogeneous or heterogeneous).
STATISTICAL DATA ANALYSIS:
HYPOTHESIS TESTING
Inferential Statistics
Inferential statistics are divided into two types: parametric and
non-parametric.
‘Parametric’ is derived from the word parameter. A parameter
describes some aspect of a set of scores for a population. For
example, the mean of a set of scores for a population would be a
parameter, whereas the mean of a set of scores for a sample would
be a statistic. Parametric statistics are, therefore, statistical tests
based on the premise that the population from which samples are
obtained follows a normal distribution and the parameters of interest
to the researcher are the population mean (m) and standard
deviation(∂).
Inferential Statistics
Parametric statistics have certain assumptions about the observations/scores.
These assumptions are:
a. The variables are measured on interval scales
b. Scores from any two individuals in a study are independent of each other
c. The variables that distinguish each population are similarly distributed
d. The variables that distinguish each population are similarly distributed among
each population, in the case of two or more groups
Parametric statistics
Examples of parametric statistics include the t-test and
Pearson Correlation Coefficient.
There are other advanced statistical parametric procedures
such analysis of variance (ANOVA), analysis of covariance
(ANCOVA).
Nonparametric tests
Non-parametric statistics, on the other hand, are statistical tests that
only make the assumption of independent observations of scores for
each individual in the study.
Nonparametric tests are also called distribution-free tests because they
don't assume that your data follow a specific distribution. You should
use nonparametric tests when your data don't meet the assumptions of
the parametric test, especially the assumption about normally
distributed data.
Example Nonparametric tests
Apparently, Pearson's correlation coefficient is parametric and
Spearman's rho is non-parametric.
In nonparametric tests, the data are typically measured in categorical
scores on either the independent or dependent variable.
The most frequently used non- parametric test in research is chi-square
(X2). It is used to test a hypothesis concerned with category within
groups for comparison.
In the next section of this session, we will explain the procedure of
testing a hypothesis using chi-square
Conducting Hypothesis Testing
Generally, there are six steps in hypothesis testing (Cresswell, 2002):
a. Establish a null and alternate hypotheses
b. Set the level of significance or alpha level for rejecting the null hypothesis
c. Collect data
d. Compute the sample statistic, (usually using a computer programme)
e. Make a decision about rejecting or failing to reject the null hypothesis
f. Determine the degree of differences if a statistically significant difference
Types of Statistical Tests and their uses
TYPES USES
Paired t-test -- Tests for the difference between two
related variables
Independent t-test-- Tests for the difference between two independent
variables
ANOVA -- Tests the difference between three or more independent groups.
Steps of Hypothesis Testing
Step 1: State the Null Hypothesis.
Step 2: State the Alternative Hypothesis.
Step 3: Set \alpha.
Step 4: Collect Data.
Step 5: Calculate a test statistic.
Step 6: Construct Acceptance / Rejection regions.
Step 7: Based on steps 5 and 6, draw a conclusion about H0
a.
Comparing Two Groups
Two independent samples t-test. An independent samples t-test is used when you want to
compare the means of a normally distributed interval dependent variable for two independent
groups.
What it means if t test is not Statistically Significant
In principle, a statistically significant result (usually a difference) is a result that's not
attributed to chance. More technically, it means that if the Null Hypothesis is true (which
means there really is no difference), there's a low probability of getting a result that large or
larger.
Comparing Two Groups
If the p-Value is Greater than .05
In the majority of analyses, an alpha of .05 is used as the cutoff
for significance. If the p-value is less than .05, we reject the
null hypothesis that there's no difference between the means
and conclude that a significant difference does exist.
Definition of p-value
In statistical science, the p-value is the probability of obtaining a result, at
least, as extreme as the one that was actually observed in experiment or a
study, given that the null hypothesis is true.
This probability represents the likelihood of obtaining a sample mean that
is at least as extreme as our sample mean in both tails of the distribution.
That's our p-value! When a p-value is less than or equal to the significance
level, you reject the null hypothesis
COMMON MYTHS IN DATA
ANALYSIS
Complex analysis and big words impress people.
Most people appreciate practical and understandable analyses.
Analysis comes at the end after all the data are collected.
Think about analysis upfront so that you can collect all the data you
need to analyze.
Quantitative analysis is the most accurate type of data analysis.
Some think numbers are more accurate than words but it is the quality
of the analysis process that matters.
COMMON MYTHS IN DATA
ANALYSIS
Data have their own meaning.
Data must be interpreted. Numbers do not speak for themselves.
Stating limitations to the analysis weakens the evaluation.
All analyses have weaknesses; it is more honest and responsible to
acknowledge them.
Computer analysis is always easier and better.
It depends upon the size of the data set and personal competencies.
For small sets of information, hand tabulation may be more efficient.