KEMBAR78
Data Analysis | PDF | Student's T Test | Statistics
0% found this document useful (0 votes)
27 views49 pages

Data Analysis

The document provides an overview of data analysis, including important definitions, types of data (qualitative vs quantitative), and methods for analyzing data through descriptive and inferential statistics. It explains various statistical tests, such as t-tests and ANOVA, and discusses measures of central tendency and dispersion. The document emphasizes the importance of understanding data types and distributions for effective statistical analysis.

Uploaded by

alvaaro870
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views49 pages

Data Analysis

The document provides an overview of data analysis, including important definitions, types of data (qualitative vs quantitative), and methods for analyzing data through descriptive and inferential statistics. It explains various statistical tests, such as t-tests and ANOVA, and discusses measures of central tendency and dispersion. The document emphasizes the importance of understanding data types and distributions for effective statistical analysis.

Uploaded by

alvaaro870
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Data analysis

Presented by

Miss Nazziwa Aisha


0705687875
aishanazziwa@yahoo.ca
Important definitions
◦ Variable: A variable contains data about anything we measure. For example; age
or gender of the participants or their score on a test. Variables can contain two
different types of data; categorical data and continuous data
◦ Data are collections of observations (such as measurements, genders, survey
responses).
◦ Population is the complete collection of all individuals (scores, people,
measurements,) to be studied. The collection is complete in the sense that it
includes all of the individuals to be studied.
◦ A census is the collection of data from every member of the population.
◦ A sample is a sub collection of members selected from a population.
Quantitative vs qualitative data
Qualitative data is non-statistical and is typically unstructured or semi-structured.
This data isn’t necessarily measured using hard numbers used to develop graphs
and charts. Instead, it is categorized based on properties, attributes, labels, and
other identifiers. Qualitative data can be used to ask the question “why.”
Generating this data from qualitative research is used for theorizations,
interpretations, developing hypotheses, and initial understandings.
Quantitative data is statistical and is typically structured in nature. This data type is
measured using numbers and values, making it a more suitable candidate for data
analysis.
quantitative data is close-ended and can be used to ask the questions “how much” or
“how many.”
TYPES OF
QUANTITATIVE
DATA
a) Categorical data
Categorical data is a measure of type rather than amount and can be broken down
into nominal data and ordinal data.
Nominal data is data that does not have any natural order at all. It is assigned to
categories or labelled e.g. male / female, blood groups, race. The categories or
labels cannot be ordered or ranked and are not related to each other.
Ordinal data relates to data that can naturally be ranked or ordered but does not
have continuous measurable value. It is data where numbers have been used to put
objects in an order. It is categorised in a fixed, specific order and we can only say
one thing is more or less than something else on the scale. Although the data can
be ranked in order it does not imply that they are at equal intervals between
categories.
For example the ranking satisfaction (very dissatisfied to very satisfied), Likert scale.
b) Continuous data
Continuous data is data that can be infinitely broken down into smaller parts or
data that continuously fluctuates. Continuous data is further broken down into:
Interval: is when the differences on each point of the scale is the same, however,
0 is a marker on a scale rather than a true zero e.g. temperature 0 degrees Celsius
does not imply there no temperature at all! (absolute 0)
Ratio: is when the data has a true 0 e.g. number of children 0 = no children. Other
examples of ratio data include time, height, blood pressure.
HOW TO
ANALYSE DATA?
Categories of data analysis
The type of data analysis to perform on a given dataset largely depend on the
Levels of measurement (type of data being collected) for analysis.
There are majorly two types of analysis
1. Descriptive Statistics
In essence, descriptive statistics are used to report or describe the features or
characteristics of data. They summarize a particular numerical data set, or multiple
sets, and deliver quantitative insights about that data through numerical or graphical
representation.
2. Inferential Statistics
Inferential statistics takes data from a sample and makes inferences/conclusions
about the larger population from which the sample was drawn.
DESCRIPTIVE
STATISTICS
Descriptive data analysis
Descriptive statistics allow the Hint on handling complete dataset :
researcher to;
If data is missing;
• Get an overview of the data
1st check if the data was missed when
• Check for any missing data entering and reenter.
• Look at the range of answers or data  You can delete the case from any
they have collected further analysis either, just for the
• Identify unusual cases and extreme variables where the data is missing or
values delete the case completely
 You can put in the mean value for that
• Identify obverse/invisible errors in the
case
data
If there is an outlier, it might help
deleting them. For outliers affect the
dataset.
1. Measures of central tendency
Measures of central tendency such as mean, median and mode describe the pattern of your data
and get an understanding of what your data means.
Mean
The mean is the average value of the data in a variable i.e. the sum of all the cases divided
by the total number of cases. This is the best measure to use if your scores have a normal
distribution.
Median
This is the middle score (50th percentile). If that data was split in half in numeric order this
would be the mid point score. This is a useful measure when the data is skewed i.e. there
are a high number of low scores.
Mode
The group or number that occurs most frequently.
Analysis for the different data
types
Nominal data Ordinal data Interval or Ratio data

Frequency tables Frequency tables Histograms


Bar charts Percentages Mean
Pie charts Bar charts Median
Percentages Mode Mode
Mode Median
2. Measures of dispersion
Measures of dispersion: are statistics designed to describe the spread
of scores around a measure of central tendency
Minimum and Maximum The lowest (smallest) and the highest (largest)
values.
Range: The difference between the maximum and the minimum values.
Variance
An overall measure of how clustered the data values are around the mean.
It is the mean squared difference between each score and the mean of
those scores.
Standard deviation: The standard deviation is the square root of the
variant.
Other Measures: Skewness and
Kurtosis

Skewness measures the degree to which cases are clustered towards one end of the
distribution, whether the distribution is symmetrical.

Kurtosis measures the peakedness of a distribution. Kurtosis is a measure of


whether the data are heavy-tailed or light-tailed relative to a normal distribution.
That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets
with low kurtosis tend to have light tails, or lack of outliers.
Distributions
Checking for normality

One of the criteria for using some statistical tests (parametric tests, such as t-tests)
is that the data in your variables is normally distributed.
To check if your data is normally distributed you can either;
Look at the skewness and kurtosis values.
(https://www.investopedia.com/terms/s/skewness.)
Calculate the z scores for the data: The Z score computes a standardised score,
which indicates how many standard deviations above or below the sample mean
your score is.
Standard deviation
Look at the histograms for each variable,
Cont…

If your data is only slightly skewed away from a normal distribution you can use:
transformation procedures such as taking the square root of each value
use statistical tests (non-parametric tests) that do not require a normal distribution.
INFERENTIAL
STATISTICS
2. Inferential Statistics

Inferential statistics look at whether there are any relationships or differences


between
variables that are genuine and not due to chance.

Role of inferential statistics


 making estimates about populations
 Testing hypothesis to draw conclusions on the population
 While descriptive statistics can only summarize a sample’s characteristics,
inferential statistics use your sample to make reasonable guesses about the larger
population.
 With inferential statistics, it’s important to use random and unbiased sampling
methods. If your sample isn’t representative of your population, then you can’t
make valid statistical inferences.
Categories of Inferential
statistical tests
Statistical tests come in several forms namely:
tests of comparison
correlation
 regression
Tests of comparison
Comparison tests assess whether there are differences in means, medians or
rankings of scores of two or more groups.

Analysing combinations of categorical and continuous data


There are two types of statistical tests that can be used to analyse data:
Parametric tests are performed on data that is normally distributed

non-parametric tests are performed data that do not conform to a normal


distribution.
Cont…
Parametric tests are considered more statistically powerful because they are more likely to
detect an effect if one exists.
To decide which test suits your aim, consider whether your data meets the conditions
necessary for parametric tests, the number of samples, and the type of data.

Parametric tests make assumption that include the following:


the population that the sample comes from follows a normal distribution of scores
the sample size is large enough to represent the population
the variances, a measure of spread, of each group being compared are similar
When your data violates any of these assumptions, non-parametric tests are more suitable.
Non-parametric tests are called “distribution-free tests” because they don’t assume
anything about the distribution of the population data.
summary
Correlation tests

Correlation tests determine the extent to which two variables are associated.
Correlation means there is a statistical association between variables while
Causation means that a change in one variable causes a change in another variable.
A correlation doesn’t imply causation, but causation always implies correlation.

Although Pearson’s r is the most statistically powerful test, Spearman’s r is


appropriate for interval and ratio variables when the data doesn’t follow a normal
distribution.

The chi square test of independence is the only test that can be used with nominal
variables.
Cont…
◦ A correlation coefficient is a number between -1 and 1 that tells you the strength
and direction of a relationship between variables.
◦ In other words, it reflects how similar the measurements of two or more variables
are across a dataset.
◦ The most commonly used correlation coefficient is Pearson’s r (Pearson product-
moment correlation) because it allows for strong inferences. It’s parametric, on
outliers and measures linear relationships. But if your data do not meet all
assumptions for this test, you’ll need to use a non-parametric test instead.
◦ Non-parametric tests of rank correlation coefficients summarize non-linear
relationships between variables. The Spearman’s rho (Spearman’s rank correlation
coefficient) and Kendall’s tau have the same conditions for use, but Kendall’s tau is
generally preferred for smaller samples whereas Spearman’s rho is more widely
used.
Coefficient of determination
When you square the correlation coefficient, you end up with the correlation of
determination (r2). This is the proportion of common variance between the variables.
The coefficient of determination is always between 0 and 1, and it’s often expressed
as a percentage.
The coefficient of determination is used to measure how much of the variance of one
variable is explained by the variance of the other variable.
A high r2 means that a large amount of variability in one variable is determined by its
relationship to the other variable. A low r2 means that only a small portion of the
variability of one variable is explained by its relationship to the other variable.
Summary of correlation
3. Regression tests

Regression tests demonstrate whether changes in predictor variables cause


changes in an outcome variable.
You can decide which regression test to use based on the number and types of
variables you have as predictors and outcomes.

Regression models describe the relationship between variables by fitting a line to


the observed data. Linear regression models use a straight line, while logistic and
nonlinear regression models use a curved line. Regression allows you to estimate
how a dependent variable changes as the independent variable(s) change.
Assumptions of simple linear regression

Normally distributed data.


The relationship between the independent and dependent variable is linear: the
line of best fit through the data points is a straight line (rather than a curve or some
sort of grouping factor).
If your data do not meet the assumptions of normality, you may be able to use
a non parametric tests such the Spearman rank test.
Multiple linear regression
◦ Multiple linear regression is used to estimate the relationship between two or more
independent variables and one dependent variable. You can use multiple linear
regression when you want to know:
◦ How strong the relationship is between two or more independent variables and one
dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect
crop growth).
For example:
A public health researcher is interested in social factors that influence heart disease. You
survey 500 towns and gather data on the percentage of people in each town who smoke, the
percentage of people in each town who bike to work, and the percentage of people in each
town who have heart disease.
Because you have two independent variables and one dependent variable, and all your
variables are quantitative, you can use multiple linear regression to analyze the relationship
between them
Summary of regression tests
If you have one interval predictor variable and one interval outcome variable =
Simple regression
If you have more than one interval/ordinal/nominal predictor variable and one
interval level outcome variable = Multiple regression
If you have more than one interval, ordinal or nominal predictor variables and one
nominal or ordinal level outcome variable = Logistic regression
 Estimating population parameters from
sample statistics

There are two important types of estimates you can make about the population: point
estimates and interval estimates.
A point estimate is a single value estimate of a parameter. For instance, a sample
mean is a point estimate of a population mean.

An interval estimate gives you a range of values where the parameter is
expected to lie. A confidence interval is the most common type of interval
estimate
Cont…
A confidence intervals uses the variability around a statistic to come up with an
interval estimate for a parameter. Confidence intervals are useful for estimating
parameters because they take sampling error into account.

While a point estimate gives you a precise value for the parameter you are interested
in, a confidence interval tells you the uncertainty of the point estimate. They are best
used in combination with each other.

Each confidence interval is associated with a confidence level. A confidence level


tells you the probability (in percentage like 95%) of the interval containing the
parameter estimate if you repeat the study again.
Hypothesis testing
Hypothesis testing is a formal process of statistical analysis using inferential
statistics. The goal of hypothesis testing is to compare populations or assess
relationships between variables using sample data.
It involves setting up a null hypothesis and an alternative hypothesis followed by
conducting a statistical test of significance. A conclusion is drawn based on the value
of the test statistic (Z test, t- test and F- test), the critical value and the confidence
intervals.
A hypothesis test can be left-tailed, right-tailed, and two-tailed. Given below are
certain important hypothesis tests that are used in inferential statistics.
t-test
A t-test is used to compare the means of two groups. It is often used in hypothesis testing to
determine whether a process or treatment/intervention actually has an effect on the population of
interest, or whether two groups are different from one another
Note:
When choosing a t-test, you will need to consider two things: whether the groups being compared
come from a single population or two different populations, and whether you want to test the
difference in a specific direction.
One-sample, two-sample, or paired t-test?
◦ If the groups come from a single population (e.g. measuring before and after an experimental
treatment), perform a paired t-test.
◦ If the groups come from two different populations (e.g. two different species, or people from two
separate cities), perform a two-sample t-test / independent t-test.
If there is one group being compared against a standard value (e.g. comparing the acidity of a
liquid to a neutral pH of 7), perform a one-sample t-test.
◦ One-tailed or two-tailed t-test?
◦ If you only care whether the two populations are different from one another,
perform a two-tailed t-test.
◦ If you want to know whether one population mean is greater than or less than the
other, perform a one-tailed t-test.
◦ In your test of whether petal length differs by species:
◦ Your observations come from two separate populations (separate species), so you
perform a two-sample t-test.
◦ You don’t care about the direction of the difference, only whether there is a
difference, so you choose to use a two-tailed t-test.
ANOVA
ANOVA, which stands for Analysis of Variance, is a statistical test used to analyze the
difference between the means of more than two groups.
A one-way ANOVA uses one independent variable, while a two-way ANOVA uses two
independent variables.
One-way ANOVA example:
As a crop researcher, you want to test the effect of three different fertilizer mixtures
on crop yield. You can use a one-way ANOVA to find out if there is a difference in crop
yields between the three groups.
A two-way ANOVA is used to estimate how the mean of a quantitative variable
changes according to the levels of two categorical variables. Use a two-way ANOVA
when you want to know how two independent variables, in combination, affect an
dependent variable.
Cont…
Example of two way ANOVA:
◦ Comparing the blood pressure reduction effects of the three drugs.

◦ ANOVA uses the F-test for statistical significance. This allows for comparison of
multiple means at once, because the error is calculated for the whole set of
comparisons rather than for each individual two-way comparison (which would
happen with a t-test).
◦ The F-test compares the variance in each group mean from the overall group
variance. If the variance within groups is smaller than the variance between groups,
the F-test will find a higher F-value, and therefore a higher likelihood that the
difference observed is real and not due to chance.
Pearson’s chi-square test
◦ A Pearson’s chi-square test is a statistical test for categorical data. It is used to
determine whether your data are significantly different from what you
expected.

There are two types of Pearson’s chi-square tests:


◦ The chi – square goodness of fit test is used to test whether the frequency
distribution of a categorical variable is different from your expectations (one
categorical variable).
◦ The chi – square test of independence is used to test whether two categorical
variables are related to each other
When to use chi-square test
When testing an hypothesis about one or more categorical variable. If one or
more of your variables is quantitative, you should use a different statistical test.
The sample was randomly sampled.
There are a minimum of five observations expected in each group or
combination of groups.
When the data is not normally distributed.
Note: Parametric tests can’t test hypotheses about the distribution of a categorical
variable, but they can involve a categorical variable as an independent variable
such as ANOVA.
Significance
The majority of statistical tests will provide a p-value this can be used to indicate the
statistical significance of the results and the degree to which we can be confident that the
results are not due to chance.
There are different levels of significance that indicate different levels of confidence, P<0.05
this indicates that we can be 95% confident that our findings are not due to chance. P<0.01
this indicates that we can be 99% confident that our findings are not due to chance.

For example:
if a p value = 0.04, this value is < 0.05 but not <0.01 and therefore we would
state that this finding is significant at the P<0.05 level.
Alternatively, If the p value = 0.17 this is not less that the suggested significance levels and
therefore the findings would not be significant.
◦ If a result is statistically significant, that means it’s unlikely to be explained
solely by chance or random factors. In other words, a statistically significant result
has a very low chance of occurring if there were no true effect in a research study.
Errors
A Type I error means rejecting the null hypothesis when it’s actually true. It means
concluding that results are statistically significant when, in reality, they came
about purely by chance or because of unrelated factors

A Type II error means not rejecting the null hypothesis when it’s actually false. This is
not quite the same as “accepting” the null hypothesis, because hypothesis testing
can only tell you whether to reject the null hypothesis.
Accepting / rejecting the null hypothesis – possible outcomes.

True population situation


Decision of researcher
Ho is true H1 is true

Accept Ho OK Type 11 error


Reject Ho Type 1 error OK
Softwares
◦ Ms excel
◦ Stata
◦ SPSS
◦R
References
◦ Pritha Bhandari (2020) Inferential Statistics | An Easy Introduction & Examples.
https://www.scribbr.com/statistics/
◦ Inferential Statistics. https://www.cuemath.com/data/inferential-statistics/
◦ Biostatistics and epidemiology : a primer for health and biomedical prefessionals /
by Sylvia Wassertheil-Smoller.—3rd ed. 2003.
◦ Triola, Mario F. Elementary statistics technology update / Mario F. Triola. -- 11th ed.
Rev. ed. of: Elementary statistics. 11th ed. c2010.

You might also like