STATISTICS
THEORY: 20 marks
Definition of stats, data, types of data
Sample, types of samples
Sampling, Sampling Techniques
Test, Application of Tests- chi test, t-test, Annova, z-test
Application of regression and correlation
Difference between regression and correlation
How do regression and correlation help managers in decision-making?
NUMERICAL: 20 marks
Chi-square - 6 marks
Z-test- 6 marks
t-test- 6 marks
Annova outcome – 2 marks
STATISTICS: The collection, analysis, interpretation, and presentation of
data.
o Descriptive Statistics: It refers to the analysis of data. Sample data
is summarized using charts, tables, and graphs. Quantitative
analysis is difficult when the population is large. Therefore, a small
data sample is interpreted.
o Inferential Statistics: Inferential statistics is a branch of statistics
that makes the use of various analytical tools to draw inferences
about the population data from sample data.
Data: Data are individual pieces of factual information recorded
and used for analysis. It is the raw information from which
statistics are created.
o Types of data: There are 2 main categories of data:
Qualitative Data: This is also known as categorical data. The
characteristics or behavioral attribute of a person or object
is included under categorical data. Eg Marital status, eye
color.
Nominal Data: Nominal data is one of the types of
qualitative information which helps to label the
variables without providing the numerical value. It is
also known as nominal scale data.
Ordinal Data: Ordinal data/variable is a type of data
that follows a natural order. The significant feature of
the nominal data is that the difference between the
data values is not determined. It is also known as the
ordinal scale level.
Quantitative Data: It is also known as numerical data which
represents the numerical value (i.e., how much, how often,
how many). Numerical data gives information about the
quantities of a specific thing.
Discrete Data: Discrete data can take only discrete
values. Discrete information contains only a finite
number of possible values. Those values cannot be
subdivided meaningfully. Eg 1,2,3,4….
Continuous Data: Continuous data is data that can be
calculated. It has an infinite number of probable
values that can be selected within a given specific
range. Eg 10-20, 30-40…..
Population: It consists of all the items or individuals about which you
want to draw a conclusion.
Sample: It has the same characteristics as the population it is
representing. The sample is part of the population.
o Random Sampling: Each member of the population has an equal
chance of being selected. True random sampling is done with
replacement. That is, once a member is picked, that member goes
back into the population and thus may be chosen more than once.
Simple Random Sampling: Each sample of the same size
has an equal chance of being selected.
Convenience Sampling: Using results that are readily
available for use.
Stratified Sampling: Divide the population into groups
(Strata) and then take a proportionate number from each
stratum.
Cluster Sampling: Divide the population into clusters
(groups) and then randomly select them.
Systematic Sampling: Randomly select a data starting point
and then take every nth piece of data from the population.
o Level of Measurement: The way a set of data is measured. Data
can be classified into 4 levels of measurement:
Nominal Scale Level (Mentioned Above)
Ordinal Scale Level (Mentioned Above)
Interval Scale Level: Similar to ordinal level data because it
has a definite ordering but there is a difference between
the data. The differences between interval scale data can be
measured though the data does not have a starting point.
Ratio Scale Level: Ratio scale data is like interval scale data,
but it has a 0 point and ratios can be calculated.
Test, Application of Tests- chi test, t-test, Annova, z-test
A chi-square (χ2) statistic is a test that measures how a model compares to
actual observed data. The data used in calculating a chi-square statistic must be
random, raw, mutually exclusive, drawn from independent variables, and
drawn from a large enough sample. For example, the results of tossing a fair
coin meet these criteria.
Chi-Square Test (Comparison)
Chi-Square test is used when we perform hypothesis
testing on two categorical variables from a single
population or we can say that to compare categorical
variables from a single population. By this we find is there
any significant association between the two categorical
variables.
The hypothesis being tested for chi-square is
Null: Variable A and Variable B are independent.
Alternate: Variable A and Variable B are not independent.
T-Test
The T-test is an inferential statistic that is used to determine
the difference or to compare the means of two groups of
samples which may be related to certain features. It is
performed on continuous variables.
There are three different versions of t-tests:
→ One sample t-test which tells whether means of sample
and population are different.
N<30, and SD is not given
Two sample t-test also is known as Independent t-test —
it compares the means of two independent groups
and determines whether there is statistical
evidence that the associated population means are
significantly different.
Two sample t-test
→ Paired t-test when you want to compare means of the
different samples from the same group or which
compares means from the same group at different
times.
ANOVA Test
It is also called an analysis of variance and is used to
compare multiple (three or more) samples with a
single test. It is used when the categorical feature has
more than two categories.
The hypothesis being tested in ANOVA is
Null: All pairs of samples are same i.e. all sample means
are equal
Alternate: At least one pair of samples is significantly
different
Z – TEST
A z-test is a statistical test to determine whether two population
means are different when the variances are known and the sample
size is large.
A z-test is a hypothesis test in which the z-statistic follows a normal
distribution.
A z-statistic, or z-score, is a number representing the result from the
z-test.
Z-tests are closely related to t-tests, but t-tests are best performed
when an experiment has a small sample size.
Z-tests assume the standard deviation is known, while t-tests
assume it is unknown.
Understanding Z-Tests
The z-test is also a hypothesis test in which the z-statistic follows a normal
distribution. The z-test is best used for greater-than-30 samples because,
under the central limit theorem, as the number of samples gets larger, the
samples are considered to be approximately normally distributed.
What's the Difference Between a T-Test and Z-Test?
Z-tests are closely related to t-tests, but t-tests are best performed when the data consists of a
small sample size, i.e., less than 30. Also, t-tests assume the standard deviation is unknown,
while z-tests assume it is known.
When Should You Use a Z-Test?
If the standard deviation of the population is unknown and the sample size is greater than or
equal to 30, then the assumption of the sample variance equaling the population variance
should be made using the z-test. Regardless of the sample size, if the population standard
deviation for a variable remains unknown, a t-test should be used instead.
What Is a Z-Score?
A z-score, or z-statistic, is a number representing how many standard deviations above or
below the mean population the score derived from a z-test is. Essentially, it is a numerical
measurement that describes a value's relationship to the mean of a group of values. If a z-
score is 0, it indicates that the data point's score is identical to the mean score. A z-score of
1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be
positive or negative, with a positive value indicating the score is above the mean and a
negative score indicating it is below the mean.
What Is Central Limit Theorem (CLT)?
In the study of probability theory, the central limit theorem (CLT) states that the distribution
of sample approximates a normal distribution (also known as a “bell curve”) as the sample
size becomes larger, assuming that all samples are identical in size, and regardless of the
population distribution shape. Sample sizes equal to or greater than 30 are considered
sufficient for the CLT to predict the characteristics of a population accurately. The z-test's
fidelity relies on the CLT holding.
The Bottom Line
A z-test is used in hypothesis testing to evaluate whether a finding or association is
statistically significant or not. In particular, it tests whether two means are the same (the null
hypothesis). A z-test can only be used if the population standard deviation is known and the
sample size is 30 data points or larger. Otherwise, a t-test should be employed.
What is the application of correlation?
Correlation is a statistical method used to assess a possible linear association
between two continuous variables. It is simple both to calculate and to interpret.
However, misuse of correlation is so common among researchers that some
statisticians have wished that the method had never been devised at all.
What are the application of regression?
Regression analysis is used to estimate the relationship between a dependent
variable and one or more independent variables. This technique is widely applied
to predict the outputs, forecasting the data, analyzing the time series, and finding the
causal effect dependencies between the variables.
Correlation Regression
‘Correlation’ as the name says it determines the ‘Regression’ explains how an independent
interconnection or a co-relationship between the variable is numerically associated with the
variables. dependent variable.
In Correlation, both the independent and dependent However, in Regression, both the dependent
values have no difference. and independent variable are different.
The primary objective of Correlation is, to find out a When it comes to regression, its primary
quantitative/numerical value expressing the intent is, to reckon the values of a haphazard
association between the values. variable based on the values of the fixed
variable.
Correlation stipulates the degree to which both of However, regression specifies the effect of
the variables can move together. the change in the unit, in the known
variable(p) on the evaluated variable (q).
Correlation helps to constitute the connection Regression helps in estimating a variable’s
between the two variables. value based on another given value.
How do regression and correlation help managers in decision-making?
Correlation is used to determine the relationship between data sets in business
and is widely used in financial analysis and to support decision-making.
Regression analysis not only refers to the relationship between data sets but also
that if one data set changes, it will cause a corresponding change in the other
data set. Regression analysis is often used in sales forecasting, product, and
service development, predicting future market trends, and other use cases.
Correlation and regression analysis aids business leaders in making more
impactful predictions based on patterns in data. This technique can help guide
business processes, direction, and performance accordingly, resulting in
improved management, better customer experience strategies, and optimized
operations.
Correlation and regression analysis combinedly paves the way for modern
approaches to business success by increasing profitability, reducing the
complexity and uncertainty of decision making, and increasing business
flexibility in ever-changing and evolving business environments.