Statistical Tools Complete Notes
Statistical Tools Complete Notes
Chapter -1
Introduction
Statistics : Statistics is the science of conducting studies to collect, organize, summarize,
analyze, and draw conclusions from data.
Data : Data are the values (measurements or observations) that the variables can assume.
A collection of data values forms a data set. Each value in the data set is called a data value
or a datum.
Data can be used in different ways. The body of knowledge called statistics is sometimes
divided into two main areas, depending on how data are used. The two areas are
1. Descriptive statistics
2. Inferential statistics
1
Descriptive Statistics
Descriptive statistics are a part of statistics that can be used to describe data.
It is used to summarize the attributes of a sample in such a way that a pattern can be
drawn from the group. It enables researchers to present data in a more meaningful
way such that easy interpretations can be made. Descriptive statistics uses two tools
to organize and describe data. These are given as follows:
Formulas:
Example:
2
Inferential Statistics
Inferential statistics is a branch of statistics that is used to make inferences
about the population by analyzing a sample. When the population data is very large
it becomes difficult to use it. In such cases, certain samples are taken that are
representative of the entire population. Inferential statistics draws conclusions
regarding the population using these samples. Sampling strategies such as simple
random sampling, cluster sampling, stratified sampling, and systematic sampling,
need to be used in order to choose correct samples from the population. Some
methodologies used in inferential statistics are as follows:
Formulas:
Example:
● A researcher wants to know the average height of all college students in the country.
Instead of measuring everyone, they collect a random sample of 200 students.
● Based on the sample, the researcher estimates the average height of all college
students is 170 cm ± 2 cm (with 95% confidence).
3
Variables and Types of Data
Variables can be classified as qualitative or quantitative.
Qualitative Variables
Qualitative variables are variables that can be placed into distinct categories,
according to some characteristic or attribute. For example, if subjects are classified
according to gender (male or female), then the variable gender is qualitative. Other
examples of qualitative variables are religious preference and geographic locations.
Quantitative Variables
Quantitative variables are referred to as “numeric” variables, these are
variables that represent a measurable quantity and can be ordered or ranked. For
example, the variable age is numerical, and people can be ranked in order according
to the value of their ages. Other examples of quantitative variables are heights,
weights, and body temperatures.
Quantitative variables can be further classified into two groups:
1. Discrete
2. Continuous
1. Discrete variables:
Discrete variables assume values that can be counted.
Discrete variables can be assigned values such as 0, 1, 2, 3 and are said
to be countable.
Examples of discrete variables are the number of children in a family, the
number of students in a classroom, and the number of calls received by a
switchboard operator each day for a month.
2. Continuous Variables:
Continuous variables can assume an infinite number of values between
any two specific values. They are obtained by measuring. They often
include fractions and decimals.
4
Continuous variables, by comparison, can assume an infinite number of
values in an interval between any two specific values.
For example, Temperature is a continuous variable, since the variable can
assume an infinite number of values between any two given temperatures.
Data Collection
1. Survey Methods:
Surveys are the most common method used in data collection. Surveys
include:
2. Other Methods:
5
Sampling Techniques
1. Random Sampling
● Process:
○ Create a complete list of the population.
○ Use a random number generator, lottery system, or random selection
software to choose participants.
● Example: Imagine a school wants to survey students about cafeteria food.
The administration lists all students and randomly selects individuals for the
study using a computer-generated list.
2. Systematic Sampling
● Process:
○ Decide on a fixed interval (k).
○ Select a random starting point from the population list.
○ Choose every kth individual thereafter.
● Example: A company wants to measure employee satisfaction and selects
every 5th employee from a staff roster starting from a randomly chosen
employee.
6
3. Stratified Sampling
● Process:
○ Divide the population into relevant subgroups (e.g., age groups,
income levels).
○ Randomly select participants from each subgroup.
● Example: If a university is conducting a survey on study habits, researchers
may divide students by academic year (freshman, sophomore, junior, senior)
and then randomly select a proportionate number from each group.
4. Cluster Sampling
● Process:
○ Divide the population into clusters based on a characteristic (e.g.,
location, department, school).
○ Randomly select clusters and survey everyone within the chosen
clusters.
● Example: A health organization wants to study diabetes prevalence. Instead
of surveying every individual in a city, they randomly pick several
neighbourhoods and survey all residents in those neighborhoods.
Convenience Sampling
Unlike the four methods above, convenience sampling prioritizes ease of data
collection rather than representation. Researchers select participants who are easy
to access, but this method risks introducing bias.
7
● Process:
○ Choose participants based on their availability.
○ Conduct the study without randomization or systematic selection.
● Example: A researcher conducting a quick study on smartphone usage may
interview people sitting at a café instead of selecting a diverse and
representative sample.
Uses of Statistics
Statistics is a powerful tool used in various fields for data analysis, decision-
making, and prediction. It helps researchers and businesses gain insights from data
and make informed choices. Some common applications include:
Misuses of Statistics
1. Suspect Samples
8
● Example: An advertisement claims, “3 out of 4 doctors recommend this
product,” but if only 4 doctors were surveyed, the result lacks credibility.
2. Ambiguous Averages
4. Detached Statistics
5. Implied Connections
6. Misleading Graphs
9
7. Faulty Survey Questions
● Example: “Do you support better education policies?” vs. “Do you support
raising taxes for education?” The first phrasing may receive more positive
responses, while the second may be less favorable.
10
Unit - 2
11
○ Formula:
● Example:
Midpoint of 30 - 40 = (30+40) / 2
= 35
Histograms
A histogram is a type of bar graph that visually represents data grouped into
classes. It helps show the distribution of values, making it easier to identify patterns
like peaks, gaps, or skewness in the dataset.
Key Features:
12
Frequency Polygons
Key Features:
13
Example:
14
Ogives (Cumulative Frequency Graphs)
Key Features:
Example:
15
16
17
Distribution Shapes
18
19
Other Types of Graphs
1. Bar Graph
Example:
20
2. Pareto Chart
Example:
21
3. Time Series Graph
Example:
22
4. Pie Graph
23
Example:
24
8. Stem-and-Leaf Plot
Example:
25
Measures of Central Tendency
Measures of central tendency summarize a dataset by identifying a central or
typical value. The four primary measures are mean, median, mode, and midrange.
Each has distinct properties and is useful in different situations.
The mean is the sum of all values in a dataset divided by the total number of
values. It is a widely used measure because it considers every data point.
Formula:
26
where:
Example 1:
Example 2:
Advantages:
Disadvantages:
27
2. Median (Middle Value)
The median is the value that falls exactly in the middle when data is arranged
in ascending order. If there is an even number of values, the median is the average
of the two middle values.
Advantages:
Disadvantages:
Example 1:
28
Example 2:
The mode is the value that appears most frequently in a dataset. A dataset can be:
Example 1:
29
Example 2:
Disadvantages:
30
Example 3: Finding Mean, Median and Mode
The midrange is the average of the smallest and largest values in a dataset.
Example 1:
31
Example 2:
Advantages of Midrange:
Disadvantages:
5. Weighted Mean
The weighted mean is a type of average that accounts for the importance (or
frequency) of each value in a dataset. It is useful when different values carry different
levels of significance.
32
Example:
33
Unit – 3
Classical Statistical Tests
Classical statistical tests are fundamental tools used in hypothesis testing and
data analysis. These tests help determine relationships between variables, identify
differences between groups, and assess statistical significance.
Z – Test
A Z-test is a type of hypothesis test that compares the sample’s average to the
population’s average and calculates the Z-score and tells us how much the sample
average is different from the population average by looking at how much the data
normally varies. It is particularly useful when the sample size is large >30. This Z-
Score is also known as Z-Statistics formula is:
Z-Score =
where,
: mean of the sample.
μ : mean of the population.
σ : Standard deviation of the population.
Example:
The average family annual income in India is 200k with a standard deviation of 5k and
the average family annual income in Delhi is 300k.
Z-Score =
= 20
Steps to perform Z-test
First step is to identify the null and alternate hypotheses.
Determine the level of significance (∝).
Find the critical value of z in the z-test.
Calculate the z-test statistics. Below is the formula for calculating the z-test
statistics
Types of Z-Tests:
1. One-Sample Z-Test
Used when comparing the mean of a single sample to a known population mean.
Example: A professor wants to test if the average exam score in their class is
significantly different from the national average of 75.
Z=
√
Types of T-Tests
Used when comparing means within the same group before and after a
treatment or intervention.
Also called the matched-pairs t-test.
Example: Measuring weight before and after a fitness program.
3. One-Sample T-Test
T-Test Formula
The formula for a t-test depends on the type used, but the general formula is:
Where:
= Sample means
s = Standard deviation
√ = Sample size
For paired t-tests, the difference between paired values is used instead of two
separate sample means.
Formula:
2. F-Test in ANOVA
Goodness of Fit-test
A goodness of fit test is used to determine whether sample data fits a
specific distribution or model. Commonly used goodness of fit tests include
Chi-square, Kolmogorov-Smirnov, Anderson-Darling, and Shapiro-Wilk. These
tests help measure how well observed data correspond to the expected
values from a model.
1. Anderson-Darling Test
The Anderson-Darling test (AD-Test) is used to test if a sample of data came
from a population with a specific distribution. It is a modification of the Kolmogorov-
Smirnov (K-S) test and gives more weight to the tails than does the K-S test.
The (AD-Test) is a measure of how well your data fits a specified distribution.
It’s commonly used as a test for normality.
Formula:
Where:
i = the ith sample, calculated when the data is sorted in ascending order.
2. Chi-Square Test
A chi-square (Χ2) goodness of fit test is a goodness of fit test for a categorical
variable. Goodness of fit is a measure of how well a statistical model fits a set of
observations.
When goodness of fit is high, the values expected based on the model are close
to the observed values.
When goodness of fit is low, the values expected based on the model are far
from the observed values.
The statistical models that are analyzed by chi-square goodness of fit tests are
distributions. They can be any distribution, from as simple as equal probability for all
groups, to as complex as a probability distribution with many parameters.
The chi-square goodness of fit test is a hypothesis test. It allows you to draw
conclusions about the distribution of a population based on a sample.
Formula:
Where,
3. Kolmogorov–Smirnov Test
Kolmogorov–Smirnov Test is a completely efficient manner to determine if two
samples are significantly one of a kind from each other. It is normally used to check the
uniformity of random numbers. Uniformity is one of the maximum important properties of
any random number generator and the Kolmogorov–Smirnov check can be used to
check it.
The Kolmogorov–Smirnov test is versatile and can be employed to evaluate
whether two underlying one-dimensional probability distributions vary. It serves as an
effective tool to determine the statistical significance of differences between two sets of
data.
Formula:
Where,
n is the sample size.
x is the normalized Kolmogorov-Smirnov statistic.
k is the index of summation in the series
Since the mean of the b values is 0, we can simplify this expression (ignoring the shift of
the y-values by their mean) to:
5. Shapiro-Wilk Test
Shapiro-Wilk test is a hypothesis test that evaluates whether a data set is
normally distributed. It evaluates data from a sample with the null hypothesis that the
data set is normally distributed. A large p-value indicates the data set is normally
distributed, a low p-value indicates that it isn’t normally distributed.
where n is the sample size, s is the sample skewness, and k is the sample
kurtosis. For (very) large sample sizes, the test statistic has a chi-square distribution
with two degrees of freedom, but more generally its distribution is obtained via Monte
Carlo simulation.
7. Lilliefors Test
With the K-S test it is assumed that the distribution parameters are known, which
is often not the case. Following Lilliefors, the test procedure is as follows: Given a
sample of n observations, one determines D, where:
D=sup[F(x)-G(x)]
where sup means supremum, or largest value of a set, with G(x) being the
sample cumulative distribution function and F(x) is the cumulative Normal distribution
function with mean μ=the sample mean, and variance σ2=the sample variance, defined
with denominator n-1.
Unit – 4
Chapter 10
2. Regression
• Regression is used to predict one variable (dependent) based on another
(independent).
• Simple linear regression follows the formula:
Y = bX + a
o Y is the dependent variable (predicted value).
o X is the independent variable (input).
o b is the slope (rate of change).
o a is the intercept (starting point).
• It helps in forecasting future trends and making informed decisions.
Scatterplots
A scatter plot is a type of graph used in statistics to visually represent the relationship
between two numerical variables. It helps identify patterns, trends, and possible correlations.
Each point on the scatter plot corresponds to a pair of values (x,y)(x, y) and shows
how one variable changes in relation to another.
It is generally used to plot the relationship between one independent variable and one
dependent variable, where an independent variable is plotted on the x-axis and a dependent
variable is plotted on the y-axis so that you can visualize the effect of the independent
variable on the dependent variable. These plots are known as Scatter Plot Graph or Scatter
Diagram.
• No clear pattern exists between variables (e.g., shoe size vs. IQ).
4. Curvilinear Relationship
The tightness of the points indicates how strong the relationship is:
Example 1:
Example 2:
Correlation Coefficient
Correlation measures how strong a relationship is between two variables.
Formula:
where:
• n = number of data points
• x, y = individual data values
Example:
Hypothesis Testing for Correlation
To determine if a correlation is significant or just due to chance, we use hypothesis testing:
• Null Hypothesis (H0): There is no correlation (ρ=0).
• Alternative Hypothesis (H1): There is a significant correlation (ρ≠0).
Test Statistic Formula:
where n−2 is the degrees of freedom. If the test statistic exceeds the critical value, we reject
H0, meaning the correlation is statistically significant.
Regression
Regression is a statistical method used to study the relationship between two or more
variables, helping researchers predict outcomes based on observed data. The goal of
regression analysis is to identify trends and make data-driven forecasts.
o If the correlation is significant, the next step is to find the equation of the
regression line.
o The regression line minimizes prediction errors and helps make forecasts
based on observed trends.
o The formula for simple linear regression:
where: -
Example: Find the linear regression equation for the given data:
x y
3 8
9 6
5 4
3 2
Analysis of Variance (ANOVA)
ANOVA is a statistical test used to examine differences among the means of three or
more groups.
Unlike a t-test, which only compares two groups, ANOVA can handle multiple
groups in a single analysis, making it an essential tool for experiments with more than two
categories.
ANOVA helps analyze variation within groups and between groups to understand if
observed differences are due to random chance or an actual effect.
• Degrees of Freedom (df): The number of values that are free to vary when calculating
statistics.
• F-Ratio: The ratio of MSB to MSW, used to test the null hypothesis.
• P-Value: This probability value helps determine if the F-ratio is significant. A small p-
value (e.g., <0.05) suggests significant differences between groups.
Types of ANOVA:
1. One-Way ANOVA:
a. Examines the effect of a single independent variable on the dependent
variable.
b. Example: Comparing test scores of students across three teaching methods.
2. Two-Way ANOVA:
a. Analyzes the impact of two independent variables simultaneously and their
interaction.
b. Example: Studying the combined effects of diet and exercise on weight loss.
3. Repeated Measures ANOVA:
a. Used when the same subjects are measured multiple times under different
conditions.
b. Example: Tracking blood pressure levels of patients before, during, and after
medication.
ANOVA Table
An ANOVA (Analysis of Variance) test table is used to summarize the results of an
ANOVA test, which is used to determine if there are any statistically significant differences
between the means of three or more independent groups. Here’s a general structure of an
ANOVA table:
Key Concepts
1. Independent Variable (Factor) – The categorical variable that divides the data into
groups (e.g., teaching methods).
2. Dependent Variable – The numerical variable being measured (e.g., student test
scores).
Example:
Two-way Analysis of Variance
Two-way ANOVA is a statistical technique used to examine the effects of two
independent variables (factors) on a dependent variable simultaneously. It helps determine
if there are significant differences between group means based on two categorical factors
and whether these factors interact.
Key Concepts
Example:
Unit – 5
Statistical Packages
A statistical package is a software application designed for performing statistical analysis,
data visualization, and interpretation. These packages provide a range of tools for conducting
descriptive statistics, inferential statistics, regression analysis, hypothesis testing, and
data modeling.
SPSS
SPSS (Statistical Package for the Social Sciences) is a software application widely
used for statistical analysis, data management, and graphical representation of data. It
was originally designed for social sciences research but has now become a popular tool in
business, healthcare, education, and many other fields.
• Data Management: Users can import, clean, and manipulate data efficiently.
• Descriptive Statistics: Calculation of mean, median, mode, standard deviation, and
frequency distributions.
• Inferential Statistics: Hypothesis testing, confidence intervals, and ANOVA.
• Regression Analysis: Used to predict outcomes and identify relationships between
variables.
• Data Visualization: Generates histograms, scatter plots, box plots, and pie charts.
• Factor Analysis & Clustering: Helps in segmentation and identifying patterns in
large datasets.
Components of SPSS
SPSS is divided into different modules, each designed for specific statistical tasks:
• SPSS Statistics Base: The core module for basic data analysis.
• SPSS Modeler: Used for predictive modeling and data mining.
• SPSS Text Analytics: Helps analyze text-based data like surveys and social media
comments.
• SPSS Amos: Used for structural equation modeling and advanced multivariate
analysis.
• SPSS Custom Tables: Creates customized reports and tables for data interpretation.
Working of SPSS
SPSS operates through two primary views:
• Data View: Displays raw data in spreadsheet format (rows = cases, columns =
variables).
• Variable View: Allows users to define data attributes such as names, labels, types,
and measurement scales.
Users can perform analyses using menus and built-in functions, or write scripts in syntax
for complex tasks.
Applications of SPSS
SPSS is used across industries for various purposes:
MS-Excel
Microsoft Excel is a powerful spreadsheet application used for data management,
analysis, and visualization. While primarily known for business applications, Excel
provides a robust set of statistical functions and tools for performing various types of
calculations and statistical tests.
Key Features for Statistical Analysis in Excel
• Data Entry & Management: Stores large datasets, enables sorting and filtering.
• Descriptive Statistics: Calculates mean, median, mode, variance, and standard
deviation.
• Inferential Statistics: Performs hypothesis testing, confidence intervals, and
ANOVA.
• Regression & Correlation Analysis: Analyzes relationships between variables.
• Graphical Visualization: Generates scatter plots, histograms, and trend lines.
• Data Analysis ToolPak: Includes specialized statistical functions like t-tests and
regression modeling.
Basic Statistics
• Mean: =AVERAGE(range)
• Median: =MEDIAN(range)
• Simple Linear Regression: Can be done using trendline functions in scatter plots.
SAS
SAS (Statistical Analysis System) is a powerful data analytics software used for statistical
analysis, data management, and predictive modeling. It is widely used in business,
healthcare, government, and scientific research due to its robust data handling capabilities
and ability to process large datasets efficiently.
and ANOVA. Regression Analysis: Simple, multiple, and logistic regression models.
Time-Series Analysis: Helps forecast trends in business and financial data. Machine
Learning & AI Integration: SAS provides advanced analytics for predictive modeling.
Data Visualization: Generates histograms, scatter plots, box plots, and trend analysis graphs.
Working of SAS
SAS operates through two main interfaces:
Users load datasets, perform statistical operations, and generate reports and visualizations.
Users write SAS scripts to perform these procedures or use Enterprise Guide for
automated workflows.
Applications of SAS
• Healthcare & Medicine – Clinical trial analysis and disease prediction.
• Finance & Banking – Risk management and fraud detection.
• Marketing & Retail – Customer segmentation and sales forecasting.
• Education & Social Sciences – Analyzing survey data and academic performance
trends.
• Manufacturing & Supply Chain – Predicting production efficiency and optimizing
logistics.
R Programming
R is an open-source programming language specifically designed for statistical
computing and data analysis. It is widely used in academic research, business analytics,
machine learning, and scientific computing due to its flexibility, vast libraries, and
powerful visualization capabilities.
Basic Statistics
• Mean: mean(data)
• Median: median(data)
• Standard Deviation: sd(data)
• Variance: var(data)
• Correlation: cor(x, y)
• Linear Regression: lm(y ~ x, data = dataset)
• Multiple Regression: lm(y ~ x1 + x2, data = dataset)
Data Visualization
• Base R: plot(x, y)
• ggplot2:
o library(ggplot2)
o ggplot(data, aes(x, y)) + geom_point()
Basic Statistics
• Mean, Median, Standard Deviation: Found under Stat > Basic Statistics.
• Summary statistics for datasets can be obtained through Stat > Descriptive Statistics.
• Simple Linear Regression: Stat > Regression > Fit Regression Model.
• Correlation Analysis: Stat > Basic Statistics > Correlation.
Data Visualization
• Scatter plots, histograms, and box plots can be accessed through Graph > Scatterplot.