KEMBAR78
Statistical Tools Complete Notes | PDF | Statistics | Mode (Statistics)
0% found this document useful (0 votes)
46 views68 pages

Statistical Tools Complete Notes

The document provides an introduction to statistics, defining key concepts such as variables, data, and the two main branches of statistics: descriptive and inferential statistics. It explains the tools and methodologies used in both branches, including measures of central tendency and dispersion for descriptive statistics, and hypothesis testing and regression analysis for inferential statistics. Additionally, it covers data collection methods, sampling techniques, and the uses and misuses of statistics, along with organizing data through frequency distributions and understanding distribution shapes.

Uploaded by

devijaishrik19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views68 pages

Statistical Tools Complete Notes

The document provides an introduction to statistics, defining key concepts such as variables, data, and the two main branches of statistics: descriptive and inferential statistics. It explains the tools and methodologies used in both branches, including measures of central tendency and dispersion for descriptive statistics, and hypothesis testing and regression analysis for inferential statistics. Additionally, it covers data collection methods, sampling techniques, and the uses and misuses of statistics, along with organizing data through frequency distributions and understanding distribution shapes.

Uploaded by

devijaishrik19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Unit - 1

Chapter -1

Introduction
Statistics : Statistics is the science of conducting studies to collect, organize, summarize,
analyze, and draw conclusions from data.

Variable : A variable is a characteristic or attribute that can assume different values.


Variables whose values are determined by chance are called random variables.

Data : Data are the values (measurements or observations) that the variables can assume.
A collection of data values forms a data set. Each value in the data set is called a data value
or a datum.

Data can be used in different ways. The body of knowledge called statistics is sometimes
divided into two main areas, depending on how data are used. The two areas are
1. Descriptive statistics
2. Inferential statistics

Descriptive and Inferential Statistics


The purpose of descriptive and inferential statistics is to analyze different
types of data using different tools. Descriptive statistics helps to describe and
organize known data using charts, bar graphs, etc., while inferential statistics aims at
making inferences and generalizations about the population data.

1
Descriptive Statistics
Descriptive statistics are a part of statistics that can be used to describe data.
It is used to summarize the attributes of a sample in such a way that a pattern can be
drawn from the group. It enables researchers to present data in a more meaningful
way such that easy interpretations can be made. Descriptive statistics uses two tools
to organize and describe data. These are given as follows:

● Measures of Central Tendency - These help to describe the central


position of the data by using measures such as mean, median, and mode.
● Measures of Dispersion - These measures help to see how spread out
the data is in a distribution with respect to a central point. Range, standard
deviation, variance, quartiles, and absolute deviation are the measures of
dispersion.

Formulas:

Example:

A teacher records the final exam scores of all 30 students in a class.


● The average (mean) score is 78.
● The highest score is 95.
● The lowest score is 60.
● The standard deviation is 8.2.

2
Inferential Statistics
Inferential statistics is a branch of statistics that is used to make inferences
about the population by analyzing a sample. When the population data is very large
it becomes difficult to use it. In such cases, certain samples are taken that are
representative of the entire population. Inferential statistics draws conclusions
regarding the population using these samples. Sampling strategies such as simple
random sampling, cluster sampling, stratified sampling, and systematic sampling,
need to be used in order to choose correct samples from the population. Some
methodologies used in inferential statistics are as follows:

● Hypothesis Testing - This technique involves the use of hypothesis tests


such as the z test, f test, t test, etc. to make inferences about the
population data. It requires setting up the null hypothesis, alternative
hypothesis, and testing the decision criteria.
● Regression Analysis - Such a technique is used to check the relationship
between dependent and independent variables. The most commonly used
type of regression is linear regression.

Formulas:

Example:

● A researcher wants to know the average height of all college students in the country.
Instead of measuring everyone, they collect a random sample of 200 students.
● Based on the sample, the researcher estimates the average height of all college
students is 170 cm ± 2 cm (with 95% confidence).

3
Variables and Types of Data
Variables can be classified as qualitative or quantitative.

Qualitative Variables
Qualitative variables are variables that can be placed into distinct categories,
according to some characteristic or attribute. For example, if subjects are classified
according to gender (male or female), then the variable gender is qualitative. Other
examples of qualitative variables are religious preference and geographic locations.

Quantitative Variables
Quantitative variables are referred to as “numeric” variables, these are
variables that represent a measurable quantity and can be ordered or ranked. For
example, the variable age is numerical, and people can be ranked in order according
to the value of their ages. Other examples of quantitative variables are heights,
weights, and body temperatures.
Quantitative variables can be further classified into two groups:
1. Discrete
2. Continuous
1. Discrete variables:
 Discrete variables assume values that can be counted.
 Discrete variables can be assigned values such as 0, 1, 2, 3 and are said
to be countable.
 Examples of discrete variables are the number of children in a family, the
number of students in a classroom, and the number of calls received by a
switchboard operator each day for a month.

2. Continuous Variables:
 Continuous variables can assume an infinite number of values between
any two specific values. They are obtained by measuring. They often
include fractions and decimals.

4
 Continuous variables, by comparison, can assume an infinite number of
values in an interval between any two specific values.
 For example, Temperature is a continuous variable, since the variable can
assume an infinite number of values between any two given temperatures.

Data Collection and Sampling Techniques

Data Collection

Data collection is the process of gathering information for analysis, decision-


making, or research. Various methods exist, each with its advantages and
drawbacks:

1. Survey Methods:

Surveys are the most common method used in data collection. Surveys
include:

1. Telephone Survey – Researchers call participants to collect responses. This


method is relatively cost-effective and efficient but excludes individuals
without phones or those unwilling to participate.
2. Mailed Questionnaire – A set of questions sent by mail for respondents to
answer and return. This approach allows for a wider reach and ensures
respondent anonymity but suffers from low response rates.
3. Personal Interview – Face-to-face interaction with respondents to gather
detailed and in-depth responses. While this method offers rich insights, it is
costly and may introduce bias due to interviewer influence.

2. Other Methods:

● Direct Observation – Researchers physically observe behaviors, events, or


actions without interaction. This method is useful for studying natural settings
but can be subjective.
● Surveying Records – Collecting pre-existing data from sources such as
government reports, company records, or databases. This technique provides
historical insights but may be limited by data accuracy and relevance.

5
Sampling Techniques

Sampling ensures that data collection is efficient, cost-effective, and


representative of the entire population being studied. Choosing the right method
prevents biases and improves the accuracy of conclusions.

1. Random Sampling

Random sampling is considered the most unbiased method, as every


individual in the population has an equal chance of being selected. This ensures that
the sample accurately reflects the population without systematic bias.

● Process:
○ Create a complete list of the population.
○ Use a random number generator, lottery system, or random selection
software to choose participants.
● Example: Imagine a school wants to survey students about cafeteria food.
The administration lists all students and randomly selects individuals for the
study using a computer-generated list.

2. Systematic Sampling

In systematic sampling, researchers select every kth individual in the


population after choosing a random starting point. It is simpler than random sampling
but ensures a structured selection process.

● Process:
○ Decide on a fixed interval (k).
○ Select a random starting point from the population list.
○ Choose every kth individual thereafter.
● Example: A company wants to measure employee satisfaction and selects
every 5th employee from a staff roster starting from a randomly chosen
employee.

6
3. Stratified Sampling

Stratified sampling is used when a population has distinct subgroups (strata)


with unique characteristics. Each group is sampled separately to ensure proportional
representation.

● Process:
○ Divide the population into relevant subgroups (e.g., age groups,
income levels).
○ Randomly select participants from each subgroup.
● Example: If a university is conducting a survey on study habits, researchers
may divide students by academic year (freshman, sophomore, junior, senior)
and then randomly select a proportionate number from each group.

4. Cluster Sampling

Cluster sampling is useful when studying large populations spread across


different locations. Researchers divide the population into clusters and randomly
select some clusters for study, surveying all individuals within those clusters.

● Process:
○ Divide the population into clusters based on a characteristic (e.g.,
location, department, school).
○ Randomly select clusters and survey everyone within the chosen
clusters.
● Example: A health organization wants to study diabetes prevalence. Instead
of surveying every individual in a city, they randomly pick several
neighbourhoods and survey all residents in those neighborhoods.

Other Sampling Method

Convenience Sampling

Unlike the four methods above, convenience sampling prioritizes ease of data
collection rather than representation. Researchers select participants who are easy
to access, but this method risks introducing bias.

7
● Process:
○ Choose participants based on their availability.
○ Conduct the study without randomization or systematic selection.
● Example: A researcher conducting a quick study on smartphone usage may
interview people sitting at a café instead of selecting a diverse and
representative sample.

Uses and Misuses of Statistics

Uses of Statistics

Statistics is a powerful tool used in various fields for data analysis, decision-
making, and prediction. It helps researchers and businesses gain insights from data
and make informed choices. Some common applications include:

1. Describing Data – Statistics allows researchers to summarize and present


data in a meaningful way using measures like mean, median, mode, and
graphical representations.
2. Comparing Data Sets – Different groups or populations can be compared
using statistical techniques to determine patterns, trends, and correlations.
3. Testing Hypotheses – Researchers use statistical tests to determine if a
claim or assumption about a population is true based on sample data.
4. Making Estimates About Populations – Based on sample data, statisticians
use inference techniques to predict characteristics of a larger population.

Misuses of Statistics

Statistics can be manipulated intentionally or unintentionally, leading to


misleading conclusions. Here are common ways statistics can be misused:

1. Suspect Samples

● A study’s validity depends on how the sample is chosen.


● Problems arise when the sample size is too small or when the sample is
not representative of the entire population.

8
● Example: An advertisement claims, “3 out of 4 doctors recommend this
product,” but if only 4 doctors were surveyed, the result lacks credibility.

2. Ambiguous Averages

● People often present mean, median, mode, or midrange as “the average”


without specifying which one.
● Example: If incomes in a city vary significantly, reporting the mean income
might make earnings seem higher than they actually are, while the median
may provide a more realistic figure.

3. Changing the Subject

● The same data can be framed differently to create a different perception.


● Example: A politician claims “Government spending increased by only 3%,”
while an opponent states, “Spending rose by $6 million.” Both statements are
correct but framed differently.

4. Detached Statistics

● A statistic lacks comparison, making it difficult to evaluate its significance.


● Example: A cereal brand claims, “Our product has 50% more fiber!” but
doesn’t mention compared to what.

5. Implied Connections

● Some statements suggest relationships between variables without


evidence.
● Example: A health product claims, “Eating fish may help lower cholesterol,”
but the statement uses uncertain words like ‘may’ or ‘some people’
without definitive proof.

6. Misleading Graphs

● Graphs can visually exaggerate trends or manipulate perception.


● Example: A graph that starts at $99,000 instead of $0 can make a small
increase seem drastic.

9
7. Faulty Survey Questions

● Question wording can bias responses and influence survey results.

● Example: “Do you support better education policies?” vs. “Do you support
raising taxes for education?” The first phrasing may receive more positive
responses, while the second may be less favorable.

10
Unit - 2

Organizing Data with Frequency Distributions


When dealing with large amounts of data, organizing them into a frequency
distribution helps make sense of the information. A frequency distribution is a
table that groups data into categories, showing how often each category occurs.

Types of Frequency Distributions

1. Categorical Frequency Distribution


○ Used for qualitative data (non-numerical categories).
○ Example: A table showing the number of students who prefer different
sports (Football, Basketball, Tennis, etc.).
2. Grouped Frequency Distribution
○ Used for numerical data when the range is large.
○ Example: Grouping test scores into intervals like 50-59, 60-69, 70-79,
instead of listing individual scores.
3. Ungrouped Frequency Distribution
○ Used when the data range is small and values can be listed
separately.
○ Example: Listing the exact marks obtained by students in a small class.

Components of a Frequency Distribution

1. Class Limits – Define the range of values in each category.


○ Example: 30-40 (Lower limit = 30, Upper limit = 40).
2. Class Boundaries – Adjusted limits to avoid gaps between intervals.
○ Example: If the class limits are 30-40, the boundaries will be 29.5-40.5.
3. Class Width – Difference between the lower limits of consecutive classes.

○ Formula: Class width = Upper limit − Lower limit

● Example: If the limits are 30-40, 41-51, the width is 41 - 30 = 11.


4. Class Midpoint – The middle value of a class.

11
○ Formula:

Midpoint = ( Lower limit + Upper limit ) / 2

● Example:

Midpoint of 30 - 40 = (30+40) / 2

= 35

Histograms

A histogram is a type of bar graph that visually represents data grouped into
classes. It helps show the distribution of values, making it easier to identify patterns
like peaks, gaps, or skewness in the dataset.

Key Features:

● Uses vertical bars to represent frequencies.


● The bars are contiguous (touch each other) to indicate continuous data.
● The x-axis represents class boundaries or intervals.
● The y-axis represents frequencies (how often values appear).

Steps to Construct a Histogram:

12
Frequency Polygons

A frequency polygon is a line graph used to represent the same data as a


histogram but with points instead of bars. It is useful for showing trends and
comparing multiple distributions.

Key Features:

● Uses midpoints of class intervals instead of boundaries.


● Each class frequency is plotted as a point.
● The points are connected with line segments.
● A line extends to the x-axis at both ends for completion.

Steps to Construct a Frequency Polygon:

1. Calculate midpoints for each class:


i. Midpoint = (Lower boundary+Upper boundary) / 2
2. 2. Draw and label the x-axis (midpoints) and y-axis (frequencies).
3. Plot points for each class frequency.
4. Connect the points with a continuous line.
5. Extend the line back to the x-axis at the beginning and end.

13
Example:

14
Ogives (Cumulative Frequency Graphs)

An ogive (pronounced o-jive) is a graph that represents cumulative


frequencies, showing the number of data values below a specific boundary.

Key Features:

● Plots cumulative frequency rather than individual frequency.


● The points are connected with a line, forming a rising curve.
● The x-axis represents class boundaries.

● The y-axis represents cumulative frequency.

Steps to Construct an Ogive:

1. Find cumulative frequencies for each class by adding up previous


frequencies.
2. Draw and label the x-axis (upper class boundaries) and y-axis (cumulative
frequency).
3. Plot cumulative frequency points at each upper class boundary.
4. Connect the points with a smooth increasing curve.

Example:

15
16
17
Distribution Shapes

Distribution Shapes When analyzing data, understanding the shape of its


distribution is crucial. It helps in selecting the appropriate statistical methods for
analysis. Here are some common types of distribution shapes:

1. Bell-Shaped (Mound-Shaped) Distribution


o A single peak in the center that tapers off symmetrically on both ends.
o Common in natural phenomena, such as heights and test scores.
2. Uniform Distribution
o The data values are spread evenly across the range, forming a flat
shape.
o Often seen when outcomes have equal likelihoods.
3. J-Shaped Distribution
o Few values on one end with a gradual increase towards the other end.
o The reverse J-Shaped distribution is its opposite.
4. Skewed Distributions
o Right-Skewed (Positively Skewed): The peak is on the left, and
values taper off to the right.
o Left-Skewed (Negatively Skewed): The peak is on the right, and
values taper off to the left.
5. Bimodal Distribution
o Contains two peaks, indicating two dominant values in the dataset.
o Suggests a mixture of two different groups within the data.
6. U-Shaped Distribution
o Data values are concentrated at both extremes with a dip in the middle.
o Can indicate polarization in certain studies.

18
19
Other Types of Graphs
1. Bar Graph

 Uses vertical or horizontal bars to represent data.


 Ideal for categorical or qualitative variables.
 The length or height of bars corresponds to frequency or value.

Example:

20
2. Pareto Chart

 Similar to a bar graph but arranged from highest to lowest frequency.


 Helps identify which categories are most significant in a dataset.
 Used in quality control and business analysis.

Example:

21
3. Time Series Graph

 Displays data points over time to reveal trends or patterns.


 Typically shown as a line chart.
 Used for analysing temperature changes, stock prices, or economic growth.

Example:

22
4. Pie Graph

 A circular chart divided into sections representing proportions of a whole.


 Ideal for showing percentages or parts of a dataset.
 Often used in business, economics, and survey results.

23
Example:

24
8. Stem-and-Leaf Plot

 A way to organize numerical data while retaining actual values.


 Helps in sorting and analyzing distributions.
 Provides a clear representation of individual data points.

Example:

25
Measures of Central Tendency
Measures of central tendency summarize a dataset by identifying a central or
typical value. The four primary measures are mean, median, mode, and midrange.
Each has distinct properties and is useful in different situations.

1. Mean (Arithmetic Average)

The mean is the sum of all values in a dataset divided by the total number of
values. It is a widely used measure because it considers every data point.

Formula:

26
where:

 ∑ X represents the sum of all data values


 n is the total number of data points

Example 1:

Example 2:

Advantages:

 Uses all values in the dataset, providing a comprehensive summary.


 Ideal for mathematical operations such as variance and standard deviation
calculations.

Disadvantages:

 Sensitive to outliers (extreme values can distort the mean).


 Cannot be used for categorical data (e.g., colors, names).

27
2. Median (Middle Value)

The median is the value that falls exactly in the middle when data is arranged
in ascending order. If there is an even number of values, the median is the average
of the two middle values.

Steps to Find the Median:

1. Arrange the data in ascending order.


2. Identify the middle value.
a. If odd number of values → pick the middle value.
b. If even number of values → find the average of the two middle values.

Advantages:

 Less affected by outliers than the mean.


 Useful in skewed distributions (such as income levels).

Disadvantages:

 Does not consider all values in the dataset.

Example 1:

28
Example 2:

3. Mode (Most Frequent Value)

The mode is the value that appears most frequently in a dataset. A dataset can be:

 Unimodal: One mode (single most frequent value).


 Bimodal: Two modes (two values appear most frequently).
 Multimodal: More than two modes.
 No mode: If all values occur equally, the dataset has no mode.

Example 1:

29
Example 2:

Advantages of the Mode:

 Only measure of central tendency that can be used for categorical


data (e.g., the most common car brand in a survey).
 Identifies popular trends in datasets.

Disadvantages:

 Can have multiple modes, making interpretation complex.


 Not always representative of the dataset.

30
Example 3: Finding Mean, Median and Mode

4. Midrange (Simplest Measure)

The midrange is the average of the smallest and largest values in a dataset.

Example 1:

31
Example 2:

Advantages of Midrange:

 Quick and easy to calculate.


 Sometimes provides a useful approximation of the central value .

Disadvantages:

 Extremely sensitive to outliers, which can distort the midrange value.


 Often not used in statistical analysis due to inaccuracy.

5. Weighted Mean

The weighted mean is a type of average that accounts for the importance (or
frequency) of each value in a dataset. It is useful when different values carry different
levels of significance.

32
Example:

33
Unit – 3
Classical Statistical Tests
Classical statistical tests are fundamental tools used in hypothesis testing and
data analysis. These tests help determine relationships between variables, identify
differences between groups, and assess statistical significance.

Z – Test
A Z-test is a type of hypothesis test that compares the sample’s average to the
population’s average and calculates the Z-score and tells us how much the sample
average is different from the population average by looking at how much the data
normally varies. It is particularly useful when the sample size is large >30. This Z-
Score is also known as Z-Statistics formula is:

Z-Score =

where,
 : mean of the sample.
 μ : mean of the population.
 σ : Standard deviation of the population.

Example:
The average family annual income in India is 200k with a standard deviation of 5k and
the average family annual income in Delhi is 300k.

Z-Score =

= 20
Steps to perform Z-test
 First step is to identify the null and alternate hypotheses.
 Determine the level of significance (∝).
 Find the critical value of z in the z-test.
 Calculate the z-test statistics. Below is the formula for calculating the z-test
statistics

Types of Z-Tests:

1. One-Sample Z-Test

 Used when comparing the mean of a single sample to a known population mean.
 Example: A professor wants to test if the average exam score in their class is
significantly different from the national average of 75.

Z=

2. Two-Sample Z-Test (Independent Z-Test)

 Compares the means of two independent samples to check if there is a


significant difference.
 Example: Comparing the average income of employees in two different
companies to see if one company pays more.
T-Test (Student’s T-Test)
A t-test is a statistical test used to compare the means of two groups and
determine whether their differences are statistically significant. It is commonly used
when the sample size is small (n<30n < 30) and when data follows a normal
distribution.

Types of T-Tests

1. Independent Samples T-Test

 Used when comparing the means of two different groups.


 Assumes that the two groups are independent and not related.
 Example: Comparing test scores between male and female students.

2. Paired Samples T-Test (Dependent T-Test)

 Used when comparing means within the same group before and after a
treatment or intervention.
 Also called the matched-pairs t-test.
 Example: Measuring weight before and after a fitness program.

3. One-Sample T-Test

 Compares the mean of a single sample to a known population mean.


 Example: Testing whether the average exam score of a class matches a
national standard.

T-Test Formula

The formula for a t-test depends on the type used, but the general formula is:
Where:

 = Sample means
 s = Standard deviation
 √ = Sample size

For paired t-tests, the difference between paired values is used instead of two
separate sample means.

F-Test (Variance Ratio Test)


The F-test is a statistical test used to compare variances between groups and
determine whether they are significantly different. It is commonly used in analysis of
variance (ANOVA) and regression analysis.

Formula:

Purpose of the F-Test

1. Comparing two population variances: To check if one group has more


variation than another.
2. ANOVA (Analysis of Variance): To compare multiple group means.
3. Regression Analysis: To test the significance of overall model fit.
Types of F-Tests

1. F-Test for Equality of Variances

 Compares the variances of two samples.


 Used before performing a t-test to verify equal variances.
 Example: Testing whether the variability in test scores differs between two
schools.

2. F-Test in ANOVA

 Tests if the means of multiple groups are significantly different.


 Uses the ratio of between-group variance to within-group variance.
 Example: Comparing the effectiveness of three teaching methods based on
student scores.

3. F-Test in Regression Analysis

 Determines if an entire regression model is statistically significant.


 Tests whether predictors contribute to explaining variation in the dependent
variable.
 Example: Evaluating whether advertising budget significantly impacts sales.

Goodness of Fit-test
A goodness of fit test is used to determine whether sample data fits a
specific distribution or model. Commonly used goodness of fit tests include
Chi-square, Kolmogorov-Smirnov, Anderson-Darling, and Shapiro-Wilk. These
tests help measure how well observed data correspond to the expected
values from a model.
1. Anderson-Darling Test
The Anderson-Darling test (AD-Test) is used to test if a sample of data came
from a population with a specific distribution. It is a modification of the Kolmogorov-
Smirnov (K-S) test and gives more weight to the tails than does the K-S test.

The (AD-Test) is a measure of how well your data fits a specified distribution.
It’s commonly used as a test for normality.

The hypotheses for the AD-test are:

 H0: The data comes from a specified distribution.


 H1: The data does not come from a specified distribution.

Formula:

Where:

n = the sample size,

F(x) = CDF for the specified distribution,

i = the ith sample, calculated when the data is sorted in ascending order.

2. Chi-Square Test
A chi-square (Χ2) goodness of fit test is a goodness of fit test for a categorical
variable. Goodness of fit is a measure of how well a statistical model fits a set of
observations.

 When goodness of fit is high, the values expected based on the model are close
to the observed values.
 When goodness of fit is low, the values expected based on the model are far
from the observed values.
The statistical models that are analyzed by chi-square goodness of fit tests are
distributions. They can be any distribution, from as simple as equal probability for all
groups, to as complex as a probability distribution with many parameters.

The chi-square goodness of fit test is a hypothesis test. It allows you to draw
conclusions about the distribution of a population based on a sample.

The hypotheses for the AD-test are:

 (H0): The population follows the specified distribution.


 (Ha): The population does not follow the specified distribution.

Formula:

Where,

3. Kolmogorov–Smirnov Test
Kolmogorov–Smirnov Test is a completely efficient manner to determine if two
samples are significantly one of a kind from each other. It is normally used to check the
uniformity of random numbers. Uniformity is one of the maximum important properties of
any random number generator and the Kolmogorov–Smirnov check can be used to
check it.
The Kolmogorov–Smirnov test is versatile and can be employed to evaluate
whether two underlying one-dimensional probability distributions vary. It serves as an
effective tool to determine the statistical significance of differences between two sets of
data.

Formula:

Where,
 n is the sample size.
 x is the normalized Kolmogorov-Smirnov statistic.
 k is the index of summation in the series

Uses of Kolmogorov-Smirnov Test:


1. Comparison of Probability Distributions: The test is used to evaluate whether
two samples exhibit the same probability distribution.
2. Compare the shape of the distributions: If we assume that the shapes or
probability distributions of the two samples are similar, the test assesses the
maximum absolute difference between the cumulative probability distributions of
the two functions.
3. Check Distributional Differences: The test quantifies the maximum difference
between the cumulative probability distributions, and a higher value indicates
greater dissimilarity in the shape of the distributions.
4. Hypothesis Testing Types: The assessment of the shape of sample data is
typically done through hypothesis testing, which can be categorized into two types:
1. Parametric Test
2. Non-Parametric Test
4. Ryan-Joiner Test
The Ryan-Joiner (RJ) test for Normality is very similar to the Shapiro Wilk test,
but the authors claim it is simpler to implement in software and to explain to users, since
it is simply a version of the correlation between the sample data, yi, and the bi th
percentage point of the Normal distribution:

Since the mean of the b values is 0, we can simplify this expression (ignoring the shift of
the y-values by their mean) to:

5. Shapiro-Wilk Test
Shapiro-Wilk test is a hypothesis test that evaluates whether a data set is
normally distributed. It evaluates data from a sample with the null hypothesis that the
data set is normally distributed. A large p-value indicates the data set is normally
distributed, a low p-value indicates that it isn’t normally distributed.

W = (∑n i=1wiX′i)2 ∑n i=1(Xi−X¯)2


6. Jarque-Bera Test
The Jarque-Bera test [JAR1] is a two-sided goodness-of-fit test for Normality
suitable when a fully-specified null distribution is unknown and its parameters must be
estimated. It is the sample skewness and kurtosis and was developed for use in
connection based with regression analysis. The test statistic is on

where n is the sample size, s is the sample skewness, and k is the sample
kurtosis. For (very) large sample sizes, the test statistic has a chi-square distribution
with two degrees of freedom, but more generally its distribution is obtained via Monte
Carlo simulation.

7. Lilliefors Test

Lilliefors is essentially a variant of the Kolmogorov-Smirnov (K-S) test. It tests to


see if a sample comes from a distribution in the Normal family with unknown population
mean and variance (these are estimated from the sample), against the alternative that it
does not come from a Normal distribution.

With the K-S test it is assumed that the distribution parameters are known, which
is often not the case. Following Lilliefors, the test procedure is as follows: Given a
sample of n observations, one determines D, where:

D=sup[F(x)-G(x)]

where sup means supremum, or largest value of a set, with G(x) being the
sample cumulative distribution function and F(x) is the cumulative Normal distribution
function with mean μ=the sample mean, and variance σ2=the sample variance, defined
with denominator n-1.
Unit – 4
Chapter 10

Correlation and Regression


1. Correlation
• Correlation measures the strength and direction of the relationship between two
variables.
• It ranges from -1 to 1:
o Positive correlation (+1) means both variables move in the same direction
(e.g., as temperature increases, ice cream sales increase).
o Negative correlation (-1) means they move in opposite directions (e.g., as
speed increases, travel time decreases).
o Zero correlation (0) means there is no relationship between the variables.
• The Pearson correlation coefficient is commonly used to quantify the relationship.
• Correlation does not imply causation—it only indicates association.

2. Regression
• Regression is used to predict one variable (dependent) based on another
(independent).
• Simple linear regression follows the formula:
Y = bX + a
o Y is the dependent variable (predicted value).
o X is the independent variable (input).
o b is the slope (rate of change).
o a is the intercept (starting point).
• It helps in forecasting future trends and making informed decisions.

Assumptions in Correlation & Regression


o Linearity: The relationship between the variables must be linear.
o Normality: The data should be normally distributed.
o Independence: Observations should be independent of each other.
o Homoscedasticity: The variance of residuals (errors) should remain constant.

Scatterplots
A scatter plot is a type of graph used in statistics to visually represent the relationship
between two numerical variables. It helps identify patterns, trends, and possible correlations.

A scatter plot consists of a set of points plotted on a Cartesian coordinate system,


where:

• The x-axis represents the independent variable.


• The y-axis represents the dependent variable.

Each point on the scatter plot corresponds to a pair of values (x,y)(x, y) and shows
how one variable changes in relation to another.

It is generally used to plot the relationship between one independent variable and one
dependent variable, where an independent variable is plotted on the x-axis and a dependent
variable is plotted on the y-axis so that you can visualize the effect of the independent
variable on the dependent variable. These plots are known as Scatter Plot Graph or Scatter
Diagram.

Steps to Construct a Scatter Plot

1. Collect Data: Gather numerical values for two variables.


2. Set up Axes: Define the x-axis (independent variable) and y-axis (dependent
variable).
3. Scale Axes: Adjust values to fit the range of data.
4. Plot Points: Each pair (x,y)(x, y) is marked on the graph.
5. Analyze Pattern: Look for trends or relationships.

Types of Relationships in Scatter Plots


Scatter plots show different types of relationships:
1. Positive Correlation

• As x increases, y also increases (e.g., studying more leads to higher grades).


• Appears as a rising trend on the plot.
2. Negative Correlation

• As x increases, y decreases (e.g., more absences lead to lower test scores).


• Appears as a falling trend.
3. No Correlation

• No clear pattern exists between variables (e.g., shoe size vs. IQ).
4. Curvilinear Relationship

• Data points form a curve rather than a straight line.


• Example: Population growth over time in a specific region.

Identifying Strength of Correlation

The tightness of the points indicates how strong the relationship is:

• Strong correlation: Points closely follow a line.


• Weak correlation: Points are loosely scattered.
• No correlation: Points appear random.

A correlation coefficient (r) can quantify this strength:

• r = +1 → Perfect positive correlation


• r = -1 → Perfect negative correlation
• r ≈ 0 → No correlation

Example 1:
Example 2:
Correlation Coefficient
Correlation measures how strong a relationship is between two variables.

• The correlation coefficient (r) quantifies this relationship:


o +1 = Perfect positive correlation
o -1 = Perfect negative correlation
o 0 = No correlation

Formula:

where:
• n = number of data points
• x, y = individual data values
Example:
Hypothesis Testing for Correlation
To determine if a correlation is significant or just due to chance, we use hypothesis testing:
• Null Hypothesis (H0): There is no correlation (ρ=0).
• Alternative Hypothesis (H1): There is a significant correlation (ρ≠0).
Test Statistic Formula:

where n−2 is the degrees of freedom. If the test statistic exceeds the critical value, we reject
H0, meaning the correlation is statistically significant.

Regression
Regression is a statistical method used to study the relationship between two or more
variables, helping researchers predict outcomes based on observed data. The goal of
regression analysis is to identify trends and make data-driven forecasts.

Key Concepts in Regression Analysis


1. Scatter Plot & Relationship Identification

o Before performing regression analysis, data is collected and plotted on a


scatter plot to determine the type of relationship between variables.
o Possible relationships:
▪ Positive linear (both variables increase together)
▪ Negative linear (one variable increases while the other decreases)
▪ Curvilinear (relationship follows a curve)
▪ No relationship (random pattern)

2. Correlation Coefficient Calculation

o The correlation coefficient (r) measures the strength of the relationship


between two variables.
o A high correlation indicates a strong relationship suitable for regression
modeling.
3. Regression Line (Line of Best Fit)

o If the correlation is significant, the next step is to find the equation of the
regression line.
o The regression line minimizes prediction errors and helps make forecasts
based on observed trends.
o The formula for simple linear regression:

where: -

• Y = dependent variable (predicted value)


• X = independent variable (input value)
• b = slope (rate of change)
• a = intercept (starting value)

Example: Find the linear regression equation for the given data:

x y
3 8
9 6
5 4
3 2
Analysis of Variance (ANOVA)
ANOVA is a statistical test used to examine differences among the means of three or
more groups.
Unlike a t-test, which only compares two groups, ANOVA can handle multiple
groups in a single analysis, making it an essential tool for experiments with more than two
categories.

ANOVA helps analyze variation within groups and between groups to understand if
observed differences are due to random chance or an actual effect.

Key Concepts and Formulas in ANOVA


• Sum of Squares (SS): This measures the overall variability in the dataset.
• SS Total: Total variability across all observations.
• SS Between (SSB): Variability due to the differences between group means.
• SS Within (SSW): Variability within each group, showing how scores differ within
individual groups.
• Mean Square (MS): The average of squared deviations, calculated for both between-
group and within-group variability.

• Degrees of Freedom (df): The number of values that are free to vary when calculating
statistics.

• F-Ratio: The ratio of MSB to MSW, used to test the null hypothesis.

• P-Value: This probability value helps determine if the F-ratio is significant. A small p-
value (e.g., <0.05) suggests significant differences between groups.

Types of ANOVA:

1. One-Way ANOVA:
a. Examines the effect of a single independent variable on the dependent
variable.
b. Example: Comparing test scores of students across three teaching methods.
2. Two-Way ANOVA:
a. Analyzes the impact of two independent variables simultaneously and their
interaction.
b. Example: Studying the combined effects of diet and exercise on weight loss.
3. Repeated Measures ANOVA:
a. Used when the same subjects are measured multiple times under different
conditions.
b. Example: Tracking blood pressure levels of patients before, during, and after
medication.

ANOVA Table
An ANOVA (Analysis of Variance) test table is used to summarize the results of an
ANOVA test, which is used to determine if there are any statistically significant differences
between the means of three or more independent groups. Here’s a general structure of an
ANOVA table:

One-way Analysis of Variance

One-Way ANOVA is a statistical method used to compare the means of three or


more independent groups to determine if there is a significant difference among them. It
helps evaluate whether variations between group means occur due to real differences or just
random chance.

Key Concepts
1. Independent Variable (Factor) – The categorical variable that divides the data into
groups (e.g., teaching methods).

2. Dependent Variable – The numerical variable being measured (e.g., student test
scores).

3. Hypothesis Testing – Determines whether the group means significantly differ:

a. Null Hypothesis (H0): All group means are equal ( 𝜇1 = 𝜇2 = 𝜇3 ).

b. Alternative Hypothesis (H1): At least one group mean differs.

Example:
Two-way Analysis of Variance
Two-way ANOVA is a statistical technique used to examine the effects of two
independent variables (factors) on a dependent variable simultaneously. It helps determine
if there are significant differences between group means based on two categorical factors
and whether these factors interact.

Key Concepts

1. Independent Variables (Factors) – The two categorical variables influencing the


dependent variable.
a. Example: Diet type and exercise level affect weight loss.
2. Dependent Variable – The numerical outcome being measured.
a. Example: Weight loss (kg).
3. Main Effects – The impact of each independent variable on the dependent variable.
a. Example: Does diet alone affect weight loss? Does exercise alone affect
weight loss?
4. Interaction Effect – If both independent variables together have a combined effect
beyond their individual impacts.
a. Example: Does the combination of diet and exercise lead to different weight
loss results compared to each factor alone?

Example:
Unit – 5
Statistical Packages
A statistical package is a software application designed for performing statistical analysis,
data visualization, and interpretation. These packages provide a range of tools for conducting
descriptive statistics, inferential statistics, regression analysis, hypothesis testing, and
data modeling.

Key Features of Statistical Packages

1. Data Management – Import, clean, and manipulate datasets.


2. Descriptive Statistics – Calculate measures like mean, median, mode, standard
deviation, and variance.
3. Inferential Statistics – Perform hypothesis tests, confidence intervals, and ANOVA.
4. Regression & Correlation Analysis – Understand relationships between variables.
5. Graphical Visualization – Generate histograms, scatter plots, box plots, and more.
6. Machine Learning & Advanced Analytics – Some packages offer predictive
modeling.

SPSS
SPSS (Statistical Package for the Social Sciences) is a software application widely
used for statistical analysis, data management, and graphical representation of data. It
was originally designed for social sciences research but has now become a popular tool in
business, healthcare, education, and many other fields.

Key Features of SPSS

SPSS offers various tools for data processing, including:

• Data Management: Users can import, clean, and manipulate data efficiently.
• Descriptive Statistics: Calculation of mean, median, mode, standard deviation, and
frequency distributions.
• Inferential Statistics: Hypothesis testing, confidence intervals, and ANOVA.
• Regression Analysis: Used to predict outcomes and identify relationships between
variables.
• Data Visualization: Generates histograms, scatter plots, box plots, and pie charts.
• Factor Analysis & Clustering: Helps in segmentation and identifying patterns in
large datasets.
Components of SPSS
SPSS is divided into different modules, each designed for specific statistical tasks:

• SPSS Statistics Base: The core module for basic data analysis.
• SPSS Modeler: Used for predictive modeling and data mining.
• SPSS Text Analytics: Helps analyze text-based data like surveys and social media
comments.
• SPSS Amos: Used for structural equation modeling and advanced multivariate
analysis.
• SPSS Custom Tables: Creates customized reports and tables for data interpretation.

Working of SPSS
SPSS operates through two primary views:

• Data View: Displays raw data in spreadsheet format (rows = cases, columns =
variables).
• Variable View: Allows users to define data attributes such as names, labels, types,
and measurement scales.

Users can perform analyses using menus and built-in functions, or write scripts in syntax
for complex tasks.

Applications of SPSS
SPSS is used across industries for various purposes:

• Social Science Research: Analyzing survey responses and behavioral studies.


• Healthcare: Studying treatment effectiveness and disease patterns.
• Education: Assessing student performance and institutional statistics.
• Business & Marketing: Consumer behavior analysis, market segmentation, and sales
forecasting.

MS-Excel
Microsoft Excel is a powerful spreadsheet application used for data management,
analysis, and visualization. While primarily known for business applications, Excel
provides a robust set of statistical functions and tools for performing various types of
calculations and statistical tests.
Key Features for Statistical Analysis in Excel
• Data Entry & Management: Stores large datasets, enables sorting and filtering.
• Descriptive Statistics: Calculates mean, median, mode, variance, and standard
deviation.
• Inferential Statistics: Performs hypothesis testing, confidence intervals, and
ANOVA.
• Regression & Correlation Analysis: Analyzes relationships between variables.
• Graphical Visualization: Generates scatter plots, histograms, and trend lines.
• Data Analysis ToolPak: Includes specialized statistical functions like t-tests and
regression modeling.

How Excel Performs Statistical Analysis


Excel provides built-in formulas and tools for statistical calculations, such as:

Basic Statistics

• Mean: =AVERAGE(range)

• Median: =MEDIAN(range)

• Standard Deviation: =STDEV.P(range) or =STDEV.S(range)

• Variance: =VAR.P(range) or =VAR.S(range)

Correlation and Regression

• Correlation: =CORREL(range1, range2)

• Simple Linear Regression: Can be done using trendline functions in scatter plots.

Hypothesis Testing & ANOVA

• t-Tests: =T.TEST(range1, range2, tails, type)

• ANOVA: Available in the Data Analysis ToolPak

Graphing Data in Excel


• Scatter Plots: Used to analyze relationships between variables.
• Histograms: Helps visualize frequency distributions.
• Trend Lines: Used in regression analysis.
Applications of Excel for Statistical Analysis
• Business & Finance – Analyzing revenue trends and sales forecasts.
• Healthcare & Medicine – Studying the effects of treatments based on patient data.
• Education – Examining student performance statistics.
• Manufacturing & Quality Control – Identifying patterns in production efficiency.

SAS
SAS (Statistical Analysis System) is a powerful data analytics software used for statistical
analysis, data management, and predictive modeling. It is widely used in business,
healthcare, government, and scientific research due to its robust data handling capabilities
and ability to process large datasets efficiently.

Key Features of SAS for Statistical Analysis


Data Management: SAS can handle, clean, and manipulate large datasets with ease.
Descriptive Statistics: Computes measures like mean, median, mode, standard deviation,
and variance. Inferential Statistics: Performs hypothesis testing, confidence intervals,

and ANOVA. Regression Analysis: Simple, multiple, and logistic regression models.

Time-Series Analysis: Helps forecast trends in business and financial data. Machine
Learning & AI Integration: SAS provides advanced analytics for predictive modeling.
Data Visualization: Generates histograms, scatter plots, box plots, and trend analysis graphs.

Working of SAS
SAS operates through two main interfaces:

• SAS Programming (SAS Syntax): Users write code to execute analyses.


• SAS Enterprise Guide (GUI-Based Interface): Provides a user-friendly drag-and-
drop interface for non-programmers.

Users load datasets, perform statistical operations, and generate reports and visualizations.

Statistical Procedures in SAS


SAS includes specialized PROC (Procedure) Statements for statistical analysis:

• PROC MEANS: Computes descriptive statistics (mean, variance, standard


deviation).
• PROC CORR: Performs correlation analysis between variables.
• PROC REG: Conducts regression modeling for predictions.
• PROC ANOVA: Performs analysis of variance (One-Way and Two-Way ANOVA).
• PROC TTEST: Executes hypothesis testing for mean differences.
• PROC GLM: Used for general linear models, including multiple regression and
ANOVA.

Users write SAS scripts to perform these procedures or use Enterprise Guide for
automated workflows.

Applications of SAS
• Healthcare & Medicine – Clinical trial analysis and disease prediction.
• Finance & Banking – Risk management and fraud detection.
• Marketing & Retail – Customer segmentation and sales forecasting.
• Education & Social Sciences – Analyzing survey data and academic performance
trends.
• Manufacturing & Supply Chain – Predicting production efficiency and optimizing
logistics.

R Programming
R is an open-source programming language specifically designed for statistical
computing and data analysis. It is widely used in academic research, business analytics,
machine learning, and scientific computing due to its flexibility, vast libraries, and
powerful visualization capabilities.

Key Features for Statistical Analysis in R


• Data Handling: Efficiently manages large datasets using built-in functions and
external libraries.
• Descriptive Statistics: Calculates mean, median, standard deviation, variance, and
percentiles.
• Inferential Statistics: Performs hypothesis tests, ANOVA, chi-square tests, and
more.
• Regression & Correlation Analysis: Supports simple and multiple regression
models.
• Data Visualization: Creates histograms, box plots, scatter plots, and complex graphs
using ggplot2.
• Machine Learning & Predictive Analytics: Includes models for classification,
clustering, and forecasting.
How R Performs Statistical Analysis
R uses built-in functions and external libraries to perform statistical computations:

Basic Statistics

• Mean: mean(data)
• Median: median(data)
• Standard Deviation: sd(data)
• Variance: var(data)

Correlation & Regression

• Correlation: cor(x, y)
• Linear Regression: lm(y ~ x, data = dataset)
• Multiple Regression: lm(y ~ x1 + x2, data = dataset)

Hypothesis Testing & ANOVA

• t-Tests: t.test(group1, group2)


• Chi-Square Test: chisq.test(x, y)
• ANOVA: aov(y ~ x, data = dataset)

Data Visualization

• Base R: plot(x, y)
• ggplot2:
o library(ggplot2)
o ggplot(data, aes(x, y)) + geom_point()

Applications of R for Statistical Analysis


• Scientific Research – Data analysis in biological and social sciences.
• Finance & Economics – Forecasting stock trends, economic modeling.
• Healthcare – Clinical trials, drug analysis.
• Business & Marketing – Customer segmentation, predictive analytics.
Minitab
Minitab is a statistical software widely used for data analysis, quality improvement,
and predictive analytics. It is designed for ease of use, making it popular among students,
researchers, and professionals working in manufacturing, healthcare, business, and
engineering.

Key Features of Minitab for Statistical Analysis


• Data Management: Easily imports and organizes datasets.
• Descriptive Statistics: Computes mean, median, standard deviation, variance, and
frequency distributions.
• Inferential Statistics: Performs hypothesis testing, ANOVA, chi-square tests, and
regression.
• Regression Analysis: Offers simple, multiple, and logistic regression modeling.
• Quality Control Tools: Provides Six Sigma and process improvement analyses
(control charts, capability analysis).
• Graphical Visualization: Generates histograms, box plots, scatter plots, and
probability plots.
• Multivariate Analysis: Supports principal component analysis (PCA) and factor
analysis.

How Minitab Performs Statistical Analysis


Minitab offers menu-driven operations, making it easier than code-based tools like
R or Python. Users can perform analysis through the following steps:

Basic Statistics

• Mean, Median, Standard Deviation: Found under Stat > Basic Statistics.
• Summary statistics for datasets can be obtained through Stat > Descriptive Statistics.

Regression & Correlation Analysis

• Simple Linear Regression: Stat > Regression > Fit Regression Model.
• Correlation Analysis: Stat > Basic Statistics > Correlation.

Hypothesis Testing & ANOVA

• t-Tests: Stat > Basic Statistics > t-test.


• Chi-Square Test: Stat > Tables > Chi-Square Test.
• ANOVA: Stat > ANOVA > One-Way or Two-Way ANOVA.

Data Visualization

• Scatter plots, histograms, and box plots can be accessed through Graph > Scatterplot.

Quality Control & Process Improvement

• Control Charts: Stat > Control Charts.


• Capability Analysis: Stat > Quality Tools > Capability Analysis.

Applications of Minitab for Statistical Analysis


• Manufacturing & Six Sigma – Process optimization and defect analysis.
• Healthcare & Medicine – Clinical trials and patient data analysis.
• Finance & Economics – Predictive modeling and risk assessment.
• Engineering & Research – Reliability testing and experimental design.

You might also like