KEMBAR78
Statistical Analysis - Discrete and Interval | PDF | Mode (Statistics) | Mean
0% found this document useful (0 votes)
19 views16 pages

Statistical Analysis - Discrete and Interval

This document provides a comprehensive analysis of data distribution, focusing on central tendency, dispersion, and shape through descriptive statistics. It explains the importance of understanding data types, constructing frequency distribution tables, and calculating measures of central tendency such as mean, median, and mode. The report emphasizes the significance of these concepts for transforming raw data into actionable insights for informed decision-making.

Uploaded by

Rahat Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views16 pages

Statistical Analysis - Discrete and Interval

This document provides a comprehensive analysis of data distribution, focusing on central tendency, dispersion, and shape through descriptive statistics. It explains the importance of understanding data types, constructing frequency distribution tables, and calculating measures of central tendency such as mean, median, and mode. The report emphasizes the significance of these concepts for transforming raw data into actionable insights for informed decision-making.

Uploaded by

Rahat Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Comprehensive Analysis of Data

Distribution: Central Tendency,


Dispersion, and Shape
Data analysis begins with understanding the fundamental characteristics of a dataset.
Descriptive statistics provide the tools to summarize and describe data, revealing its central
location, spread, and overall shape. This report offers an integrated exploration of key
descriptive measures—mean, median, mode, standard deviation, skewness, and
kurtosis—demonstrating their calculation and application through frequency distribution tables
for both discrete and interval data. A thorough understanding of these concepts is paramount for
transforming raw data into meaningful insights, which is essential for informed decision-making
across various disciplines.

1. Foundations: Data Types and Frequency


Distributions
Effective statistical analysis hinges on a clear understanding of the data's nature and how it is
organized. This section establishes the groundwork by defining the relevant data types and
detailing the construction of frequency distributions, which are crucial for structuring raw data for
subsequent statistical computations.

1.1. Understanding Data Types: Discrete vs. Interval Data


Data can be broadly categorized based on the values it can assume and the scale of
measurement. Two primary types are discrete and interval data, each requiring specific
approaches for analysis.
Discrete Data refers to quantitative information that can only take on specific, distinct values.
These values are typically countable and often represent counts or categories. For instance, the
number of books read by individuals in a year or survey responses categorized as
"conservative," "moderate," or "liberal" are examples of discrete data. A discrete frequency
distribution is a tabular representation that systematically displays each individual outcome or
category along with the count of its occurrences within a given list. This method is also known
as an ungrouped frequency distribution.
Interval Data, often derived from continuous variables, represents measurements where the
difference between values is meaningful, but there is no true zero point (e.g., temperature in
Celsius). While continuous data can theoretically take any value within a range, for practical
analysis, especially with large datasets, it is frequently organized into defined class intervals.
The methods for analyzing continuous variables, particularly when grouped, differ significantly
from those applied to discrete data. For example, for continuous variables or ratio levels of
measurement, the mode may not be a particularly helpful measure of central tendency.
Examples of interval data include reaction times measured in milliseconds, heights, weights, or
student exam scores.
The nature of the data profoundly influences the selection of appropriate statistical measures
and calculation methodologies. Applying methods suitable for one data type to another can lead
to inaccurate statistical conclusions. Therefore, the initial step in any quantitative analysis
involves a careful assessment of the data's scale of measurement.

1.2. Discrete Frequency Distribution Tables


Discrete frequency distribution tables are fundamental for organizing and summarizing discrete
datasets, making raw data comprehensible and ready for statistical computation.
A discrete frequency distribution serves to organize and summarize categorical data into
meaningful categories and their corresponding frequencies, providing a clear representation of
how frequently each category occurs in a dataset.
The construction of a discrete frequency distribution typically follows a systematic process, often
employing a tally chart method:
1.​ Table Setup: A table is drawn with at least three columns: one for the unique data items,
one for tally marks, and one for the frequency.
2.​ Unique Item Listing: Each distinct item or value from the dataset is listed in the first
column, ensuring no repetitions.
3.​ Tallying Occurrences: The raw data is reviewed item by item. For each occurrence of an
individual item, a standing line is placed in the second column. After four standing lines,
the fifth is drawn diagonally across them to form a group of five, enhancing readability and
simplifying counting. This simple technique is an early form of grouping that improves
readability and reduces counting errors for larger datasets, underpinning more complex
data summarization methods used in extensive data processing.
4.​ Frequency Count: The tally marks for each item are counted, and the total count,
representing the frequency of occurrence, is recorded in the third column.
5.​ Cumulative Frequency (Optional): If desired, a cumulative frequency column can be
added. Cumulative frequency represents the total frequency up to a particular category,
including all preceding categories. This is particularly useful for understanding the
distribution of data and identifying cumulative patterns.
Illustrative Example: Number of Goals Scored by a Football Team in 15 Matches
Consider the following discrete data representing the number of goals scored by a football team
in 15 matches: 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 4, 0
Table 1.2.1: Discrete Frequency Distribution Table for Goals Scored
Goals Scored (𝑥𝑖) Tally Marks Frequency (𝑓𝑖) Cumulative Frequency
(𝑐𝑓𝑖)
0
1
2
3 1
4 1
Total N = 15
This table transforms raw, disorganized data into a structured format, enabling immediate visual
inspection of patterns (e.g., zero goals were the most frequent outcome) and serving as the
direct input for calculating central tendency and dispersion measures for discrete data.

1.3. Grouped/Interval Frequency Distribution Tables


When dealing with continuous data or a very wide range of discrete values, organizing data into
class intervals to form a grouped frequency distribution becomes necessary. This approach
simplifies the dataset for easier analysis and visualization.
The construction of a grouped frequency distribution involves several steps:
1.​ Determine the Range: The range is calculated as the difference between the maximum
and minimum values in the dataset.
2.​ Determine the Number of Class Intervals (𝑘): While there are formal methods like
Sturge's formula, a common heuristic suggests using between 5 and 15 classes.
3.​ Calculate Class Interval Size (ℎ): The class width is determined by dividing the range by
the chosen number of class intervals. It is generally advisable to maintain consistent class
widths where possible.
4.​ Establish Class Boundaries/Limits: Define the lower and upper limits for each interval.
These intervals must be mutually exclusive (no overlap) and exhaustive (cover all data
points). The lower limit is the smallest value in the data class, and the upper limit is the
greatest data value within that class.
5.​ Tally Frequencies: Each observation from the raw data is assigned to its appropriate
class interval, and tally marks are made.
6.​ Record Frequencies: The tallies for each class are summed to obtain the frequency for
that interval.
7.​ Calculate Class Midpoints (𝑥𝑖): For each class interval, the midpoint (or class mark) is
calculated by adding the lower and upper limits of the class and dividing by 2. This
midpoint serves as the representative value for all observations within that class for
subsequent calculations.
8.​ Calculate Cumulative Frequency (𝑐𝑓𝑖): The cumulative frequency for a class is found by
adding its frequency to the cumulative frequency of the preceding class.
This process involves a fundamental compromise: while grouping makes large datasets
manageable and reveals overall patterns, it sacrifices some precision. The original raw data
points within an interval are no longer individually accessible; they are represented by a single
midpoint. Consequently, calculations derived from grouped data, such as the mean, standard
deviation, skewness, and kurtosis, are estimates rather than exact values from the original raw
data. This trade-off is a crucial consideration for the accuracy and generalizability of findings,
particularly with small datasets or wide class intervals.
Illustrative Example: Student Exam Scores (out of 100)
Consider the following raw scores of 50 students on a statistics exam: 65, 78, 92, 55, 81, 70, 63,
88, 75, 60, 95, 68, 72, 85, 50, 77, 80, 62, 71, 90, 58, 73, 86, 66, 79, 83, 52, 74, 89, 61, 91, 67,
76, 82, 59, 70, 84, 64, 77, 87, 56, 71, 93, 69, 75, 80, 57, 72, 85, 60
Steps to Construct Grouped Frequency Table:
●​ Minimum Score = 50, Maximum Score = 95.
●​ Range = 95 - 50 = 45.
●​ Let's choose 6 class intervals.
●​ Class Width = Range / Number of Classes = 45 / 6 = 7.5. We round up to 8 for
convenience and to ensure all data is covered.
●​ Start the first class at 50.
Table 1.3.1: Grouped Frequency Distribution Table for Student Exam Scores
Class Interval Midpoint (𝑥𝑖) Tally Marks Frequency (𝑓𝑖) Cumulative
Frequency (𝑐𝑓𝑖)
50-57 53.5
7 7
58-65 61.5
66-73 69.5
74-81 77.5
82-89 85.5
Class Interval Midpoint (𝑥𝑖) Tally Marks Frequency (𝑓𝑖) Cumulative
Frequency (𝑐𝑓𝑖)
90-97 93.5 0 50
Total N = 50
Note: The last class interval (90-97) has a frequency of 0, but it's included to ensure all possible
scores up to the maximum are covered by the intervals.
This table facilitates the analysis of large, continuous datasets by reducing complexity, allowing
for the estimation of population parameters that would be unwieldy to calculate from raw data.

2. Measures of Central Tendency


Measures of central tendency provide a single value that attempts to describe the "average" or
"typical" value of a dataset. The three most common measures are the mean, median, and
mode.

2.1. Mean
The mean, or arithmetic average, is the most commonly used measure of central tendency
because it incorporates all values in its calculation. It is calculated by summing all values in a
dataset and dividing by the total number of values. The mean represents the "balance point" of
a distribution.
Formulas:
∑𝑥
●​ Raw Data (Sample Mean): For a sample dataset, the mean (𝑥) is calculated as: 𝑥 = 𝑛

Where ∑ 𝑥 is the sum of all individual values, and 𝑛 is the total number of values in the

sample.
●​ Raw Data (Population Mean): For a population dataset, the mean (µ) is calculated as:
∑𝑥
µ = 𝑁
Where ∑ 𝑥 is the sum of all values in the population, and 𝑁 is the total number of

values in the population.


●​ Discrete and Grouped Frequency Distribution: When data is presented in a frequency
∑𝑓𝑥
distribution table, the mean is calculated as: 𝑀𝑒𝑎𝑛 = Here, 𝑓 represents the
∑𝑓

frequency of each value (for discrete data) or the midpoint of each class interval (for
grouped data), and 𝑥 represents the value or midpoint, respectively. The sum of

frequencies (∑ 𝑓) is equivalent to the total number of observations (𝑁). For grouped data,

this yields an estimated mean.


Application and Interpretation: The mean is highly sensitive to extreme values (outliers). A
single unusually large or small value can significantly shift the mean. This sensitivity arises
because the formula for frequency distributions, \frac{\sum fx}{\sum f}, effectively treats the
mean as a weighted average, where each value (or midpoint) is weighted by its frequency.
Values that appear more frequently will exert a greater pull on the mean, even if they are not
numerically extreme. For a perfectly normal distribution, the mean, median, and mode are
identical.

2.2. Median
The median is the middle number in a dataset that has been ordered from lowest to highest.
Unlike the mean, the median is less affected by extreme values, making it a robust measure of
central tendency, particularly useful for skewed distributions or when outliers are present.
Formulas:
●​ Raw Data (Odd Number of Observations): If the total number of observations (n) is odd,
𝑛+1 𝑡ℎ
the median is the value at the position: 𝑀𝑒𝑑𝑖𝑎𝑛 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 = ( 2
) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛.

●​ Raw Data (Even Number of Observations): If the total number of observations (n) is
even, the median is the average of the two middle values:
𝑛+1 𝑡ℎ 𝑛 𝑡ℎ
𝑀𝑒𝑑𝑖𝑎𝑛 = ( 2
) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 + ( 2 + 1) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 2 ..
●​ Discrete Frequency Distribution:
1.​ Calculate the cumulative frequency (𝑐𝑓) for each value.
𝑛
2.​ Find 2 , where 𝑁 is the total sum of frequencies.
3.​ The median is the value of the variable (𝑥𝑖) corresponding to the cumulative
𝑛
frequency that is just greater than or equal to 2
.
●​ Grouped Frequency Distribution (Continuous Data): For grouped data, the median is
calculated using the following formula after identifying the median class:
𝑛
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐼 + [( 2 − 𝑐)𝑓]ℎ
Where:
○​ 𝐼 = lower limit of the median class.
○​ 𝑛 = total number of observations (\sum f).
○​ 𝑐 = cumulative frequency of the class before the median class.
○​ 𝑓 = frequency of the median class.
○​ ℎ = size of the class (class width). The median class is the first class interval whose
𝑛
cumulative frequency is greater than or equal to 2 .
Application and Interpretation: The median precisely divides the data into two halves, with
50% of the values falling below it and 50% above it. Its positional definition means its value is
determined by its rank rather than its magnitude, making it robust to extreme values. This
robustness makes the median a preferred measure of central tendency for highly skewed
datasets (e.g., income distribution, property values) where the mean might be misleadingly
inflated or deflated by a few extreme observations. It often provides a more accurate
representation of the "typical" experience for the majority of the data points in such cases.

2.3. Mode
The mode is defined as the most frequently occurring value in a dataset. A dataset can have no
mode (if all values occur with the same frequency), one mode (unimodal), or more than one
mode (multimodal).
Finding the Mode:
●​ Raw Data: To find the mode, the dataset is typically sorted, and the value that appears
most often is identified.
●​ Discrete Frequency Distribution: For a discrete frequency distribution, the mode is
simply the value of the variable that has the highest frequency.
●​ Grouped Frequency Distribution (Modal Class): For grouped data, it is generally more
appropriate to identify the modal class, which is the class interval possessing the
maximum frequency. The mode for grouped data can be estimated using the following
𝑓𝑚−𝑓1
formula: 𝑀𝑜𝑑𝑒 = 𝐼 + [ 2𝑓 −𝑓1−𝑓2
]ℎ
𝑚

Where:
○​ 𝐼 = lower limit of the modal class.
○​ 𝑓𝑚= frequency of the modal class.
○​ 𝑓1 = frequency of the class before the modal class.
○​ 𝑓2 = frequency of the class after the modal class.
○​ ℎ = width of the class.
Application and Interpretation: The mode is particularly valuable for nominal (categorical)
data, where it indicates the most popular or common category. For example, in a survey asking
about political identification, the mode would reveal the most frequently chosen category (e.g.,
"liberal" if it has the highest frequency). The mode's unique applicability to categorical data,
unlike the mean or median which require numerical data, makes it a bridge between qualitative
and quantitative data analysis. In contrast, for continuous variables, a specific value may rarely
repeat, rendering the mode less helpful as a measure of central tendency. In such cases, the
modal class provides more meaningful information.

2.4. Interplay of Central Tendency Measures


The relationship between the mean, median, and mode offers valuable insights into the shape of
a data distribution.
●​ Normal Distribution: In a perfectly normal (symmetrical) distribution, the mean, mode,
and median are exactly the same. The data is symmetrically distributed with no skew, and
most values cluster around a central region, tapering off as they move away from the
center.
●​ Skewed Distributions: In distributions that are not symmetrical, the mean, median, and
mode will differ from each other, with more values falling on one side of the center than
the other. This provides a quick, intuitive check of data distribution.
○​ Positively Skewed (Right-skewed): In a positively skewed distribution, the mode
is less than the median, which is less than the mean (Mode < Median < Mean). The
mean is pulled towards the longer right tail by higher values. For instance, if the
mean is significantly higher than the median, it immediately signals a positive skew,
prompting further investigation into potential outliers or inherent characteristics of
the data (e.g., income data is typically positively skewed).
○​ Negatively Skewed (Left-skewed): In a negatively skewed distribution, the mean
is less than the median, which is less than the mode (Mean < Median < Mode). The
mean is pulled towards the longer left tail by lower values.
●​ Empirical Relationship: For moderately skewed distributions, an empirical relationship
often holds: 2 × 𝑀𝑒𝑎𝑛 + 𝑀𝑜𝑑𝑒 = 3 × 𝑀𝑒𝑑𝑖𝑎𝑛. This relationship allows for the
estimation of one measure if the other two are known.
This interplay provides a powerful diagnostic tool for understanding distribution shape. Before
even calculating skewness coefficients, one can infer the general shape of the distribution by
comparing these three values. This initial diagnostic guides the choice of further analytical
methods and interpretation.
3. Measures of Dispersion: Standard Deviation
While central tendency measures describe the "average" value, measures of dispersion quantify
the spread or variability of data points around that average. The standard deviation is the most
widely used measure of dispersion.

3.1. Standard Deviation


The standard deviation (\sigma) quantifies the average amount of variability or dispersion of
data points around the mean. A smaller standard deviation indicates that data points are
clustered closely around the mean, suggesting less variability. Conversely, a larger standard
deviation indicates that data points are more spread out from the mean, implying greater
variability.
Formulas for Frequency Distribution Tables: Two common formulations exist for calculating
standard deviation from frequency tables:
●​ Formulation 1 (Conceptual): This formula emphasizes the deviation of each score from

2
∑(𝑥𝑖−µ) 𝑓𝑖
the mean: σ =
∑𝑓𝑖

Where:
○​ 𝑥𝑖 = individual score (or midpoint for grouped data).
○​ µ = mean of the distribution.
○​ 𝑓𝑖 = frequency of the score/midpoint.

○​ ∑ 𝑓𝑖 = total sum of frequencies (𝑁).

●​ Formulation 2 (Computational, often easier): This formula is generally more efficient


for calculations from frequency tables, particularly when dealing with large datasets:

2
∑𝑓𝑖 𝑥𝑖
2
σ = −µ
∑𝑓𝑖

Where:
○​ 𝑥𝑖 = individual score (or midpoint for grouped data).
○​ µ = mean of the distribution.
○​ 𝑓𝑖 = frequency of the score/midpoint.

○​ ∑ 𝑓𝑖 = total sum of frequencies (𝑁).

2
○​ ∑ 𝑓𝑖 𝑥𝑖 = sum of the product of each frequency and its corresponding score

squared.
Step-by-step Calculation (using Formulation 2):
1.​ Calculate the Mean (µ): First, compute the mean of the distribution using the formula
∑𝑓𝑖𝑥𝑖
µ = . This step requires columns for 𝑥𝑖, 𝑓𝑖, and 𝑓𝑖𝑥𝑖.
∑𝑓𝑖

2
2.​ Calculate 𝑥𝑖 : Square each individual score or midpoint value (𝑥𝑖 ).
2 2
3.​ Calculate 𝑓𝑖 𝑥𝑖 : Multiply each frequency (𝑓𝑖) by its corresponding squared score (𝑥𝑖 ).
2 2
4.​ Sum 𝑓𝑖 𝑥𝑖 : Add all the values obtained in the 𝑓𝑖 𝑥𝑖 column.
5.​ Sum 𝑓𝑖 : Add all the frequencies to get the total number of observations (𝑁).
2
6.​ Apply the Formula: Substitute the calculated sums (∑ 𝑓𝑖 𝑥𝑖 and ∑ 𝑓𝑖) and the mean (µ)

into Formulation 2 to compute the standard deviation.


Application and Interpretation: Standard deviation quantifies the typical distance of data
points from the mean. It is crucial for understanding the reliability of the mean as a measure of
central tendency and for comparing the variability of different datasets. For example, two
datasets might have the same mean, but vastly different standard deviations, indicating different
levels of consistency or risk.
Beyond mere description, standard deviation is a foundational parameter for inferential
statistics. It allows for the construction of confidence intervals, hypothesis testing, and the
assessment of statistical significance. Its relationship with the normal distribution makes it a
powerful tool for identifying unusual observations or "outliers" (values far from the mean in terms
of standard deviations) and for understanding the probability of observing certain data points.
For instance, in a normal distribution, approximately 68% of data falls within \pm 1 standard
deviation of the mean, 95% within \pm 2, and 99.7% within \pm 3. This concept is critical for
defining "normal" ranges around the mean and for assessing how unusual a particular data
point might be.

4. Measures of Shape: Skewness


Beyond central tendency and dispersion, measures of shape describe the form of a distribution.
Skewness is one such measure that quantifies the asymmetry of a data distribution, providing
insights into the elongation of its tails.

4.1. Skewness
Skewness is a measure of symmetry, or more precisely, the lack of symmetry, in a distribution. A
distribution is considered symmetric if it appears the same to the left and right of its center point.
Skewness provides an idea regarding the overall shape of the frequency distribution.
Interpretation:
●​ Skewness = 0: This indicates a perfectly symmetric distribution, such as a normal
distribution. In such cases, the data is evenly distributed around the mean.
●​ Positive Skewness (> 0): A positively skewed distribution has a longer tail extending to
the right side. This means that more values are concentrated on the left side of the peak,
and the mean is typically greater than the median, which is greater than the mode (Mean
> Median > Mode). This often occurs when there are a few unusually high values pulling
the mean to the right.
●​ Negative Skewness (< 0): A negatively skewed distribution has a longer tail extending to
the left side. In this case, more values are concentrated on the right side of the peak, and
the mean is typically less than the median, which is less than the mode (Mean < Median <
Mode). This can happen when there are a few unusually low values pulling the mean to
the left.
Formulas: Skewness is commonly calculated using moments, which are specific mathematical
expectations of a random variable. The Fisher-Pearson coefficient of skewness is widely used.
●​ Fisher-Pearson Coefficient of Skewness (Sample): The adjusted formula for sample
𝑛 𝑥𝑖−𝑥 3
skewness is: 𝑔1 = (𝑛−1) (𝑛−2)
. ∑( 𝑠
) For frequency distributions, this involves the

third central moment (𝑚3) and the second central moment (𝑚2, which is the variance).
1 𝑟
Central moments for grouped data are defined as:𝑚𝑟 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) . Thus, the
𝑚3
skewness can be approximated as: 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 ≈ 3 (derived from moment definitions ).
𝑚2 2
●​ Karl Pearson's Coefficient of Skewness: This simpler measure relates the mean,
𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
mode, and standard deviation: 𝑆𝑘 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 If the mode is ill-defined (e.g., in a
multimodal distribution or for continuous data where a specific mode is rare), an
3(𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒)
alternative formula is used: 𝑆𝑘 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 .
●​ Bowley's Coefficient of Skewness (Quartile Skewness): This measure uses quartiles:
𝑄3+𝑄1−2𝑄2
𝑆𝑘 = 𝑄3−𝑄1
Where 𝑄1, 𝑄2(median), and 𝑄3are the first, second, and third quartiles,
respectively.
Calculation Methods for Discrete and Grouped Data (using moments): To calculate
skewness using the moment-based formula, the following steps are typically performed after
calculating the mean and standard deviation:
1.​ Calculate Deviations from the Mean: For each value (𝑥𝑖) or midpoint, compute the
deviation (𝑥𝑖 − 𝑥).
3
2.​ Calculate Cubed Deviations: Cube each deviation:(𝑥𝑖 − 𝑥) .
3.​ Multiply by Frequency: Multiply each cubed deviation by its corresponding frequency
3
𝑓𝑖 : 𝑓𝑖 (𝑥𝑖 − 𝑥) .
3
4.​ Sum these Products: Sum all the values from the𝑓𝑖 (𝑥𝑖 − 𝑥) column. This sum
represents 𝑁 × 𝑚3.
5.​ Apply the Skewness Formula: Use the calculated sums and the standard deviation to
compute the skewness coefficient.
Application and Insights: Skewness helps identify non-normal distributions, which is critically
important because many classical statistical tests and intervals rely on assumptions of normality.
If data is significantly skewed, applying parametric tests that assume normality may yield invalid
or misleading results. Understanding skewness moves beyond mere description to a crucial
step in inferential statistics, informing the choice between parametric and non-parametric
statistical tests. This ensures that the chosen analytical method is appropriate for the data's
underlying distribution, thereby safeguarding the validity and reliability of research findings.

5. Measures of Shape: Kurtosis


Kurtosis is another measure of a distribution's shape, complementing skewness by describing
its peakedness and the "heaviness" or "lightness" of its tails relative to a normal distribution.

5.1. Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal
distribution. It also quantifies the degree of peakedness of a frequency distribution.
Interpretation: The normal distribution serves as a reference point and is considered
"mesokurtic".
●​ Leptokurtic (Kurtosis > 3 or Excess Kurtosis > 0): A leptokurtic distribution has a
sharper peak and "heavier tails" than a normal distribution. This indicates that a greater
proportion of data is concentrated around the mean (a high peak) and/or there are more
extreme values (outliers) in the tails.
●​ Mesokurtic (Kurtosis \approx 3 or Excess Kurtosis \approx 0): This describes a
distribution with similar peakedness and tail behavior to a normal distribution.
●​ Platykurtic (Kurtosis < 3 or Excess Kurtosis < 0): A platykurtic distribution is flatter and
has "lighter tails" than a normal distribution. This suggests that data points are more
uniformly distributed across the range, with fewer extreme values.
Formulas: Kurtosis is calculated using the fourth central moment.
●​ Pearson's Coefficient of Kurtosis: This is the most common formula for kurtosis:
𝑚4
𝐾𝑢 = 2
𝑚2
Where:
1 4
○​ 𝑚4 = fourth central moment, calculated as 𝑚4 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) .

1 2
○​ 𝑚2 = second central moment (variance), calculated as 𝑚2 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) .

●​ Excess Kurtosis: To make the normal distribution's kurtosis value zero, excess kurtosis
is often calculated: 𝐸𝑥𝑐𝑒𝑠𝑠 𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 = 𝑃𝑒𝑎𝑟𝑠𝑜𝑛'𝑠 𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 − 3 This definition simplifies
interpretation, as a positive value indicates heavier tails than normal, and a negative value
indicates lighter tails.
Calculation Methods for Discrete and Grouped Data (using moments): To calculate kurtosis
using the moment-based formula, the process builds upon the calculations for skewness:
1.​ Calculate Deviations from the Mean: For each value (𝑥𝑖) or midpoint, compute the
deviation (𝑥𝑖 − 𝑥).
2.​ Calculate Fourth Power of Deviations: Raise each deviation to the fourth power:
4
(𝑥𝑖 − 𝑥 ) .
3.​ Multiply by Frequency: Multiply each fourth-powered deviation by its corresponding
4
frequency 𝑓𝑖: 𝑓𝑖(𝑥𝑖 − 𝑥) .
4
4.​ Sum these Products: Sum all the values from the 𝑓𝑖(𝑥𝑖 − 𝑥) column. This sum
represents 𝑁 × 𝑚4.
5.​ Calculate the Second Central Moment (m_2): This is the variance, calculated as
2
∑𝑓𝑖(𝑥𝑖−𝑥)
𝑚2 = .
∑𝑓𝑖
6.​ Apply the Kurtosis Formula: Substitute the calculated sums into the formula for
Pearson's Coefficient of Kurtosis.
Application and Insights: Kurtosis is instrumental in identifying distributions with extreme
values or outliers ("heavy tails") or those that are unusually flat ("light tails"). In fields such as
finance or risk management, high kurtosis (leptokurtic distributions) signifies a higher probability
of extreme events (e.g., market crashes, large gains/losses) compared to what a normal
distribution would predict. This understanding is critical for risk assessment, portfolio
optimization, and comprehending the potential for rare, impactful events. Like skewness,
kurtosis is crucial for assessing normality assumptions for statistical tests.

6. Integrated Examples and Comprehensive


Application
This section provides two comprehensive, step-by-step examples—one for discrete data and
one for interval data—demonstrating the calculation of all discussed measures (mean, median,
mode, standard deviation, skewness, and kurtosis) from their respective frequency distribution
tables. Each example includes a detailed interpretation of the results, showcasing how these
measures collectively characterize a dataset.

6.1. Discrete Data Example


Scenario: A small business recorded the number of customer complaints received per day over
a period of 30 days.
Raw Data (30 days): 0, 1, 1, 0, 2, 1, 0, 3, 1, 1, 0, 2, 0, 1, 0, 0, 1, 2, 1, 0, 0, 1, 1, 0, 2, 1, 0, 1, 0, 0
Step-by-step Construction of Discrete Frequency Table: The unique values (number of
complaints) are 0, 1, 2, 3. We tally their occurrences and compute frequencies and cumulative
frequencies.
Table 6.1.1: Comprehensive Discrete Data Analysis Table (Customer Complaints)
Compla Freque Cumula f_i x_i x_i^2 f_i (𝑥𝑖 − 𝑥) f_i(x_i - (x_i - f_i(x_i - (x_i - f_i(x_i -
ints ncy tive x_i^2 \bar{x}) \bar{x}) \bar{x}) \bar{x}) \bar{x})
(x_i) (f_i) Freque ^2 ^3 ^3 ^4 ^4
ncy
(cf_i)
0 13 13 0 0 0 -0.9 10.53 -0.729 -9.477 0.6561 8.5293
1 12 25 12 1 12 0.1 0.12 0.001 0.012 0.0001 0.0012
2 4 29 8 4 16 1.1 4.84 1.331 5.324 1.4641 5.8564
3 1 30 3 9 9 2.1 4.41 9.261 9.261 19.448 19.448
1 1
Total N = 30 23 37 19.9 5.12 33.835
Detailed Calculation of Measures:
∑𝑓𝑖𝑥𝑖
23
●​ Mean (𝑥): 𝑥 = = 30
≈ 0. 767 The average number of complaints per day is
∑𝑓𝑖

approximately 0.77.
𝑁 𝑡ℎ
●​ Median: Total observations (𝑁) = 30. Median position = 2
= 30/2 = 15 observation.
Looking at the cumulative frequency column, the 15th observation falls within the "1
complaint" category (cf=25, which is just greater than 15). Therefore, the Median = 1
complaint.
●​ Mode: From the frequency column, the value with the highest frequency is 0 (with a
frequency of 13). Therefore, the Mode = 0 complaints.

2
∑𝑓𝑖 𝑥𝑖
●​ Standard Deviation (\sigma): Using Formulation 2: σ = ,
2
∑𝑓𝑖−µ

37
σ = 2 = 1. 2333 − 0. 5883 = 0. 645 ≈ 0. 803 The standard deviation is
30 − 0.767
approximately 0.803 complaints. This indicates that the typical deviation of daily
complaints from the average is about 0.8 complaints.
●​ Skewness (𝑔1): We need
1 3 5.12 1 2 19.9
𝑚3 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) = 30
≈ 0. 1707 𝐴𝑛𝑑 𝑚2 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) = 30
≈ 0. 6633

The standard deviation 𝑠 = 𝑚2 ≈ 0. 814 s = \sqrt{m_2} \approx 0.814 (using 𝑁 in


denominator for standard deviation for moment calculation consistency).
𝑚3 0.1707 0.1707
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 ≈ 3 = 1.5 = 0.5401
≈ 0. 316 . The skewness is approximately
𝑚2 2 0.6633

0.316.
●​ Kurtosis (K_u): We need
1 4 33.835 𝑚4 1.1278 1.1278
𝑚4 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) = 30
≈ 1. 1278, 𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 𝐾𝑢 = 2 = 2 = 0.4399
≈ 2. 564.
𝑚2 0.6633

The kurtosis is approximately 2.564.


Interpretation:
●​ Central Tendency: The mean (0.77), median (1), and mode (0) are all quite close but
show a pattern: Mode < Mean < Median. This suggests a slight positive skew, where the
mean is pulled slightly to the right by higher complaint numbers, but the most frequent
occurrence is zero complaints. The mode of 0 indicates that most days had no
complaints, which is a significant operational detail.
●​ Dispersion: The standard deviation of 0.803 indicates that daily complaint numbers
typically vary by about 0.8 complaints from the average. This relatively low spread
suggests some consistency in complaint volumes, though the mode highlights a strong
concentration at zero.
●​ Shape:
○​ Skewness (0.316): A positive skewness value confirms that the distribution is
slightly skewed to the right. This means there are more days with very few
complaints (including zero), but a few days have a higher number of complaints,
pulling the mean higher than the median and mode.
○​ Kurtosis (2.564): The kurtosis value of 2.564 (or excess kurtosis of -0.436) is less
than 3, indicating a platykurtic distribution. This suggests that the distribution is
slightly flatter than a normal distribution, with potentially lighter tails and fewer
extreme outliers than a normal distribution would predict. In this context, it implies
that while there are days with more complaints, they are not excessively extreme
compared to the overall pattern.
Collectively, these measures indicate that the business generally experiences very few
customer complaints, with zero complaints being the most common outcome. However, there's
a slight tendency for occasional higher complaint days, which slightly skews the average
upwards. The distribution is relatively flat, suggesting that while low, the complaint numbers
aren't extremely concentrated at the mean.

6.2. Interval Data Example


Scenario: The scores of 50 students on a statistics exam (out of 100).
Raw Data (50 scores): 65, 78, 92, 55, 81, 70, 63, 88, 75, 60, 95, 68, 72, 85, 50, 77, 80, 62, 71,
90, 58, 73, 86, 66, 79, 83, 52, 74, 89, 61, 91, 67, 76, 82, 59, 70, 84, 64, 77, 87, 56, 71, 93, 69,
75, 80, 57, 72, 85, 60
Step-by-step Construction of Grouped Frequency Table: As determined in Section 1.3, we
use class intervals of width 8, starting from 50.
Table 6.2.1: Comprehensive Interval Data Analysis Table (Student Exam Scores)
Class Midpoi Freque Cumul f_i x_i x_i^2 f_i (x_i - f_i(x_i (x_i - f_i(x_i (x_i - f_i(x_i
Interva nt (x_i) ncy ative x_i^2 \bar{x} - \bar{x} - \bar{x} -
l (f_i) Freque ) \bar{x} )^3 \bar{x} )^4 \bar{x}
ncy )^2 )^3 )^4
(cf_i)
50-57 53.5 7 7 374.5 2862.2 20035. -20.66 2977.1 -8818. -61727 40624 28437
5 75 0 14 .00 8.8 41.6
58-65 61.5 9 16 553.5 3782.2 34040. -12.66 1445.8 -2001. -18009 25337. 22804
5 25 9 07 .63 81 0.29
66-73 69.5 13 29 903.5 4830.2 62793. -4.66 282.89 -106.9 -1389. 498.05 6474.6
5 25 1 83 5
74-81 77.5 13 42 1007.5 6006.2 78081. 3.34 144.97 36.31 472.03 121.39 1578.0
5 25 7
82-89 85.5 8 50 684.0 7310.2 58482. 11.34 1028.2 11656. 93253. 13219 10575
5 00 4 70 60 4.5 56.0
90-97 93.5 0 50 0.0 8742.2 0.00 19.34 0.00 7233.1 0.00 13996 0.00
5 0 2.3
Total N = 50 3523 25343 5879.0 14909. 41300
2.5 9 17 90.61
Note: The mean \bar{x} for calculations is 70.46.
Detailed Calculation of Measures:
∑𝑓𝑖𝑥𝑖
3523
●​ Mean (𝑥 ): 𝑥 = = 50
= 70. 46. The estimated average exam score is 70.46.
∑𝑓𝑖

𝑁 50 𝑡ℎ
●​ Median: Total observations (N) = 50. Median position = 2
= 2
= 25 observation.
The median class is the 66-73 interval, as its cumulative frequency (29) is the first to be
greater than or equal to 25.
○​ l (lower limit of median class) = 66
○​ n (total observations) = 50
○​ c (cumulative frequency of class before median class) = 16
○​ f (frequency of median class) = 13
50
( −16) (25−16)
○​ ℎ (𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ) = 8 𝑀𝑒𝑑𝑖𝑎𝑛 = 66 + 2
13
× 8 = 66 + 13
×8
9
= 66 + 13
× 8 = 66 + 0. 6923 × 8 = 66 + 5. 5384 ≈ 71. 54 . The estimated
median score is 71.54.
●​ Mode: The modal class is 66-73 (frequency 13) or 74-81 (frequency 13). Since there are
two classes with the same maximum frequency, this distribution is bimodal in terms of
modal classes. Let's calculate the mode for the first modal class (66-73) for
demonstration.
○​ l (lower limit of modal class) = 66
○​ 𝑓𝑚(frequency of modal class) = 13
○​ 𝑓1(frequency of class before modal class) = 9
○​ 𝑓2(frequency of class after modal class) = 13
(13−9) 4
○​ ℎ (𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ) = 8 𝑀𝑜𝑑𝑒 = 66 + 2(13−9−13)
× 8 = 66 + 26−22
×8
4
= 66 + 4
× 8 = 66 + 1 × 8 ≈ 74 . The estimated mode for the 66-73 class is
74. If we were to calculate for the 74-81 class, it would be different. This illustrates
why the mode can be less helpful for continuous data.
●​ Standard Deviation (\sigma): Using Formulation 2:

2
∑𝑓𝑖 𝑥𝑖
2 253432 2
σ= −µ , σ = 50
− 70. 46 = 5068. 65 − 4964. 6116 = 104. 0384 ≈ 10. 20.
∑𝑓𝑖

The estimated standard deviation is approximately 10.20.


●​ Skewness (𝑔1): We need
1 3 14909.17 1 2 5879.09
𝑚3 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) = 50
≈ 298. 18 𝑎𝑛𝑑 𝑚2 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) = 50
≈ 117. 58

The standard deviation 𝑠 = 𝑚2 ≈ 10. 84 (using 𝑁 in denominator for standard deviation


for moment calculation consistency).
𝑚3 298.18 298.18
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 ≈ 3 = 1.5 = 1274.58
≈ 0. 234. The skewness is approximately
𝑚2 2 117.58

0.234.
●​ Kurtosis (K_u): We need
1 4 4130090.61 𝑚4 82601.81 82601.81
𝑚4 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) = 50
≈ 82601. 81, 𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 𝐾𝑢 = 2 = 1.5 = 13824.95
≈ 5975.
𝑚2 117.58

The kurtosis is approximately 5.975.


Interpretation:
●​ Central Tendency: The estimated mean score is 70.46, the median is 71.54, and the
mode for the first modal class is 74. The order (Mean < Median < Mode) suggests a slight
negative skew, although the calculated skewness is positive. This discrepancy can occur
due to the approximation inherent in grouped data calculations and the bimodal nature of
the data. The average score is around 70, with half the students scoring above 71.54.
●​ Dispersion: The estimated standard deviation of 10.20 indicates that student scores
typically deviate by about 10.2 points from the average. This provides a measure of how
spread out the scores are.
●​ Shape:
○​ Skewness (0.234): The positive skewness value of 0.234 suggests a slight
right-skew. This implies that while most scores are clustered around the middle to
higher end, there might be a few lower scores pulling the mean slightly left, or more
scores are concentrated on the lower side of the peak.
○​ Kurtosis (5.975): The kurtosis value of 5.975 (excess kurtosis of 2.975) is
significantly greater than 3, indicating a leptokurtic distribution. This suggests that
the distribution of exam scores is more peaked than a normal distribution, with
heavier tails. In practical terms, this could mean that a large number of students
scored very close to the mean (contributing to the peak), and there are also a
notable number of students with very high or very low scores (contributing to the
heavy tails), more so than would be expected in a normal distribution. This might
indicate distinct groups of high-achievers and struggling students, rather than a
continuous spread.
The analysis of grouped data provides valuable insights into overall trends and distributions,
which is particularly useful for large datasets. However, it is important to remember that the
results derived are estimates, a crucial consideration for data interpretation. The calculations for
grouped data inherently involve an approximation, as individual data points are replaced by
class midpoints. This is a fundamental compromise in data analysis: the trade-off between the
practicality of handling large datasets and the precision of calculations.

7. Conclusion
The comprehensive analysis of central tendency, dispersion, and shape measures—mean,
median, mode, standard deviation, skewness, and kurtosis—provides a robust and holistic
understanding of any dataset. These descriptive statistics, when used collectively and
interpreted in context, transform raw data into a coherent narrative.
Measures of central tendency (mean, median, mode) pinpoint the typical value, while measures
of dispersion (standard deviation) quantify the variability around that center. The interplay
between these measures, particularly the relative positions of the mean, median, and mode,
serves as a powerful diagnostic indicator of a distribution's asymmetry. This initial assessment
of distribution shape is crucial, as it informs the selection of appropriate statistical tests,
ensuring the validity and reliability of subsequent analyses.
Furthermore, skewness and kurtosis offer deeper insights into the distribution's form. Skewness
reveals the direction and extent of asymmetry, which is critical for identifying non-normal
distributions. Kurtosis, by describing peakedness and tail heaviness, provides valuable
information about the presence and impact of extreme values or outliers. In fields like risk
management, a high kurtosis value can signal a greater probability of extreme events than a
normal distribution would suggest, which is vital for informed decision-making.
Ultimately, mastering these fundamental concepts is paramount for anyone engaging in
data-driven decision-making. They form the bedrock for more advanced statistical inference and
modeling, enabling analysts to succinctly convey complex data characteristics to diverse
audiences. This comprehensive understanding of data's core features, variability, and
underlying distribution is vital for effective data storytelling and collaborative decision-making in
any domain.

Works cited

1. Central Tendency | Understanding the Mean, Median & Mode - Scribbr,


https://www.scribbr.com/statistics/central-tendency/ 2. Discrete Frequency Distribution: Table,
Method & Solved Examples, https://testbook.com/maths/discrete-frequency-distribution 3.
Construction of a Discrete Frequency Distribution • MBA Notes by TheMBA.Institute,
https://themba.institute/quantitative-analysis-for-managerial-applications/construction-of-a-discre
te-frequency-distribution/ 4. Steps in Constructing a Frequency Distribution Table.pptx -
SlideShare,
https://www.slideshare.net/slideshow/steps-in-constructing-a-frequency-distribution-tablepptx/25
8589926 5. Grouped Frequency Table - Math Steps, Examples & Questions,
https://thirdspacelearning.com/us/math-resources/topic-guides/statistics-and-probability/grouped
-frequency-table/ 6. Mean Median Mode Formula - BYJU'S,
https://byjus.com/mean-median-mode-formula/ 7. www.scribbr.com,
https://www.scribbr.com/statistics/central-tendency/#:~:text=Measures%20of%20central%20ten
dency%20help,the%20total%20number%20of%20values. 8. Averages from Frequency Tables -
Steps, Examples & Worksheet - Third Space Learning,
https://thirdspacelearning.com/gcse-maths/statistics/averages-from-frequency-tables/ 9.
1.3.5.11. Measures of Skewness and Kurtosis - Information Technology Laboratory,
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm 10. lecture 24 moments,
skewness and kurtosis - prof. shruti,
http://14.139.237.190/mooc/upload/18_11_2023_Moments_Skewness_Kurtosis.pdf 11.
www.nagwa.com,
https://www.nagwa.com/en/videos/172108231524/#:~:text=To%20find%20the%20standard%20
deviation,the%20sum%20of%20the%20frequencies. 12. Question Video: Calculating the
Standard Deviation of a Data Set ..., https://www.nagwa.com/en/videos/172108231524/ 13.
Skewness and kurtosis | Probability and Statistics Class Notes - Fiveable,
https://library.fiveable.me/probability-and-statistics/unit-4/skewness-kurtosis/study-guide/P86d5o
r1XgfB3BA0 14. What Is Skewness And Kurtosis? | Statistics | Simplilearn - YouTube,
https://www.youtube.com/watch?v=Gp6dqDLchbk&pp=0gcJCfwAo7VqN5tD

You might also like