Statistical Analysis - Discrete and Interval
Statistical Analysis - Discrete and Interval
2.1. Mean
The mean, or arithmetic average, is the most commonly used measure of central tendency
because it incorporates all values in its calculation. It is calculated by summing all values in a
dataset and dividing by the total number of values. The mean represents the "balance point" of
a distribution.
Formulas:
∑𝑥
● Raw Data (Sample Mean): For a sample dataset, the mean (𝑥) is calculated as: 𝑥 = 𝑛
Where ∑ 𝑥 is the sum of all individual values, and 𝑛 is the total number of values in the
sample.
● Raw Data (Population Mean): For a population dataset, the mean (µ) is calculated as:
∑𝑥
µ = 𝑁
Where ∑ 𝑥 is the sum of all values in the population, and 𝑁 is the total number of
frequency of each value (for discrete data) or the midpoint of each class interval (for
grouped data), and 𝑥 represents the value or midpoint, respectively. The sum of
frequencies (∑ 𝑓) is equivalent to the total number of observations (𝑁). For grouped data,
2.2. Median
The median is the middle number in a dataset that has been ordered from lowest to highest.
Unlike the mean, the median is less affected by extreme values, making it a robust measure of
central tendency, particularly useful for skewed distributions or when outliers are present.
Formulas:
● Raw Data (Odd Number of Observations): If the total number of observations (n) is odd,
𝑛+1 𝑡ℎ
the median is the value at the position: 𝑀𝑒𝑑𝑖𝑎𝑛 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 = ( 2
) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛.
● Raw Data (Even Number of Observations): If the total number of observations (n) is
even, the median is the average of the two middle values:
𝑛+1 𝑡ℎ 𝑛 𝑡ℎ
𝑀𝑒𝑑𝑖𝑎𝑛 = ( 2
) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 + ( 2 + 1) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 2 ..
● Discrete Frequency Distribution:
1. Calculate the cumulative frequency (𝑐𝑓) for each value.
𝑛
2. Find 2 , where 𝑁 is the total sum of frequencies.
3. The median is the value of the variable (𝑥𝑖) corresponding to the cumulative
𝑛
frequency that is just greater than or equal to 2
.
● Grouped Frequency Distribution (Continuous Data): For grouped data, the median is
calculated using the following formula after identifying the median class:
𝑛
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐼 + [( 2 − 𝑐)𝑓]ℎ
Where:
○ 𝐼 = lower limit of the median class.
○ 𝑛 = total number of observations (\sum f).
○ 𝑐 = cumulative frequency of the class before the median class.
○ 𝑓 = frequency of the median class.
○ ℎ = size of the class (class width). The median class is the first class interval whose
𝑛
cumulative frequency is greater than or equal to 2 .
Application and Interpretation: The median precisely divides the data into two halves, with
50% of the values falling below it and 50% above it. Its positional definition means its value is
determined by its rank rather than its magnitude, making it robust to extreme values. This
robustness makes the median a preferred measure of central tendency for highly skewed
datasets (e.g., income distribution, property values) where the mean might be misleadingly
inflated or deflated by a few extreme observations. It often provides a more accurate
representation of the "typical" experience for the majority of the data points in such cases.
2.3. Mode
The mode is defined as the most frequently occurring value in a dataset. A dataset can have no
mode (if all values occur with the same frequency), one mode (unimodal), or more than one
mode (multimodal).
Finding the Mode:
● Raw Data: To find the mode, the dataset is typically sorted, and the value that appears
most often is identified.
● Discrete Frequency Distribution: For a discrete frequency distribution, the mode is
simply the value of the variable that has the highest frequency.
● Grouped Frequency Distribution (Modal Class): For grouped data, it is generally more
appropriate to identify the modal class, which is the class interval possessing the
maximum frequency. The mode for grouped data can be estimated using the following
𝑓𝑚−𝑓1
formula: 𝑀𝑜𝑑𝑒 = 𝐼 + [ 2𝑓 −𝑓1−𝑓2
]ℎ
𝑚
Where:
○ 𝐼 = lower limit of the modal class.
○ 𝑓𝑚= frequency of the modal class.
○ 𝑓1 = frequency of the class before the modal class.
○ 𝑓2 = frequency of the class after the modal class.
○ ℎ = width of the class.
Application and Interpretation: The mode is particularly valuable for nominal (categorical)
data, where it indicates the most popular or common category. For example, in a survey asking
about political identification, the mode would reveal the most frequently chosen category (e.g.,
"liberal" if it has the highest frequency). The mode's unique applicability to categorical data,
unlike the mean or median which require numerical data, makes it a bridge between qualitative
and quantitative data analysis. In contrast, for continuous variables, a specific value may rarely
repeat, rendering the mode less helpful as a measure of central tendency. In such cases, the
modal class provides more meaningful information.
2
∑(𝑥𝑖−µ) 𝑓𝑖
the mean: σ =
∑𝑓𝑖
Where:
○ 𝑥𝑖 = individual score (or midpoint for grouped data).
○ µ = mean of the distribution.
○ 𝑓𝑖 = frequency of the score/midpoint.
2
∑𝑓𝑖 𝑥𝑖
2
σ = −µ
∑𝑓𝑖
Where:
○ 𝑥𝑖 = individual score (or midpoint for grouped data).
○ µ = mean of the distribution.
○ 𝑓𝑖 = frequency of the score/midpoint.
2
○ ∑ 𝑓𝑖 𝑥𝑖 = sum of the product of each frequency and its corresponding score
squared.
Step-by-step Calculation (using Formulation 2):
1. Calculate the Mean (µ): First, compute the mean of the distribution using the formula
∑𝑓𝑖𝑥𝑖
µ = . This step requires columns for 𝑥𝑖, 𝑓𝑖, and 𝑓𝑖𝑥𝑖.
∑𝑓𝑖
2
2. Calculate 𝑥𝑖 : Square each individual score or midpoint value (𝑥𝑖 ).
2 2
3. Calculate 𝑓𝑖 𝑥𝑖 : Multiply each frequency (𝑓𝑖) by its corresponding squared score (𝑥𝑖 ).
2 2
4. Sum 𝑓𝑖 𝑥𝑖 : Add all the values obtained in the 𝑓𝑖 𝑥𝑖 column.
5. Sum 𝑓𝑖 : Add all the frequencies to get the total number of observations (𝑁).
2
6. Apply the Formula: Substitute the calculated sums (∑ 𝑓𝑖 𝑥𝑖 and ∑ 𝑓𝑖) and the mean (µ)
4.1. Skewness
Skewness is a measure of symmetry, or more precisely, the lack of symmetry, in a distribution. A
distribution is considered symmetric if it appears the same to the left and right of its center point.
Skewness provides an idea regarding the overall shape of the frequency distribution.
Interpretation:
● Skewness = 0: This indicates a perfectly symmetric distribution, such as a normal
distribution. In such cases, the data is evenly distributed around the mean.
● Positive Skewness (> 0): A positively skewed distribution has a longer tail extending to
the right side. This means that more values are concentrated on the left side of the peak,
and the mean is typically greater than the median, which is greater than the mode (Mean
> Median > Mode). This often occurs when there are a few unusually high values pulling
the mean to the right.
● Negative Skewness (< 0): A negatively skewed distribution has a longer tail extending to
the left side. In this case, more values are concentrated on the right side of the peak, and
the mean is typically less than the median, which is less than the mode (Mean < Median <
Mode). This can happen when there are a few unusually low values pulling the mean to
the left.
Formulas: Skewness is commonly calculated using moments, which are specific mathematical
expectations of a random variable. The Fisher-Pearson coefficient of skewness is widely used.
● Fisher-Pearson Coefficient of Skewness (Sample): The adjusted formula for sample
𝑛 𝑥𝑖−𝑥 3
skewness is: 𝑔1 = (𝑛−1) (𝑛−2)
. ∑( 𝑠
) For frequency distributions, this involves the
third central moment (𝑚3) and the second central moment (𝑚2, which is the variance).
1 𝑟
Central moments for grouped data are defined as:𝑚𝑟 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) . Thus, the
𝑚3
skewness can be approximated as: 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 ≈ 3 (derived from moment definitions ).
𝑚2 2
● Karl Pearson's Coefficient of Skewness: This simpler measure relates the mean,
𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
mode, and standard deviation: 𝑆𝑘 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 If the mode is ill-defined (e.g., in a
multimodal distribution or for continuous data where a specific mode is rare), an
3(𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒)
alternative formula is used: 𝑆𝑘 = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 .
● Bowley's Coefficient of Skewness (Quartile Skewness): This measure uses quartiles:
𝑄3+𝑄1−2𝑄2
𝑆𝑘 = 𝑄3−𝑄1
Where 𝑄1, 𝑄2(median), and 𝑄3are the first, second, and third quartiles,
respectively.
Calculation Methods for Discrete and Grouped Data (using moments): To calculate
skewness using the moment-based formula, the following steps are typically performed after
calculating the mean and standard deviation:
1. Calculate Deviations from the Mean: For each value (𝑥𝑖) or midpoint, compute the
deviation (𝑥𝑖 − 𝑥).
3
2. Calculate Cubed Deviations: Cube each deviation:(𝑥𝑖 − 𝑥) .
3. Multiply by Frequency: Multiply each cubed deviation by its corresponding frequency
3
𝑓𝑖 : 𝑓𝑖 (𝑥𝑖 − 𝑥) .
3
4. Sum these Products: Sum all the values from the𝑓𝑖 (𝑥𝑖 − 𝑥) column. This sum
represents 𝑁 × 𝑚3.
5. Apply the Skewness Formula: Use the calculated sums and the standard deviation to
compute the skewness coefficient.
Application and Insights: Skewness helps identify non-normal distributions, which is critically
important because many classical statistical tests and intervals rely on assumptions of normality.
If data is significantly skewed, applying parametric tests that assume normality may yield invalid
or misleading results. Understanding skewness moves beyond mere description to a crucial
step in inferential statistics, informing the choice between parametric and non-parametric
statistical tests. This ensures that the chosen analytical method is appropriate for the data's
underlying distribution, thereby safeguarding the validity and reliability of research findings.
5.1. Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal
distribution. It also quantifies the degree of peakedness of a frequency distribution.
Interpretation: The normal distribution serves as a reference point and is considered
"mesokurtic".
● Leptokurtic (Kurtosis > 3 or Excess Kurtosis > 0): A leptokurtic distribution has a
sharper peak and "heavier tails" than a normal distribution. This indicates that a greater
proportion of data is concentrated around the mean (a high peak) and/or there are more
extreme values (outliers) in the tails.
● Mesokurtic (Kurtosis \approx 3 or Excess Kurtosis \approx 0): This describes a
distribution with similar peakedness and tail behavior to a normal distribution.
● Platykurtic (Kurtosis < 3 or Excess Kurtosis < 0): A platykurtic distribution is flatter and
has "lighter tails" than a normal distribution. This suggests that data points are more
uniformly distributed across the range, with fewer extreme values.
Formulas: Kurtosis is calculated using the fourth central moment.
● Pearson's Coefficient of Kurtosis: This is the most common formula for kurtosis:
𝑚4
𝐾𝑢 = 2
𝑚2
Where:
1 4
○ 𝑚4 = fourth central moment, calculated as 𝑚4 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) .
1 2
○ 𝑚2 = second central moment (variance), calculated as 𝑚2 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) .
● Excess Kurtosis: To make the normal distribution's kurtosis value zero, excess kurtosis
is often calculated: 𝐸𝑥𝑐𝑒𝑠𝑠 𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 = 𝑃𝑒𝑎𝑟𝑠𝑜𝑛'𝑠 𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 − 3 This definition simplifies
interpretation, as a positive value indicates heavier tails than normal, and a negative value
indicates lighter tails.
Calculation Methods for Discrete and Grouped Data (using moments): To calculate kurtosis
using the moment-based formula, the process builds upon the calculations for skewness:
1. Calculate Deviations from the Mean: For each value (𝑥𝑖) or midpoint, compute the
deviation (𝑥𝑖 − 𝑥).
2. Calculate Fourth Power of Deviations: Raise each deviation to the fourth power:
4
(𝑥𝑖 − 𝑥 ) .
3. Multiply by Frequency: Multiply each fourth-powered deviation by its corresponding
4
frequency 𝑓𝑖: 𝑓𝑖(𝑥𝑖 − 𝑥) .
4
4. Sum these Products: Sum all the values from the 𝑓𝑖(𝑥𝑖 − 𝑥) column. This sum
represents 𝑁 × 𝑚4.
5. Calculate the Second Central Moment (m_2): This is the variance, calculated as
2
∑𝑓𝑖(𝑥𝑖−𝑥)
𝑚2 = .
∑𝑓𝑖
6. Apply the Kurtosis Formula: Substitute the calculated sums into the formula for
Pearson's Coefficient of Kurtosis.
Application and Insights: Kurtosis is instrumental in identifying distributions with extreme
values or outliers ("heavy tails") or those that are unusually flat ("light tails"). In fields such as
finance or risk management, high kurtosis (leptokurtic distributions) signifies a higher probability
of extreme events (e.g., market crashes, large gains/losses) compared to what a normal
distribution would predict. This understanding is critical for risk assessment, portfolio
optimization, and comprehending the potential for rare, impactful events. Like skewness,
kurtosis is crucial for assessing normality assumptions for statistical tests.
approximately 0.77.
𝑁 𝑡ℎ
● Median: Total observations (𝑁) = 30. Median position = 2
= 30/2 = 15 observation.
Looking at the cumulative frequency column, the 15th observation falls within the "1
complaint" category (cf=25, which is just greater than 15). Therefore, the Median = 1
complaint.
● Mode: From the frequency column, the value with the highest frequency is 0 (with a
frequency of 13). Therefore, the Mode = 0 complaints.
2
∑𝑓𝑖 𝑥𝑖
● Standard Deviation (\sigma): Using Formulation 2: σ = ,
2
∑𝑓𝑖−µ
37
σ = 2 = 1. 2333 − 0. 5883 = 0. 645 ≈ 0. 803 The standard deviation is
30 − 0.767
approximately 0.803 complaints. This indicates that the typical deviation of daily
complaints from the average is about 0.8 complaints.
● Skewness (𝑔1): We need
1 3 5.12 1 2 19.9
𝑚3 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) = 30
≈ 0. 1707 𝐴𝑛𝑑 𝑚2 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) = 30
≈ 0. 6633
0.316.
● Kurtosis (K_u): We need
1 4 33.835 𝑚4 1.1278 1.1278
𝑚4 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) = 30
≈ 1. 1278, 𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 𝐾𝑢 = 2 = 2 = 0.4399
≈ 2. 564.
𝑚2 0.6633
𝑁 50 𝑡ℎ
● Median: Total observations (N) = 50. Median position = 2
= 2
= 25 observation.
The median class is the 66-73 interval, as its cumulative frequency (29) is the first to be
greater than or equal to 25.
○ l (lower limit of median class) = 66
○ n (total observations) = 50
○ c (cumulative frequency of class before median class) = 16
○ f (frequency of median class) = 13
50
( −16) (25−16)
○ ℎ (𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ) = 8 𝑀𝑒𝑑𝑖𝑎𝑛 = 66 + 2
13
× 8 = 66 + 13
×8
9
= 66 + 13
× 8 = 66 + 0. 6923 × 8 = 66 + 5. 5384 ≈ 71. 54 . The estimated
median score is 71.54.
● Mode: The modal class is 66-73 (frequency 13) or 74-81 (frequency 13). Since there are
two classes with the same maximum frequency, this distribution is bimodal in terms of
modal classes. Let's calculate the mode for the first modal class (66-73) for
demonstration.
○ l (lower limit of modal class) = 66
○ 𝑓𝑚(frequency of modal class) = 13
○ 𝑓1(frequency of class before modal class) = 9
○ 𝑓2(frequency of class after modal class) = 13
(13−9) 4
○ ℎ (𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ) = 8 𝑀𝑜𝑑𝑒 = 66 + 2(13−9−13)
× 8 = 66 + 26−22
×8
4
= 66 + 4
× 8 = 66 + 1 × 8 ≈ 74 . The estimated mode for the 66-73 class is
74. If we were to calculate for the 74-81 class, it would be different. This illustrates
why the mode can be less helpful for continuous data.
● Standard Deviation (\sigma): Using Formulation 2:
2
∑𝑓𝑖 𝑥𝑖
2 253432 2
σ= −µ , σ = 50
− 70. 46 = 5068. 65 − 4964. 6116 = 104. 0384 ≈ 10. 20.
∑𝑓𝑖
0.234.
● Kurtosis (K_u): We need
1 4 4130090.61 𝑚4 82601.81 82601.81
𝑚4 = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥) = 50
≈ 82601. 81, 𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 𝐾𝑢 = 2 = 1.5 = 13824.95
≈ 5975.
𝑚2 117.58
7. Conclusion
The comprehensive analysis of central tendency, dispersion, and shape measures—mean,
median, mode, standard deviation, skewness, and kurtosis—provides a robust and holistic
understanding of any dataset. These descriptive statistics, when used collectively and
interpreted in context, transform raw data into a coherent narrative.
Measures of central tendency (mean, median, mode) pinpoint the typical value, while measures
of dispersion (standard deviation) quantify the variability around that center. The interplay
between these measures, particularly the relative positions of the mean, median, and mode,
serves as a powerful diagnostic indicator of a distribution's asymmetry. This initial assessment
of distribution shape is crucial, as it informs the selection of appropriate statistical tests,
ensuring the validity and reliability of subsequent analyses.
Furthermore, skewness and kurtosis offer deeper insights into the distribution's form. Skewness
reveals the direction and extent of asymmetry, which is critical for identifying non-normal
distributions. Kurtosis, by describing peakedness and tail heaviness, provides valuable
information about the presence and impact of extreme values or outliers. In fields like risk
management, a high kurtosis value can signal a greater probability of extreme events than a
normal distribution would suggest, which is vital for informed decision-making.
Ultimately, mastering these fundamental concepts is paramount for anyone engaging in
data-driven decision-making. They form the bedrock for more advanced statistical inference and
modeling, enabling analysts to succinctly convey complex data characteristics to diverse
audiences. This comprehensive understanding of data's core features, variability, and
underlying distribution is vital for effective data storytelling and collaborative decision-making in
any domain.
Works cited