UNIT-3
DESCRIPTIVE STATISTICS
Introduction to Descriptive Statistics:
Statistics is the foundation of data science. Descriptive statistics are
simple tools that help us understand and summarize data. They show
the basic features of a dataset, like the average, highest and lowest
values and how spread out the numbers are. It's the first step in
making sense of information.
Types of Descriptive Statistics:
There are three categories for standard classification of descriptive
statistics methods, each serving different purposes in summarizing and
describing data. They help us understand:
1. Where the data centers (Measures of Central Tendency)
2. How much data is spread out from the central tendency
(Measure of Variability)
3. How the data is distributed (Measures of Frequency Distribution)
1. Measures of Central Tendency:
Statistical values that describe the central position within a dataset.
There are three main measures of central tendency:
Measures of Central Tendency:
Mean: is the sum of observations divided by the total number of
observations. It is also defined as average which is the sum divided by
count.
xˉ=∑x/n
where ,
x = Observations
n = number of terms
Example:
import numpy as np
# Sample Data
arr = [5, 6, 11]
# Mean
mean = np.mean(arr)
print ("Mean = ", mean)
Output
Mean = 7.333333333333333
Mode: The most frequently occurring value in the dataset. It’s useful
for categorical data and in cases where knowing the most common
choice is crucial.
Example:
import statistics
# sample Data
arr = [1, 2, 2, 3]
# Mode
m = statistics.mode(arr)
print("Mode = ", mode)
Output:
Mode = ModeResult(mode=array([2]), count=array([2]))
Median: The median is the middle value in a sorted dataset. If the
number of values is odd, it's the center value, if even, it's the average
of the two middle values.
Example:
import numpy as np
# sample Data
arr = [1, 2, 3, 4]
# Median
median = np.median(arr)
print ("Median = ", median)
Output:
Median = 2.5
2. Measure of Variability:
Measures of variability, also called measures of dispersion, help us spot the
spread or distribution of observations in a dataset. They identifying outliers,
assessing model assumptions and understanding data variability in relation to its
mean. The key measures of variability include:
1. Range : describes the difference between the largest and smallest data point in
our data set. The bigger the range, the more the spread of data and vice versa.
While easy to compute range is sensitive to outliers. This measure can provide a
quick sense of the data spread but should be complemented with other statistics.
Range = Largest data value - smallest data value
Example:
import numpy as np
# Sample Data
arr = [1, 2, 3, 4, 5]
# Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)
# Difference Of Max and Min
Range = Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(
Maximum, Minimum, Range))
Output
Maximum = 5, Minimum = 1 and Range = 4
2. Variance: is defined as an average squared deviation from the
mean. It is calculated by finding the difference between every data
point and the average which is also known as the mean, squaring
them, adding all of them and then dividing by the number of data
points present in our data set.
σ² = Σ (xi - μ)² / N
where,
x -> Observation under consideration
N -> number of terms
μ -> Mean
Example:
import statistics
# sample data
arr = [1, 2, 3, 4, 5]
# variance
Print ("Var = ", (statistics.variance(arr)))
Output
Var = 2.5
3. Standard deviation: Standard deviation is widely used to measure
the extent of variation or dispersion in data. It's especially important
when assessing model performance (e.g., residuals) or comparing
datasets with different means.
It is defined as the square root of the variance. It is calculated by
finding the mean, then subtracting each number from the mean
which is also known as the average and squaring the result. Adding
all the values and then dividing by the no of terms followed by the
square root.
σ = √[ Σ (xᵢ - μ)² / N ]
where,
x = Observation under consideration
N = number of terms
Μ = Mean
import statistics
arr = [1, 2, 3, 4, 5]
print("Std = ", (statistics.stdev(arr)))
Output
Std = 1.5811388300841898
Variability measures are important in residual analysis to check how
well a model fits the data.
3. Measures of Frequency Distribution
measures of frequency distribution describe how often values occur in
a dataset. They help us understand the spread of data across
categories or intervals and summarize large datasets into simpler
frequency tables, graphs, or numerical measures.
Here are the main measures of frequency distribution:
1. Frequency (f)
The number of times a particular value (or class) appears in the
dataset.
Example: In exam scores, if 5 students scored 80 marks, then the
frequency of 80 is 5.
2. Relative Frequency
The proportion of observations that fall into a class, compared to
the total number of observations.
Formula:
Relative Frequency= Frequency of a class/total no of
observations
Values usually expressed as fractions, decimals, or percentages.
Example: If 5 out of 50 students scored 80 marks,
Relative Frequency=5/50=0.10(10%)
3. Cumulative Frequency
Running total of frequencies up to a certain class or value.
Two types:
o Less than cumulative frequency → counts all observations
less than or equal to a certain value.
o More than cumulative frequency → counts all
observations greater than or equal to a certain value.
Useful for constructing ogives (cumulative frequency curves).
4. Percentage Frequency
Frequency expressed as a percentage of the total.
Formula:
Percentage Frequency=f/N×100
where f = class frequency, N = total number of observations.
5. Class Intervals & Class Frequencies (for grouped data)
When data is large, values are grouped into intervals (bins), and
the frequency of each interval is recorded.
Example:
Marks Frequency
0–10 3
10–20 7
20–30 12
Here, "10–20" is a class interval, and "7" is the class frequency.
6. Graphical Representations of Frequency Distribution
Histogram – shows frequencies with bars for intervals.
Frequency Polygon – line graph connecting midpoints of
intervals.
Ogive (Cumulative Frequency Curve) – shows cumulative
frequencies.
Bar Graph / Pie Chart – used for categorical frequency
distribution.
Data Preparation in Descriptive Statistics:
Before we can apply descriptive statistics (like mean, median, mode,
variance, charts, etc.), we must prepare the data properly. Data
preparation ensures accuracy, consistency, and reliability in statistical
analysis.
Steps in Data Preparation
1. Data Collection
o Gather raw data from surveys, experiments, databases, or
observations.
o Example: Collecting student marks in Mathematics out of
100.
2. Data Cleaning
o Handling Missing Values:
Remove records with too many missing values.
Replace missing values with mean, median, or mode
(imputation).
o Removing Duplicates: Avoid repeated entries.
o Correcting Errors: Fix typos, wrong units, or out-of-range
values.
Example: A student’s marks recorded as 120 (invalid,
since out of 100).
3. Data Transformation
o Standardization / Normalization: Converting data into a
uniform scale.
o Categorical Encoding: Changing qualitative data into
numerical form.
Example: Male → 1, Female → 0.
4. Data Reduction (if necessary)
o Summarize or group data into categories.
o Example: Group ages into intervals (10–20, 21–30, etc.).
5. Data Organization
o Arrange data systematically in tables, spreadsheets, or
databases.
o Example: Frequency distribution table of marks.
6. Checking Data Consistency
o Ensure data follows logical rules.
o Example: In student data, "age = 8" and "grade = college" is
inconsistent.
Why Data Preparation is Important:
Removes errors.
Makes analysis more meaningful.
Helps in applying statistical tools effectively.
Ensures valid conclusions from descriptive statistics.
Example:
Suppose we collect student marks:
[45, 50, 55, 60, 95, 100, 120, -, 75, 80]
After Data Preparation:
Remove/replace invalid value 120.
Handle missing value - (replace with mean/median).
Final dataset: [45, 50, 55, 60, 75, 80, 95, 100]
Now we can calculate mean, median, mode, range, standard
deviation, etc.
Exploratory Data Analysis:
Exploratory Data Analysis (EDA) is the process of summarization of a
dataset by analyzing it. It is used to investigate a dataset and lay down
its characteristics. EDA is a fundamental process in many Data Science
or Analysis tasks.
Different types of Exploratory Data Analysis
There are broadly two categories of EDA
1. Univariate Exploratory Data Analysis
2. Multivariate Exploratory Data Analysis
Univariate Exploratory Data Analysis - In Univariate Data Analysis
we use one variable or feature to determine the characteristics
of the dataset. We derive the relationships and distribution of
data concerning only one feature or variable. In this category, we
have the liberty to use either the raw data or follow a graphical
approach.
o In the Univariate raw data approach or Non-Graphical, we
determine the distribution of data based on one variable
and study a sample from the population. Also, we may
include outlier removal which is a part of this process.
Let's look into some of the non-graphical approaches.
o The measure of Central tendency Central tendency tried to
summarize a whole population or dataset with the help of
a single value that represents the central value.
The three measures are the mean, the median, and the mode.
o Mean: It is the average of all the observations. i.e., the sum
of all observations divided by the number of observations.
o Median: It is the middle value of the observations or
distribution after arranging them in ascending or
descending order.
o Mode: It is the most frequently occurring observation.
o Variance: It indicates the spread of the data about the
middle or Mean value. It helps us gather info regarding
observations concerning central tendencies like mean. It is
calculated as the mean of the square of all observations.
o In the Univariate graphical approach, we may use any
graphing library to generate graphs like histograms,
boxplots, quantile-quantile plots, violin plots, etc. for
visualization. Data Scientists often use visualization to
discover anomalies and patterns. The graphical method is a
more subjective approach to EDA. These are some of the
graphical tools to perform univariate analysis.
o Histograms: They represent an actual count of a particular
range of values. It shows the frequency of data in the form
of rectangles' which is also known as bar graph
representation and can be either vertical or horizontal.
o Box plots: Also known as box and whisker plots. They use
lines and boxes to show the distribution of data from one
or more than one groups. A central line indicates the
median value. The extended line captures the rest of the
data. They are useful in the way that they can be used to
compare groups of data and compare symmetry.
o Q-Q plots:To determine if two datasets come from the
same or different distribution, a Q- Q plot is used.
Multivariate Exploratory Data Analysis :
In Multivariate analysis we use more than one variable to show
the relationships and visualizations. It is used to show the
interaction between different fields.
o Multivariate Non-Graphical (raw data): Techniques like
tabulation of more than two variables.
o Multivariate Graphical: In visualization analysis for
multivariate statistics, the below plots can be used.
Scatterplot: It is used to display the relationship
between two variables by plotting the data as dots.
Additionally, color coding can be intelligently used to
show groups within the two features based on a third
feature.
Heatmap: In this visualization technique the values
are represented with colors with a legend showing
color for different levels of the value. It is a 2d graph.
Bubble plot:In this graph circles are used to show
different values. The radius of the circle on the chart
is proportional to the value of the data point.
Estimation in Statistics:
Estimation is a technique for calculating information about a bigger
group from a smaller sample, and statistics are crucial to analyzing
data.
For instance, the average age of a city's population may be obtained
by taking the age of a sample of 1,000 residents. While estimates
aren't perfect, they are typically trustworthy enough to be of value.
Estimation in statistics involves using sample data to make educated
guesses about a population's characteristics, such as mean, variance,
or proportion. The population refers to the entire interest group, like
all people in a country or all products made by a company.
Since it's often impractical to measure every member of a population,
statisticians rely on samples to make inferences about the entire
population.
Estimation helps to conclude population parameters based on sample
data.
Bias and Variance: Estimators can have bias, consistently
overestimating or underestimating the true parameter. Variance
measures the spread of estimator values around their predicted
value. Both variance and bias impact the accuracy of estimators.
Mean and Variance of Estimators: Estimators have the same
mean and variance as random variables. The mean of an
estimator should be equal to the parameter it is estimating. The
variance of an estimator indicates its precision or variability.
Purpose of Estimation in Statistics
Statistical estimation is essential for finding unknown population
parameters using sample data, like the mean and variance, without
individual measurements.
This evaluation is vital for decision-making in business and
healthcare, informing strategies and treatment options.
It is closely linked to hypothesis testing, contributing to scientific
development, political decisions, public health, and economic
choices.
Risk assessment benefits from evaluation in managing
probabilities and risk in finance and insurance.
Quality control also relies on evaluation to ensure products and
services meet standards by identifying and correcting deviations.
Types of Estimation
Estimation is of two types that include:
Point Estimation
Interval Estimation
Point Estimation:
A single value (called a point estimate) is used to approximate an
unknown population parameter. This single number, called a score
estimator, gives a rough idea of the group's characteristics.
The population mean is estimated using the sample mean. Similar
techniques can be applied to estimate other attributes, like
percentages of specific characteristics in a population. While not
always precise, these estimates offer a good understanding of the
group's traits.
For instance, measuring the heights of random people can be used to
estimate the average height of the entire group. If individuals
measured were 5 feet, 6 feet, and 5 feet. We could estimate the
average height to be around 5 feet.
Interval Estimation:
Interval estimates give a range likely to contain the true parameter.
This method recognizes data variability and estimation uncertainty.
When estimating the number of jelly beans in a jar, it is better to
provide a range, known as a confidence interval, rather than a single
guess. This range, such as 80 to 120 jelly beans, allows for uncertainty
in the estimate and acknowledges the margin of error.
Confidence intervals give us a sense of freedom in our estimations,
while point estimates only provide a single number without
considering this uncertainty.
Confidence Interval in Interval Estimation:
A confidence interval is the range of values, derived from a sample,
that is likely to contain the true value of an unknown population
parameter.
(1-α)% confidence interval means: "If we repeat the experiment many
times, then (1 − α)% of the constructed intervals will contain the true
parameter."
Example: A 95% confidence level means: "If we took 100 random
samples and built 100 confidence intervals, we expect about 95 of
them to contain the true population parameter."
Factors Affecting Estimation
Various factors affecting estimation are:
1. Sample Size: Larger sample sizes lead to more precise estimates,
increasing the likelihood of accurately representing the population
parameter.
Estimating the average height of students in a school is more
accurate with a larger sample size.
Measuring just five students may not be reliable, but measuring
50 or even 500 students can provide a better idea of the true
average height.
A larger sample size leads to a more accurate estimate of the
entire population's characteristics.
In short, studying more individuals results in a more precise
estimate of the entire population.
2. Sampling Method: The sampling method affects estimate accuracy.
A random sample with every member having an equal chance ensures
an unbiased estimate, improving accuracy.
The sampling method is crucial for accurate estimations.
Random sampling selects individuals purely by chance, giving
each an equal chance of being chosen.
This ensures a fair representation of the entire group, making it
useful for determining distributions like colored candies in a jar
or favorite ice cream flavors in a town without bias.
Random sampling helps reflect the opinions of the whole group,
not just a subset, leading to fair and unbiased findings for
drawing accurate conclusions about a population or problem.
Estimation Methods
Several statistical techniques are used to estimate unknown
parameters from data:
1. Method of Moments
This method compares the moments (central tendency and
spread) that are computed from the sample data to the
corresponding moments in the population.
The population parameters can be estimated by working out the
resulting equations.
2. Maximum Likelihood Estimation (MLE)
Maximum likelihood estimation (MLE) aims to find parameter
values that give the highest chance of observing the given
sample.
It involves identifying values that maximize the likelihood of the
observed data.
3. Least Squares Estimation
LSE minimizes the sum of squared differences between observed
values and predicted values.
It’s widely used to find the best fit for a model, especially in
regression, by reducing prediction errors.
4. Bayesian Estimation
Bayesian Estimation updates prior belief about unknown
parameters using observed data through Bayes' theorem.
In Bayesian learning, parameters are random variables with
probabilities assigned to them.
5. Interval Estimation
This approach constructs a range of plausible values (e.g.,
confidence intervals) for an unknown parameter, reflecting
uncertainty with a specified level of confidence.
Applications of Estimations in Computer Science
Machine Learning and AI
Estimating parameters of models from data (e.g., weights in
neural networks, coefficients in regression).
Example: In linear regression, we use least squares estimation to
find the line of best fit. In Bayesian networks, Bayesian
estimation is used to compute posterior probabilities.
Data Compression
Estimating the probability distribution of symbols for efficient
encoding.
Example: In Huffman coding, character frequencies are
estimated to build optimal prefix codes. In arithmetic coding,
probabilities are estimated in real time to achieve better
compression ratios.
Cryptography
Estimating the entropy of keys or messages to assess strength.
Example: Estimating the key recovery success rate based on
observed ciphertexts.
Computer Graphics and Vision
Estimating depth, object boundaries, or motion in visual scenes.
Example: In SLAM (Simultaneous Localization and Mapping),
estimation is used to track a robot’s position in space.
Statistical inference:
Statistical inference is the process of using data analysis to infer
properties of an underlying distribution of a population. It is a branch
of statistics that deals with making inferences about a population
based on data from a sample.
Statistical inference is based on probability theory and probability
distributions. It involves making assumptions about the population
and the sample, and using statistical models to analyze the data
Statistical inference is the process of drawing conclusions or making
predictions about a population based on data collected from a sample
of that population. It involves using statistical methods to analyze
sample data and make inferences or predictions about parameters or
characteristics of the entire population from which the sample was
drawn.
Consider a scenario where you are presented with a bag which is too
big to effectively count each bean by individual shape and colours. The
bag is filled with differently shaped beans and different colors of the
same. The task entails determining the proportion of red-coloured
beans without spending much effort and time. This is how statistical
inference works in this context.
You simply pick a random small sample using a handful and then
calculate the proportion of the red beans. In this case, you would have
picked a small subset, your handful of beans to create an inference on
a much larger population, that is the entire bag of beans.
Branches of Statistical Inference
There are two main branches of statistical inference:
Parameter Estimation
Hypothesis Testing
Parameter Estimation
Parameter estimation is another primary goal of statistical inference.
Parameters are capable of being deduced; they are quantified traits or
properties related to the population you are studying. Some instances
comprise the population mean, population variance, and so on-the-
list. Imagine measuring each person in a town to realize the mean.
This is a daunting if not an impossible task. Thus, most of the time, we
use estimates.
There are two broad methods of parameter estimation:
Point Estimation
Interval Estimation
Hypothesis Testing:
Hypothesis testing is used to make decisions or draw conclusions
about a population based on sample data. It involves formulating a
hypothesis about the population parameter, collecting sample data,
and then using statistical methods to determine whether the data
provide enough evidence to reject or fail to reject the hypothesis.
Statistical Inference Methods
There are various methods of statistical inference, some of these
methods are:
Parametric Methods
Non-parametric Methods
Bayesian Methods
Let's discuss these methods in detail as follows:
Parametric Methods:
In this scenario, the parametric statistical methods will assume that
the data is drawn from a population characterized by a probability
distribution. It is mainly believed that they follow a normal distribution
thus can allow one to make guesses about the populace in question.
For example, the t-tests and ANOVA are parametric tests that give
accurate results with the assumption that the data ought to be
Example: A psychologist may ask himself if there is a measurable
difference, on average, between the IQ scores of women and
men. To test his theory, he draws samples from each group and
assumes they are both normally distributed. He can opt for a
parametric test such as t-test and assess if the mean disparity is
statistically significant.
Non-Parametric Methods:
These are less assumptive and more flexible analysis methods when
dealing with data out of normal distribution. They are also used to
conduct data analysis when one is uncertain about meeting the
assumption for parametric methods and when one have less or
inadequate data. Some of the non-parametric tests include Wilcoxon
signed-rank test and Kruskal-Wallis test among others.
Example: A biologist has collected data on plant health in an
ordinal variable but since it is only a small sample and normal
assumption is not met, the biologist can use Kruskal-Wallis
testing.
Bayesian Methods
Bayesian statistics is distinct from conventional methods in that it
includes prior knowledge and beliefs. It determines the various
potential probabilities of a hypothesis being genuine in the light of
current and previous knowledge.Thus, it allows updating the likelihood
of beliefs with new data.
Example: consider a situation where a doctor is investigating a
new treatment and has the prior belief about the success rate of
the treatment. Upon conducting a new clinical trial, the doctor
uses Bayesian method to update his “prior belief” with the data
from the new trials to estimate the true success rate of the
treatment.
Statistical Inference Techniques
Some of the common techniques for statistical inference are:
Hypothesis Testing
Confidence Intervals
Regression Analysis
Hypothesis Testing:
One of the central parts of statistical analysis is hypothesis testing
which assumes an inference or withstand any conclusions concerning
the element from the sample data. Hypothesis testing may be defined
as a structured technique that includes formulating two opposing
hypotheses, an alpha level, test statistic computation, and a decision
based on the obtained outcomes. Two types of hypotheses can be
distinguished: a null hypothesis to signify no significant difference and
an alternative hypothesis H1 or Ha to express a significant effect or
difference.
Example: If a car manufacturing company makes a claim that
their new car model gives a mileage of not less than
25miles/gallon. Then an independent agency collects data for a
sample of these cars and performs a hypothesis test. The null
hypothesis would be that the car does give a mileage of not less
than 25miles/gallon and they would test against the alternative
hypothesis that it doesn’t. The sample data would then be used
to either fail to reject or reject the null hypothesis.
Confidence Intervals (CI)
Another statistical concept that involves confidence intervals is
determining a range of possible values where the population
parameter can be, given a certain confidence percentage – usually
95%. In simpler terms, CI’s provide an estimate of the population value
and the level of uncertainty that comes with it.
Example: A study on health records could show that 95% CI for
average blood pressure is 120-130. In other words, there is a
95% chance that the average blood pressure of all population is
in the values between 120 and 130.
Regression Analysis
Multiple regression refers to the relationship between more than two
variables. Linear regression, at its most basic level, examines how a
dependent variable Y varies with an independent variable X. The
regression equation, Y = a + bX + e, a + bX + e, which is the best fit line
through the data points quantifies this variation.
Example: Consider a situation in which one is curious about
one’s advertisement on sales and is presented with it. Ultimately,
it may influence questionnaire allocation as well as lead staff to
feel disgruntled or upset and dissatisfied. In several regression
conditions, regression analysis allows for the quantification of
these two effects as well. Specifically, Y is the predicted
outcoming factor while X1, X2, and X3 are the observed variables
used to anticipate it.
Applications of Statistical Inference:
Statistical inference has a wide range of applications across various
fields. Here are some common applications:
Clinical Trials: In medical research, statistical inference is used to
analyze clinical trial data to determine the effectiveness of new
treatments or interventions. Researchers use statistical methods
to compare treatment groups, assess the significance of results,
and make inferences about the broader population of patients.
Quality Control: In manufacturing and industrial settings,
statistical inference is used to monitor and improve product
quality. Techniques such as hypothesis testing and control charts
are employed to make inferences about the consistency and
reliability of production processes based on sample data.
Market Research: In business and marketing, statistical inference
is used to analyze consumer behavior, conduct surveys, and
make predictions about market trends. Businesses use
techniques such as regression analysis and hypothesis testing to
draw conclusions about customer preferences, demand for
products, and effectiveness of marketing strategies.
Economics and Finance: In economics and finance, statistical
inference is used to analyze economic data, forecast trends, and
make decisions about investments and financial markets.
Techniques such as time series analysis, regression modeling,
and Monte Carlo simulations are commonly used to make
inferences about economic indicators, asset prices, and risk
management.