1.
Data means a package of information. Into the package, we collect
facts, numbers, or details about something.
Data
├── Categorical (Qualitative)
│ ├── Nominal → Names only (e.g., Eye Colour)
│ └── Ordinal → Ordered (e.g., Rank in class)
└── Numerical (Quantitative)
├── Discrete → The numbers are whole — no fractions or
decimals. (e.g., Number of students)
└── Continuous → The numbers can be fractions or decimals.
(e.g., Height, Weight)
1.2
A population is the complete collection of all items, people, or
observations that you want to study.
A sample is a smaller group selected from the population.
It represents the population — a mini version of it.
1.3
Types of Statistics:
Statistics is the science of collecting, organizing, analysing, and
interpreting data.
Descriptive Statistics:
Descriptive statistics summarise and describe the main features of a
dataset.
They help us understand what the data shows through numbers and
visuals.
Examples: Mean, Median, Mode, Range, Standard Deviation, Bar Graphs,
Pie Charts.
Inferential Statistics:
Inferential statistics use data from a sample to make conclusions or
predictions about a population.
It helps us infer or estimate things we can’t measure directly.
Examples: Estimation, Hypothesis Testing, Regression Analysis.
1.4
Exploratory Data Analysis (EDA): Exploring and understanding your
data before analysis.
It means looking closely at the data — using graphs, charts, and simple
calculations — to understand it better before doing any advanced
analysis.
1.5
The Objectives of EDA
1. Understand the database structure;
(Learn what’s inside your data — what columns (variables) you have, what
type of data (numbers, text, dates), and if any values are missing.
🧩 Example: Checking a table of patient records to see how many columns
(like age, weight, blood pressure) you have and what type of data each is.)
2. Visualise potential relationships (direction and magnitude) between
exposure and outcome variables;
(See how two or more things are connected — for example, does one go
up when the other goes up (positive relationship), or does it go down
(negative relationship)?
📊 Example: Making a scatter plot to see if hours studied are related to
exam marks — more study hours = higher marks.)
3. Detect outliers and anomalies (values that are significantly different
from the other observations);
(Find data points that are very different from the rest — they might be
mistakes or special cases.
⚠️Example: In a list of student ages (15, 16, 17, 45, 16), the “45” is an
outlier — maybe entered by mistake.)
4. Develop parsimonious models (a predictive or explanatory model that
performs with as few exposure variables as possible);
(Build simple models that work well using only the most important
variables, not too many.
⚙️Example: If you can predict a person’s blood pressure using just age
and weight, there’s no need to add ten extra variables.)
5. Extract and create clinically relevant variables.
(Create new meaningful data columns that can help in analysis or
predictions.
🧠 Example: Instead of using height and weight separately, you can create
a new variable called BMI (Body Mass Index) — which is more useful in
health studies.)
1.6
Classification of EDA Methods:
Univariate Graphical EDA (Discrete data): Here, we look at one variable
at a time.
Bar Plot: Shows how many times each category appears.
📊 Example: Number of students in each class (Class A, B, C).
Pie Chart: Shows how each category contributes to the total.
🥧 Example: Percentage of students by favourite fruit (apple, banana,
mango).
Univariate Graphical EDA (for Continuous Data): Still one variable — but
with continuous (measurable) data (like height, weight, time).
Line Plot: Shows data changing over time.
⏱ Example: Temperature during the day.
Histogram: Shows how data values are spread or distributed.
📊 Example: Distribution of students’ test scores.
Stem and Leaf Plot: Shows raw data and distribution together.
🌿 Example: Listing exam marks in a compact, visual way.
Boxplot: Shows spread, median, and outliers (using boxes and lines).
📦 Example: Comparing exam scores to find unusual results.
Bivariate Graphical EDA (Two Variables): We study the relationship
between two variables.
2D Line Plot: Shows how one variable changes with another.
📈 Example: Time vs. Sales — see if sales rise over time.
Curve Fitting: Draws a smooth line through data points to show trends.
📉 Example: Relationship between study hours and marks.
Trivariate and Multivariate Graphical EDA (Three or More Variables): Used
to explore relationships among 3 or more variables.
3D Scatter Plot: Shows how three variables relate in 3D space.
🧊 Example: Height, Weight, and Age plotted together.
Side-by-Side Boxplots: Compare multiple groups at once.
📦📦 Example: Exam scores of boys vs. girls across subjects.
Heat Maps: Use colour to show relationships between many variables.
🌡 Example: Showing correlation between different features in a dataset.
3D Surface Plots: Show how two inputs affect one output.
🗻 Example: How temperature and humidity affect crop yield.
PCA Plot (Principal Component Analysis): Used to reduce many
variables into 2–3 key ones and visualise them.
💫 Example: Summarising 10 features of patients into 2 main axes to find
patterns.
1.7
Univariate Non-graphical EDA:
It means making a simple table to show how many items are in each
group or category.
1.7.1
Characteristics of Quantitative Data:
Central Tendency (Middle of Data)
Shows the “center” of data.
Mean = average
Median = middle value
Mode = most common value
2️⃣ Spread / Dispersion (How Data is Spread Out)
Range = highest − lowest
Shows how close or far values are.
Variance / SD = average difference from mean
IQR = middle 50% of values
3️⃣ Shape of Data
Shows the pattern of data.
Skewness = leans left or right
Kurtosis = how flat or tall the data looks
Measures of Central Tendency
A measure of central tendency is a single value that represents the
“centre” of a dataset.
It tells us where most of the data is located.
Main types:
1️⃣ Mean (Average)
Add all values, then divide by the number of values.
Formula: Mean = (Sum of all values) ÷ (Number of values)
Sensitive to outliers (very high or low numbers can skew it).
📘 Example:
Dataset: {13, 35, 54, 54, 55, 56, 57, 67, 85, 89, 96}
Mean = 60.09
2️⃣ Median (Middle Value)
Arrange data in order, pick the middle value.
Better than mean if there are outliers or skewed data.
Odd number of values: middle value
Even number of values: average of two middle values
📘 Example:
Dataset: {13, 35, 54, 54, 55, 56, 57, 67, 85, 89, 96}
Median = 56
3️⃣ Mode (Most Frequent Value)
The value that occurs most often.
Can be thought of as the most popular value.
📘 Example:
Dataset: {13, 35, 54, 54, 55, 56, 57, 67, 85, 89, 96}
Mode = 54
Skewness (Measure of Asymmetry)
Skewness shows if the data is lopsided or concentrated on one side.
1️⃣ Positive Skew (Right Skew)
Mean > Median > Mode
Tail is long on the right
Few very large values pull the mean up
Example: Most people earn <$2,000, few earn >$14,000 → mean >
median → right skew
2️⃣ Negative Skew (Left Skew)
Mean < Median < Mode
Tail is long on the left
Few very small values pull the mean down
Measures of Variability (Dispersion)
Dispersion tells us how spread out the data is. While the mean gives the
center, dispersion shows how far values are from the center.
1️⃣ Range
Difference between largest and smallest value.
Only considers extremes, ignores middle values.
Example: {13, 33, 45, 67, 70} → Range = 70 − 13 = 57
2️⃣ Variance
Measures how far each value is from the mean.
Average of squared differences from the mean.
Units are squared, so we use standard deviation for easier
interpretation.
Population variance: divide by N
Sample variance: divide by (n − 1) → Bessel’s correction reduces
bias
3️⃣ Standard Deviation (SD)
Square root of variance → brings back original units.
Shows how concentrated or spread out the data is around the
mean.
Population SD (σ) = √(population variance)
Sample SD (s) = √(sample variance)
📘 Example: {3, 5, 6, 9, 10} → calculate SD to see how data spreads
around mean
4️⃣ Coefficient of Variation (CV)
Measures relative variability: SD ÷ Mean × 100
Useful to compare spread between two datasets with different
units or means.
Example:
Item A → Mean = 50, SD = 5 → CV = 10%
Item B → Mean = 100, SD = 5 → CV = 5%
CV helps compare variability relative to the mean, not just absolute
numbers.
Quartiles
Quartiles split ordered data into 4 equal parts. They help us see how
data is spread.
1️⃣ Q1 (First Quartile)
25% of data values are smaller, 75% are larger
Think of it as the lower quarter of data
2️⃣ Q2 (Second Quartile)
Same as the median
50% of data are smaller, 50% are larger
Middle value of the dataset
3️⃣ Q3 (Third Quartile)
75% of data are smaller, 25% are larger
Think of it as the upper quarter of data
✅ In short:
Q1 = 25% mark, Q2 = 50% (median), Q3 = 75% mark
1.8
Chebyshev’s Theorem:
For any dataset (any distribution), at least a certain percentage of values
will lie within k standard deviations from the mean.
1.8.1
Empirical Rule: