KEMBAR78
Notes | PDF | Statistics | Mean
0% found this document useful (0 votes)
2 views11 pages

Notes

The document provides an overview of data types, populations, samples, and the basics of statistics, including descriptive and inferential statistics. It also covers exploratory data analysis (EDA) objectives, methods, and key concepts such as measures of central tendency, variability, and quartiles. Additionally, it introduces Chebyshev's Theorem and the Empirical Rule for understanding data distributions.

Uploaded by

The Masker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views11 pages

Notes

The document provides an overview of data types, populations, samples, and the basics of statistics, including descriptive and inferential statistics. It also covers exploratory data analysis (EDA) objectives, methods, and key concepts such as measures of central tendency, variability, and quartiles. Additionally, it introduces Chebyshev's Theorem and the Empirical Rule for understanding data distributions.

Uploaded by

The Masker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

1.

Data means a package of information. Into the package, we collect


facts, numbers, or details about something.

Data

├── Categorical (Qualitative)

│ ├── Nominal → Names only (e.g., Eye Colour)

│ └── Ordinal → Ordered (e.g., Rank in class)

└── Numerical (Quantitative)

├── Discrete → The numbers are whole — no fractions or


decimals. (e.g., Number of students)

└── Continuous → The numbers can be fractions or decimals.


(e.g., Height, Weight)

1.2

A population is the complete collection of all items, people, or


observations that you want to study.

A sample is a smaller group selected from the population.

It represents the population — a mini version of it.

1.3

Types of Statistics:

Statistics is the science of collecting, organizing, analysing, and


interpreting data.

Descriptive Statistics:

Descriptive statistics summarise and describe the main features of a


dataset.
They help us understand what the data shows through numbers and
visuals.

Examples: Mean, Median, Mode, Range, Standard Deviation, Bar Graphs,


Pie Charts.
Inferential Statistics:

Inferential statistics use data from a sample to make conclusions or


predictions about a population.
It helps us infer or estimate things we can’t measure directly.

Examples: Estimation, Hypothesis Testing, Regression Analysis.

1.4

Exploratory Data Analysis (EDA): Exploring and understanding your


data before analysis.
It means looking closely at the data — using graphs, charts, and simple
calculations — to understand it better before doing any advanced
analysis.

1.5

The Objectives of EDA

1. Understand the database structure;

(Learn what’s inside your data — what columns (variables) you have, what
type of data (numbers, text, dates), and if any values are missing.

🧩 Example: Checking a table of patient records to see how many columns


(like age, weight, blood pressure) you have and what type of data each is.)

2. Visualise potential relationships (direction and magnitude) between


exposure and outcome variables;

(See how two or more things are connected — for example, does one go
up when the other goes up (positive relationship), or does it go down
(negative relationship)?

📊 Example: Making a scatter plot to see if hours studied are related to


exam marks — more study hours = higher marks.)

3. Detect outliers and anomalies (values that are significantly different


from the other observations);

(Find data points that are very different from the rest — they might be
mistakes or special cases.

⚠️Example: In a list of student ages (15, 16, 17, 45, 16), the “45” is an
outlier — maybe entered by mistake.)

4. Develop parsimonious models (a predictive or explanatory model that


performs with as few exposure variables as possible);
(Build simple models that work well using only the most important
variables, not too many.

⚙️Example: If you can predict a person’s blood pressure using just age
and weight, there’s no need to add ten extra variables.)

5. Extract and create clinically relevant variables.

(Create new meaningful data columns that can help in analysis or


predictions.

🧠 Example: Instead of using height and weight separately, you can create
a new variable called BMI (Body Mass Index) — which is more useful in
health studies.)

1.6

Classification of EDA Methods:

Univariate Graphical EDA (Discrete data): Here, we look at one variable


at a time.

 Bar Plot: Shows how many times each category appears.


📊 Example: Number of students in each class (Class A, B, C).

 Pie Chart: Shows how each category contributes to the total.


🥧 Example: Percentage of students by favourite fruit (apple, banana,
mango).

Univariate Graphical EDA (for Continuous Data): Still one variable — but
with continuous (measurable) data (like height, weight, time).

 Line Plot: Shows data changing over time.


⏱ Example: Temperature during the day.

 Histogram: Shows how data values are spread or distributed.


📊 Example: Distribution of students’ test scores.

 Stem and Leaf Plot: Shows raw data and distribution together.
🌿 Example: Listing exam marks in a compact, visual way.

 Boxplot: Shows spread, median, and outliers (using boxes and lines).
📦 Example: Comparing exam scores to find unusual results.

Bivariate Graphical EDA (Two Variables): We study the relationship


between two variables.

2D Line Plot: Shows how one variable changes with another.


📈 Example: Time vs. Sales — see if sales rise over time.

Curve Fitting: Draws a smooth line through data points to show trends.

📉 Example: Relationship between study hours and marks.

Trivariate and Multivariate Graphical EDA (Three or More Variables): Used


to explore relationships among 3 or more variables.

3D Scatter Plot: Shows how three variables relate in 3D space.

🧊 Example: Height, Weight, and Age plotted together.

Side-by-Side Boxplots: Compare multiple groups at once.

📦📦 Example: Exam scores of boys vs. girls across subjects.

Heat Maps: Use colour to show relationships between many variables.

🌡 Example: Showing correlation between different features in a dataset.

3D Surface Plots: Show how two inputs affect one output.

🗻 Example: How temperature and humidity affect crop yield.

PCA Plot (Principal Component Analysis): Used to reduce many


variables into 2–3 key ones and visualise them.

💫 Example: Summarising 10 features of patients into 2 main axes to find


patterns.

1.7

Univariate Non-graphical EDA:

It means making a simple table to show how many items are in each
group or category.

1.7.1

Characteristics of Quantitative Data:

Central Tendency (Middle of Data)

 Shows the “center” of data.


 Mean = average
 Median = middle value
 Mode = most common value
2️⃣ Spread / Dispersion (How Data is Spread Out)

Range = highest − lowest


 Shows how close or far values are.

 Variance / SD = average difference from mean
 IQR = middle 50% of values

3️⃣ Shape of Data

 Shows the pattern of data.


 Skewness = leans left or right
 Kurtosis = how flat or tall the data looks

Measures of Central Tendency

A measure of central tendency is a single value that represents the


“centre” of a dataset.
It tells us where most of the data is located.

Main types:

1️⃣ Mean (Average)

 Add all values, then divide by the number of values.


 Formula: Mean = (Sum of all values) ÷ (Number of values)
 Sensitive to outliers (very high or low numbers can skew it).

📘 Example:
Dataset: {13, 35, 54, 54, 55, 56, 57, 67, 85, 89, 96}

 Mean = 60.09

2️⃣ Median (Middle Value)

 Arrange data in order, pick the middle value.


 Better than mean if there are outliers or skewed data.
 Odd number of values: middle value
 Even number of values: average of two middle values
📘 Example:
Dataset: {13, 35, 54, 54, 55, 56, 57, 67, 85, 89, 96}

 Median = 56

3️⃣ Mode (Most Frequent Value)

 The value that occurs most often.


 Can be thought of as the most popular value.

📘 Example:
Dataset: {13, 35, 54, 54, 55, 56, 57, 67, 85, 89, 96}

 Mode = 54

Skewness (Measure of Asymmetry)

Skewness shows if the data is lopsided or concentrated on one side.

1️⃣ Positive Skew (Right Skew)

 Mean > Median > Mode

 Tail is long on the right

 Few very large values pull the mean up

 Example: Most people earn <$2,000, few earn >$14,000 → mean >
median → right skew

2️⃣ Negative Skew (Left Skew)

 Mean < Median < Mode

 Tail is long on the left

 Few very small values pull the mean down

Measures of Variability (Dispersion)


Dispersion tells us how spread out the data is. While the mean gives the
center, dispersion shows how far values are from the center.

1️⃣ Range

 Difference between largest and smallest value.

 Only considers extremes, ignores middle values.

 Example: {13, 33, 45, 67, 70} → Range = 70 − 13 = 57

2️⃣ Variance

 Measures how far each value is from the mean.

 Average of squared differences from the mean.

 Units are squared, so we use standard deviation for easier


interpretation.

 Population variance: divide by N

 Sample variance: divide by (n − 1) → Bessel’s correction reduces


bias

3️⃣ Standard Deviation (SD)

 Square root of variance → brings back original units.

 Shows how concentrated or spread out the data is around the


mean.

 Population SD (σ) = √(population variance)

 Sample SD (s) = √(sample variance)

📘 Example: {3, 5, 6, 9, 10} → calculate SD to see how data spreads


around mean

4️⃣ Coefficient of Variation (CV)

 Measures relative variability: SD ÷ Mean × 100

 Useful to compare spread between two datasets with different


units or means.

Example:
 Item A → Mean = 50, SD = 5 → CV = 10%

 Item B → Mean = 100, SD = 5 → CV = 5%

CV helps compare variability relative to the mean, not just absolute


numbers.

Quartiles

Quartiles split ordered data into 4 equal parts. They help us see how
data is spread.

1️⃣ Q1 (First Quartile)

 25% of data values are smaller, 75% are larger

 Think of it as the lower quarter of data

2️⃣ Q2 (Second Quartile)

 Same as the median

 50% of data are smaller, 50% are larger

 Middle value of the dataset

3️⃣ Q3 (Third Quartile)

 75% of data are smaller, 25% are larger

 Think of it as the upper quarter of data

✅ In short:

Q1 = 25% mark, Q2 = 50% (median), Q3 = 75% mark

1.8

Chebyshev’s Theorem:

For any dataset (any distribution), at least a certain percentage of values


will lie within k standard deviations from the mean.
1.8.1

Empirical Rule:

You might also like