Statistics - Material
Statistics - Material
2. Descriptive Statistics
3. Probability
Definition: The distribution of the sample mean approaches a normal distribution as the
sample size increases, even if the population is not normally distributed.
Key Importance: This allows for the use of normal distribution for hypothesis testing and
estimation, even with non-normal data.
Definition: The Central Limit Theorem states that the distribution of the sample mean
approaches a normal distribution as the sample size increases, even if the population
distribution is not normal. This holds true for sufficiently large sample sizes (typically
n≥30n \geq 30n≥30).
Why It Matters:
o The CLT is the foundation for much of inferential statistics because it allows
us to make statistical inferences (such as hypothesis testing and confidence
intervals) based on the assumption of normality, even when the underlying
population is not normally distributed.
o This means that we can use methods such as z-tests and t-tests, which assume
normality, even for non-normally distributed populations, as long as we have a
sufficiently large sample.
Application of CLT:
The CLT allows us to estimate population parameters (such as the population mean)
based on the sample mean, and to calculate the standard error of the mean.
The sampling distribution of the sample mean is normal (or nearly normal)
regardless of the shape of the population distribution, as long as the sample size is
large enough.
Example:
Simulate sampling from a skewed dataset (e.g., income distribution) and demonstrate
how the sampling distribution of the mean approaches a normal distribution as the
sample size increases.
Practical Example: Simulate sampling from a skewed dataset (e.g., income) and
demonstrate how the sampling distribution of the mean approaches normality as the
sample size increases.
6. Hypothesis Testing
Null Hypothesis (H₀): The default assumption that there is no effect, no difference,
or no relationship in the population (e.g., the mean income is $50,000).
Alternative Hypothesis (H₁): The hypothesis that we are trying to prove, asserting
that there is an effect or a difference (e.g., the mean income is not $50,000).
Steps in Hypothesis Testing:
1. Formulate the Hypotheses: Define H₀ and H₁.
2. Select the Significance Level (α): This is the probability threshold for
rejecting the null hypothesis (commonly 0.05 or 0.01).
3. Collect Data: Gather a sample to test the hypothesis.
4. Perform the Test: Use a test statistic (e.g., t-test or z-test) to analyze the data.
5. Make a Decision: If the p-value ≤ α, reject H₀. If p-value > α, fail to reject
H₀.
P-value:
The probability of obtaining a result at least as extreme as the observed one, assuming
the null hypothesis is true.
Interpretation: A small p-value (typically ≤ 0.05) suggests that the observed result is
unlikely to have occurred under the null hypothesis, so we reject H₀. If the p-value is
large, we fail to reject H₀, indicating insufficient evidence to support H₁.
Types of Errors:
Type I Error (False Positive): Rejecting a true null hypothesis (e.g., concluding
there is a difference when there is none).
Type II Error (False Negative): Failing to reject a false null hypothesis (e.g.,
concluding there is no difference when there actually is).
o Null Hypothesis (H₀): Assumes no effect, no difference, or no relationship.
o Alternative Hypothesis (H₁): The hypothesis being tested for.
o P-value: The probability of obtaining a result at least as extreme as the
observed one, assuming the null hypothesis is true.
Decision Rule: If p≤0.05p \leq 0.05, reject H₀; if p>0.05p > 0.05, fail
to reject H₀.
T-test:
o One-sample T-test: Tests if the sample mean differs from a known value.
o Two-sample T-test: Compares means from two independent groups (e.g.,
testing if male and female salaries differ).
o Paired T-test: Compares means of two related groups (e.g., pre-test vs post-
test scores).
Practical Example: Perform a two-sample t-test to compare the average scores of
two groups.
6. Confidence Intervals
A confidence interval is a range of values that is likely to contain the true population
parameter with a certain level of confidence (e.g., 95% confidence interval).
Where:
A/B testing uses inferential statistics to determine if the observed difference between
two groups is statistically significant. This allows businesses to infer whether one
version of a product is superior to another.
Certainly! Below is a section explaining the difference between Inferential Statistics and
Descriptive Statistics:
Objective: Understand the key distinctions between Inferential Statistics and Descriptive
Statistics, and how each is used in data analysis.
Descriptive Statistics:
Definition: Descriptive statistics refers to methods that summarize and describe the main
features of a dataset. It is focused on organizing, presenting, and analyzing data in a way
that is easy to understand, without making predictions or inferences about a larger population.
Purpose:
o The main goal is to describe the data that you have collected in a meaningful
way. This includes measures like averages, percentages, and frequencies that
help give a clear picture of the dataset.
Key Operations:
o Central Tendency: Measures such as mean, median, and mode, which tell us
about the "center" of the data.
o Dispersion: Measures like range, variance, and standard deviation that
describe how spread out the data is.
o Shape of the Distribution: Measures such as skewness and kurtosis that
describe the shape of the data distribution.
o Visualizations: Histograms, bar charts, pie charts, and box plots that
graphically represent the data.
Limitations:
o No Generalizations: Descriptive statistics do not allow us to make
generalizations or predictions about a population. They only describe the
sample data at hand.
o No Hypothesis Testing: It does not test hypotheses or predict future
outcomes.
Example:
o Suppose you have a dataset of the heights of 100 students. Descriptive
statistics would allow you to calculate the average height, the range, and the
standard deviation of heights, but it wouldn't tell you about the population of
all students outside this sample.
Inferential Statistics:
Purpose:
o The main goal is to make inferences or draw conclusions about a population
based on a sample. It uses probability theory to make predictions or decisions
about a population's characteristics.
Key Operations:
o Estimation: Estimating population parameters (e.g., estimating the mean or
variance of a population from sample data).
o Hypothesis Testing: Testing assumptions or claims about a population (e.g.,
testing whether the average income of a group is different from a known
value).
o Confidence Intervals: Estimating a range of values for a population
parameter with a certain level of confidence (e.g., estimating the population
mean with a 95% confidence interval).
o Predictive Modeling: Using statistical models to predict future outcomes
(e.g., predicting future sales based on historical data).
Limitations:
o Sample-Dependent: The conclusions drawn are based on the sample, and
there is always a level of uncertainty. The accuracy of inferences depends on
the sample size and representativeness.
o Requires Assumptions: Many inferential techniques (e.g., t-tests, regression
models) require assumptions, such as normality or independence of
observations.
Example:
o Suppose you take a sample of 100 students' heights from a university and find
an average height of 5.5 feet with a standard deviation of 0.2 feet. Inferential
statistics could help you estimate the average height of all students at that
university (population) and test whether the sample mean significantly differs
from the average height in other populations.
Key Differences:
Conclusion:
Recap:
oWe covered data types, descriptive statistics, probability, the Central Limit
Theorem, hypothesis testing, confidence intervals, and A/B testing.
o How these concepts help in making decisions based on data in machine
learning and data science.
Q&A: Open the floor for final questions and clarifications.
Statistics plays a crucial role in Exploratory Data Analysis (EDA) by helping data
scientists and analysts summarize, visualize, and understand the underlying structure of data
before moving to more complex modeling tasks. Here’s how statistics contributes to the EDA
process:
Descriptive Statistics:
o Descriptive statistics (mean, median, mode, variance, standard deviation)
provide an initial understanding of the central tendency and spread of the
data. This helps in identifying the key characteristics of each variable.
o For example, knowing the mean and standard deviation of a numeric
column gives you a sense of the average value and how spread out the values
are.
Visualizing Distribution:
o Histograms and density plots are used to visualize how data is distributed
(e.g., normal, skewed, uniform).
o Understanding the distribution helps you know which statistical techniques to
apply and whether certain transformations (e.g., log transformation) might be
needed to normalize the data for modeling.
2. Identifying Outliers
Box Plots:
o Box plots provide a clear visualization of the spread and help identify
outliers (data points that fall outside the whiskers). Outliers can significantly
affect model performance and are thus crucial to identify during EDA.
Z-scores:
o Z-scores can also help detect outliers. Data points with a Z-score greater than
3 or less than -3 are typically considered outliers. This step is useful in
detecting extreme values that might need further investigation or removal from
the dataset.
Statistical Tests:
o During EDA, statistical tests (e.g., t-tests, ANOVA) help assess the
significance of variables. These tests provide a way to check the reliability and
importance of features in explaining the target variable.
Normalization and Standardization:
o Standardizing or normalizing features (scaling them to a standard range or
distribution) is essential for some machine learning algorithms. This can be
done using statistical techniques, ensuring that no feature dominates due to
differences in scale.
Identifying Data Types:
o During EDA, statistical summaries allow you to classify data types (nominal,
ordinal, interval, ratio) and choose the appropriate analysis method. This is
especially important when deciding between techniques like chi-square tests
for categorical data or regression for continuous data.
7. Hypothesis Generation
Statistical Reasoning:
o During EDA, you often generate hypotheses about the relationships in the data
or patterns that might exist. Statistical reasoning helps you form questions
like:
Is the mean of one group different from another group?
Is there a significant relationship between two variables?
o This step is essential for guiding further analyses and model-building steps.
Conclusion:
Statistics is at the heart of Exploratory Data Analysis (EDA) because it helps uncover the
distribution, relationships, and patterns in data. By applying descriptive and inferential
statistical techniques, data scientists can better understand the data, identify problems such as
outliers or missing values, and prepare the data for subsequent modeling. This statistical
foundation is crucial for making informed decisions, refining models, and ensuring that
insights drawn from the data are valid and reliable.