KEMBAR78
Statistics - Material | PDF | Statistics | P Value
0% found this document useful (0 votes)
19 views12 pages

Statistics - Material

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

Statistics - Material

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

1.

Introduction to Statistics and Understanding Data (20 minutes)

 Why Statistics Matters:


o Statistics plays a critical role in data analysis, helping to uncover patterns,
make decisions, and validate models. In machine learning, statistics is used to
preprocess data, evaluate models, and make predictions.
o It helps provide insights into the structure and characteristics of datasets,
allowing data scientists to design better models.
 Why Understanding Data Types is Crucial:
o Data type determines which statistical methods are appropriate for analysis.
o Different types of data require different preprocessing techniques and impact
the choice of machine learning algorithms.
 Understanding Data Types:
o Nominal Data: Categories without a meaningful order (e.g., color, gender).
 Example: Gender (Male/Female), Color (Red/Blue/Green).
 Statistical Operations: Mode, Frequency distribution.
o Ordinal Data: Data with a meaningful order or ranking, but without
consistent intervals (e.g., rating scales).
 Example: Educational level (High School, Undergraduate, Graduate).
 Statistical Operations: Mode, Median, Rank correlation.
o Interval Data: Data with consistent intervals, but no true zero (e.g.,
temperature in Celsius).
 Example: Temperature (30°C, 40°C).
 Statistical Operations: Mean, Median, Standard Deviation.
o Categorical Data (Nominal and Ordinal): Data divided into categories.
 Statistical Operations: Mode, Frequency tables, Cross-tabulation.
o Continuous Data: Data that can take any value within a range (e.g., height,
weight).
 Statistical Operations: Mean, Median, Standard Deviation, Variance.
o Discrete Data: Data that can only take specific values (e.g., number of
children, goals scored).
 Statistical Operations: Mode, Frequency distribution.

2. Descriptive Statistics

 What is Descriptive Statistics?


o Descriptive Statistics refers to the statistical methods used to summarize and
describe the main features of a dataset. Unlike inferential statistics (which
make predictions about a population from a sample), descriptive statistics
focus solely on the characteristics of the data at hand.
o It includes measures of central tendency, dispersion, and the shape of the data
distribution, allowing for a clear and concise understanding of the data.
 Measures of Central Tendency:
o Mean (Arithmetic Average): The sum of all data points divided by the
number of points.
 Formula: μ=∑i=1nxin\mu = \frac{\sum_{i=1}^{n} x_i}{n} Where
xix_i are the data points, and nn is the number of data points.
 Use: Most common for continuous, normally distributed data.
o Median: The middle value when the data is sorted in ascending order.
 Use: When data is skewed or contains outliers, the median is a more
accurate measure of central tendency.
o Mode: The most frequent value in the dataset.
 Use: For categorical or nominal data where mean or median cannot be
used.
 Example: In a dataset of colors, the mode could be "Red" if it appears
most frequently.
o When to Use:
 Mean: For continuous, normally distributed data.
 Median: For skewed distributions or ordinal data.
 Mode: For categorical data.
 Measures of Dispersion (Spread of Data):
o Variance: A measure of the average squared deviation from the mean.
 Formula: σ2=1n∑i=1n(xi−μ)2\sigma^2 = \frac{1}{n} \
sum_{i=1}^{n} (x_i - \mu)^2
 Use: Provides insight into how spread out the data is. A higher
variance indicates a wider spread.
o Standard Deviation: The square root of variance, providing a measure of how
much each data point deviates from the mean.
 Formula: σ=σ2\sigma = \sqrt{\sigma^2}
 Use: More interpretable than variance, as it is in the same unit as the
data.
o Range: The difference between the highest and lowest values in the dataset.
 Formula: Range=max⁡(x)−min⁡(x)\text{Range} = \max(x) - \min(x)
 Use: Provides a simple way to understand the spread of data, but can
be affected by outliers.
 Shape of the Distribution:
o Skewness: Measures the asymmetry of the distribution.
 Positive Skew: The right tail is longer than the left (e.g., income data
with a few very high earners).
 Negative Skew: The left tail is longer than the right.
o Kurtosis: Measures the "tailedness" of the distribution and the concentration
of data in the tails.
 High Kurtosis: More data points in the tails (e.g., more extreme
outliers).
 Low Kurtosis: Less data in the tails (e.g., a more uniform
distribution).
 Visualizing Descriptive Statistics:
o Histograms: Show the frequency of data points within certain ranges.
o Box Plots: Display the distribution and identify outliers.
o Bar Charts: Useful for visualizing categorical data (e.g., frequency of each
category).
 Practical Example:
o Use a dataset of exam scores to calculate the mean, median, mode, variance,
standard deviation, and range. Visualize the distribution with a histogram and
box plot.
Usharama61@gmail.com

3. Probability

 Basic Probability Concepts:


o Sample Space (S): The set of all possible outcomes (e.g., for a dice roll: {1, 2,
3, 4, 5, 6}).
o Event: A subset of the sample space (e.g., rolling an even number).
o Probability of an Event (A):
P(A)=Number of favorable outcomesTotal number of outcomesP(A) = \frac{\
text{Number of favorable outcomes}}{\text{Total number of outcomes}}
 Conditional Probability:
o Formula: P(A∣B)=P(A∩B)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}
o Bayes' Theorem: P(A∣B)=P(B∣A)P(A)P(B)P(A|B) = \frac{P(B|A) P(A)}
{P(B)}
o Example: Calculate conditional probabilities and apply Bayes' Theorem to
predict outcomes.
 Z-Score:
o A Z-score measures how many standard deviations a data point is from the
mean: Z=X−μσZ = \frac{X - \mu}{\sigma}
o Interpretation: A Z-score of 2 means the data point is 2 standard deviations
above the mean.
 Practical Example: Calculate the Z-scores for a dataset of exam scores and interpret
the results.

4. Inferential Statistics (45 minutes)

Objective: Understand the concepts of inferential statistics, which enable us to make


predictions or inferences about a population from sample data.

 What is Inferential Statistics?


Inferential statistics is a branch of statistics that allows us to make conclusions or
inferences about a population based on sample data. Unlike descriptive statistics,
which only summarize the data at hand, inferential statistics uses data from a sample
to estimate, predict, or test hypotheses about a population.

Inferential statistics is powerful because collecting data from an entire population is


often impractical or impossible. Instead, we collect data from a sample, and using
inferential methods, we make predictions or generalizations about the larger
population from which the sample is drawn.

 Key Concepts in Inferential Statistics:


o Population vs. Sample:
 Population: The entire group being studied (e.g., all employees in a
company, all customers in a region).
 Sample: A subset of the population selected for analysis. Samples are
used because it's often impractical to collect data from the entire
population.
o Parameter vs. Statistic:
 Parameter: A value that describes a characteristic of the entire
population (e.g., population mean, population variance).
 Statistic: A value that describes a characteristic of a sample (e.g.,
sample mean, sample variance).
o Point Estimation: Estimating population parameters (such as the population
mean) using sample statistics. Point estimates are used when we want to
estimate a specific value of a population characteristic.
o Confidence Intervals: Provides a range of values that is likely to contain the
population parameter with a certain level of confidence (e.g., 95% confidence
interval).
o Hypothesis Testing: Using sample data to test assumptions or claims
(hypotheses) about a population. Hypothesis testing helps us to validate or
refute assumptions about the population using statistical evidence.

5. Central Limit Theorem (CLT)

Definition: The distribution of the sample mean approaches a normal distribution as the
sample size increases, even if the population is not normally distributed.

Key Importance: This allows for the use of normal distribution for hypothesis testing and
estimation, even with non-normal data.

 Central Limit Theorem (CLT):

 Definition: The Central Limit Theorem states that the distribution of the sample mean
approaches a normal distribution as the sample size increases, even if the population
distribution is not normal. This holds true for sufficiently large sample sizes (typically
n≥30n \geq 30n≥30).
 Why It Matters:
o The CLT is the foundation for much of inferential statistics because it allows
us to make statistical inferences (such as hypothesis testing and confidence
intervals) based on the assumption of normality, even when the underlying
population is not normally distributed.
o This means that we can use methods such as z-tests and t-tests, which assume
normality, even for non-normally distributed populations, as long as we have a
sufficiently large sample.

 Application of CLT:

 The CLT allows us to estimate population parameters (such as the population mean)
based on the sample mean, and to calculate the standard error of the mean.
 The sampling distribution of the sample mean is normal (or nearly normal)
regardless of the shape of the population distribution, as long as the sample size is
large enough.
 Example:

 Simulate sampling from a skewed dataset (e.g., income distribution) and demonstrate
how the sampling distribution of the mean approaches a normal distribution as the
sample size increases.

 Practical Example: Simulate sampling from a skewed dataset (e.g., income) and
demonstrate how the sampling distribution of the mean approaches normality as the
sample size increases.

6. Hypothesis Testing

 Hypothesis Testing Basics:

 Null Hypothesis (H₀): The default assumption that there is no effect, no difference,
or no relationship in the population (e.g., the mean income is $50,000).
 Alternative Hypothesis (H₁): The hypothesis that we are trying to prove, asserting
that there is an effect or a difference (e.g., the mean income is not $50,000).
 Steps in Hypothesis Testing:
1. Formulate the Hypotheses: Define H₀ and H₁.
2. Select the Significance Level (α): This is the probability threshold for
rejecting the null hypothesis (commonly 0.05 or 0.01).
3. Collect Data: Gather a sample to test the hypothesis.
4. Perform the Test: Use a test statistic (e.g., t-test or z-test) to analyze the data.
5. Make a Decision: If the p-value ≤ α, reject H₀. If p-value > α, fail to reject
H₀.

 P-value:

 The probability of obtaining a result at least as extreme as the observed one, assuming
the null hypothesis is true.
 Interpretation: A small p-value (typically ≤ 0.05) suggests that the observed result is
unlikely to have occurred under the null hypothesis, so we reject H₀. If the p-value is
large, we fail to reject H₀, indicating insufficient evidence to support H₁.

 Types of Errors:

 Type I Error (False Positive): Rejecting a true null hypothesis (e.g., concluding
there is a difference when there is none).
 Type II Error (False Negative): Failing to reject a false null hypothesis (e.g.,
concluding there is no difference when there actually is).


o Null Hypothesis (H₀): Assumes no effect, no difference, or no relationship.
o Alternative Hypothesis (H₁): The hypothesis being tested for.
o P-value: The probability of obtaining a result at least as extreme as the
observed one, assuming the null hypothesis is true.
 Decision Rule: If p≤0.05p \leq 0.05, reject H₀; if p>0.05p > 0.05, fail
to reject H₀.
 T-test:
o One-sample T-test: Tests if the sample mean differs from a known value.
o Two-sample T-test: Compares means from two independent groups (e.g.,
testing if male and female salaries differ).
o Paired T-test: Compares means of two related groups (e.g., pre-test vs post-
test scores).
 Practical Example: Perform a two-sample t-test to compare the average scores of
two groups.

6. Confidence Intervals

A confidence interval is a range of values that is likely to contain the true population
parameter with a certain level of confidence (e.g., 95% confidence interval).

 Formula for Confidence Interval:

CI=Xˉ±Z×σn\text{CI} = \bar{X} \pm Z \times \frac{\sigma}{\sqrt{n}}CI=Xˉ±Z×nσ

Where:

o Xˉ\bar{X}Xˉ is the sample mean,


o ZZZ is the Z-score for the desired confidence level (e.g., 1.96 for 95%
confidence),
o σ\sigmaσ is the population standard deviation (or sample standard deviation if
the population standard deviation is unknown),
o nnn is the sample size.
 Interpretation:
o A 95% confidence interval means that if we were to take 100 samples, 95 of
them would contain the true population parameter.
o It gives us an estimate of the precision of our sample statistic and is used to
infer the true value of a population parameter.
 Inferential Statistics Link: Confidence intervals help us make inferences about a
population parameter by using a sample, and they give us a range that is likely to
contain the true population value.
 Practical Example:
o Calculate a 95% confidence interval for the average height based on a sample
dataset.

7. A/B Testing (30 minutes)

Objective: Understand and implement A/B testing in data science.

 A/B Testing Overview:


 Definition: A/B testing is a randomized controlled experiment comparing two
versions of a product or feature to determine which one performs better (e.g., website
layout, email subject lines).

 Steps in A/B Testing:

1. Formulate Hypotheses: Null hypothesis (no difference) and alternative hypothesis


(there is a difference).
2. Random Assignment: Randomly assign users to two groups (Group A and Group B).
3. Measure and Compare: Collect performance metrics (e.g., conversion rates, click-
through rates) from both groups.
4. Statistical Testing: Use t-tests or z-tests to determine if the difference between
groups is statistically significant.

 Inferential Statistics in A/B Testing:

 A/B testing uses inferential statistics to determine if the observed difference between
two groups is statistically significant. This allows businesses to infer whether one
version of a product is superior to another.

 Steps in A/B Testing:

1. Hypothesis Formulation: Null hypothesis (no difference) and alternative


hypothesis (there is a difference).
2. Random Assignment: Randomly assign users to different versions of a
product.
3. Measure and Compare: Collect data from each group (e.g., conversion rates,
click-through rates).
4. Statistical Testing: Use t-tests or z-tests to determine if the difference
between groups is statistically significant.

 Practical Example: Implement an A/B test comparing two versions of a website


(e.g., different button colors) to see which has a higher conversion rate.

Certainly! Below is a section explaining the difference between Inferential Statistics and
Descriptive Statistics:

Difference Between Inferential Statistics and Descriptive Statistics

Objective: Understand the key distinctions between Inferential Statistics and Descriptive
Statistics, and how each is used in data analysis.

Descriptive Statistics:
Definition: Descriptive statistics refers to methods that summarize and describe the main
features of a dataset. It is focused on organizing, presenting, and analyzing data in a way
that is easy to understand, without making predictions or inferences about a larger population.

 Purpose:
o The main goal is to describe the data that you have collected in a meaningful
way. This includes measures like averages, percentages, and frequencies that
help give a clear picture of the dataset.
 Key Operations:
o Central Tendency: Measures such as mean, median, and mode, which tell us
about the "center" of the data.
o Dispersion: Measures like range, variance, and standard deviation that
describe how spread out the data is.
o Shape of the Distribution: Measures such as skewness and kurtosis that
describe the shape of the data distribution.
o Visualizations: Histograms, bar charts, pie charts, and box plots that
graphically represent the data.
 Limitations:
o No Generalizations: Descriptive statistics do not allow us to make
generalizations or predictions about a population. They only describe the
sample data at hand.
o No Hypothesis Testing: It does not test hypotheses or predict future
outcomes.
 Example:
o Suppose you have a dataset of the heights of 100 students. Descriptive
statistics would allow you to calculate the average height, the range, and the
standard deviation of heights, but it wouldn't tell you about the population of
all students outside this sample.

Inferential Statistics:

Definition: Inferential statistics involves using sample data to make predictions,


generalizations, or inferences about a population from which the sample is drawn. It
allows us to go beyond the immediate data and make conclusions or test hypotheses about a
larger group.

 Purpose:
o The main goal is to make inferences or draw conclusions about a population
based on a sample. It uses probability theory to make predictions or decisions
about a population's characteristics.
 Key Operations:
o Estimation: Estimating population parameters (e.g., estimating the mean or
variance of a population from sample data).
o Hypothesis Testing: Testing assumptions or claims about a population (e.g.,
testing whether the average income of a group is different from a known
value).
o Confidence Intervals: Estimating a range of values for a population
parameter with a certain level of confidence (e.g., estimating the population
mean with a 95% confidence interval).
o Predictive Modeling: Using statistical models to predict future outcomes
(e.g., predicting future sales based on historical data).
 Limitations:
o Sample-Dependent: The conclusions drawn are based on the sample, and
there is always a level of uncertainty. The accuracy of inferences depends on
the sample size and representativeness.
o Requires Assumptions: Many inferential techniques (e.g., t-tests, regression
models) require assumptions, such as normality or independence of
observations.
 Example:
o Suppose you take a sample of 100 students' heights from a university and find
an average height of 5.5 feet with a standard deviation of 0.2 feet. Inferential
statistics could help you estimate the average height of all students at that
university (population) and test whether the sample mean significantly differs
from the average height in other populations.

Key Differences:

Feature Descriptive Statistics Inferential Statistics


Summarizes and describes data Makes predictions or inferences about a
Purpose
at hand population from sample data
Deals with the data you Uses sample data to draw conclusions
Data Focus
currently have about a larger population
No generalizations beyond the Makes generalizations and predictions
Generalizations
dataset beyond the sample
Mean, median, mode, range, Estimation, hypothesis testing, confidence
Key Techniques
variance, standard deviation intervals, regression
None needed for summarizing Requires assumptions (e.g., sample size,
Assumptions
data normality)
More abstract visualizations, such as
Bar charts, histograms, box
Visualizations confidence intervals and hypothesis testing
plots, etc.
plots
Calculating the mean height of Estimating the average height of all
Example
a group of students students in a university based on a sample

Conclusion:

 Descriptive statistics help us summarize and describe the dataset we have in a


meaningful way, providing a snapshot of the data.
 Inferential statistics allows us to make conclusions, predictions, or generalizations
about a larger population, using the data we have from a sample. It involves statistical
modeling and hypothesis testing to make informed decisions based on sample data.
Both branches of statistics are essential in data science. Descriptive statistics provide the
foundation for data exploration, while inferential statistics enable us to make evidence-based
decisions and predictions for the broader population.

8. Conclusion and Q&A (10 minutes)

Objective: Recap the session and address any questions.

 Recap:
oWe covered data types, descriptive statistics, probability, the Central Limit
Theorem, hypothesis testing, confidence intervals, and A/B testing.
o How these concepts help in making decisions based on data in machine
learning and data science.
 Q&A: Open the floor for final questions and clarifications.

Statistics plays a crucial role in Exploratory Data Analysis (EDA) by helping data
scientists and analysts summarize, visualize, and understand the underlying structure of data
before moving to more complex modeling tasks. Here’s how statistics contributes to the EDA
process:

1. Understanding the Data Distribution

 Descriptive Statistics:
o Descriptive statistics (mean, median, mode, variance, standard deviation)
provide an initial understanding of the central tendency and spread of the
data. This helps in identifying the key characteristics of each variable.
o For example, knowing the mean and standard deviation of a numeric
column gives you a sense of the average value and how spread out the values
are.
 Visualizing Distribution:
o Histograms and density plots are used to visualize how data is distributed
(e.g., normal, skewed, uniform).
o Understanding the distribution helps you know which statistical techniques to
apply and whether certain transformations (e.g., log transformation) might be
needed to normalize the data for modeling.

2. Identifying Outliers

 Box Plots:
o Box plots provide a clear visualization of the spread and help identify
outliers (data points that fall outside the whiskers). Outliers can significantly
affect model performance and are thus crucial to identify during EDA.
 Z-scores:
o Z-scores can also help detect outliers. Data points with a Z-score greater than
3 or less than -3 are typically considered outliers. This step is useful in
detecting extreme values that might need further investigation or removal from
the dataset.

3. Understanding Relationships Between Variables

 Correlation and Covariance:


o Correlation analysis (e.g., Pearson’s correlation coefficient) helps identify
relationships between two or more numeric variables. Positive or negative
correlations can indicate how one variable changes with respect to another.
o Covariance is similar to correlation and can provide insights into the direction
of relationships between variables.
 Scatter Plots:
o Scatter plots are an essential tool for visualizing the relationship between two
continuous variables. They help uncover linear or non-linear relationships,
trends, or patterns that can guide feature engineering or model selection.
 Cross-tabulation and Chi-Square Tests:
o For categorical data, cross-tabulation and Chi-square tests help analyze the
relationship between two categorical variables by checking for independence
or association.

4. Identifying Data Patterns and Structure

 Skewness and Kurtosis:


o Skewness indicates the asymmetry of the data distribution (whether the data is
skewed left or right).
o Kurtosis tells you whether the data has heavy or light tails. High kurtosis
indicates a high concentration of data points in the tails, which could be a sign
of outliers.
 Box-Cox Transformation:
o For data that is not normally distributed, transformations like Box-Cox can be
applied to normalize the distribution. This step improves the accuracy of
models that assume normality (e.g., linear regression, logistic regression).

5. Feature Selection and Engineering

 Identifying Important Features:


o Through statistical methods like correlation analysis, ANOVA, or mutual
information, EDA helps in selecting the most relevant features for modeling.
Features with high correlation to the target variable or low variance can be
candidates for feature selection.
 Handling Missing Data:
o Statistics is vital in dealing with missing data. Techniques like mean
imputation, median imputation, or multiple imputation are used to handle
missing values based on statistical reasoning, which helps in preventing biases
or errors in subsequent modeling.

6. Preparing Data for Modeling

 Statistical Tests:
o During EDA, statistical tests (e.g., t-tests, ANOVA) help assess the
significance of variables. These tests provide a way to check the reliability and
importance of features in explaining the target variable.
 Normalization and Standardization:
o Standardizing or normalizing features (scaling them to a standard range or
distribution) is essential for some machine learning algorithms. This can be
done using statistical techniques, ensuring that no feature dominates due to
differences in scale.
 Identifying Data Types:
o During EDA, statistical summaries allow you to classify data types (nominal,
ordinal, interval, ratio) and choose the appropriate analysis method. This is
especially important when deciding between techniques like chi-square tests
for categorical data or regression for continuous data.

7. Hypothesis Generation

 Statistical Reasoning:
o During EDA, you often generate hypotheses about the relationships in the data
or patterns that might exist. Statistical reasoning helps you form questions
like:
 Is the mean of one group different from another group?
 Is there a significant relationship between two variables?
o This step is essential for guiding further analyses and model-building steps.

Conclusion:

Statistics is at the heart of Exploratory Data Analysis (EDA) because it helps uncover the
distribution, relationships, and patterns in data. By applying descriptive and inferential
statistical techniques, data scientists can better understand the data, identify problems such as
outliers or missing values, and prepare the data for subsequent modeling. This statistical
foundation is crucial for making informed decisions, refining models, and ensuring that
insights drawn from the data are valid and reliable.

You might also like