KEMBAR78
Module 2 - Part 1 | PDF | Bootstrapping (Statistics) | Resampling (Statistics)
0% found this document useful (0 votes)
7 views42 pages

Module 2 - Part 1

Statistics of AIDS

Uploaded by

aditideo624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views42 pages

Module 2 - Part 1

Statistics of AIDS

Uploaded by

aditideo624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

University of Mumbai

Program – Bachelor of Engineering in


Computer Science and Engineering (Artificial Intelligence
and Machine Learning)

Class - T.E.
Course Code – CSDLO5011

Course Name – Statistics for Artificial


Intelligence Data Science

By
Prof. A.V.Phanse
Sampling –
 In statistics, sampling is a method of selecting the subset of the population to
make statistical inferences.

 From the sample, the characteristics of the whole population can be estimated.

 Sampling in market research can be classified into two different types, namely
probability sampling and non-probability sampling.
What is Probability Sampling?

 The probability sampling method utilizes some form of random selection. In


this method, all the eligible individuals have a chance of selecting the sample
from the whole sample space.
 This method is more time consuming and expensive than the non-probability
sampling method.
 The benefit of using probability sampling is that it guarantees the sample that
should be the representative of the population.

What is Non-Probability Sampling?

 The non-probability sampling method is a technique in which the researcher


selects the sample based on subjective judgment rather than the random
selection.
 In this method, not all the members of the population have a chance to
participate in the study.
Simple Random Sampling

 In simple random sampling technique, every item in the population has an equal
and likely chance of being selected in the sample.
 Since the item selection entirely depends on the chance, this method is known
as “Method of chance Selection”.
 As the sample size is large, and the item is chosen randomly, it is known as
“Representative Sampling”.

Example:
Suppose we want to select a simple random sample of 200 students from a school.
Here, we can assign a number to every student in the school database from 1 to
500 and use a random number generator to select a sample of 200 numbers.
Systematic Sampling

 In the systematic sampling method, the items are selected from the target
population by selecting the random selection point and selecting the other
methods after a fixed sample interval.
 It is calculated by dividing the total population size by the desired population
size.

Example:
 Suppose the names of 300 students of a school are sorted in the reverse
alphabetical order.
 To select a sample in a systematic sampling method, we have to choose some 15
students by randomly selecting a starting number, say 5. From number 5
onwards, will select every 15th person from the sorted list. Finally, we can end
up with a sample of some students.
Stratified Sampling

 In a stratified sampling method, the total population is divided into smaller


groups to complete the sampling process.
 The small group is formed based on a few characteristics in the population.
 After separating the population into a smaller group, the statisticians randomly
select the sample.

Example :
There are three bags (A, B and C), each with different balls. Bag A has 50 balls, bag
B has 100 balls, and bag C has 200 balls. We have to choose a sample of balls from
each bag proportionally. Suppose 5 balls from bag A, 10 balls from bag B and 20
balls from bag C.
Clustered Sampling

 In the clustered sampling method, the cluster or group of people are formed
from the population set.
 The group has similar characteristics. Also, they have an equal chance of being a
part of the sample.
 This method uses simple random sampling for the cluster of population.

Example:
An educational institution has ten branches across the country with almost the
same number of students. If we want to collect some data regarding facilities and
other things, we can’t travel to every unit to collect the required data. Hence, we
can use random sampling to select three or four branches as clusters.
Data and Sampling Distributions
 The left hand side figure represents
a population that is assumed to
follow an underlying but unknown
distribution.
 The right hand side figure is the
sample data and its empirical
distribution.
 To get from the left hand side to the
right hand side, a sampling
procedure is used (represented by
an arrow).

 Traditional statistics focused very much on the left‐ hand side, using theory based
on strong assumptions about the population.
 Modern statistics has moved to the right hand side, where such assumptions are
not needed.
Random Sampling and Sample Bias

 A sample is a subset of data from a larger data set i.e. population.

 A population is a large, defined set of data.

 Random sampling is a process in which each available member of the


population being sampled has an equal chance of being chosen for the sample
at each draw. The sample that results is called a simple random sample.

 Simple Random Sampling with Replacement (SRSWR) –


When simple random samples are selected in the way that unit which has been
selected as sample unit is remixed or replaced in the population before the
selection of the next unit in the sample then the method is known as simple
random sampling with replacement.

 Simple Random Sampling without Replacement (SRSWOR) –


When simple random sample are selected in the way that a unit is selected as
sample unit is not mixed or replaced in the population before the selection of the
next unit in the sample is known as simple random sampling without replacement
i.e. once a unit is selected in the sample will never be selected again in the sample.
Sampling bias occurs when some members of a population are systematically more
likely to be selected in a sample than others.

Types of sample bias -

Self-Selection Bias Nonresponse Bias

 For example, suppose you’re  For example, subjects with health


conducting a survey about local issues might not be able to complete
water quality. a study for a physical fitness program.
 People already interested in this  Consequently, the program appears
topic are more likely to respond more effective in the sample than in
and, thus, be overrepresented in the population.
the results.
 This group likely has opinions that
differ from the general population.
Survivorship Bias Pre-screening or Advertising Bias

 Studies that assess a sample of  For example, a study that advertises


existing companies are a classic a fitness improvement program is
example of this bias. more likely to find subjects who are
 By focusing on the financial already motivated to get fit.
status of active companies,  Hence, the program might be more
these studies don’t include effective in this sample than in the
those that have gone out of general population.
business.
Undercoverage Bias Healthy User Bias

 For example, homeless people  A study is conducted to assess the


are unlikely to appear on various effectiveness of a new exercise
lists and won’t have an address program on reducing cardiovascular
or phone number. disease risk.
 Consequently, samples are  The participants are selected from
unlikely to include them. a group of people who use fitness
app
 The people who use a fitness app
are likely more health-conscious
and physically active than the
average person, making them less
representative of the general
population.
How to Avoid Sampling Bias

1. Use random or stratified sampling - Stratified random sampling will help ensure
you get a representative research sample and reduce the interference of
irrelevant variables in your systematic investigation.
2. Avoid convenience sampling -Rather than collecting data from only easily
accessible or available participants, you should gather data from the different
subgroups that make up your population of interest.
3. Clearly define a target population and a sampling frame -Matching the
sampling frame to the target population as much as possible will reduce the risk
of sampling bias.
4. Follow up on non-responders - When people drop out or fail to respond to your
survey, do not ignore them, but rather follow up to determine why they are
unresponsive and see if you can garner a response.
5. Oversampling - Oversampling can be used to avoid sampling bias in cases where
members of the defined population are underrepresented.
6. Aim for a large research sample - The larger your sample population, the more
likely you are to represent all subgroups from your population of interest.
7. Set up quotas for each identified demographic - If you think participant gender,
age, or some other demographic characteristic is a potential source of bias
within your study, quotas will allow you to evenly sample people from different
demographic groups within the study.
Bias

 Statistical bias refers to a systematic error that causes an estimator or a


measurement process to consistently differ from the true value or the expected
result.

 In other words, bias refers to a flaw in the experiment design or data collection
process, which generates results that don’t accurately represent the population.

 Bias can lead to inaccuracies in data analysis, making the results less reliable.

 There are various types of statistical bias, each arising from different sources,
such as the design of a study, data collection methods, or data analysis
techniques.
Data Size Versus Data Quality
 Data quality is often more important than data quantity, and random sampling
can reduce bias and facilitate quality improvement that would otherwise be
prohibitively expensive.

 In the era of big data, it is sometimes surprising that smaller is better.

 Time and effort spent on random sampling not only reduces bias but also
allows greater attention to data exploration and data quality.

 For example, missing data and outliers may contain useful information. It might
be difficult to track down missing values or evaluate outliers in millions of
records, but doing so in a sample of several thousand records may be feasible.
Data Size

Pros:

 Increased Accuracy: Larger datasets can provide more information and


reduce the margin of error in statistical analyses.
 Better Representation: A bigger sample size is more likely to represent the
population accurately, reducing sampling bias.
 Powerful Models: More data can lead to more powerful machine learning
models that can capture complex patterns.

Cons:

 Storage and Processing: Large datasets require more storage space and
computational power, which can be costly and time-consuming.
 Noise and Redundancy: With more data, there's a higher chance of including
irrelevant or duplicate information, which can obscure important patterns.
Data Quality

Pros:

 Accuracy and Reliability: High-quality data ensures that the information is


accurate, complete, and reliable, leading to better decision-making.
 Efficiency: Good quality data requires less cleaning and preprocessing, saving
time and resources.
 Trust: Reliable data builds trust among stakeholders and users.

Cons:

 Cost and Effort: Ensuring high data quality can be resource-intensive, requiring
thorough validation, cleaning, and maintenance.
 Limited Scope: High-quality data might be more challenging to obtain in large
volumes, limiting the scope of analysis.
Balancing Data Size and Data Quality

Trade-offs:
Sometimes, there is a trade-off between data size and quality. It's crucial to find
a balance that suits the specific use case. For example, in some cases, a smaller
dataset of high-quality data may be more valuable than a larger, low-quality
dataset.

Context and Purpose:


The importance of data size versus quality depends on the context. For instance,
in medical research, high data quality is crucial, while in marketing, larger
datasets may be more beneficial for capturing trends and patterns.

Data Governance:
Implementing strong data governance practices can help ensure both data
quality and manageability, regardless of size.
In practice, prioritizing both large amounts of high-quality data is ideal, but the
focus may shift depending on the specific needs and constraints of a project.
Sample Mean Versus Population Mean
Sample Mean Population Mean

Definition: The sample mean is the Definition: The population mean is the
average of a set of observations average of all possible observations in
taken from a larger population. the entire population.

 It is used to estimate the population  It represents the true average of the


mean when it is impractical or entire population and is a parameter
impossible to measure the entire of the population.
population.
 The value of the sample mean can  It is a fixed value, it does not change
vary from one sample to another. unless the population itself changes.

In summary, the sample mean is a practical tool for estimating the population
mean, when the population is too large or impractical to measure its entirety.
Selection Bias

 Selection bias refers to the practice of selectively choosing data consciously or


unconsciously in a way that leads to a conclusion that is misleading.

 If you specify a hypothesis and conduct a well-designed experiment to test it,


you can have high confidence in the conclusion. But, this is frequently not what
followed.

 Often, one looks at available data and tries to understand or derive patterns.
These patterns are many a time result of data snooping.

 Data snooping is the process of extensive hunting through the data until
something interesting emerges
Regression to the Mean
 Regression to the mean is a statistical phenomenon that occurs when extreme
values in a data set tend to be closer to the average on subsequent
measurements. The phenomenon was first identified by Francis Galton in 1886.

 This effect is particularly noticeable in cases where there is some degree of


random variability or error in the measurements.

 When a variable is extreme on its first measurement, it will likely be closer to


the mean on its next measurement.

 This happens due to random variation or measurement error, not because of


any actual change in the underlying variable.

Example:
Imagine measuring students' scores on two tests.
Students who score extremely high or low on the first test are likely to score closer
to the average on the second test, simply due to random fluctuations in test
performance.
Sampling Distribution of a Statistic

 The term sampling distribution of a statistic refers to the distribution of some


sample statistic over many samples drawn from the same population.
 It is constructed by repeatedly drawing samples of the same size from a
population, calculating the statistic for each sample, and then plotting the
distribution of these statistics.

 Typically, a sample is drawn with the goal of measuring something or modeling


something.
 Since our estimate or model is based on a sample, it might contain error.
 We are therefore interested in sampling variability. If the data is huge, we could
draw additional samples and observe the distribution of a sample statistic
directly
 It is important to distinguish between the distribution of the individual data
points, known as the data distribution, and the distribution of a sample
statistic, known as the sampling distribution.
 The distribution of a sample statistic such as the mean is likely to be more
regular and bell-shaped than the distribution of the data itself.
Consider a population
of 10000 people.
The average height of
population is 5’4
Difference between population distribution and Sampling distribution
Central Limit Theorem

 Central limit theorem says that the means drawn from multiple samples will
resemble the familiar bell-shaped normal curve even if the source population
is not normally distributed, provided that the sample size is large enough and
the departure of the data from normality is not too great.

 As the sample size increases, the distribution of the sample means becomes
increasingly normal. This is particularly useful because the normal distribution
has well-known properties, making it easier to make inferences about the
population.

 The theorem applies when the sample size is "sufficiently large," often
considered as n ≥ 30. However, if the original population distribution is normal,
the CLT holds even for small sample sizes.
Standard Error
 The Standard Error (SE) is a statistical measure that quantifies the variability of a
sample mean estimate of a population parameter.
 In simpler terms, it indicates how much the sample mean is expected to
fluctuate from the true population mean if you were to repeatedly draw
samples.

where:
σ is the standard deviation of the population
s is the standard deviation of the sample
n is the sample size.
 A smaller standard error suggests that the sample mean is a more precise
estimate of the population mean.
 Conversely, a larger standard error indicates more variability in the sample
means and less precision.
 Standard error is often used in hypothesis testing, confidence intervals, and
regression analysis to understand the precision of an estimate.
 While the standard deviation measures the variability within a single sample,
the standard error measures the variability of the sample mean from one
sample to another.
Following steps are used for measuring standard error:

1. Collect a number of brand-new samples from the population.


2. For each new sample, calculate the statistic (e.g. mean).
3. Calculate the standard deviation of the mean computed in step 2 and use this
as your estimate of standard error

Numerical for Practice

In a certain property investment company with an international presence, workers


have a mean hourly wage of $12 with a population standard deviation of $3.
Given a sample size of 30, estimate and interpret the SE of the sample mean

 In practice, this approach of collecting new samples to estimate the standard


error is typically not feasible (and time consuming).

 In modern statistics, the bootstrap has become the standard way to estimate
standard error.
The Bootstrap

 Bootstrap is a powerful statistical method used for estimating the distribution of


a statistic (like mean, median etc.) by resampling with replacement from the
original data.
 In this method, additional samples are drawn with replacement from the sample
itself and the statistic for each resample is recalculated.

 The bootstrap process involves repeatedly drawing samples from the original
dataset, where each sample is of the same size as the original dataset but is
drawn with replacement (or remixing).
 This means that some data points may appear multiple times in a resampled
dataset, while others may not appear at all.
 Each resampled dataset is called a bootstrap sample. From each bootstrap
sample, the statistic of interest (e.g., mean, median) is calculated.
 By repeating the resampling process many times (typically thousands), you
create a distribution of the statistic of interest.
 This distribution is called the bootstrap distribution and can be used to estimate
standard errors, confidence intervals, and more.
Process:

1. Original Dataset: Suppose you have a dataset with n observations.


2. Resampling: Draw a bootstrap sample by randomly selecting n observations
from the original dataset with replacement.
3. Statistic Calculation: Calculate the statistic of interest (e.g., mean) from the
bootstrap sample.
4. Repeat: Repeat steps 2 and 3 many times (e.g., 1000 times) to create the
bootstrap distribution.
5. Analysis: Use the bootstrap distribution to estimate the standard error,
construct confidence intervals, or perform hypothesis testing.
Resampling Versus Bootstrapping

Resampling

 Resampling is a broad statistical technique that involves repeatedly drawing


samples from a dataset and assessing the variation in a statistic of interest.
 It is a general term that includes various methods like permutation tests, cross-
validation, and bootstrapping.
 Resampling methods are used to test hypotheses, validate models, or estimate
the variability of a statistic without relying on traditional parametric
assumptions.

Bootstrapping

 Bootstrapping is a specific type of resampling method. It involves repeatedly


drawing samples from the original data with replacement to estimate the
sampling distribution of a statistic.
 Bootstrapping is particularly useful when the underlying distribution of the data
is unknown or when the sample size is small, making traditional assumptions
unreliable.
Key Differences:

Scope:

 Resampling is an umbrella term that includes multiple methods.


 Bootstrapping is one specific method within the broader category of
resampling.

Resampling Method:

 Resampling can involve either sampling with or without replacement,


depending on the specific method being used (e.g., cross-validation
involves sampling without replacement).
 Bootstrapping always involves sampling with replacement.
Confidence Intervals
 A Confidence Interval (CI) is a range of values, derived from a sample, that is
likely to contain the true population parameter (such as the mean, proportion,
or difference between means) with a certain level of confidence.
 It's a fundamental concept in statistics used to express the uncertainty around
an estimate.
 A 95% confidence level means that if you were to take 100 different samples
and compute a confidence interval for each, about 95 of those intervals would
be expected to contain the true population parameter.
 If you calculate a 95% confidence interval for a mean as [10, 15], it means you
can be 95% confident that the true population mean lies between 10 and 15.

Factors Affecting Confidence Intervals:

Sample Size (n):


Larger sample sizes lead to narrower confidence intervals, as the estimate is more
precise.
Variability (σ or s):
Higher variability in the data leads to wider confidence intervals, indicating less
precision.
Confidence Level:
Higher confidence levels lead to wider intervals, as they provide more "certainty"
about containing the true parameter.
Numerical for Practice
1. Find the standard error of the estimate of the mean weight of high school
football players using the data given of weights of high school football players from
your school. Then find a 95% confidence interval for the data
Player Number Weight in Pounds
1 150
2 203
3 176
4 190
5 168
6 193
7 189
8 178
9 197
10 172
2. Find the standard error of the estimate for the average number of children in a
household in your city by using the data collected from a sample of households in
your city. Then find a 95% confidence interval for the data
Thank You…

You might also like