Global Academy of Technology, Bengaluru
Artificial Intelligence and Data Science
                       Course: Statistical Machine Learning-1
                            Module 2: Inferential Statistics
In this chapter, we shall define and discuss several special families of distributions that are widely
used in applications of probability and statistics. The distributions that will be presented here include
discrete and continuous distributions of univariate, bi variate, and multivariate types
2.1 Overview of Probability Distributions (Bernoulli, Binomial,
Poisson, Chi-square, t-tail)
The Bernoulli and Binomial Distributions: The simplest type of experiment has only two possible
outcomes, call them 0 and 1. If X equals the outcome from such an experiment, then X has the
simplest type of nondegenerate distribution, which is a member of the family of Bernoulli
distributions. If n independent random variables X1,...,Xn all have the same Bernoulli distribution,
then their sum is equal to the number of the Xi’s that equal1, and the distribution of the sum is a
member of the binomial family.
Bernoulli Distribution: A random variable X has the Bernoulli distribution with parameter p(0 ≤p
≤1) if X can take only the values 0 and 1 and the probabilities are
Bernoulli Trials/Process. If the random variables in a finite or infinite sequence X1, X2,...are i, and if
each random variable Xi has the Bernoulli distribution with parameter p, then it is said that
X1,X2,...are Bernoulli trials with parameter p. An infinite sequence of Bernoulli trials is also called
a Bernoulli process.
Binomial Distribution: A random variable X has the binomial distribution with parameters n and p
if X has a discrete distribution for which the p.f. is as follows:
In this distribution, n must be a positive integer, and p must lie in the interval 0 ≤p≤1.
Poisson Distribution: Let λ>0. A random variable X has the Poisson distribution with mean λ if the
p.f. of X is as follows:
Theorem: Mean. The mean of the distribution with p.f. is λ.
          Variance. The variance of the Poisson distribution with mean λ is also λ
            Moment Generating Function. The m.g.f. of the Poisson distribution with mean λ is
ψ(t)=eλ(et−1) for all real t.
           If the random variables X1,...,Xk are independent and if Xi has the Poisson distribution
with mean λi (i=1,...,k), thenthesumX1+...+Xk has the Poisson distribution with mean λ1+...+λk.
Poisson Process: A Poisson process with rate λ per unit time is a process that satisfies the following
two properties:
   i.      The number of arrival sine very fixed interval of time of length t has the Poisson
           distribution with mean λt.
   ii.     The numbers of arrivals in every collection of disjoint time intervals are independent.
Chi-Square Distributions
χ2 Distributions. For each positive number m, the gamma distribution with parameters α =m/2andβ
=1/2 is called the χ2 distribution with m degrees of freedom.
Properties of the Distributions
Mean and Variance: If a random variable X has the χ2 distribution with m degrees of freedom, then
E(X)=m and Var(X)=2m.
If the random variables X1,...,Xk are independent and if Xi has the χ2 distribution with mi degrees of
freedom (i = 1,...,k), then the sum X1+...+Xk has the χ2 distribution with m1 + ...+ mk degrees of
freedom.
Let X have the standard normal distribution. Then the random variable Y = X2 has the χ2 distribution
with one degree of freedom.
t Distributions
t Distributions: Consider two independent random variables Y and Z, such that Y has the χ2
distribution with m degrees of freedom and Z has the standard normal distribution. Suppose that a
random variable X is defined by the equation
Then the distribution of X is called the t distribution with m degrees of freedom.
Properties
Probability Density Function. The p.d.f. of the t distribution with m degrees of freedom is
2.2 Joint distribution of the Sample Mean and Sample Variance
Orthogonal Matrices
Orthogonal Matrix. It is said that an n × n matrix Ais orthogonal if ATA= I, where AT is the transpose
of A.
Properties of Orthogonal Matrices
   1. Determinant is 1. If A is orthogonal, then |det A|=1.
   2. Squared Length Is Preserved. Consider two n-dimensional random vectors
4.3 Confidence Intervals
Confidence Interval is used to describe the uncertainty associated with a sampling method. A
confidence interval gives the probability within which the true value of the parameter will lie.
What is the confidence interval estimate of the population mean?
The general format of a confidence interval estimate of a population mean is given by:
CI= Sample mean ± Multiplier × Standard error of Mean
For variable Xj, a confidence interval estimate of its population mean µj is given by,
                                CI= Xj ± z × Sj / (n)1/2
Where,
   ●   Xj is the sample mean,
   ● Sj is the standard sample deviation,
   ● n is the sample value
   ● z represents the appropriate z-values corresponding to the confidence interval in z-table
Hence, the confidence interval estimate of population mean is Xj ± z × Sj / (n)1/2 .
What assumptions does a confidence interval calculation use?
There are six assumptions to check when you calculate a confidence interval:
   ● Does the data use random sampling?
   ● Is each observation within the data set independent?
   ● Is the sample size large enough to make conclusions?
   ● Is the sample size less than or equal to 10% of the population size?
   ● Does the sample have sufficient successes and failures to use a normal distribution?
   ● Do multi-sample data sets have the same variance?
What are the limits of confidence intervals?
Confidence intervals are limited by two factors.
1. The first is whether a sample is truly random rather than artificially chosen.
2. The second is whether the data conforms to a normal distribution since the calculation assumes it
does.
Problems based on Confidence Interval
Q1. Given a sample of 10 test scores, 80, 95, 90, 90, 95, 75, 75, 85, 90 and 80, find:
1. Sample Mean
2. Standard Deviation
3. Standard Error
4. Margin of Error
5. Confidence interval for 95% confidence
6. Confidence Interval for 99% confidence
Q2. A retail company wants to rate the quality of a product by using results from a customer
satisfaction survey. In the survey, the company asks respondents to rate the quality on a scale of one
to five, with one being the poorest quality and five being the highest quality.
The sample size is 25, the sample mean is 4.5 and the standard deviation is 2.5. Calculate the
confidence interval assuming a 97% confidence level:(For a 97% confidence level, Z≈2.17).
Q3. Green Life, a new natural wellness brand, wants to measure its most popular product according
to customer ratings. The business uses a 10-point scale and asks a sample of 64 customers to rate their
favorite products, with one representing the least popular product and 10 representing the most
popular product. If Green Life wants to find out the popularity of a particular item in the customer
survey, sales teams can find the confidence interval. With a 98% confidence level, a sample mean of
8.5 and a standard deviation of 4.75, calculate the confidence interval:(For a 98% confidence level,
Z≈2.33)
2.3 Bayesian Analysis of samples from Normal Distribution
Bayes Theorem
Bayes theorem is a theorem in probability and statistics, named after the Reverend Thomas Bayes,
that helps in determining the probability of an event that is based on some event that has already
occurred.
Bayes theorem has many applications such as bayesian interference, in the healthcare sector - to
determine the chances of developing health problems with an increase in age and many others.
What is Bayes Theorem?
Bayes theorem, in simple words, determines the conditional probability of an event A given that event
B has already occurred.
Bayes theorem is also known as the Bayes Rule or Bayes Law. It is a method to determine the
probability of an event based on the occurrences of prior events. It is used to calculate conditional
probability.
 Bayes theorem calculates the probability based on the hypothesis.
Bayes theorem states that the conditional probability of an event A, given the occurrence of another
event B, is equal to the product of the likelihood of B, given A and the probability of A. It is given as:
Here, P(A) = how likely A happens(Prior knowledge)- The probability of a hypothesis is true before
any evidence is present.
P(B) = how likely B happens(Marginalization)- The probability of observing the evidence.
P(A/B) = how likely A happens given that B has happened(Posterior)-The probability of a hypothesis
is true given the evidence.
P(B/A) = how likely B happens given that A has happened(Likelihood)- The probability of seeing the
evidence if the hypothesis is true.
.
Terms Related to Bayes Theorem
As we have studied about Bayes theorem in detail, let us understand the meanings of a few terms
related to the concept which have been used in the Bayes theorem formula and derivation:
     ● Conditional Probability - Conditional Probability is the probability of an event A based on
         the occurrence of another event B. It is denoted by P(A|B) and represents the probability of A
         given that event B has already happened.
     ● Joint Probability - Joint probability measures the probability of two more events occurring
         together and at the same time. For two events A and B, it is denoted by P(A∩B).
     ● Random Variables - Random variable is a real-valued variable whose possible values are
         determined by a random experiment. The probability of such variables is also called
         the experimental probability.
     ● Posterior Probability - Posterior probability is the probability of an event that is calculated
         after all the information related to the event has been accounted for. It is also known as
         conditional probability.
     ● Prior Probability - Prior probability is the probability of an event that is calculated before
         considering the new information obtained. It is the probability of an outcome that is
         determined based on current knowledge before the experiment is performed.
Important Notes on Bayes Theorem
     ● Bayes theorem is used to determine conditional probability.
     ● When two events A and B are independent, P(A|B) = P(A) and P(B|A) = P(B)
     ● Conditional probability can be calculated using the Bayes theorem for continuous random
         variables.
Problems based on Bayes Theorem
Example 1: Amy has two bags. Bag I has 7 red and 4 blue balls and bag II has 5 red and 9 blue balls.
Amy draws a ball at random and it turns out to be red. Determine the probability that the ball was
from the bag I using the Bayes theorem.
Solution: Let X and Y be the events that the ball is from the bag I and bag II, respectively. Assume A
to be the event of drawing a red ball. We know that the probability of choosing a bag for drawing a
ball is 1/2, that is,P(X) = P(Y) = 1/2
Since there are 7 red balls out of a total of 11 balls in the bag I, therefore, P(drawing a red ball from
the bag I) = P(A|X) = 7/11
Similarly, P(drawing a red ball from bag II) = P(A|Y) = 5/14
We need to determine the value of P(the ball drawn is from the bag I given that it is a red ball), that
is, P(X|A). To determine this we will use Bayes Theorem. Using Bayes theorem, we have the
following:
= [((7/11)(1/2))/(7/11)(1/2)+(5/14)(1/2)]
= 0.64
Answer: Hence, the probability that the ball is drawn is from bag I is 0.64
Example 2: Assume that the chances of a person having a skin disease are 40%. Assuming that skin
creams and drinking enough water reduces the risk of skin disease by 30% and prescription of a
certain drug reduces its chance by 20%. At a time, a patient can choose any one of the two options
with equal probabilities. It is given that after picking one of the options, the patient selected at random
has the skin disease. Find the probability that the patient picked the option of skin screams and
drinking enough water using the Bayes theorem.
Solution: Assume E1: The patient uses skin creams and drinks enough water; E2: The patient uses
the drug; A: The selected patient has the skin disease
P(E1) = P(E2) = 1/2
Using the probabilities known to us, we have
P(A|E1) = 0.4 × (1-0.3) = 0.28
P(A|E2) = 0.4 × (1-0.2) = 0.32
Using Bayes Theorem, the probability that the selected patient uses skin creams and drinks enough
water is given by,
= (0.28 × 0.5)/(0.28 × 0.5 + 0.32 × 0.5)
= 0.14/(0.14 + 0.16)
= 0.47
Answer: The probability that the patient picked the first option is 0.47
Example 3: A man is known to speak the truth 3/4 times. He draws a card and reports it is king. Find
the probability that it is actually a king.
Solution:
Answer: Thus the probability that the drawn card is actually a king = 0.5
2.4 Fisher Estimator
What is Fisher Information?
Fisher information tells us how much information about an unknown parameter we can get from
a sample. In other words, it tells us how well we can measure a parameter, given a certain amount of
data. More formally, it measures the expected amount of information given by a random variable (X)
for a parameter of interest. The concept is related to the law of entropy, as both are ways to measure
disorder in a system (Friedan, 1998).
Applications:
   ● Describing the asymptotic behavior of maximum likelihood estimates.
   ● Calculating the variance of an estimator.
   ● Finding priors in Bayesian inference.
2.4 Central Limit Theorem
The Central Limit Theorem states that the sampling distribution of the sample means approaches
a normal distribution as the sample size gets larger — no matter what the shape of
the population distribution. This fact holds especially true for sample sizes over 30.
All this is saying is that as you take more samples, especially large ones, your graph of the sample
means will look more like a normal distribution.
Here’s what the Central Limit Theorem is saying, graphically. The picture below shows one of the
simplest types of test: rolling a fair die. The more times you roll the die, the more likely the shape
of the distribution of the means tends to look like a normal distribution graph.
2.5 What is Hypothesis Testing in Statistics?
Hypothesis testing uses sample data from the population to draw useful conclusions regarding the
population probability distribution. It tests an assumption made about the data using different types
of hypothesis testing methodologies. The hypothesis testing results in either rejecting or not rejecting
the null hypothesis.
Hypothesis testing can be defined as a statistical tool that is used to identify if the results of an
experiment are meaningful or not. It involves setting up a null hypothesis and an alternative
hypothesis. These two hypotheses will always be mutually exclusive. This means that if the null
hypothesis is true then the alternative hypothesis is false and vice versa. An example of hypothesis
testing is setting up a test to check if a new medicine works on a disease in a more efficient manner.
Null Hypothesis
The null hypothesis is a concise mathematical statement that is used to indicate that there is no
difference between two possibilities. In other words, there is no difference between certain
characteristics of data. This hypothesis assumes that the outcomes of an experiment are based on
chance alone. It is denoted as H0. Hypothesis testing is used to conclude if the null hypothesis can be
rejected or not. Suppose an experiment is conducted to check if girls are shorter than boys at the age
of 5. The null hypothesis will say that they are the same height.
Alternative Hypothesis
The alternative hypothesis is an alternative to the null hypothesis. It is used to show that the
observations of an experiment are due to some real effect. It indicates that there is a statistical
significance between two possible outcomes and can be denoted as H1 or Ha. For the above-
mentioned example, the alternative hypothesis would be that girls are shorter than boys at the age of
5
Hypothesis Testing P Value
In hypothesis testing, the p value is used to indicate whether the results obtained after conducting a
test are statistically significant or not. It also indicates the probability of making an error in rejecting
or not rejecting the null hypothesis. This value is always a number between 0 and 1. The p value is
compared to an alpha level, α or significance level. The alpha level can be defined as the acceptable
risk of incorrectly rejecting the null hypothesis. The alpha level is usually chosen between 1% to 5%.
Hypothesis Testing Formula
Depending upon the type of data available and the size, different types of hypothesis testing are used
to determine whether the null hypothesis can be rejected or not. The hypothesis testing formula for
some important test statistics are given below:
Types of Hypothesis Testing
Selecting the correct test for performing hypothesis testing can be confusing. These tests are used to
determine a test statistic based on which the null hypothesis can either be rejected or not rejected.
Some of the important tests used for hypothesis testing are given below.
   1. Hypothesis Testing Z Test
A z test is a way of hypothesis testing that is used for a large sample size (n ≥ 30). It is used to
determine whether there is a difference between the population mean and the sample mean when the
population standard deviation is known. It can also be used to compare the mean of two samples. It
is used to compute the z test statistic. The formulas are given as follows:
   2. Hypothesis Testing t Test
The t test is another method of hypothesis testing that is used for a small sample size (n < 30). It is
also used to compare the sample mean and population mean. However, the population standard
deviation is not known. Instead, the sample standard deviation is known. The mean of two samples
can also be compared using the t test.
   3. Hypothesis Testing Chi Square
The Chi square test is a hypothesis testing method that is used to check whether the variables in a
population are independent or not. It is used when the test statistic is chi-squared distributed.
One Tailed Hypothesis Testing
One tailed hypothesis testing is done when the rejection region is only in one direction. It can also be
known as directional hypothesis testing because the effects can be tested in one direction only. This
type of testing is further classified into the right tailed test and left tailed test.
     1. Right Tailed Hypothesis Testing
The right tail test is also known as the upper tail test. This test is used to check whether the population
parameter is greater than some value. The null and alternative hypotheses for this test are given as
follows:
H0: The population parameter is ≤ some value
H1: The population parameter is > some value.
If the test statistic has a greater value than the critical value then the null hypothesis is rejected
    2. Left Tailed Hypothesis Testing
The left tail test is also known as the lower tail test. It is used to check whether the population
parameter is less than some value. The hypotheses for this hypothesis testing can be written as
follows:
H0: The population parameter is ≥ some value
H1: The population parameter is < some value.
The null hypothesis is rejected if the test statistic has a value lesser than the critical value.
Two Tailed Hypothesis Testing
In this hypothesis testing method, the critical region lies on both sides of the sampling distribution. It
is also known as a non - directional hypothesis testing method. The two-tailed test is used when it
needs to be determined if the population parameter is assumed to be different than some value. The
hypotheses can be set up as follows:
H0: the population parameter = some value
H1: the population parameter ≠ some value
The null hypothesis is rejected if the test statistic has a value that is not equal to the critical value.
Hypothesis Testing Steps
Hypothesis testing can be easily performed in five simple steps. The most important step is to
correctly set up the hypotheses and identify the right method for hypothesis testing. The basic steps
to perform hypothesis testing are as follows:
Step 1: Set up the null hypothesis by correctly identifying whether it is the left-tailed, right-tailed, or
two-tailed hypothesis testing.
Step 2: Set up the alternative hypothesis.
Step 3: Choose the correct significance level, α, and find the critical value.
Step 4: Calculate the correct test statistic (z, t or χ) and p-value.
Step 5: Compare the test statistic with the critical value or compare the p-value with α to arrive at a
conclusion. In other words, decide if the null hypothesis is to be rejected or not.
What is F Test in Statistics?
F test in statistics is a test that is performed on an f distribution. A two-tailed f test is used to check
whether the variances of the two given samples (or populations) are equal or not. However, if an f
test checks whether one population variance is either greater than or lesser than the other, it becomes
a one-tailed hypothesis f test.
F Test Definition
F test can be defined as a test that uses the f test statistic to check whether the variances of two samples
(or populations) are equal to the same value. To conduct an f test, the population should follow an f
distribution and the samples must be independent events. On conducting the hypothesis test, if the
results of the f test are statistically significant then the null hypothesis can be rejected otherwise it
cannot be rejected.
Define the concept of F-test statistic, with respect to large and small size samples. Also,
list the F-test expression for left tailed, right tailed and two tailed tests.
F Statistic
The f test statistic or simply the f statistic is a value that is compared with the critical value to
check if the null hypothesis should be rejected or not. The f test statistic formula is given below:
F Test Formula
The f test is used to check the equality of variances using hypothesis testing. The f test formula for
different hypothesis tests is given as follows:
2.6 F-Distribution
The F-distribution, also known Fisher-Snedecor distribution, is extensively used to test for
equality of variances from two normal populations.
Additionally, the f-distribution is the ratio of the X1 random chi-square variable with degrees
of freedom ϑ1 and the X2 random chi-square variable with degrees of freedom ϑ2. In other
words, each Chi-Square random variable has been divided by its degrees of freedom.
F-test is to determine whether the two independent estimates of population variance differ
significantly. In this case, F-ratio is:
or
To find out whether the two samples drawn from the normal population have the same variance.
In this case, F-ratio is:
F Test Critical Value
A critical value is a point that a test statistic is compared to in order to decide whether to reject
or not to reject the null hypothesis. Graphically, the critical value divides a distribution into the
acceptance and rejection regions. If the test statistic falls in the rejection region then the null
hypothesis can be rejected otherwise it cannot be rejected. The steps to find the f test critical
value at a specific alpha level (or significance level), α, are as follows:
        •   Find the degrees of freedom of the first sample. This is done by subtracting 1 from
            the first sample size. Thus, x = n1−1.
        •   Determine the degrees of freedom of the second sample by subtracting 1 from the
            sample size. This given y = n2−1.
        •   If it is a right-tailed test then α is the significance level. For a left-tailed test 1 – α
            is the alpha level. However, if it is a two-tailed test then the significance level is
            given by α/ 2.
        •   The F table is used to find the critical value at the required alpha level.
        •   The intersection of the x column and the y row in the f table will give the f test
            critical value.
2.7 ANOVA F Test
The one-way ANOVA is an example of an f test. ANOVA stands for analysis of variance. It is
used to check the variability of group means and the associated variability in observations
within that group. The F test statistic is used to conduct the ANOVA test. The hypothesis is
given as follows:
H0: The means of all groups are equal.
H1: The means of all groups are not equal.
Test Statistic: F = explained variance / unexplained variance
Decision rule: If F > F critical value then reject the null hypothesis.
To determine the critical value of an ANOVA f test the degrees of freedom are given by df1= K
- 1 and df1 = N - K, where N is the overall sample size and K is the number of groups.
2.8 Bayes Test Procedures
Suppose that we need to decide between two hypotheses H0 and H1. In the Bayesian setting,
we assume that we know prior probabilities of H0 and H1. That is, we know P(H0) =
p0 and P(H1) = p1, where p0+p1=1. We observe the random variable (or the random vector) Y.
We know the distribution of Y under the two hypotheses, i.e, we know
One way to decide between H0 and H1 is to compare P(H0|Y=y) and P(H1|Y=y), and accept
the hypothesis with the higher posterior probability. This is the idea behind the maximum a
posteriori (MAP) test. Here, since we are choosing the hypothesis with the highest probability,
it is relatively easy to show that the error probability is minimized.
To be more specific, according to the MAP test, we choose H0 if and only if
Note that as always, we use the PMF instead of the PDF if Y is a discrete random variable. We
can generalize the MAP test to the case where you have more than two hypotheses. In that case,
again we choose the hypothesis with the highest posterior probability.
Problems
   1. An Olympic pistol shooter has a 2/3 chance of hitting the target at each shot. Find the
      probability of:
      (i) Hitting 10 targets in his first 10 shots
      (ii) Hitting exactly 5 targets out of 15
      (iii) Hitting exactly 10 targets out of 15
   2. The Everlasting Lightbulb company produces light bulbs, which are packaged in boxes
      of 20 for shipment. Tests have shown that 4% of their light bulbs are defective.
      (a) What is the probability that a box, ready for shipment, contains exactly 3 defective
          light bulbs?
      (b) What is the probability that the box contains 3 or more defective light bulbs?
      (c) Expected value of defective bulbs
      (d) Standard deviation of defective bulbs
   3. Suppose that on a given weekend the number of accidents at a certain intersection has
      the Poisson distribution with mean 0.7. What is the probability that there will be at least
      three accidents at the intersection during the weekend?
   4. Determine whether each of the five following matrices is orthogonal
   5. Amy has two bags. Bag I have 7 red and 4 blue balls and bag II has 5 red and 9 blue
      balls. Amy draws a ball at random and it turns out to be red. Determine the probability
      that the ball was from the bag I using the Bayes theorem.
   6. A retail company wants to rate the quality of a product by using results from a customer
      satisfaction survey. In the survey, the company asks respondents to rate the quality on
      a scale of one to five, with one being the poorest quality and five being the highest
      quality. The sample size is 25, the sample mean is 4.5 and the standard deviation is 2.5.
      Calculate the confidence interval assuming a 97% confidence level(For a 97%
      confidence level, Z≈2.17)
   7. The probability of triplets in human births is approximately 0.001. What is the
      probability that there will be exactly one set of triplets among 700 births in a large
      hospital