MSE204 Lecture Questions
MSE204 Lecture Questions
Lecture 1
Boltzmann distribution
Maxwell-Boltzmann distribution (molecule in an ideal gas)
Exercise:
1. Starting from the Boltzmann distribution, derive the temperature dependence of susceptibility of
a paramagnet (Curie law).
3. Ideal gas considering gravity: find P (z), i.e., the probability of finding a molecule at height z.
4. Starting from the Boltzmann distribution, derive expression for specific heat of a paramagnet.
Lecture 2
2. Binomial distribution
Bernoulli trial: a trial with only two possible outcomes, “success” (probability p) and “fail-
ure” (probability q = 1 − p)
Binomial random variable x: number of “success” among n trials
Two parameters of binomial distribution: n and p
PMF and CDF of binomial distribution
Mean and variance of Binomial distribution
3. Python: Use of SciPy and Matplotlib library to solve simple problems, plot PMF and CDF
stats.binom.pmf(x,n,p)
stats.binom.cdf(x,n,p)
Exercise:
1. Various concepts: Let the experiment be single throw of a die and the random variable be the
outcome of a throw.
1
Find the mean and variance. In this case, mean is just the simple arithmetic average of all
possible outcomes. However, it is NOT TRUE in general.
Convince yourself that mean is a WEIGHTED AVERAGE of the outcomes. For example,
assume an unfair die, with probability of getting 1 and 2 are 1/4 and rest are 1/8. What
would be the mean? Try to understand the concept of weighted average in terms of relative
frequency.
Another name of mean is EXPECTATION VALUE. Does is mean that if we do the experi-
ment once, the most likely outcome is equal to the expectation value?
Convince yourself that it is NOT necessary that the mean of a discrete probability distribu-
tion is one of the outcomes. If it happens, then it is just a coincidence.
What is the difference between simple ARITHMETIC AVERAGE and WEIGHTED AVER-
AGE? When are they same?
Give an example when the random variable is same as the sample space of the experiment.
Give an example when the random variable is not same as the sample space of the experiment.
Give an example when the sample space is same but the random variables are different.
3. Let the random variable be the outcome of sum of two dice thrown simultaneously.
4. You have ordered 2000 boxes of certain product, each box containing 5 items. Let the random
variable X be the number of defective items in each box. Given that P (X = 0) = 0.8, P (X =
1) = 0.08, P (X = 2) = 0.05, P (X = 3) = 0.04, P (X = 4) = 0.02, P (X = 5) = 0.01.
Find the mean and standard deviation of defective samples present in a box.
Is the mean same as one of the possible outcomes (X)? Explain the meaning of mean.
What would be the total number of defective items?
6. Let the random variable be the outcome of sum of two dice thrown simultaneously.
7. Let the random variable X be the number of heads when three coins are tossed. Make a table of
X and the probability mass function f (xi ) = pi . Plot the PMF and CDF.
8. Sampling without replacement: Suppose 20 balls are kept in a box and 5 of them are red in color,
rest of them are blue. Two balls are picked randomly without replacement from the box. Let the
random variable X be the number of red balls picked in a trial. Find the CDF of X.
9. Sampling with replacement: 100 samples are kept in a box and 10 of them are defective. We
randomly pick a sample from the box. If the sample is not defective, we put it back to the box
and randomly pick another sample from the box. What is the probability that we find a defective
sample at xth trial? The probability distribution is known as the geometric distribution.
2
Calculate mean and variance of geometric distribution.
Find the CDF for the geometric distribution.
(py) The PMF of the geometric distribution is f (x) = (1 − p)x−1 p, where p is a parameter.
Plot the PMF and CDF for different values of p.
Prove that E(aX + b) = aE(X) + b and V (aX + b) = a2 V (X), where X is a random variable.
Choosing suitable values of a, b, can you scale the random variable X and define a new
random variable having mean 0 and variance 1? Scaling is frequently used in data science.
11. Based on the examples discussed so far, try to grasp the following theorem: Given a symmetric
distribution with respect to xi = c, i.e., f (c − xi ) = f (c + xi ); mean of the distribution is µ = c.
Verify whether the theorem works for experiments like single throw of a die, throwing two dice
simultaneously etc. Upshot: you can avoid the algebra, provided the probability distribution
is symmetric.
12. Derivation of PMF and CDF: A coin is tossed n = 3 times and the random variable x is the total
number of heads, such that X = 0, 1, 2, 3. Probability of getting a head is p = 1/2. Derive the
PMF and CDF of Binomial distribution.
13. There are 3 MCQs. Each question has 4 options and only one is correct. The person answering
the questions has no clue about the correct option and taking a guess. Possible outcomes are
RRR, RRW etc., where R stands for right and W stands for wrong.
Find out the probability of all right, three right, two right, one right and zero right answer.
Verify that P (X = 3), P (X = 2), P (X = 1) and P (X = 0) follow binomial probability
distribution.
Compare with the previous problem and comment.
For a given n, what value of p would give you the maximum variance?
15. (py) Symmetry of binomial distribution: Write a code to make bar plots of binomial probability
mass functions:
n = 20, p = 0.1
n = 20, p = 0.3
n = 20, p = 0.5
n = 20, p = 0.7
n = 20, p = 0.9
16. Using the theorem on symmetric probability distribution (discussed previously), prove that bino-
mial distribution is not symmetric in general. Find out, under what condition binomial distribu-
tion becomes symmetric.
3
Write a code to plot the PMF and CDF of a binomial distribution for fixed p = 0.5 and
three different values of n = 10, 100, 1000.
Do you see that the PMF becomes very sharply peaked as n increases? What happens to
the corresponding CDF?
18. The ratio σ/µ is a good measure of relative width of a probability distribution.
q
Prove that the relative width of a binomial distribution is √1n pq .
What happens when n becomes very large? Does it explain what you have observed in the
previous problem?
19. Let X be a binomial random variable. Define a new random variable Y , which is equal to the
difference between “successful” and “unsuccessful” trials.
This problem can be connected to 1D random walk and paramagnet, where the difference is equal
to the net displacement and net magnetic moment.
Considering no external field, probability of ↑ spin is p = 0.5.1 Plot the probability distri-
bution function f (m) assuming n = 20. From the plot, find the ratio of f (4)/f (0), which
is significant. That means, there is a considerable probability of getting net magnetic mo-
ment in the absence of an external magnetic field in a paramagnet. But this is not possible
experimentally.
Plot the probability distribution function f (m) assuming n = 200. From the plot, find the
ratio of f (40)/f (0), which can be directly compared with the value obtained in the previous
problem for n = 20. What do you observe?
Imagine what would happen in case of a solid where n = 1023 .
Do you see that why properties of nano-materials generally have larger error-bars compared
to bulk materials?
21. (py) 30 samples are prepared every day in a workshop. The probability that a sample is defective
is 0.1. Clearly, number of defective samples X = 0, 1, 2, ....., 30 follows a binomial distribution.
Using a python code, find the following.
Lecture 3-4
4
2. Normal distribution
3. Python:
Exercise:
1. Let a continuous random variable X denote the temperature. A vaccine is supposed to be stored
at 0◦ C. If the temperature goes beyond 3◦ C, it can not be used. Historical data show that the
temperature fluctuation can be modeled by a PDF f (x) = e−x for X ≥ 0 and f (x) = 0 for X < 0.
Estimate the fraction of vaccine dose wasted.
The vaccine can not be used if the temperature goes beyond 1◦ C. Estimate the fraction of
vaccine dose wasted.
The vaccine can not be used if the temperature goes beyond 2◦ C. Estimate the fraction of
vaccine dose wasted.
The vaccine can not be used if the temperature goes beyond 3◦ C. Estimate the fraction of
vaccine dose wasted.
Convince yourself that if the tolerance limit is too close to the mean, the wastage is very high. It is
natural to estimate the distance between tolerance limit and mean in terms of standard deviation.
For example, the tolerance limit of 3◦ C is 2 standard deviation away from the mean.
5. Continuous uniform distribution: A continuous random variable X has a PDF f (x) = 1/(b − a)
for a ≤ x ≤ b.
Prove that,
a+b (b − a)2
µ= , σ2 = .
2 12
5
Verify that uniform distribution is symmetric about (a + b)/2.
Find the cumulative distribution function F (x) and plot it.
Starting from F (x), try to get f (x).
7. (py) Generating random data: in the last problem, you plotted probability distribution func-
tions of various continuous distributions. Often it is useful to generate a data set randomly,
which mimics certain probability distribution. Generate random data for normal, log-normal and
exponential distribution.
8. Mean, median and mode of symmetric, left-skewed and right-skewed distribution: it is known
α α−1
that mean of beta distribution is µ = α+β and mode of β distribution is α+β−2 . You may use this
information to verify the answers of the following problems.
To solve the following problems, you have to generate a very large data set randomly for beta
distribution and plot, as shown in the previous problem.
(py) Verify that Beta distribution is symmetric about x = 0.5, if α = β. For a symmetric
beta distribution, mean, mode and median values are equal to each other.2 You have to
verify this from the plots and the formula given above. Mode and median can be obtained
from the probability distribution and cumulative distribution, respectively.
(py) Verify that Beta distribution is right-skewed if α < β. For a right-skewed beta distribu-
tion, mode < median < mean.3 You have to verify this from the plots and the formula given
above. Mode and median can be obtained from the probability distribution and cumulative
distribution, respectively.
(py) Verify that Beta distribution is left-skewed if α > β. For a left-skewed beta distribution,
mode > median > mean.4 You have to verify this from the plots and the formula given
above. Mode and median can be obtained from the probability distribution and cumulative
distribution, respectively.
2 2
9. PDF of normal distribution: f (x) = √1 e−(x−µ) /2σ
σ 2π
Verify normalization.
Verify that mean is µ.
Verify that variance is σ 2 .
2
This is true for any symmetric distribution, for example normal distribution.
3
This is true for any right-skewed distribution.
4
This is true for any left-skewed distribution.
6
10. What is the CDF of normal distribution?
11. What are the values of P (µ − σ < X < µ + σ), P (µ − 2σ < X < µ + 2σ) and P (µ − 3σ < X <
µ + 3σ)?
12. What is a standard normal variable and what are the corresponding PDF and CDF?
13. Understand the standard normal cumulative probability table and find the values of:
P (Z ≤ −1.54)
P (Z > −1.54)
P (Z ≤ 1.02)
P (Z > 1.02
P (Z ≤ 0.28)
P (Z > 0.28)
P (0.28 < Z < 1.02)
Find z such that P (Z ≤ z) = 0.770350
Find z such that P (Z > z) = 0.0968
15. One more definition: D(z) = Φ(z) − Φ(−z). Answer the following.
17. Standardization: how do we apply the above method for any µ and σ?
18. A coating is applied on a glass plate. Coating thickness follows a normal distribution with a mean
of 10 mm and variance of 4 mm2 .
20. Say some detector is detecting some signal. Background noise follows a normal distribution with
mean 0 and standard deviation 2. The detector records a signal if the value is ≥ 4.
A false signal is recorded when the noise level is 4 or higher. What is the probability of
detecting a false signal?
Find the symmetric bound about the mean that include 95% of all noise readings.
Find the symmetric bound about the mean that include 97% of all noise readings.
Find the symmetric bound about the mean that include 99% of all noise readings.
7
21. A coating is applied on a glass plate. Coating thickness follows a normal distribution with a mean
of 10 mm and variance of 4 mm2 . The accepted dimension is 9 ± 2 mm. Find the fraction of
coated glass plates wasted? What would you do to minimize the wastage?
22. Normal approximation to binomial distribution: An exam has 50 MCQs. Each question has 4
choice and only one of them is correct. Assume that a candidate is guessing all the questions.
Let X be the random variable representing the number of correct answers. Using binomial
distribution,
Find P (X = 2), which is equal to the probability that exactly 2 answers are correct.
Find P (X ≤ 2), which is equal to the probability that at most 2 answers are correct.
Find P (1 ≤ X ≤ 3), which is equal to the probability that 1-3 answers are correct.
A standard normal variable can be defined using the mean and variance of binomial distribution:
X − np
Z=p .
np(1 − p)
Since we are converting from a discrete to a continuous random variable, we have to apply a
continuity correction,
x − 0.5 − np X − np x + 0.5 − np
P (X = x) = P (x − 0.5 ≤ X ≤ x + 0.5) = P p ≤p ≤ p .
np(1 − p) np(1 − p) np(1 − p)
| {z }
Z
Similarly, !
a − 0.5 − np b + 0.5 − np
P (a ≤ X ≤ b) = P p ≤Z≤ p .
np(1 − p) np(1 − p)
23. This problem is designed to show the limits of normal approximation for a given binomial dis-
tribution. This will help you to understand when normal approximation of binomial distribution
may not work very well.
(py) Create bar-plots of binomial distribution with different parameters like (a) n = 10, p =
0.5, (b) n = 10, p = 0.1, (c) n = 10, p = 0.9. Plot probability density function of normal
approximation for each of the cases.
(py) Create bar-plots of binomial distribution with different parameters like (a) n = 50, p =
0.5, (b) n = 50, p = 0.1, (c) n = 50, p = 0.9. Plot probability density function of normal
approximation for each of the cases.
Comment: Normal approximation of binomial distribution is good for np > 5 and n(1 − p) > 5.
24. Assume that 100000 pages are printed per day in a press. Probability that a page is rejected due
to some error can be modeled by a binomial distribution and such a probability is p = 10−3 .
8
Lecture 5
2. Python:
Exercise:
2. (py) Using the data file given, get the data summary.
3. (py) Using the data file given, plot a histogram. What are the advantages of histogram plot?
4. (py) Using the data file given, plot a box and whisker diagram. What are the advantages of box
and whisker plot?
7. (py) Using the data file given, draw a normal probability plot. What is the purpose of using
normal probability plot?
8. (py) Compare normal probability plots of normal distribution, heavy-tailed (compared to normal)
distribution and skewed distribution.
Lecture 6-7
3. Python:
CI using Z-distribution:
stats.norm.interval(alpha, loc = X mean, scale = X std/np.sqrt(X count))
CI using T-distribution:
stats.t.interval(alpha, X count − 1, loc = X mean, scale = X std/np.sqrt(X count))
Meaning of different terms: alpha = 0.9, 0.95, 0.99 etc., X count = sample size, X mean =
sample mean, X std = population standard deviation σ for Z-distribution and X std =
sample standard deviation for T-distribution
9
Exercise:
1. What are the point estimates of mean and variance of normal distribution?
2. (py) Verify that point estimates of parameters of a normal distribution are random variables and
they are not same as population mean and population variance.
3. (py) Central limit theorem: Let us do random sampling from a normal distribution with mean µ
and variance σ 2 . Using a code, verify that the sample average follows a normal distribution with
X−µ
mean µ and variance σ 2 /n. We have to define Z = σ/ √ and confirm that Z has mean 0 and
n
variance 1.
4. Central limit theorem holds good, even when we do random sampling from a non-normal dis-
tribution! Let us test this for a random variable X, having a continuous uniform probability
distribution, f (x) = 1/10 for 0 ≤ x ≤ 10 and f (x) = 0 otherwise.
If we take a random sample of size n = 30, what would be the probability distribution of
sample mean X, according to the central limit theorem.
Draw (by hand) the probability distribution of X and X.
(py) Using python, draw histogram and normal probability plot of X.
(py) Using python, standardize X and draw histogram and normal probability plot.
6. Validity of point estimation: Machine parts are manufactured having a mean length of 100 mm
and standard deviation of 10 mm.
What is the probability that a random sample of size 25 have a sample average greater than
105 mm?
What is the probability that a random sample of size 10 have a sample average greater than
105 mm?
What do you conclude?
7. Strength of an alloy is a normal random variable distributed with σ = 2 MPa. Ten measurement
are as follows: 11.39, 7.46, 13.78, 11.98, 8.06, 11.69, 9.86, 12.02, 9.12, 11.73.
Find a 95% CI on µ.
Find a 99% CI on µ.
What is the precision level (interval width) and absolute error in estimating µ?
What would be the interval width for 95% and 99% confidence level?
What would be the sample size if we want to reduce absolute error to 0.715 for 95% confidence
level?
10. The length of a rod is a normal random variable distributed with σ = 1 cm. Ten measurements
are as follows: 50.7, 50.5, 50.6, 50.5, 50.3, 50.6, 50.8, 50.2, 50.3, 50.1.
10
Find a 95% CI on µ, the mean length.
Repeat the same problem, this time for 99% CI.
From the above two problems, verify that the precision level for 99% CI is lower than that
of 95% CI.
What would you do to improve the precision level of 99% CI to the level of 95% CI?
11. (py) Verify that CI is a random variable: Assuming a normal population with µ = 10, σ = 2,
generate several random samples of size 10.
Plot E/σ (absolute error measured in the unit of σ) as a function of sample size n for 95%
and 99% CI.
(py) Plot E s /E σ as a function of sample size n for 95% and 99% CI, assuming equal values
of s and σ.
Based on the plots, what would you conclude?
13. Number of days after which a component needs to be replaced is as follows: 7.5, 12.7, 16.7, 11.9,
15.4, 11.9, 15.8, 11.4, 14.9, 7.9, 17.6, 13.6, 10.1, 18.5, 14.1, 8.8, 19.8, 15.4, 11.4, 19.5, 15.4. Find
the sample average and sample variance using python and then solve the rest of the problem by
hand.
Find a 90% CI on µ.
Find a 95% CI on µ.
Find a 98% CI on µ.
Lecture 8-10
2. Python
11
One-sample T-test on the mean:
stats.ttest 1samp(df,popmean,alternative=‘greater’)
where df is a data frame, containing the sample values; popmean is the value of null hy-
pothesis (assumed value of population mean); alternative=‘greater’ (right-sided or upper-
tailed test) or alternative=‘less’ (left-sided or lower-tailed test) or alternative=‘two-
sided’ (two-sided test).
Two-sample T-test on the mean:
stats.ttest ind(df[‘Population 1’],df[‘Population 2’],alternative=‘greater’)
where df is a data frame, containing two data columns Population 1 and Population 2;
alternative=‘greater’ (right-sided or upper-tailed test) or alternative=‘less’ (left-sided
or lower-tailed test) or alternative=‘two-sided’ (two-sided test).
Paired-sample T-test on the mean:
stats.ttest rel(df[‘After’],df[‘Before’],alternative=‘greater’)
where df is a data frame, containing two data columns After and Before; alterna-
tive=‘greater’ (right-sided or upper-tailed test) or alternative=‘less’ (left-sided or lower-
tailed test) or alternative=‘two-sided’ (two-sided test).
Exercise:
1. General ideas about the significance level and type II error.
Consider a normal distribution with σ 2 = 9 and the null hypothesis H0 : µ = 90. In a
two-sided test, the acceptance region is 90 ± 1.5, such that a type I error occurs if X > 91.5
or X < 88.5. Find the significance level, taking n = 9.
If we widen the acceptance region to 90 ± 2.5, what would be the significance level.
If we shorten the acceptance region to 90 ± 0.5, what would be the significance level.
Effect of sample size on the significance level: Repeat the previous problem, taking n = 64.
Given the null hypothesis H0 : µ = 90 and alternative hypothesis H1 : µ = 94, find the type
II error, given that acceptance range is 90 ± 1.5, σ 2 = 9, n = 9.
Type II error increases as null and alternative hypothesis gets closer. Given the null hypoth-
esis H0 : µ = 90 and alternative hypothesis H1 : µ = 92. Other parameters are same as the
previous problem.
Effect of sample size on type II error: show that type II error decreases if we take n = 64.
Other parameters are same as the previous problem.
2. Fixed significance level hypothesis test:
(py) Right-sided test: Assume X to be a normal random variable with variance σ 2 = 20.
Using a sample size of 10, test the hypothesis H0 : µ = 100 against the alternative H1 : µ >
100 and plot the power function. Assume a fixed significance level α = 0.05.
(py) Left-sided test: Assume X to be a normal random variable with variance σ 2 = 20. Using
a sample size of 10, test the hypothesis H0 : µ = 100 against the alternative H1 : µ < 100
and plot the power function. Assume a fixed significance level α = 0.05.
(py) Two-sided test: Assume X to be a normal random variable with variance σ 2 = 20. Using
a sample size of 10, test the hypothesis H0 : µ = 100 against the alternative H1 : µ 6= 100
and plot the power function. Assume a fixed significance level α = 0.05.
3. Pvalue hypothesis test: Say a random normal sample is
{103.5, 104, 102, 103, 101, 99.5, 100.5, 103.5, 102.5},
and population variance is σ 2 = 9. Perform a two-sided test of the null hypothesis H0 : µ = 100
against the alternative H1 : µ 6= 100.
12
4. Say we are testing a null hypothesis H0 : µ = 100, against an alternate hypothesis H1 : µ > 100.
Do the following exercise and verify that your decision depends on the sample size! How do you
explain this?
Given sample size n = 9, sample average x = 101.4 and σ 2 = 9, find the p−value.
Given sample size n = 16, sample average x = 101.4 and σ 2 = 9, find the p−value.
Given sample size n = 25, sample average x = 101.4 and σ 2 = 9, find the p−value.
5. (py) Sensitivity of Pvalue hypothesis test: Show that, if the null hypothesis is significantly wrong
compared to the true population mean, Pvalue should be able to detect that. Do the test with
sample size 10 and sample size 20 or even higher. What would you conclude?
Say sample mean is x = 0.83724 and sample standard deviation is s = 0.024557 for a sample
size n = 15. Test null hypothesis H0 : µ = 0.82 against alternative hypothesis H1 : µ > 0.82.
(py) Do the above exercise using python.
Z-test: A manufacturer claims to have improved the battery life. From the new population,
a random sample of size n = 10 has a sample average of x1 = 121 days of battery life. On the
other hand, from the old population, a random sample of size n = 10 has a sample average
of x2 = 112 days of battery life. For both the populations, standard deviation σ = 8 days.
What conclusion can we draw about the claim of improvement of the battery life?
T-test: A company is asking for more price of a material (population 1) than that of another
(population 2), because the former has higher thermal conductivity. Two random samples
are as follows:
P opulation1 P opulation2
118 108
127 123
117 119
117 119
126 124
126 116
123 114
130 124
120 115
113 114
Indeed, population 1 has higher sample average (121.7 W/mK) than that of population 2
(117.6 W/mK). Determine whether the difference is statistically significant.
(py) Repeat the above exercise using python.
Values are given (in GPa) for a high strength steel after and before the heat treatment. We
13
want to test whether the process of heat treatment has really lead to strength enhancement.
The sample average before heat treatment is 1.3 GPa and after heat treatment is 1.34 GPa.
Looking at the sample average, it looks like the heat treatment process has marginally
increased the strength. But is this statistically significant?
(py) Repeat the above exercise using python.
Manager of a workshop claims that out of the total products manufactured in a day, 35% of
the products fall in excellent category, 40% of the product fall in good category, 20% of the
product fall in acceptable category and 5% are the products fall in rejected category. On a
particular day, 500 products are randomly sampled and it was found that 190 of them belong
to excellent, 185 of them belong to good, 90 of them belong to acceptable and 35 of them
belong to the rejected category. Based on this, verify the claim of the workshop manager.
(py) Repeat the above exercise using python.
Lecture 11-14
3. Python
14
xi yi (xi − x)(yi − y) (xi − x)2 ŷi (yi − y)2 (ŷi − y)2 (ŷi − yi )2
10.2 89.05
12.9 93.74
13.6 94.45
14.6 96.73
14.0 93.65
11.5 92.52
10.1 89.45
9.5 87.33
x =? y =? Sxy =? Sxx =? SST =? SSR =? SSE =?
M SR = SSR M SE = σ̂ 2 = SSE
n−2
=?
F = M SR /M SE =?
Hypothesis test on coeffs 95% CI
β̂0 =? se(β̂0 ) =? t =? Pvalue =? [=?,=?]
β̂1 =? se(β̂1 ) =? t =? Pvalue =? [=?,=?]
Table 1: Find the linear regression coefficients, β̂1 = Sxy /Sxx , β̂0 = y − β̂1 x and complete the table.
Verify that (a) SST = SSR + SSE , (b) SSE = SST − β̂1 Sxy , (c) SSR = β̂1 Sxy . Compare the numbers
obtained so far with the output of the command summary2().
Exercise:
Simple linear regression
1. Using method of least squares derive β̂0 = y − β̂1 x and β̂1 = Sxy /Sxx .
2. Residuals are defined as i = yi − ŷi . What are the assumptions of linear regression model about
the residuals?
5. Define M SE , M SR , M ST .
6. Write the null and alternate hypothesis for testing the regression coefficients.
10. Find a 95% CI on the slope of the regression line using the following: β̂1 = 14.86, σ̂ 2 = 0.96, Sxx =
0.92, n = 30.
11. (py) Complete Table 1 using some spreadsheet software. Solve it separately using python and
compare the results.
12. (py) Using the data given in this link, build a linear regression model and perform some model
diagnosis like
ANOVA, R2
Residual analysis to verify residuals are normal random variables with constant variance
Influence plot
15
Hypothesis test and confidence intervals on regression coefficients
4. Given β̂1 = −0.24 and se(β̂1 ) = 0.09. Using hypothesis test, verify the significance of regression.
Find the Pvalue , reference value of zα for α = 0.05 and 95% CI.
16
CI cheat sheet: We want to construct a CI on some population parameter. Define a new random
variable (RV)
θ̂ − θ
RV = ,
se(θ)
where θ is a population parameter, θ̂ is the corresponding point estimate and se(θ) is the standard
error/ estimated standard error. For example, if we want CI on population mean µ, the point estimate
is the sample average X. Following are the list of all cases covered in this course.
X−µ
CI on µ, σ −Zα ≤ ≤ Zα ; Zα = 1.96 for 95% CI Sample average X, size n, population
se(X)
known X − Zα · se(X) ≤ µ ≤ X + Zα · se(X) variance σ, standard error se(X) = √σn
X−µ
CI on µ, σ −Tα ≤ se(X) ≤ Tα ; Tα = T0.025,n−1 for 95% CI Sample average X, size n, sample vari-
unknown X − Tα · se(X) ≤ µ ≤ X + Tα · se(X) ance S, estimated std error se(X) = √Sn
β̂1 −β1
CI on β1 −Tα ≤ se(β̂1 )
≤ Tα ; Tα = T0.025,n−2 for 95% CI Linear regression slope β̂1 , size n, esti-
(linear) β̂1 − Tα · se(β̂1 ) ≤ β1 ≤ β̂1 + Tα · se(β̂1 ) mated standard error se(β̂1 ) = √Sσ̂xx
1 −β1
CI on β1 −Zα ≤ β̂se( β̂1 )
≤ Zα ; Zα = 1.96 for 95% CI Logistic regression slope β̂1 , estimated
(logistic) β̂ − Z · se(β̂ ) ≤ β ≤ β̂ + Z · se(β̂ ) standard error se(β̂1 )
1 α 1 1 1 α 1
Example: In a linear regression problem, you found the slope to be β̂1 = 10, SSE = 7, Sxx = 1, n = 9.
Find the 95% CI on the slope.
Answer: Mean square error M SE = σ̂ 2 = SS n−2
E
= 1 and standard error se(β̂1 ) = √Sσ̂xx = 1. For 95%
CI, T0.025,7 = 2.365. Thus, the 95% CI on the slope is [7.635,12.365]. Since the CI on slope does not
contain 0, there is strong evidence that the slope is not zero. If the interval contains 0, then it is very
unlikely that linear regression model is correct.
Example: In a logistic regression proble, you find β̂1 = −0.24 and se(β̂1 ) = 0.09. Find the 95% CI
on the slope.
Answer: The reference value of Z for 95% CI is z0.025 = 1.96. Thus, the 95% CI on the slope is
[-0.4164,-0.0636]. Since the CI on slope does not contain 0, there is strong evidence that the slope is
not zero. If the interval contains 0, then it is very unlikely that regression model is correct.
Hypothesis test cheat sheet: We want to test the null hypothesis H0 : θ = θ0 . The test statistic
is,
θ̂ − θ0
Statistic = ,
se(θ)
where θ0 is the null hypothesis (guess value of the population parameter), θ̂ is the corresponding point
estimate and se(θ) is the standard error/ estimated standard error. For example, if we want to test
population mean µ, the point estimate is the sample average X. Following are the list of all cases
covered in this course.
17
H0 : µ = µ0 ; σ known Z = X−µ 0
se(X)
; N (0, 1) Sample average X, sample size n, population
distribution variance σ, standard error se(X) = √σn
H0 : µ = µ0 ; σ unknown T = X−µ se(X)
0
; Tα,n−1 Sample average X, sample size n, sample vari-
distribution ance S, estimated standard error se(X) = √Sn
X 1 −X 2 −∆
H0 : µ1 − µ2 = ∆; σ1 , σ2 Z = se(X 1 −X 2 )
; Sample average X 1 , X 2 , sample
q 2 size 2 n1 , n2 ,
known (two-sample) N (0, 1) distribution σ σ
standard error se(X1 − X 2 ) = n11 + n22
D−∆
H0 : µD = ∆ (paired- T = se(D) ; Tα,n−1 Paired-sample average D, sample size n, esti-
SD
sample) distribution mated standard error se(D) = √ n
β̂1 −0
H0 : β1 = 0 (linear) T = se( β̂1 )
; Tα,n−2 Linear regression slope β̂1 , sample size n, esti-
distribution mated standard error se(β̂1 ) = √Sσ̂xx
β̂1 −0
H0 : β1 = 0 (logistic) Z = se( β̂1 )
; N (0, 1) Linear regression slope β̂1 , estimated standard
distribution error se(β̂1 )
After calculating the statistic (Z-statistic/ T-statistic), we either do fixed significance level test or
Pvalue test. Generally acceptable value of significance level is α = 0.05.
Reference Z-values for one-sided test (α = 0.05): Z0.95 = 1.65, Z0.05 = −1.65
Reference Z-values for two-sided test (α = 0.05): Z0.975 = 1.96, Z0.025 = −1.96
18
8−10
√ = −2. Thus, Pvalue =
Same problem we can solve using Pvalue test. Test statistic is Z = 3/ 9
Φ(−2) = 0.02. Since Pvalue < 0.05, we reject the null hypothesis. Note that, test of hypothesis using
Z-score or Pvalue are equivalent.
19
Lecture 15-18
2. Classification problem
3. Python
Exercise:
Multiple linear regression
Obs. no. yi xi1 xi2 ŷi 2i = (yi − ŷi )2 (ŷi − y)2
1 24 8 110
2 32 11 120
3 35 10 550
4 25 8 295
5 45 15 250
6 24 9 100
7 27 8 300
8 37 11 400
9 42 12 500
10 35 10 540
y =? SSE = T =? SSR = (ŷ − y)T (ŷ − y) =?
M SE = SSE /(n − k − 1) =? M SR = SSR /k =?
β̂0 =? β̂1 =? β̂2 =? F = M SR /M SE =?
20
2. (py) Use the data given in Table 2 and fit the regression model using sklearn and statsmodel.
Compare the regression coefficients
3. Given SSE = 9.02, SSR = 501.38, total number of data points n = 10 and total number of
features k = 2. Find the F-statistic and coefficient of determination R2 .
4. In a multiple linear regression problem, there are k = 2 feature variables and n = 10 observa-
tions. Estimated regression coefficients and standard errors are β̂1 = 2.9, β̂2 = 0.02, se(β̂1 ) =
0.17, se(β̂2 ) = 0.002. Verify the significance of regression.
Build a linear regression model after properly selecting the important features using a cor-
relation or heat map
ANOVA, R2
Residual analysis to verify residuals are normal random variables with constant variance
Influence plot
Hypothesis test and confidence intervals on regression coefficients
Classification
1. (py) Using the data given in this link, do the following
2. From the confusion matrix obtained above, calculate precision, recall, F1-score and accuracy.
What is your overall conclusion about the regression model?
Lecture 19-20 Two lectures are kept for case studies and any other general discussion.
Lecture 21-25
1. Taylor series expansion
Bisection method
Relaxation method
Newton-Raphson method
Order of an iterative method
4. Python
21
Exercise:
4. (py) Write a python code to find all the roots of x3 − 15x − 4 = 0, using bisection method.
2
5. (py) Write a python code to find the root of x = e−x , using relaxation method. Verify first few
steps by hand (using a scientific calculator).
7. (py) Write a code to find the roots of f (x) = x2 − 3x + 1 = 0 using the relaxation method.
8. Convince yourself that f (x) = x3 + x − 1 = 0 has a root between 0 and 1. Try to find the root
by relaxation method using x = 1/(1 + x2 ) and using x = 1 − x3 . Which one works and why?
4x0 + 2x1 + 3x2 = 8, 3x0 − 5x1 + 2x2 = −14, −2x0 + 3x1 + 8x2 = 27.
17. xm+1 = Pxm + q. Derive the form of P and q for the Jacobi and Gauss-Seidel method in terms
of D , U and L.
Lecture 26-30
1. Numerical integration
Rectangle method
Trapezoidal method
Gauss integration method
Error analysis
2. Numerical differentiation
22
3. Python
Exercise:
8. Using the Taylor series expansion, derive the forward, backward and central difference formula to
calculate the first derivative. Identify the first and second order method(s).
11. (py) Write codes to calculate the first derivative of x3 /3 using central difference and forward
difference in the range [-1,1]. Plot the quadratic and linear dependence of error on h in central
and forward difference.
12. Calculate numerical derivative of x3 /3 at x = 1, using h = 0.1. Calculate the error and compare
with what you get from the code.
13. (py) Write a code to calculate the second derivative of x3 /3 in the range [-1,1]. Plot the dependence
of error on h.
Lecture 31-33
Explicit method
Implicit method
2. Python
23
Exercise:
1. Solve ẋ = ax by Euler forward and Euler backward method. Do stability analysis and comment.
2. (py) Write codes to solve ẋ = ax by Euler forward and Euler backward method. Take a = −10.
3. Solve ẋ = x2 − 100x using x(0) = 10 and ∆t = 0.001 using forward Euler method.
5. Solve ẋ = x2 − 100x using x(0) = 10 and ∆t = 0.02 using backward Euler method.
7. What is the advantage of Euler backward method over Euler forward method?
Lecture 34-40
2. Python
Exercise:
1. Using separation of variables, solve diffusion or heat equation with initial condition f (x, t = 0) =
50 and boundary conditions f (x = 0, t) = f (x = l, t) = 0.
2. (py) Using a python code, plot the solution obtained in the previous problem for different values
of t.
3. Consider a metal bar of length 1 and α2 = 1 in the heat equation. Both the ends of the bar are kept
at temperature 0◦ C. At time t = 0, the temperature distribution in the bar is f (x, t = 0) = sin πx.
Applying the FTCS method with h = 0.2 and r = 1/2, find the temperature f (x, t) in the bar
when t > 0.
4. Consider a metal bar of length 1 and α2 = 1 in the heat equation. Both the ends of the bar are kept
at temperature 0◦ C. At time t = 0, the temperature distribution in the bar is f (x, t = 0) = sin πx.
Applying the CN method with h = 0.2 and r = 1, find the temperature f (x, t) in the bar when
t > 0.
24