KEMBAR78
Roch Mmids Intro 5exercises | PDF | Normal Distribution | Probability Density Function
0% found this document useful (0 votes)
138 views9 pages

Roch Mmids Intro 5exercises

Uploaded by

aslamzohaib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views9 pages

Roch Mmids Intro 5exercises

Uploaded by

aslamzohaib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Course: Math 535 - Mathematical Methods in Data Science (MMiDS)

Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison


Updated: February 14, 2022
Copyright: © 2022 Sebastien Roch

Chapter 1 - Introduction and review


1.5 Exercises
1.1 (Adapted from [VLMS]) Recall that for two vectors and of the same dimension, their
x y

inner product is . A vector is said to be nonnegative if all of its entries are .


x
T
y ≥ 0

(a) Show that the inner product of two nonnegative vectors is necessarily . ≥ 0

(b) Suppose the inner product of two nonnegative vectors is zero. What can you say about their
sparsity patterns, i.e., which of their entries are zero or nonzero.

1.2 (Adapted from [VLMS]) A Boolean -vector is one for which all entries are either or .
n 0 1

Such vectors are used to encode whether each of conditions holds, with
n meaning that ai = 1

condition holds.
i

(a) Another common encoding of the same information uses the two values and for the −1 +1

entries. For example the Boolean vector would be written using this alternative
(0, 1, 1, 0)
T

encoding as (−1, +1, +1, −1) . Suppose that is a Boolean vector, and is a vector
T
x y

encoding the same information using the values and . Express in terms of using
−1 +1 y x

vector notation. Also, express in terms of using vector notation.


x y

(b) Suppose that and are Boolean -vectors. Give a simple word description of their
x y n

squared Euclidean distance and justify it mathematically.


∥x − y∥
2
2

1.3 (Adapted from [VLMS]) If is a vector, then is the vector of size


a ar:s , with entries s − r + 1

, i.e.,
ar , … , as . The vector is called a slice. As a more concrete
ar:s = (ar , … , as )
T
ar:s

example, if is the -vector


z 4 , the slice
(1, −1, 2, 0)
T
. Suppose the -vector
z2:3 = (−1, 2)
T
T x

represents a time series or signal. The quantity


2 2 2
D(x) = (x1 − x2 ) + (x2 − x3 ) + ⋯ + (xT −1 − xT ) ,

the sum of the differences of adjacent values of the signal, is called the Dirichlet energy of the
signal. The Dirichlet energy is a measure of the roughness or wiggliness of the time series.
(a) Express D(x)in vector notation using slicing. [Hint: Note the similarity between and D(x)

the squared Euclidean distance between two vectors.]


(b) How small can be? What signals have this minimum value of the Dirichlet energy?
D(x) x

(c) Find a signal with entries no more than one in absolute value that has the largest possible
x

value of . Give the value of the Dirichlet energy achieved.


D(x)

1.4 (Adapted from [VLMS]) A vector of length can represent the number of times each word in
n

a dictionary of words appears in a document. For example,


n means that the first (25, 2, 0)
T

dictionary word appears times, the second one twice, and the third one not at all. Suppose
25

the -vector is the word count vector associated with a document and a dictionary of
n w n

words. For simplicity we will assume that all words in the document appear in the dictionary.
(a) What is ? Here is an all-one vector of the appropriate size.
1
T
w 1

(b) What does mean?


w282 = 0

(c) Let be the -vector that gives the histogram of the word counts, i.e., is the fraction of
h n hi

the words in the document that are word . Use vector notation to express in terms of . (You
i h w

can assume that the document contains at least one word.)


1.5 (Adapted from [VLMS]) Suppose and are vectors of the same size. The triangle
a b

inequality states that . Show that we also have


∥a + b∥2 ≤ ∥a∥2 + ∥b∥2

∥a + b∥2 ≥ ∥a∥2 − ∥b∥2 . [Hint: Apply the triangle inequality to .] (a + b) + (−b) ⊲

1.6 (Adapted from [VLMS]) Verify that the following identities hold for any two vectors and a b

of the same size.


(a) (a + b)
T
(a − b) = ∥a∥ . 2

2
− ∥b∥
2

(b) ∥a + b∥
2
2
+ ∥a − b∥
2
2
. This is called the Parallelogram Law.
= 2(∥a∥)
2
2
+ ∥b∥ )
2
2

1.7 (Adapted from [VLMS]) The root-mean-square (RMS) value of an -vector is defined as n x

2 2
x + ⋯ + xn ∥x∥2
1
rms(x) = √ = .
n √n

(a) Show that at least one entry of a vector has absolute value at least as large as the RMS value
of the vector.
(b) For an -vector , let
n x and
avg(x) = 1
T
x/n

∥x − avg(x)x∥2
std(x) = .
√n

Establish the identity


2 2 2
rms(x) = avg(x) + std(x) .

[Hint: Expand the expression for std(x)


2
.]
(c) Use (b) to show that |avg(x)| ≤ rms(x) .
(d) Use (b) to show that std(x) ≤ rms(x) .

1.8 (Adapted from [VLMS]) Suppose that the vectors are clustered using k-means,
x1 , … , xN

with group representatives .


z1 , … , zk

(a) Suppose the original vectors are nonnegative, i.e., their entries are nonnegative. Explain
xi

why the representatives are also nonnegative.


zj

(b) Suppose the original vectors represent proportions, i.e., their entries are nonnegative and
xi

sum to one. Explain why the representatives also represent proportions, i.e., their entries are
zj

nonnegative and sum to one.


(c) Suppose the original vectors are Boolean, i.e., their entries are either or . Give an
xi 0 1

interpretation of , the -th entry of the -th group representative.


(zj )i i j

1.9 (Adapted from [VLMS]) Clustering a collection of vectors into groups is called -way k = 2 2

partitioning, since we are partitioning the vectors into groups, with index sets and . 2 G1 G2

Suppose we run -means with


k on the -vectors
k = 2 . Show that there is a
n x1 , … , xN

nonzero vector and a scalar that satisfy


w v

T T
w xi + v ≥ 0, ∀i ∈ G1 , w xi + v ≤ 0, ∀i ∈ G2 .

In other words, the affine function f (x) = w is greater than or equal to zero on the first
T
x + v

group, and less than or equal to zero on the second group. This is called linear separation of the
two groups. [Hint: Consider the function ∥x − z1 ∥ , where and are the
2
− ∥x − z2 ∥
2
z1 z2

group representatives.]
2 2

1.10 (Adapted from [ASV]) Let . A random variable has the geometric distribution
0 < p ≤ 1 X

with success parameter if the possible values of are


p and satisfies
X {1, 2, 3, …} X

P(X = k) = (1 − p)
k−1
for positive integers . Its mean is and its variance is
p k 1/p (1 − p)/p
2

.
Let be a geometric random variable with parameter
X . p = 1/6

(a) Use Markov’s inequality to find an upper bound for . P(X ≥ 16)

(b) Use Chebyshev’s inequality to find an upper bound for . P(X ≥ 16)

(c) Use the sum of a geometric series to explicitly compute the probability and P(X ≥ 16)

compare with the upper bounds you derived.


1.11 (Adapted from [ASV]) Let . A random variable has the exponential
0 < λ < +∞ X

distribution with parameter if has density function

λ X , for
f (x) = λeand−λx
x ≥ 0 0

otherwise. Its mean is and its variance is .


1/λ 1/λ
2

Let be an exponential random variable with parameter


X . λ = 1/2

(a) Use Markov’s inequality to find an upper bound for . P(X > 6)

(b) Use Chebyshev’s inequality to find an upper bound for . P(X > 6)

(c) Use an integreal to explicitly compute the probability and compare with the upper
P(X > 6)

bounds you derived.


1.12 (Adapted from [ASV]) Suppose that is a nonnegative random variable with
X . E[X] = 10

(a) Use Markov to give an upper bound on the probability that is larger than .
X 15

(b) Suppose that we also know that Var(X) = 3 . Use Chebyshev to give a better upper bound
on
P(X > 15) than in part (a).
(c) Suppose that Y1 , Y2 , … , Y300 are i.i.d. random variables with the same distribution as , so X

that, in particularE(Yi ) = 10 and . Use Chebyshev to give an upper bound on the


Var(Yi ) = 3

probability that ∑
300

i=1
is larger than .
Yi 3060

1.13 (Adapted from [ASV]) Suppose that we have i.i.d. random variables X1 , X2 , X3 , … with
finite mean and variance
E[X1 ] = μ . Let
Var(X1 ) = σ
2
S n = X1 + ⋯ + Xn . Prove that for
any fixed ε > 0 and we have
1/2 < α < 1

∣ Sn − nμ ∣
lim P [∣ ∣ < ε] = 1.
α
n→+∞ ∣ n ∣

1.14 (Adapted from [ASV]) By mimicking the proof of Weak Law of Large Numbers, prove the
following variant. Suppose that we have random variables each with finite mean
X1 , X2 , …

E[Xi ] = μ and variance . Suppose further that


Var(Xi ) = σ
2
whenever
Cov(Xi , Xj ) = 0

|i − j| ≥ 2 and that there is a constant so that


c > 0 for all . Let
|Cov(Xi , Xi+1 )| < c i

S n = X1 + ⋯ + Xn . Then for any fixed we have


ε > 0

∣ Sn ∣
lim P [∣ − μ∣] < ε = 1.
n→+∞ ∣ n ∣

1.15 (Adapted from [ASV]) By mimicking the proof of Chebyshev, prove the following variant. Let
X be a random variable with a finite mean and for which
μ for some
E[exp(s(X − μ)] < +∞
s > 0 . Then we have for c > 0

E[exp(s(X − μ)]
P(X ≥ μ + c) ≤ .
sc
e

Justify your answer carefully. ⊲

1.16 (Adapted from [ASV]) Recall that the cumulative distribution function (CDF) of a real-
valued random variable is the function
Z for all
FZ (z) = P[Z ≤ z] . Let z ∈ R X1 , X2 , … , Xn

be independent random variables with the same cumulative distribution function . Denote the F

minimum and the maximum by


Z = min(X1 , X2 , … , Xn ) and W = max(X1 , X2 , … , Xn ).

Find the cumulative distribution functions and of and respectively.


FZ FW Z W ⊲

1.17 (Adapted from [ASV]) Suppose that are i.i.d. random variables with Exponential
X1 , X2 , …

distribution with parameter (see Exercise 1.11 above), and let


λ = 1 .Mn = max(X1 , … , Xn )

Show that for any we have


x ∈ R

−x
lim P(Mn − ln n ≤ x) = exp(−e ).
n→+∞

The right hand side is the cumulative distribution function of the Gumbel distribution, an
example of an extreme value distribution. [Hint: Use Exercise 1.16 to compute
P(Mn ≤ ln n + x) explicitly, and then evaluate the limit as .] n → +∞ ⊲

1.18 (Adapted from [ASV]) Let and be two disjoint events. Under what condition are they
A B

independent? ⊲

1.19 (Adapted from [ASV]) We choose a number from the set uniformly at
{10, 11, 12, … , 99}

random.
(a) Let be the first digit and the second digit of the chosen number. Show that and
X Y X Y

are independent random variables.


(b) Let be the first digit of the chosen number and the sum of the two digits. Show that
X Z X

and are not independent.


Z

1.20 (Adapted from [ASV]) Suppose that and have joint probability density function
X Y

f (x, y) = 2e for
−(x+2y)
and elsewhere. Show that and are independent
x > 0, y > 0 0 X Y

random variables and provide their marginal distributions. [Hint: Recall the exponential
distribution from Exercise 1.11.] ⊲

1.21 (Adapted from [ASV]) Suppose that have joint probability density function
X, Y

2 2
x (x − y)
f (x, y) = c exp[− − ],
2 2

for x, y ∈ R
2
, for some constant .
c > 0
(a) Find the value of the constant . [Hint: The order of integration matters. You can do this
c

without doing complicated integrals. Specifically, recall that for any and , it holds μ ∈ R σ > 0

that
+∞ 2
1 (z − μ)
∫ exp(− )dz = 1. (1)
2
−∞ √2πσ 2 2σ

]
(b) Find the marginal density functions of and . [Hint: You can do this without doing
X Y

complicated integrals. Complete the square (see Section 1.3.1.1).]


(c) Determine whether and are independent. Justify your answer.
X Y

1.22 (Adapted from [ASV]) Let be the joint probability mass function of
p(x, y) . Assume (X, Y )

that there are two functions and such that


a b for all possible values of
p(x, y) = a(x)b(y) x X

and of . Show that and are independent. Do not assume that and are probability
y Y X Y a b

mass functions. ⊲

1.23 (Adapted from [BHK]) (a) Let and be independent random variables with uniform
X Y

distribution in . Compute ,
[−1/2, 1/2] , , 2
, , and
E[X] E[X ] Var[X ] E[X − Y ] E[XY ]
2

E[(X − Y ) ] . 2

(b) What is the expected squared Euclidean distance between two points generated at random
inside the unit d-dimensional cube ?
C = [−1/2, 1/2]
d

1.24 (Adapted from [BHK]) (a) For an arbitrary a > 0 give a probability distribution for a
nonnegative random variable such thatX

E[X]
P[X ≥ a] = .
a

(b) Show that for any c > 0 there exists a distribution for which Chebyshev’s inequality is tight,
that is,
Var[X]
P[|X − E[X]| ≥ c] = .
2
c

[Hint: Choose a distribution symmetric about .] 0

1.25 (Adapted from [BHK]) Let X1 , X2 , … , Xn be i.i.d. random variables with


mean and μ

variance . Let
σ
2

n
¯
¯¯¯¯ 1
Xn = ∑ Xi ,
n
i=1
be the sample mean. Suppose one estimates the variance using the sample mean as follows
n
2
1 ¯
¯¯¯¯
2
Sn = ∑ (Xi − X n ) .
n
i=1

Calculate . [Hint: Replace


2
E(Sn ) with
Xi − X n
¯
¯¯¯¯
.] (Xi − μ) − (X n − μ)
¯
¯¯¯¯

1.26 Compute the inner product of and


u = (1, 2, 3, 4) without using the
v = (5, 4, 3, 2)

function np.dot . Hint: The product of two real numbers and is a * b . a b

In [ ]:
import pandas as pd

u = np.array([1., 2., 3. ,4.])

# EDIT THIS LINE: define v

# EDIT THIS LINE: compute the inner product between u and v

1.27 Let and have derivatives at and let and be constants. Prove from the definition of
f g x α β

the derivative that


′ ′ ′
[αf (x) + βg(x)] = αf (x) + βg (x).

1.28 Recall that the cumulative distribution function (CDF) of a random variable is defined as X

FX (z) = P[X ≤ z], ∀z ∈ R.

(a) Let be the interval where


Z FX (z) ∈ (0, 1) and assume that FX is strictly increasing on .
Z

Let . Show that


U ∼ U[0, 1]

−1
P[F (U ) ≤ z] = FX (z).
X

(b) Generate a sample from for arbitrary , using numpy.random and the observation
U[a, b] a b

in (a). This is called the inverse transform sampling method.


In [ ]:
import numpy as np

from numpy.random import default_rng


rng = default_rng(535)a = -1

b = 1

X = rng.random()

# EDIT THIS LINE: transform X to obtain a random variable Y ~ U[a,b]

1.29 Let Z1 ∼ N(μ1 , σ ) and2


be independent Normal variables with mean ,
Z2 ∼ N(μ2 , σ )
2
μ1

and variance and respectively. Recall that is still Normal.


1 2
2 2
μ2 σ σ Z1 + Z2
1 2

(a) What are the mean and variance of ? Z1 + Z2

(b) Compute E[∥X1 − X2 ∥ ] and 2


explicitly in the claim of Section 1.4.1.3.
E[∥Y1 − Y2 ∥ ]
2


1.30 Suppose we are given vectorsn x1 , … , xn in and a partition
R
d
C1 , … , Ck ⊆ [n] . Let
be the size of cluster and let
ni = |Ci | Ci

1
μ∗ = ∑ xj
i
ni
j∈Ci

be the centroid of , for


Ci i = 1, … , k .
(a) Show that
⎛ ⎞
2 2 2
∑ ∥xj − μ∗ ∥ = ∑ ∥xj ∥ − ni ∥μ∗ ∥ .
i i
⎝ ⎠
j∈Ci j∈Ci

(b) Show that


⎛ ⎞
1
∗ 2 ⎜
2 T ⎟

∥μ ∥ = ⎜
∑ ∥xj ∥ + ∑ x xℓ ⎟
.
i
2 ⎜ j ⎟
n
i j∈Ci j,ℓ∈Ci
⎝ ⎠
j≠ℓ

(c) Show that


2 2 T
∑ ∥xj − xℓ ∥ = 2(ni − 1) ∑ ∥xj ∥ − 2 ∑ x xℓ .
j

j,ℓ∈Ci j∈Ci j,ℓ∈Ci

j≠ℓ j≠ℓ

(d) Combine (a), (b), (c) to prove that minimizing the -means objective k G(C1 , … , Ck ) over all
partitions of is equivalent to minimizing instead
C1 , … , Ck [n]

k
1
2
∑ ∑ ∥xj − xℓ ∥ .
2|Ci |
i=1 j,ℓ∈Ci

1.31 Modify mmids_kmeans to take a tolerance tol as input and stop when the improvement
in objective value G falls below the tolerance. ⊲

1.32 Suppose the rows of A ∈ R are given by the transposes of


n×m
and the r1 , … , rn ∈ R
m

columns of are given by


A . Give expressions for the elements of and
c1 , … , cm ∈ R
n
A
T
A

AA
T
in terms of these vectors. ⊲

References for Chapter 1


Chapter 1 is based on the following references.
[Sol] Solomon, Numerical Algorithms: Methods for Computer Vision, Machine Learning, and
Graphics, CRC, 2015
[Bis] Bishop, Pattern Recognition and Machine Learning, Springer, 2006
[Carp] B. Carpenter, Typical Sets and the Curse of Dimensionality
[BHK] A. Blum, J. Hopcroft, R. Kannan, Foundations of Data Science, Cambridge University
Press, 2020
[VMLS] S. Boyd and L. Vandenberghe. Introduction to Applied Linear Algebra: Vectors, Matrices,
and Least Squares. Cambridge University Press, 2018
[SSBD] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to
Algorithms. Understanding Machine Learning: From Theory to Algorithms. Cambridge University
Press, 2014
[ASV] D. Anderson, T. Seppalainen and B. Valko. Introduction to Probability (Cambridge
Mathematical Textbooks). Cambridge University Press, 2017

You might also like