Roch Mmids Intro 5exercises
Roch Mmids Intro 5exercises
(a) Show that the inner product of two nonnegative vectors is necessarily . ≥ 0
(b) Suppose the inner product of two nonnegative vectors is zero. What can you say about their
sparsity patterns, i.e., which of their entries are zero or nonzero.
⊲
1.2 (Adapted from [VLMS]) A Boolean -vector is one for which all entries are either or .
n 0 1
Such vectors are used to encode whether each of conditions holds, with
n meaning that ai = 1
condition holds.
i
(a) Another common encoding of the same information uses the two values and for the −1 +1
entries. For example the Boolean vector would be written using this alternative
(0, 1, 1, 0)
T
encoding as (−1, +1, +1, −1) . Suppose that is a Boolean vector, and is a vector
T
x y
encoding the same information using the values and . Express in terms of using
−1 +1 y x
(b) Suppose that and are Boolean -vectors. Give a simple word description of their
x y n
, i.e.,
ar , … , as . The vector is called a slice. As a more concrete
ar:s = (ar , … , as )
T
ar:s
the sum of the differences of adjacent values of the signal, is called the Dirichlet energy of the
signal. The Dirichlet energy is a measure of the roughness or wiggliness of the time series.
(a) Express D(x)in vector notation using slicing. [Hint: Note the similarity between and D(x)
(c) Find a signal with entries no more than one in absolute value that has the largest possible
x
1.4 (Adapted from [VLMS]) A vector of length can represent the number of times each word in
n
dictionary word appears times, the second one twice, and the third one not at all. Suppose
25
the -vector is the word count vector associated with a document and a dictionary of
n w n
words. For simplicity we will assume that all words in the document appear in the dictionary.
(a) What is ? Here is an all-one vector of the appropriate size.
1
T
w 1
(c) Let be the -vector that gives the histogram of the word counts, i.e., is the fraction of
h n hi
the words in the document that are word . Use vector notation to express in terms of . (You
i h w
1.5 (Adapted from [VLMS]) Suppose and are vectors of the same size. The triangle
a b
1.6 (Adapted from [VLMS]) Verify that the following identities hold for any two vectors and a b
2
− ∥b∥
2
(b) ∥a + b∥
2
2
+ ∥a − b∥
2
2
. This is called the Parallelogram Law.
= 2(∥a∥)
2
2
+ ∥b∥ )
2
2
1.7 (Adapted from [VLMS]) The root-mean-square (RMS) value of an -vector is defined as n x
2 2
x + ⋯ + xn ∥x∥2
1
rms(x) = √ = .
n √n
(a) Show that at least one entry of a vector has absolute value at least as large as the RMS value
of the vector.
(b) For an -vector , let
n x and
avg(x) = 1
T
x/n
∥x − avg(x)x∥2
std(x) = .
√n
1.8 (Adapted from [VLMS]) Suppose that the vectors are clustered using k-means,
x1 , … , xN
(a) Suppose the original vectors are nonnegative, i.e., their entries are nonnegative. Explain
xi
(b) Suppose the original vectors represent proportions, i.e., their entries are nonnegative and
xi
sum to one. Explain why the representatives also represent proportions, i.e., their entries are
zj
1.9 (Adapted from [VLMS]) Clustering a collection of vectors into groups is called -way k = 2 2
partitioning, since we are partitioning the vectors into groups, with index sets and . 2 G1 G2
T T
w xi + v ≥ 0, ∀i ∈ G1 , w xi + v ≤ 0, ∀i ∈ G2 .
In other words, the affine function f (x) = w is greater than or equal to zero on the first
T
x + v
group, and less than or equal to zero on the second group. This is called linear separation of the
two groups. [Hint: Consider the function ∥x − z1 ∥ , where and are the
2
− ∥x − z2 ∥
2
z1 z2
group representatives.]
2 2
1.10 (Adapted from [ASV]) Let . A random variable has the geometric distribution
0 < p ≤ 1 X
P(X = k) = (1 − p)
k−1
for positive integers . Its mean is and its variance is
p k 1/p (1 − p)/p
2
.
Let be a geometric random variable with parameter
X . p = 1/6
(a) Use Markov’s inequality to find an upper bound for . P(X ≥ 16)
(b) Use Chebyshev’s inequality to find an upper bound for . P(X ≥ 16)
(c) Use the sum of a geometric series to explicitly compute the probability and P(X ≥ 16)
1.11 (Adapted from [ASV]) Let . A random variable has the exponential
0 < λ < +∞ X
λ X , for
f (x) = λeand−λx
x ≥ 0 0
(a) Use Markov’s inequality to find an upper bound for . P(X > 6)
(b) Use Chebyshev’s inequality to find an upper bound for . P(X > 6)
(c) Use an integreal to explicitly compute the probability and compare with the upper
P(X > 6)
1.12 (Adapted from [ASV]) Suppose that is a nonnegative random variable with
X . E[X] = 10
(a) Use Markov to give an upper bound on the probability that is larger than .
X 15
(b) Suppose that we also know that Var(X) = 3 . Use Chebyshev to give a better upper bound
on
P(X > 15) than in part (a).
(c) Suppose that Y1 , Y2 , … , Y300 are i.i.d. random variables with the same distribution as , so X
probability that ∑
300
i=1
is larger than .
Yi 3060
1.13 (Adapted from [ASV]) Suppose that we have i.i.d. random variables X1 , X2 , X3 , … with
finite mean and variance
E[X1 ] = μ . Let
Var(X1 ) = σ
2
S n = X1 + ⋯ + Xn . Prove that for
any fixed ε > 0 and we have
1/2 < α < 1
∣ Sn − nμ ∣
lim P [∣ ∣ < ε] = 1.
α
n→+∞ ∣ n ∣
1.14 (Adapted from [ASV]) By mimicking the proof of Weak Law of Large Numbers, prove the
following variant. Suppose that we have random variables each with finite mean
X1 , X2 , …
∣ Sn ∣
lim P [∣ − μ∣] < ε = 1.
n→+∞ ∣ n ∣
1.15 (Adapted from [ASV]) By mimicking the proof of Chebyshev, prove the following variant. Let
X be a random variable with a finite mean and for which
μ for some
E[exp(s(X − μ)] < +∞
s > 0 . Then we have for c > 0
E[exp(s(X − μ)]
P(X ≥ μ + c) ≤ .
sc
e
1.16 (Adapted from [ASV]) Recall that the cumulative distribution function (CDF) of a real-
valued random variable is the function
Z for all
FZ (z) = P[Z ≤ z] . Let z ∈ R X1 , X2 , … , Xn
be independent random variables with the same cumulative distribution function . Denote the F
1.17 (Adapted from [ASV]) Suppose that are i.i.d. random variables with Exponential
X1 , X2 , …
−x
lim P(Mn − ln n ≤ x) = exp(−e ).
n→+∞
The right hand side is the cumulative distribution function of the Gumbel distribution, an
example of an extreme value distribution. [Hint: Use Exercise 1.16 to compute
P(Mn ≤ ln n + x) explicitly, and then evaluate the limit as .] n → +∞ ⊲
1.18 (Adapted from [ASV]) Let and be two disjoint events. Under what condition are they
A B
independent? ⊲
1.19 (Adapted from [ASV]) We choose a number from the set uniformly at
{10, 11, 12, … , 99}
random.
(a) Let be the first digit and the second digit of the chosen number. Show that and
X Y X Y
1.20 (Adapted from [ASV]) Suppose that and have joint probability density function
X Y
f (x, y) = 2e for
−(x+2y)
and elsewhere. Show that and are independent
x > 0, y > 0 0 X Y
random variables and provide their marginal distributions. [Hint: Recall the exponential
distribution from Exercise 1.11.] ⊲
1.21 (Adapted from [ASV]) Suppose that have joint probability density function
X, Y
2 2
x (x − y)
f (x, y) = c exp[− − ],
2 2
for x, y ∈ R
2
, for some constant .
c > 0
(a) Find the value of the constant . [Hint: The order of integration matters. You can do this
c
without doing complicated integrals. Specifically, recall that for any and , it holds μ ∈ R σ > 0
that
+∞ 2
1 (z − μ)
∫ exp(− )dz = 1. (1)
2
−∞ √2πσ 2 2σ
]
(b) Find the marginal density functions of and . [Hint: You can do this without doing
X Y
1.22 (Adapted from [ASV]) Let be the joint probability mass function of
p(x, y) . Assume (X, Y )
and of . Show that and are independent. Do not assume that and are probability
y Y X Y a b
mass functions. ⊲
1.23 (Adapted from [BHK]) (a) Let and be independent random variables with uniform
X Y
distribution in . Compute ,
[−1/2, 1/2] , , 2
, , and
E[X] E[X ] Var[X ] E[X − Y ] E[XY ]
2
E[(X − Y ) ] . 2
(b) What is the expected squared Euclidean distance between two points generated at random
inside the unit d-dimensional cube ?
C = [−1/2, 1/2]
d
1.24 (Adapted from [BHK]) (a) For an arbitrary a > 0 give a probability distribution for a
nonnegative random variable such thatX
E[X]
P[X ≥ a] = .
a
(b) Show that for any c > 0 there exists a distribution for which Chebyshev’s inequality is tight,
that is,
Var[X]
P[|X − E[X]| ≥ c] = .
2
c
variance . Let
σ
2
n
¯
¯¯¯¯ 1
Xn = ∑ Xi ,
n
i=1
be the sample mean. Suppose one estimates the variance using the sample mean as follows
n
2
1 ¯
¯¯¯¯
2
Sn = ∑ (Xi − X n ) .
n
i=1
In [ ]:
import pandas as pd
1.27 Let and have derivatives at and let and be constants. Prove from the definition of
f g x α β
1.28 Recall that the cumulative distribution function (CDF) of a random variable is defined as X
−1
P[F (U ) ≤ z] = FX (z).
X
(b) Generate a sample from for arbitrary , using numpy.random and the observation
U[a, b] a b
b = 1
X = rng.random()
⊲
1.30 Suppose we are given vectorsn x1 , … , xn in and a partition
R
d
C1 , … , Ck ⊆ [n] . Let
be the size of cluster and let
ni = |Ci | Ci
1
μ∗ = ∑ xj
i
ni
j∈Ci
∥μ ∥ = ⎜
∑ ∥xj ∥ + ∑ x xℓ ⎟
.
i
2 ⎜ j ⎟
n
i j∈Ci j,ℓ∈Ci
⎝ ⎠
j≠ℓ
j≠ℓ j≠ℓ
(d) Combine (a), (b), (c) to prove that minimizing the -means objective k G(C1 , … , Ck ) over all
partitions of is equivalent to minimizing instead
C1 , … , Ck [n]
k
1
2
∑ ∑ ∥xj − xℓ ∥ .
2|Ci |
i=1 j,ℓ∈Ci
1.31 Modify mmids_kmeans to take a tolerance tol as input and stop when the improvement
in objective value G falls below the tolerance. ⊲
AA
T
in terms of these vectors. ⊲