Tensor Density Estimator by Convolution-Deconvolution
Tensor Density Estimator by Convolution-Deconvolution
4
Department of Applied and Computational Mathematics and Statistics, University of
Notre Dame
Abstract
We propose a linear algebraic framework for performing density estimation. It consists of
three simple steps: convolving the empirical distribution with certain smoothing kernels to re-
move the exponentially large variance; compressing the empirical distribution after convolution
as a tensor train, with efficient tensor decomposition algorithms; and finally, applying a decon-
volution step to recover the estimated density from such tensor-train representation. Numerical
results demonstrate the high accuracy and efficiency of the proposed methods.
1 Introduction
Density estimation is a fundamental problem in data science with wide-ranging applications in
fields such as science (Cranmer (2001); Chen (2017)), engineering (Chaudhuri et al. (2014)), and
economics (Zambom and Dias (2013)). More precisely, given N data points {x(i) }N ∗
i=1 ∼ p , where
each x(i) is an independent and identically distributed (i.i.d.) sample from an unknown ground
truth distribution p∗ : Rd → R, the goal is to estimate p∗ based on empirical distribution
N
1 X
p̂(x) = δx(i) (x) (1)
N
i=1
where δx(i) (x) is the Dirac delta function centered at x(i) . In general, one cannot use the em-
pirical distribution as an estimator of the density p∗ , due to the lack of absolutely continuity or
exponentially large variance.
1
large variance in p̂ is suppressed. Additionally, we leverage tensor-train (TT) algebra to perform
this projection with linear computational complexity in terms of both dimensionality and sample
size. This approach also yields a final density estimator in the form of a tensor train. Furthermore,
we theoretically show that, in addition to achieving linear scaling in computational complexity,
the estimation error also enjoys linear scaling under certain assumptions. Numerical experiments
demonstrate that our estimator outperforms deep learning methods on several tasks, including
Boltzmann sampling and generative modeling for real-world datasets. Here we list two remarkable
results in Figure 1. More details on the numerical experiments can be found in Section 9.
0.2
one of our methods
neural network
second-moment error
0.15
0.1
0.05
10 20 30 40 50 60 70 80 90 100
Figure 1: Left: density estimation for data sampled from a Boltzmann distribution. Figure vi-
sualizes the second-moment error for different dimensionality d. We compare the result of one of
our proposed tensor-train-based method called TT-SVD-kn (see Section 4) with that of one neural
network approach (Masked Autoregressive Flow); Right: images generated by another proposed
algorithm called TT-rSVD-t (see Section 6) on the MNIST dataset.
where µ(x) = µ1 (x1 ) · · · µd (xd ) is a mean-field density, i.e., a separable density. {ϕjl }nl=1 are orthog-
onal basis functions with respect to µj and ϕj1 ≡ 1, i.e.,
Z
ϕjl1 (xj )ϕjl2 (xj )µj (dxj ) = δl1 ,l2 , (3)
and bjl11 , blj11lj22 are the corresponding coefficients. The indices (j1 , j2 ) ∈ d2 denote all unordered pairs
2
density µ, where each term in the perturbation series involves only a few variables (up to two
variables in (2)). Although it is straightforward to generalize (2) to an expansion that includes
higher-order interactions involving more than two variables with decaying coefficients, for clarity of
presentation, we use (2) as our default model. Such an expansion is common in perturbation and
cluster expansion theory in statistical mechanics.
Based on this setting, we discuss several estimators with increasing levels of sophistication. The
first approach to estimate p∗ is based on a simple and natural subspace projection:
p̃ = µ ⊙ Φdiag(wK )ΦT p̂ .
(4)
For notational simplicity, we treat all functions and operators above as discrete matrices and vectors:
we regard the concatenation of basis functions
Φ := [Φl (x)]x,l
as a matrix with rows indexed by x and columns indexed by l. The operation ΦT p̂ corresponds to
convolution between basis functions and p̂, i.e., summing or integrating out the variable x. The
vector wK , indexed by l, contains the weights associated with the basis functions and depends on a
chosen positive integer K. diag(·) is the operator producing diagonal matrix with the elements of
the input vector on the main diagonal. The mean-field µ is also treated as a vector, indexed by x,
and ⊙ denotes the Hadamard (entrywise) product. Intuitively, the operator µ ⊙ (Φdiag(wK )ΦT ·)
resembles a projector that projects p̂ onto a function space that only involves a small number of
variables determined by the chosen integer chosen integer K. More precisely, let the complete (up
to a certain numerical precision) set of k-variable basis functions be defined as:
n o
Bk,d = ϕjl11 (xj1 ) · · · ϕjlkk (xjk ) | 1 ⩽ j1 < · · · < jk ⩽ d and 2 ⩽ l1 , · · · , lk ⩽ n . (5)
which we refer to as the k-cluster basis throughout the paper. In particular, B0,d = {1} only
contains the constant basis. Furthermore, we say a function is k-cluster if it can be expanded by
up to k-cluster basis. For example, the perturbed part of p∗ in (2) is 2-cluster. Based on these
definitions, Φ can be formulated as the concatenation of the k-cluster basis across different k:
which is an indicator vector that selects the 0- to K-cluster basis. With these definitions, one can
verify that if p∗ takes the form of (2), and if the univariate basis functions {ϕjl }nl=1 satisfy the
orthogonality condition (3), then we can recover
p∗ = µ ⊙ Φdiag(wK )ΦT p∗
(8)
as long as K ⩾ 2. This fact motivates the proposed density estimator in (4). Moreover, unlike p̂,
whose variance scales as O(nd ), Lemma 1 shows that the variance of p̃ in (4) is significantly reduced,
3
scaling as polynomially dependent term O(dK ) due to the projection onto a lower-dimensional
function space. This substantial reduction in variance is a key advantage of our method.
While the estimator in (4) is intuitive, it is possible to further reduce the variance by impos-
ing additional assumptions on p∗ . Specifically, we assume that the low-order cluster coefficients
diag(wK )ΦT p∗ can be represented as a low-rank tensor train. The details of such a representation
are provided in Section 1.5. Under this assumption, our estimation procedure can be modified as:
where we apply certain tensor-train compression algorithm, denoted as TT-compress, to the tensor
diag(wK )ΦT p̂. The singular value thresholding involved in the compression procedure (TT-compress)
can further suppress noise in diag(wK )ΦT p̂, leading to a more accurate density estimator.
While the estimator in (9) is expected to achieve a lower variance, a major drawback is that
the projection diag(wK )ΦT p̂ onto the cluster basis up to order K has a computational complexity
of O(dK nK ), corresponding to the number of cluster basis functions. This scaling quickly becomes
prohibitive as d and K increase. To address this computational bottleneck, we propose a final
estimator, which can be viewed as a more efficient variant of (9):
4
challenging areas of research. Furthermore, for some of these models (e.g. energy based model),
generating i.i.d. samples can still be expensive. For methods like GANs, while it can generate
samples efficiently, it does not provide a way to evaluate the value of a density at any given point.
In short, it is difficult for a method to enjoy all the desirable properties such as easy to train, easy
to generate samples, can support evaluation of density values, and can be estimated without the
curse of dimensionality.
Tensor train from partial sensing of a function. Tensor train, also known as the matrix
product state (MPS) (White (1993); Perez-Garcia et al. (2006)) in the physics literature, has
recently been used when to uncover a function, when only limited number of observations of this
function are given. Methods for this include TT-cross (Oseledets and Tyrtyshnikov (2010)), DMRG-
cross (Savostyanov and Oseledets (2011)), TT completion (Liu et al. (2012); Song et al. (2019)),
and others (Sun et al. (2020); Kressner et al. (2023); Shi et al. (2023)). It has broad applications in
high-dimensional scientific computing problems (Kressner and Uschmajew (2016); Bachmayr et al.
(2016) and machine learning (Stoudenmire and Schwab (2016); Huggins et al. (2019); Sengupta
et al. (2022)).
Tensor train for density estimation. In a density estimation setting, we are not given
observed entries of a density function, but just an empirical distribution of it. The key challenge
lies in the randomness of the observed object, i.e. the empirical histogram, which contains an
exponentially large variance in terms of the dimensionality. Therefore, reducing variance in a high-
dimensional setting is crucial to design effective algorithms. Recently, work (Hur et al. (2023)) and
subsequent studies (Tang et al. (2022); Peng et al. (2023); Chen and Khoo (2023); Tang and Ying
(2023); Khoo et al. (2024)) employ sketching techniques in tensor decomposition step for density
estimation problems, which originates from randomized linear algebra literature. However, most
existing algorithms in these works do not have a general way to ensure a linear computational
complexity when sketching, in terms of dimensionality. Another approach for density estimation
with tensors is through optimization-based method. Works (Bradley et al. (2020); Han et al.
(2018); Novikov et al. (2021)) implement gradient descent methods and optimize tensor-network
representations according to some loss functions. However, the performance strongly depends on the
initialization and often suffers from the problem of local minima due to the highly non-convex nature
of the optimization landscape. Furthermore, these iterative procedures require multiple iterations
over N data points, resulting in higher computational complexity compared to the previous tensor
decomposition algorithms.
1.4 Organization
Here we summarize the organization of the paper. In Section 1.5, we introduce the tensor no-
tation and diagram convetions used throughout. In Section 2, we present the core ideas behind
our proposed “convolution–deconvolution” framework, highlighting its three key steps. The central
component is the TT-compress algorithm as discussed previously that compresses cluster coeffi-
cients into a tensor train. Section 3 – 6 introduce various of such TT-compress algorithms: Section
3 presents a baseline algorithm based on sequential truncated singular value decomposition (SVD),
along with its fast implementation for density estimation problems. Section 4 includes an algorithm
that utilizes the Nyström approximation to further accelerate the algorithm in Section 2. In Section
5, we study an algorithm that performs SVD efficiently by truncating the elements in the tensor,
and then introduce a further acceleration by employing a hierarchical matrix factorization. Section
6 proposes an algorithm utilizing random sketching matrices with low-rank tensor structures. Sec-
5
tion 7 presents a pre-processing procedure based on principal component analysis (PCA), enabling
our compression algorithms to handle very high-dimensional problems. Theoretical analysis for the
algorithms is provided in Section 8. In Section 9, we showcase the comprehensive numerical results
to demonstrate the high accuracy and efficiency for our proposed algorithms. Finally, in Section
10, we summarize our findings.
Based on the definition of tensor, we list several tensor operations that will be used frequently
in the paper:
“◦” contraction. We use ◦ to represent the tensor cores contraction (A ◦ B)(i1 , i2 , j2 ) =
1. P
k A(i1 , i2 , k)B(k, j2 ). Such operation is denoted as diagram in Figure 2 (b) which combines
the last leg of A and the first leg of B. The resulting C = A ◦ B is a three-dimensional tensor.
where G1 ∈ Rn1 ×r1 , G2 ∈ Rr1 ×n2 ×r2 , · · · , Gd−1 ∈ Rrd−2 ×nd−1 ×rd−1 , Gd ∈ Rrd−1 ×nd are called tensor
cores and {rj }d−1
j=1 are the ranks. The TT representation can be visualized as the left panel of Figure
6
3. Throughout the paper, we denote the largest rank by
r = max rj .
j
Then the memory cost for storing A in TT form is at O(dnr2 ) , which has no curse of dimensionality
if r is small. In addition, a TT can also be used to represent a continuous function f (x1 , · · · , xd )
formulated as a tensor product basis function expansion:
n
X
f (x1 , x2 , · · · , xd ) = A(l1 , l2 , · · · , ld )ϕl1 (x1 )ϕl2 (x2 ) · · · ϕld (xd )
l1 ,l2 ,··· ,ld =1
where the coefficient A takes the TT format as (12) . Such functional tensor-train representation
is illustrated in the right panel of Figure 3.
x1 x2 x3 xd
ϕ ϕ ϕ ϕ
i1 i2 i3 id l1 l2 l3 ld
γ1 γ2 γ3 γd−1 γ1 γ2 γ3 γd−1
G1 G2 G3 ⋯ Gd G1 G2 G3 ⋯ Gd
Figure 3: Diagram for tensor-train representation. Left: discrete case. Right: continuous case.
2 Main approach
In this section, we detail the framework of our main approach. In order to perform (10), we perform
the following three steps:
1. Convolution. Project the empirical distribution p̂ to “coefficients” ĉ under certain basis
expansion with O(dN ) computational cost:
Z
ĉ = diag(w̃α )Φ p̂ = diag(w̃α ) Φ(x)T p̂(x)dx.
T
(13)
7
3. Deconvolution. Get the desired density estimator based on the TT coefficient c̃ with O(dN )
computational cost:
p̃ = µ ⊙ (Φdiag(w̃α )−1 c̃). (14)
Here p̃ is also a TT, if Φdiag(w̃α )−1 has an appropriate low-rank tensor structure.
Overall, the computational cost our approach (with proper choices of TT-compress) can achieve
computational complexity that scales linearly with both the dimensionality d and sample size N ,
demonstrating the high efficiency. In the rest of this section, we will further elaborate convolution
(Step 1) and deconvolution (Step 3). We defer the details of tensor compression (Step 2) to Section 3
– 6.
where ∥·∥0 is the zero norm counting the number of non-zero entries of a vector, and 1 is the all-one
d-dimensional vector. ∥l − 1∥0 is then the number of non-constant univariate basis functions among
ϕjl11 (xj1 ) · · · ϕjlkk (xjk ). Under such choice, diag(w̃α ) can be formulated as a rank-1 tensor operator,
allowing the fast convolution (and also fast deconvolution later) with only O(dN ) computational
cost. In fact, given samples {x(i) }N i=1 , ĉ is formulated as
N
1 X (i) (i) (i)
ĉ = Φ̃1 ⊗ Φ̃2 ⊗ · · · ⊗ Φ̃d (16)
N
i=1
8
l1 l2 l3 ld
α α α α
l′1 l′2 l′3 l′d
l1 l2 l3 ⋯ ⋯ ld N ϕ1 ϕ2 ϕ3 ϕd
1
ĉ = N ∑ x1 x2 x3 xd
Φ̃(i)
1
Φ̃(i)
2
Φ̃(i)
3
Φ̃(i)
d
Figure 4: Tensor diagram of ĉ = diag(w̃α )ΦT p̂. “α” denotes diagonal matrix diag(1, α, · · · , α) ∈
Rn×n ; “ϕj ” denotes “matrix” [ϕjl (x)]x,l .
Lemma 1. Suppose {ϕl }nl=1 are Legendre polynomials and the ground truth density p∗ satisfies (8)
with the mean-field component µ ∼ Unif([0, 1]d ), the following variance estimation holds
dK n2K
E∥Φdiag(wK )ΦT p̂ − Φdiag(wK )ΦT p∗ ∥2L2 = O( ). (18)
N
d
Proof. Note that the error ∥diag(wK )ΦT p̂ − diag(wK )ΦT p∗ ∥2F has at most nK terms, where
K
each term follows !2
Z Y K
jk ∗
( ϕlk (xjk ))(p̂(x) − p (x))dx (19)
k=1
d
for indices (j1 , · · · , jK ) ∈ K and l1 , · · · , lK ∈ {2, · · · , n}. Since Legendre polynomial satisfies
√
∥ϕjlkk ∥∞ ⩽ n, then
K
!2 K
nK nK
Z Y Z Y Z
jk ∗ 1 jk 2 ∗
E ( ϕlk (xjk ))(p̂(x) − p (x))dx = ( ϕlk (xjk )) p (x)dx ⩽ p∗ (x)dx =
N N N
k=1 k=1
(20)
Therefore, the total variance is bounded by
nK dK n2K
T T ∗ 2 T T ∗ 2 d
E∥Φdiag(wK )Φ p̂−Φdiag(wK )Φ p ∥L2 = E∥diag(wK )Φ p̂−Φ p ∥F ⩽ nK = O( ).
K N N
(21)
On the other hand, the variance of the soft-thresholding approach under the model assumption
9
in Lemma 1 is estimated by
E ∥Φdiag(w̃α )ΦT p̂ − Φdiag(w̃α )ΦT p∗ ∥2L2
10
x1 x2 x3 xd
ϕ1 ϕ2 ϕ3 ϕd
l1 l2 l3 ld
α −1 α −1 α −1 α −1
l′1 l′2 l′3 l′d
Ĝd
γ1 γ2 γ3 γd−1
Ĝ1 Ĝ2 Ĝ3 ⋯
Figure 5: Tensor diagram for the final density estimator p̃. “α−1 ” denotes diagonal matrix
diag(1, 1/α, · · · , 1/α) ∈ Rn×n ; “ϕj ” denotes “matrix” [ϕjl (x)]x,l .
A = CU + R
where C ∈ Rm×r and R ∈ Rr×n respectively consist of selected r columns and rows. U is the
intersection (cross) submatrix formed by the selected rows and columns, and U + is the Moore-
Penrose generalized inverse of U . To ensure a good approximation, the columns and rows are
typically chosen to maximize the determinant of U . Repeating the procedure for all dimensions,
we can receive a tensor-train decomposition eventually. Further details will be discussed in Section
4. Based on the notations above, we first unfold the weighted projection tensor diag(w̃α )ΦT p̂ as
(0) d 2 [q̂ (0) ]d
" #
1 α[q̂ ]
j j=2 α ′ ′ · · ·
(diag(w̃α )ΦT p̂)(1) = (1)
j,j j,j =2
(1) =: B1 (27)
αq̂1 α2 [q̂1,j ]dj=2 α3 [q̂1,j,j ′ ]dj,j ′ =2 · · ·
11
in terms of the order of α, where we use [·] to represent the concatenation of those inside terms
through the column. Since the ground truth density p∗ is mean-field (rank-1), we choose only r = 1
column and row for CUR approximation of (diag(w̃α )ΦT p̂)(1) . With α chosen to be sufficiently
small, the (1, 1)-entry is the largest in B1 . Therefore, first row and first column are chosen for the
CUR approximation:
1 h (0) (0)
i
B1 ≈ 1 α[q̂j ]dj=2 α2 [q̂j,j ′ ]dj,j ′ =2 · · · (28)
αq̂1 | {z }
=:B2
1
The left vector is the resulting first TT core (before normalization). Next, we continue to
αq̂1
implement the cross approximation on the unfolding of remaining tensor B2 . First row and first
column are again chosen following the same logic:
(0) (0)
" #
α[q̂j ]dj=3 α2 [q̂j,j ′ ]dj,j ′ =3 · · ·
1 1 h (0) d 2 [q̂ (0) ]d
i
B2 = (1) (1) ≈ 1 α[q̂ ]
j j=3 α ′ ′
j,j j,j =3 · · · . (29)
αq̂2 α2 [q̂ ]dj=3 α3 [q̂ ′ ]d ′ ··· αq̂2 |
2,j 2,j,j j,j =3 {z }
=:B3
We repeat this procedure until decomposing the tensor into the product of d rank-1 cores:
T 1 1 1
=: TT-compress diag(w̃α )ΦT p̂ .
diag(w̃α )Φ p̂ ≈ ⊗ ⊗ ··· ⊗ (30)
αq̂1 αq̂2 αq̂d
Furthermore, we try to estimate the difference between the representation in (30) and hard trunca-
T
tion TT-compress diag(wK )Φ p̂ in (9), which are the summation of all k-cluster terms for k > K.
Since ground truth density p∗ is mean-field, the expectation of empirical one-marginal is E[q̂j ] = 0.
j n
Besides, variance Var[q̂j ] = O( n−1
N ) for each j due to the orthogonality of {ϕl }l=1 with respect to
n k/2
µ. Therefore, we can have an upper bound of k-cluster, αk ∥q̂j1 ⊗ · · · ⊗ q̂jk ∥F ⩽ Cαk ( N ) for some
constant C > 0. Then compared to the hard-thresholding K-cluster estimator in (9), the difference
towards the soft-thresholding estimator (10) is upper bounded by
d
T
T
X d k n k/2
∥ TT-compress diag(wK )Φ p̂ − TT-compress diag(w̃α )Φ p̂ ∥F ⩽ C α ( )
k N
k=K+1
d r r
X αde n k αde n K+1
⩽C ( ) ⩽ Cd( ) (31)
K N K N
k=K+1
when αde
pn
K N < 1 for sufficient large N . The difference between two estimators is polynomially
small with respect to sample size N . This turns out to be a clear example to demonstrate the
reliability of soft-thresholding approach (10).
12
find the approximation c̃ in TT format of ĉ depicted in Figure 4, which can be obtained following
the procedures in Algorithm 1.
Algorithm 1 TT-SVD
INPUT: Tensor ĉ ∈ Rn×n×···×n . Target rank {r1 , · · · , rd−1 }.
1: for j ∈ {1, · · · , d − 1} do
d−j
2: Perform SVD to Bj ∈ Rnrj−1 ×n , which is formed by the following tensor contraction
lj+1:d γj−1
Ĝ1 ⋯ Ĝj−1
{
lj ⋯ l1 ⋯ lj−1 lj lj+1 ⋯ ld
γj−1
Bj = ĉ
In particular, B1 = ĉ(1) which is the first unfolding matrix of ĉ defined in (11). With SVD
Bj = Uj Σj VjT , we obtain the j-th core Ĝj ∈ Rrj−1 ×n×rj , which is reshaped from the first rj
principle left singular vectors in Uj .
3: end for
4: Obtain the last core Ĝd = Bd ∈ Rrd−1 ×n .
OUTPUT: Tensor train c̃ = Ĝ1 ◦ Ĝ2 ◦ · · · ◦ Ĝd ∈ Rn×n×···×n .
Below, we briefly outline the idea behind this algorithm. When j = 1, we want to obtain the
first core Ĝ1 by performing SVD to ĉ(1) (which is B1 in Algorithm 1). Then we let Ĝ1 ∈ Rn×r1 to
be first r1 principle left singular vectors. The rationale of the first step is to approximate ĉ(1) as
ĉ(1) ≈ Ĝ1 (ĜT1 ĉ(1) ). In the j = 2 step, we want to obtain Ĝ2 , which gives a further approximation
d−1
to Ĝ1 (ĜT1 ĉ(1) ). The way we do this is to factorize ĜT1 ĉ(1) ∈ Rr1 ×n , by applying SVD to B2 ,
which is the reshaping of ĜT1 ĉ(1) . This completes the j = 2 step. Then we do the above procedure
recursively till j = d − 1. After implementing sequential SVD for d − 1 times, the remaining
Bd ∈ Rrd−1 ×n is a matrix and we take it as the final core in the tensor-train representation since
no more decomposition is required.
Naively, the computational cost from SVD in Algorithm 1 scales exponentially large as O(dnd+1 )
in terms of dimensionality. However, for the density estimation problem, we can take advantage of
the structure of ĉ as depicted in Figure 4 to achieve a fast implementation of Algorithm 1. We will
use the next subsection to elaborate such fast implementation.
13
nodes in Figure 6. Fortunately, by using the structure of ĉ as shown in Figure 4, one can form
Bj BjT much more efficiently. This is shown in Line 2 of Algorithm 2, which summarizes the TT-SVD
procedure that is implemented in a computationally feasible manner.
γj−1 γ′j−1
Ĝ1 ⋯ Ĝj−1 Ĝj−1 ⋯ Ĝ1
lj l′j
l1 ⋯ lj−1 lj l′j l′j−1 ⋯ l′1
=
γj−1 γ′j−1
Bj ⋯ Bj ĉ ĉ
⋯
lj+1:d lj+1:d
In Algorithm 2, the main computation lies in forming Dj and Ej in Line 2. A naive way
is to compute them independently, leading to O(dN 2 ) complexity. However, Dj and Ej can be
computed in a recursive way efficiently, where computations can be reused. More specifically, one
can precompute Ej by
Ej = Ej+1 ⊙ (Φ̃j Φ̃Tj ) (32)
starting from Ed = IN ×N which is N × N the identity matrix. Dj is computed recursively based
on core Ĝj−1 starting from D1 = Φ̃T1 according to Figure 7. The last core Ĝd is assigned as the
(i)
matrix Bd in algorithm 1, which is equal to the average of Dd over sample index i. Therefore, we
(i)
compute Ĝd as the average of Dd in Algorithm 2. In this way, the computational cost of Ej , Dj
and Bj BjT is O(N 2 nr) which has no dependency on dimensionality. Afterwards, one can solve the
core by eigen-decomposition of Bj BjT ∈ Rnrj−1 ×nrj−1 with a minor cost as O((nr)3 ). Overall, the
computational cost of this fast implementation is O(dN 2 nr).
While this fast implementation significantly reduces the computational complexity from expo-
nential to linear in d, we point out that such implementation still requires O(N 2 ) computations to
form each Ej , which can make it challenging with a large number of samples even in low dimen-
sional cases. In Section 4, we will further improve Algorithm 2 using kernel Nyström approximation,
which will bring computational cost to O(dN ) only.
γj−1 lj γj−1 lj
γj−2
Dj(i) = (i)
Dj−1
lj−1
Ĝj−1 Φ̃(i)
j
(i)
Figure 7: Tensor diagram illustrating the recursive calculation of Dj , which is the i-th column of
the matrix Dj in Algorithm 2.
14
Algorithm 2 Fast implementation of TT-SVD
INPUT: Samples {x(i) }N i=1 . Target rank {r1 , · · · , rd−1 }.
1: for j ∈ {1, · · · , d − 1} do
2: Perform eigen-decomposition to Aj := Bj BjT ∈ Rnrj−1 ×nrj−1 , which is computed by Aj =
1
D E DT where Dj ∈ Rnrj−1×N and Ej ∈ RN ×N are the resulting matrices by contraction in
N2 j j j
the boxes of the below figure:
lj l′j
= =
lj l′j γj−1 γ′j−1
Aj Bj Bj
⋯
γj−1 γ′j−1 lj+1:d
Dj Ej DjT
γj−1 γ′j−1
Ĝ1 ⋯ Ĝj−1 Φ̃(i′
j+1
)
Φ̃(i′
d
)
Ĝj−1 ⋯ Ĝ1
N
1 ⋯ ⋯ ⋯
∑
l1 lj−1 lj lj+1 ld l′j l′j−1 l′1
N2 Φ̃(i) Φ̃(i) Φ̃(i) Φ̃(i′)
Φ̃(i′)
Φ̃(i′)
i,i′=1 Φ̃(i)
1 j−1 j j+1 Φ̃(i)
d j j−1 1
A ≈ M W +M T (33)
15
where M ∈ Rm×r̃ consists of the r̃ chosen columns, W ∈ Rr̃×r̃ is the intersection of the chosen r̃
columns and the corresponding r̃ rows and W + is the Moore-Penrose generalized inverse of W .
m r̃
m ≈ m ×
+
×
+
W
MT
M
A
Figure 8: Cross approximation of a symmetric matrix.
which is depicted by the first equation in Figure 9. To apply kernel Nyström method, we randomly
(I)
sample r̃ rows in Ej with index I ⊂ {1, 2, · · · , N }, and we define Φ̃m ∈ Rr̃×n , which is denoted by
the blocks in each Φ̃m in Figure 9, as the resulting submatrices formed by the sampled rows.
N n n n
N = (N × ) ⊙ (N × ) ⊙ ⋯⊙ (N × )
Φ̃Td Φ̃Td−1 Φ̃Tj+1
Ej Φ̃d Φ̃d−1 Φ̃j+1
r̃ n
n
N = (N
⊙ ⊙ ⋯ ⊙ ) × ( r̃ ⊙ ⊙ ⋯⊙ )
T
Φ̃(I)
d
Φ̃(I)
d−1
Φ̃(I)
j+1
Mj Φ̃d Φ̃d−1 Φ̃j+1
r̃ n n
r̃ =( r̃ ⊙ ⊙ ⋯⊙ ) × ( r̃ ⊙ ⊙ ⋯⊙ )
T
Ej ≈ Mj Wj+ MjT
16
where Mj ∈ RN ×r̃ is the submatrix formed by the chosen columns in Ej and Wj ∈ Rr̃×r̃ is the
corresponding intersection. Using the expression (34), Mj and Wj are respectively formulated as
the second and third equation in Figure 9. The structures of Mj and Wj allow us to compute them
in a recursive manner as in Line 2 of Algorithm 3: suppose we have already obtained Φ̃j+2:d and
(I)
Φ̃j+2:d , Mj and Wj can be evaluated with complexity O(N n) (direct computations based on their
definitions in Figure 9 is O(dN n)). Consequently, Bj BjT in the previous Algorithm 2 can now be
approximated by
1 1
Bj BjT = 2 Dj Ej DjT ≈ 2 Dj Mj Wj+ MjT DjT
N N
as in Line 4 of Algorithm 4. The computational cost for the above matrix multiplications is
O(N nr max(nr, r̃)), which is significantly smaller compared to the original O(N 2 nr) complexity in
Algorithm 2. By considering the calculations for all d tensor cores, the total computational cost is
O(dN nr max(nr, r̃)). Here we remark that we have employed the same indices I for computing all
Ej . Such practice is the key to reuse the calculations in this recursive implementation so that we
can achieve a complexity that is only linear in sample size.
Algorithm 3 TT-SVD-kn
INPUT: Samples {x(i) }N i=1 . Target rank {r1 , · · · , rd−1 }. Number of sampled columns r̃.
1: Sample r̃ integers from {1, · · · , N } and let I be the set of sampled integers.
d−1 d−1
2: Pre-compute {Wj }j=1 and {Mj }j=1 based on sampled indices I recursively by Figure 9.
3: for j ∈ {1, · · · , d − 1} do
4: Perform eigen-decomposition to Bj BjT ∈ Rnrj−1 ×nrj−1 , which is approximated by
1
Bj BjT ≈ Aj := Dj Mj Wj+ MjT DjT
N2
where Dj is computed recursively as Figure 7. With eigen-decomposition Aj = Uj Λj UjT , we
obtain the j-th core Ĝj ∈ Rrj−1 ×n×rj , which is reshaped from the first rj principle eigenvectors
in Uj .
5: end for
6: Obtain the last core:
N
1 X (i)
Ĝd = Dd ∈ Rrd−1 ×n
N
i=1
(i)
where Dd is the i-th column of Dd .
OUTPUT: Tensor train c̃ = Ĝ1 ◦ Ĝ2 ◦ · · · ◦ Ĝd ∈ Rn×n×···×n .
17
instead compute the left singular vectors from a sketched Bj :
Bj′ = Bj Sj , (35)
where Sj is some appropriately chosen sketching matrix with size nd−j × r̃j , with sketching size
r̃j much smaller than nd−j (Line 2 of Algorithm 4). In this way, we can reduce the exponential
complexity of full SVD for Bj to O(max(nrr̃j2 , n2 r2 r̃j )) complexity of SVD for Bj′ .
Algorithm 4 TT-SVD-c
INPUT: Samples {x(i) }N i=1 . Target rank {r1 , · · · , rd−1 }. Cluster order K.
1: for j ∈ {1, · · · , d − 1} do
2: Perform SVD to Bj′ ∈ Rnrj−1 ×r̃j , which is formed by the following tensor contraction
d−j
where Sj ∈ Rn ×r̃j is the sketching matrix depending on cluster order K. The contraction
of the righ-hand side is done efficiently as shown in Figure 10. With SVD Bj′ = Uj Σj VjT , we
obtain the j-th core Ĝj ∈ Rrj−1 ×n×rj , which is reshaped from the first rj principle left singular
vectors in Uj .
3: end for
4: Obtain the last core Ĝd = Bd′ ∈ Rrd−1 ×n .
OUTPUT: Tensor train c̃ = Ĝ1 ◦ Ĝ2 ◦ · · · ◦ Ĝd ∈ Rn×n×···×n .
The key of this approach is to choose good sketching matrix Sj such that Bj′ retains the same
range, and thus the same left singular space, as Bj . Here, we follow a specific way to construct
sketching matrix Sj (Chen and Khoo (2023); Peng et al. (2023)), where we only include the low-
order cluster basis. We first introduce the set of k-cluster indices:
Jk,d = (l1 , · · · , ld ) | l1 , · · · , ld ∈ {1, · · · , n} and ∥l − 1∥0 = k ,
which are the basis indices of those cluster bases in Bk,d defined in (5) since ϕj1 ≡ 1.
Our choice of sketching matrix is based on the low-order cluster indices. To sketch Bj , we
d−j
construct Sj ∈ Rn ×r̃j as follows: each column of Sj is defined based on a multivariate index
SK
l ∈ k=0 Jk,d−j where K is the chosen cluster order:
1l1 ⊗ 1l2 ⊗ · · · ⊗ 1ld−j
where 1l is a n × 1 column vector evaluating as 1 on the lth component and 0 elsewhere. Under
this construction, Sj is a sparse matrix with column size
K
X
r̃j = |Jk,d−j | = O(dK nK ).
k=0
18
At this point, the only concern is that forming Bj′ = Bj Sj has exponential complexity by direct
matrix multiplication. Similar to Algorithm 2, we utilize the structure of ĉ (16):
N
1 1 X (i) (i) T
Bj′ = Bj Sj = Dj (Ej )T = Dj (Ej ) ∈ Rnrj−1 ×r̃j (36)
N N
i=1
(i) (i)
where Dj , Ej are defined in Figure 10 (c) and Dj , Ej represent the i-th column of matrix Dj , Ej ,
respectively. According to the structure of Dj , Ej , their computation can be done in a recursive
way following the diagram in Figure 10 (a) and (b).
γj−1 lj γj−1 lj
γj−2
Dj(i) = (i)
Dj−1
lj−1
Ĝj−1 Φ̃(i)
j
(a) (b)
(c)
Figure 10: Tensor diagrams illustrating recursive computation of Bj′ in Line 2 of Algorithm 4. (a)
(i)
recursive calculation of Dj , which is the i-th column of the matrix Dj ; (b) recursive calculation
(i) (i) (i)
of Ej , which is the i-th column of the matrix Ej ; (c) calculation of Bj′ based on Dj and Ej .
In terms of the computational complexity of Algorithm 4, the cost for {Ej }d−1
j=1 is O(d
K+1 N nK )
19
density p∗ has high correlations between nearby variables (such as x1 and x2 ) and low correlation
between distant variables (such as x1 and xd ), which is satisfied by many real cases. In Algorithm 4,
the sketching matrix Sj considers all correlations equally without taking the distance into account,
which includes redundant correlations with distant variables and leads to a large column size of
Sj . In fact, even if cluster order K = 1, the column size is O(d), which results in the overall
computational cost of O(d2 ). This motivates us to further sketch Sj , which is done by hierarchical
matrix decomposition (Hackbusch (1999); Hackbusch and Khoromskij (2000); Bebendorf (2008))
of covariance matrix M , whose (j1 , j2 )-block is formulated as
for all 1 ⩽ j1 , j2 ⩽ d.
Figure 11: Hierarchical matrix decomposition of covariance matrix for d = 8. The figure depicts the
three block matrices M1:4,5:8 , M1:2,3:4 , M1,2 whose top r̃ right singular vectors (visualized as dark
gray matrix in each block matrix).
20
In terms of computational complexity, there are two major sources of computations. The first
is the SVD of off-diagonal block matrices (M1:4,5:8 , M1:2,3:4 , M1,2 for d = 8). For each block, we first
apply kernel Nyström approximation (33) with r̃ sampled rows and columns, followed by SVD. The
cost for this part of computations is O(d log(d)N nr̃). The second major computation is forming
Bj′ = Bj Sj Qj . The number of columns in the weight matrix Qj is at most log2 (d)r̃. Since Sj is a
sparse matrix, the computation of Bj′ = Bj Sj Qj for all j needs O(d log(d)N nr̃) computational cost
upon leveraging a similar strategy for computing (36). The computational costs discussed above
are the dominant cost in Algorithm 4. We can replace Sj with Sj Qj in Algorithm 4 and follow the
operations in Figure 10, the method achieves a reduced computational complexity O(d log(d)N n)
compared to the original complexity O(d2 N n2 ) for TT-SVD-c with cluster order K = 1.
Algorithm 5 TT-rSVD-t
INPUT: Samples {x(i) }N i=1 . Target rank {r1 , · · · , rd−1 }. Sketching size r̃.
1: Generate random tensor train with cores {Hj }dj=1 , where H1 ∈ Rn×r̃ , H2 , · · · , Hd−1 ∈
Rr̃×n×r̃ , Hd ∈ Rr̃×n .
2: for j ∈ {1, · · · , d − 1} do
3: Perform SVD to Bj′ ∈ Rnrj−1 ×r̃j , which is formed by the following tensor contraction
The contraction of the righ-hand side is done efficiently as shown in Figure 12. With SVD
Bj′ = Uj Σj VjT , we obtain the j-th core Ĝj ∈ Rrj−1 ×n×rj , which is reshaped from the first rj
principle left singular vectors in Uj .
4: end for
5: Obtain the last core Ĝd = Bd′ ∈ Rrd−1 ×n .
OUTPUT: Tensor train c̃ = Ĝ1 ◦ Ĝ2 ◦ · · · ◦ Ĝd ∈ Rn×n×···×n .
21
To implement Algorithm 5, we first generate a d-dimensional random tensor train with fixed
rank r̃ whose cores are H1 , H2 , · · · , Hd (Line 1). In practice, we can set each core as i.i.d. ran-
dom Gaussian matrix/tensor or random uniform matrix/tensor. When sketching Bj , we construct
d−j
sketching matrix Sj ∈ Rn ×r̃ as the reshaping of Hj+1 ◦ Hj+2 ◦ · · · ◦ Hd (Line 3). Afterwards, we
contract Bj with Sj and perform SVD to solve the j-th core Ĝj .
Similar to (36) for TT-SVD-c, Bj′ is computed recursively as illustrated in Figure 12(c). The
key difference lies in the new formulation of Ej , which arises from the new choice of sketching
matrix Sj . Leveraging the tensor-train structure, Ej can also be computed recursively as shown
in Figure 12(b). The total computational cost for {Ej }d−1 2
j=1 is O(dN nr̃ ). The computation of
{Dj }d−1
j=1 , depicted in Figure 12(a), follows the same approach used in Algorithm 2 and incurs a
computational cost of O(dN nr2 ). Overall, the SVD step required to compute all tensor cores has
a total complexity of O(dN nr̃2 ), which is linear with respect to both dimensionality and sample
size.
γj−1 lj γj−1 lj
γj−2
Dj(i) = (i)
Dj−1
lj−1
Ĝj−1 Φ̃(i)
j
(a) (b)
(c)
Figure 12: Tensor diagrams illustrating recursive computation of Bj′ in Line 2 of Algorithm 5. (a)
(i)
recursive calculation of Dj , which is the i-th column of the matrix Dj ; (b) recursive calculation
(i) (i) (i)
of Ej , which is the i-th column of the matrix Ej ; (c) calculation of Bj′ based on Dj and Ej .
22
Algorithm 6 Tensor density estimator after variable re-ordering.
INPUT: Samples {x(i) }N ′ ′
i=1 . Embedding dimension d , d ⩽ d. Target rank {r1 , · · · , rd−1 }.
1: Apply PCA to the samples {x(i) }N ′
′
i=1 . Denote the top d orthonormal principal components
as Q ∈ R d×d (i) T (i)
. Define the new variables as z = Q x , i = 1, · · · , N , with the associated
empirical distribution denoted by p̂z .
2: Estimate the 1-marginals µ′1 (z1 ) · · · µ′d′ (zd′ ) using one-dimensional kernel density estimation.
Q′
The product of these 1-marginals forms the mean-field µ′ = dj=1 µ′j in the reduced coordinate
system.
j ′
3: Compute the orthogonal basis {ϕl (zj )}n l=1 for each j = 1, · · · , d , where the basis functions are
orthogonalized with respect to µ′j (zj ).
4: Obtain the d′ -dimensional tensor density estimator in the reduced coordinate system:
When dealing with a general distribution, the optimal ordering of variables is generally un-
known. A poor variable ordering may lead to a high-rank TT representation, significantly increasing
computational complexity. To address this, Step 1 of Algorithm 6 applies PCA to simultaneously
decorrelate the variables and reduce the dimensionality. The resulting reduced-coordinate variables
{z (i) }N
i=1 often exhibit simpler dependence structures, allowing a more compact TT representation.
The eigen-decomposition in PCA can be performed through a “streaming approach”, as in Henrik-
sen and Ward (2019); Hoskins et al. (2024); Khoo et al. (2024) with a complexity of O(N dd′ ).
In Step 2 of Algorithm 6, we aim to perform the projection (10) on the empirical distribution p̂z .
Q′
However, to do so, we require knowledge of the mean-field µ′ (z) = dj=1 µ′j (zj ). To approximate
this, we apply a kernel density estimator to each of the one-dimensional marginals of p̂z . This step
ensures that µ′ captures the product structure of the new coordinate system.
In Step 3, we construct orthonormal basis functions with respect to each one-dimensional
marginal. This can be achieved using a Gram-Schmidt procedure or a stable three-term recur-
rence method (see Gautschi (2004)) for polynomial basis functions.
Finally, in Step 4, we estimate the density p̂z in the reduced-coordinate system using the
tensor-train density estimator. This involves applying any tensor-train compression algorithms
TT-compress detailed in Section 3 – 6 to diag(w̃α )ΦT p̂z . The resulting density estimator p̃z is then
mapped back to the original space via the transformation p̃(x) = p̃z (QT x) followed by normaliza-
tion.
23
8 Theoretical results
In this section, we present an error analysis of our proposed methods. For simplicity, we focus pri-
marily on the hard-thresholding approach described in (9). Based on Section 2.3, we know the esti-
mator from hard-thresholding approach in (9) is close to the primarily discussed soft-thresholding
approach in (10). The detailed analysis of the soft-thresholding approach in (10) will be deferred
to future work.
We again use the notations ĉ = diag(wK )ΦT p̂ and c∗ = diag(wK )ΦT p∗ to denote the coefficient
tensors. For simplicity, we assume the mean-field component µ ∼ Unif([0, 1]d ). In this way, the
univariate basis functions {ϕl }nl=1 are orthonormal, and we take Legendre polynomials as an example
in our analysis. In addition, we propose the following assumptions on the ground truth density:
Assumption 1. Suppose the ground truth density function p∗ : [0, 1]d → R+ satisfies is an additive
model (1-cluster):
p∗ (x1 , · · · , xd ) = 1 + q1∗ (x1 ) + · · · + qd∗ (xd ), (38)
where qj∗ ∈ H β0 ([0, 1]) for some integer β0 ⩾ 1 satisfies the regularity
for two O(1) positive constants b∗ and B ∗ . Under this assumption, we can immediately obtained
that ∥p∗ ∥∞ ⩽ 1 + B ∗ d according to Sobolev embedding theorem.
Assumption 1 consider only the 1-cluster model and include the regularity constraint for each
one-dimensional component. This motivation stems from the fact that the left/right subspaces are
computationally straightforward, allowing for an efficient tensor-train representation of p∗ . This
specific structure, in turn, yields a favorable estimation error rate with respect to the dimensionality.
We state our main theorem under these assumptions:
Theorem 1. Suppose the ground truth p∗ satisfies Assumption 1, the empirical coefficient tensor
ĉ = diag(wK )ΦT p̂, and
N ⩾ C0 d2K−1 n4K−2 log(dn), (39)
where C0 is some sufficiently large constant independent of N , n and d. When we use TT-SVD
(Algorithm 1) as TT-compress with c̃ being the output with target rank r = 2, the following bound
holds with high probability:
v
∗
u
∗ d−1
∥c̃ − c ∥F u dnB log(dn) X 1
= O t (40)
∥c∗ ∥F N σ 2 (c∗(j) )
2
j=1
24
Corollary 1. Suppose all the conditions in Theorem 1 hold, the following bound for density esti-
mator p̃ (24) by TT-SVD (Algorithm 1) with target rank r = 2 holds with high probability:
v
d−1
∥p̃ − p∗ ∥L2 ([0,1]d )
u ∗ log(dn) X
u
−β
dnB 1
=O tn 0+ , (41)
∥p∗ ∥L2 ([0,1]d ) N σ22 (c∗(j) )
j=1
9 Numerical results
In this section, we carry out numerical experiments to examine the performance of our proposed
density estimation algorithms. In specific, we use TT-SVD-kn (Algorithm 3), TT-SVD-c (Algo-
rithm 4) with hierarchical matrix decomposition and TT-rSVD-t (Algorithm 5). Unless otherwise
noted, we use the following parameter settings throughout: we set the number of sampled columns
in Nyström approximation r̃ = 100 in TT-SVD-kn, sketching size r̃ = 10 in TT-SVD-c with hi-
erarchical matrix decomposition, sketching size r̃ = 30 in TT-rSVD-t and the kernel parameter
α = 0.01. These parameter choices are based on empirical observations.
For simplicity, we choose the mean-field density µ as a uniform distribution, so that {ϕjl }nl=1
can be chosen as any orthonormal basis functions. Here we use the Fourier basis
√1
2L , for l = 1,
ϕl (x) = √1L cos( lπx
2L ) for even l > 1 (42)
√1 sin( (l+1)πx ) for odd l > 1
L 2L
25
0.2
10 0
0.15
Relative error
Relative error
0.1
0.53277
0.05 0.75182
10 -1
8
24
48
96
92
8
38
76
12
25
51
10
20
40
81
3 6 9 12 15 18 21 242730
16
32
Figure 13: Gaussian mixture model: relative L2 -error for proposed algorithms. Left: d = 10 and
sample number N = 27 , 28 , · · · , 215 . Right: N = 106 and dimension d = 3, 6, 9, · · · , 30.
TT without approximation error, and the numerical L2 -norm based on quadrature of TTs can be
easily computed with O(d) complexity Oseledets and Tyrtyshnikov (2010).
Next, we fix the sample size N = 106 and test our algorithms with various dimensions d.
The relative L2 -error is plotted in right panel of Figure 13, which exhibits sublinear growth with
increasing dimensionality according to the curves. In particular, with N = 106 samples, our
algorithms can achieve two-digit relative L2 -error up to d = 12. For dimensionality higher than
that, we can expect that the error can also be controlled below 0.1 for a sufficiently large sample
size, as suggested by the Monte Carlo convergence observed in the previous experiment.
Finally, we verify the time complexity of the algorithms. In Figure 14, we plot the wall clock
time of three algorithms for fixed d = 5 with various N (left panel) and fixed N = 10000 with
various d (right panel). The rates of wall clock time verify the theoretical computational complexity
at listed in Table 1.
26
Wall clock time (s) 10 0
0
10
1.119
-2
10
-1
10
10 -3
8
24
48
96
92
8
38
76
12
25
51
10
20
40
81
16 3 6 9 12 15 18 21 242730
32
Figure 14: Gaussian mixture model: average wall clock time (seconds) for proposed algorithms.
Left: d = 5 and sample number N = 27 , 28 , · · · , 215 . Right: N = 10000 and dimension d =
3, 6, 9, · · · , 30.
where x0 = xd+1 = 0, h = 1/(1 + d), λ = 0.03 and β = 1/8. We consider hypercube domain
[−L, L]d with L = 2.5 and mesh size 0.05 in each direction. For the ground truth p∗ , we use DMRG-
cross algorithm (see Savostyanov and Oseledets (2011)) to get p∗ in TT form, which is known to
provide highly accurate approximations with arbitrarily prescribed precision. The samples of p∗
are generated by running Langevin dynamics for sufficiently long time.
In Table 2, we list the relative L2 -error of the approximated densities estimated by N = 106
samples with different number of univariate basis functions. For this example, we set fixed target
rank r = 4. One can observe that the approximation improves as we use more basis functions.
Such improvement can be further visualized in Figure 15, where we plot the marginal distribution
of the approximated solutions solved by three algorithms when d = 20. All of them give accurate
match to the sample histograms when a sufficient number of basis functions are used.
GL-1D, TT-SVD-c GL-1D, TT-SVD-kn GL-1D, TT-rSVD-t
0.6 0.6 0.6
0 0 0
Figure 15: 1D G-L model: x10 -marginal of approximated solutions for d = 20 and N = 106 under
different numbers of univariate basis functions used.
27
d=5 d = 10
n
TT-SVD-c TT-SVD-kn TT-rSVD-t TT-SVD-c TT-SVD-kn TT-rSVD-t
7 0.2810 0.2809 0.2810 0.4304 0.4304 0.4305
11 0.0782 0.0781 0.0781 0.1365 0.1365 0.1378
15 0.0305 0.0304 0.0301 0.0488 0.0483 0.0533
19 0.0275 0.0273 0.0289 0.0449 0.0404 0.0526
d = 15 d = 20
n
TT-SVD-c TT-SVD-kn TT-rSVD-t TT-SVD-c TT-SVD-kn TT-rSVD-t
7 0.6342 0.6342 0.6343 0.7897 0.7897 0.7898
11 0.2169 0.2167 0.2186 0.3017 0.3016 0.3037
15 0.0733 0.0740 0.0836 0.1126 0.1125 0.1158
19 0.0613 0.0602 0.0716 0.0884 0.0874 0.1099
Table 2: 1D G-L model: comparison of L2 relative errors with N = 106 and d = 5, 10, 15, 20 under
different numbers of univariate basis functions used.
where we choose parameter h = 1/(1 + m), λ = 0.1 and inverse temperature β = 1. We set L = 2
with mesh size 0.05 for the domain. We consider a test example where m = 16 and thus the total
number of dimensionality is m2 = 256. For this 2D model, we apply the PCA pre-processing steps
described in Section 7 to order the dimensions and also reduce the dimensionality, where we choose
embedding dimension d′ = 100. In terms of the algorithms, we set target rank r = 10, α = 0.001
and we use N = 104 samples and n = 21 Fourier bases.
We evaluate the performance of our fitted tensor-train density estimators as generative models.
Specifically, we first fit tensor-train density estimators p̃ from the given N = 104 samples {xi }Ni=1
using proposed algorithms, then generate new samples {x̃i }N i=1 from the learned tensor trains.
Sampling each x̃i ∼ p̃ can be done via tensor-train conditional sampling
where each
p̃(x1 , · · · , xi )
p̃(xi |xi−1 , · · · , x1 ) =
p̃(x1 , · · · , xi−1 )
28
is a conditional distribution of p̃, and p̃(x1 ), · · · , p̃(x1 , · · · , xd−1 ) are the marginal distributions. We
point out each conditional sampling can be done in O(d) complexity using the TT structure of p̃
Dolgov et al. (2020).
Figure 16: 2D G-L model on a 16 × 16 lattice: visualization of (x3,3 , x14,14 )-marginal (first row)
and (x6,12 , x8,12 )-marginal (second row). Left to right: reference sample histograms, samples from
tensor trains fitted by proposed algorithms.
Figure 16 illustrates the (x3,3 , x14,14 )-marginal and (x6,12 , x8,12 )-marginal distributions of the
generated samples. For comparison, we also show results for N ∗ = 105 reference samples. The visu-
alizations reveal that samples generated from tensor trains learned by all three proposed algorithms
can capture local minima accurately. To quantify the accuracy, we compute the second-moment
error
∥X̃ T X̃ − X ∗ T X ∗ ∥F /∥X ∗ T X ∗ ∥F (45)
∗
where X̃ ∈ RN ×d is the matrix formed by the generated samples, and X ∗ ∈ RN ×d are reference
samples. The second-moment errors for TT-SVD-kn, TT-SVD-c and TT-rSVD-t algorithms are
respectively 0.0278, 0.0252, 0.0264.
29
9.3.1 1D Ginzburg-Landau model
We first consider the Boltzmann distribution of 1D Ginzburg-Landau potential defined in (9.2.1)
where we choose model parameters λ = 0.005 and β = 1/20. We use N = 104 samples as training
data to fit tensor trains, and then generate N = 104 new samples from the learned tensor trains
by conditional sampling (44).
0.2
TT-SVD-c
TT-SVD-kn
TT-rSVD-t
second-moment error
0.15
NN
0.1
0.05
10 20 30 40 50 60 70 80 90 100
In Figure 17, we plot the second-moment errors of the samples generated by tensor-train al-
gorithms and a neural network, compared against a reference set of N ∗ = 106 samples across
various dimensions d = 10, 20, · · · , 100. Here we choose target rank r = 4 and use n = 17 Fourier
bases for all simulations. The three proposed tensor-train algorithms exhibit similar performance
and consistently achieve two-digit accuracy across all dimensions. Moreover, they are considerably
more accurate than the neural network. To visualize this accuracy gap, Figure 18 further shows
histograms of two selected marginal distributions from the generated samples for d = 100. It can
be observed that samples generated by the tensor-train estimators more accurately capture local
minima compared to those produced by the neural network.
30
Figure 18: 1D G-L model for d = 100: visualization of (x99 , x100 )-marginal (first row) and (x90 , x95 )-
marginal (second row) of samples from generative models. Left to right: reference samples, samples
from tensor trains under three proposed algorithms, and samples from neural network.
per digit from the original MNIST dataset. The generated images from the tensor train exhibit
well-preserved digit structures and visually plausible variations across classes.
10 Conclusion
In this paper, we propose a natural approach to high-dimensional density estimation entirely based
on linear algebra, which allows the construction of a density estimator as a low-rank functional
tensor train. To this end, we provide four tensor decomposition algorithms. The first one is based
on direct TT-SVD algorithm with fast implementation in a density estimation setting. The second
algorithm follows a Nystörm approximation. The third approach modifies the first approach by
thresholding certain elements in the tensor and employing a hierarchical matrix factorization to
accelerate computational efficiency. The last algorithm employs a tensorized sketching technique.
Several of these proposed algorithms can achieve linear computational complexity both in dimension
d and in sample size N . Theoretical analysis demonstrates that tensor-train estimators can also
achieve linear scaling of the estimation error with respect to the dimensionality d under some
specific conditions. Comprehensive numerical experiments validate the accuracy of all proposed
algorithms, in comparison to other density estimators based on neural networks.
11 Declaration
Acknowledgment. Y. Khoo acknowledges partial supports of DMS-2339439, DE-SC0022232
and Sloan foundation. Y. Peng acknowledges partial support of DE-SC0022232.
31
(a) Real images (b) Generated images
Figure 19: Generated and real images from MNIST. For each digit class, five images are randomly
selected for illustration.
References
Markus Bachmayr, Reinhold Schneider, and André Uschmajew. Tensor networks and hierarchi-
cal tensors for the solution of high-dimensional partial differential equations. Foundations of
Computational Mathematics, 16:1423–1472, 2016.
Tai-Danae Bradley, E Miles Stoudenmire, and John Terilla. Modeling sequences with quantum
states: a look under the hood. Machine Learning: Science and Technology, 1(3):035008, 2020.
T Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with
applications to high-dimensional statistics. 2018.
Kamalika Chaudhuri, Sanjoy Dasgupta, Samory Kpotufe, and Ulrike Von Luxburg. Consistent
procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory,
60(12):7900–7912, 2014.
Yen-Chi Chen. A tutorial on kernel density estimation and recent advances. Biostatistics & Epi-
demiology, 1(1):161–187, 2017.
Yian Chen and Yuehaw Khoo. Combining particle and tensor-network methods for partial differ-
ential equations via sketching. arXiv preprint arXiv:2305.17884, 2023.
32
Kyle Cranmer. Kernel estimation in high-energy physics. Computer Physics Communications, 136
(3):198–207, 2001.
Richard A Davis, Keh-Shin Lii, and Dimitris N Politis. Remarks on some nonparametric estimates
of a density function. Selected Works of Murray Rosenblatt, pages 95–100, 2011.
Ronald A DeVore and George G Lorentz. Constructive approximation, volume 303. Springer Science
& Business Media, 1993.
Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components
estimation. arXiv preprint arXiv:1410.8516, 2014.
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv
preprint arXiv:1605.08803, 2016.
Sergey Dolgov, Karim Anaya-Izquierdo, Colin Fox, and Robert Scheichl. Approximation and sam-
pling of multivariate probability distributions in the tensor train decomposition. Statistics and
Computing, 30:603–625, 2020.
Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. Advances
in Neural Information Processing Systems, 32, 2019.
Walter Gautschi. Orthogonal polynomials: computation and approximation. OUP Oxford, 2004.
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder
for distribution estimation. In International conference on machine learning, pages 881–889.
PMLR, 2015.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the
ACM, 63(11):139–144, 2020.
Wolfgang Hackbusch. A sparse matrix arithmetic based on-matrices. part i: Introduction to-
matrices. Computing, 62(2):89–108, 1999.
Wolfgang Hackbusch and Boris N Khoromskij. A sparse h-matrix arithmetic. part ii: Application
to multi-dimensional problems. Computing, 64:21–47, 2000.
Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:
Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53
(2):217–288, 2011.
Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. Unsupervised generative modeling
using matrix product states. Physical Review X, 8(3):031012, 2018.
Amelia Henriksen and Rachel Ward. Adaoja: Adaptive learning rates for streaming pca. arXiv
preprint arXiv:1905.12115, 2019.
Karl-Heinz Hoffmann and Qi Tang. Ginzburg-Landau Phase Transition Theory and Superconduc-
tivity. Birkhäuser Basel, 2012.
33
Jeremy Hoskins, Yuehaw Khoo, Oscar Mickelin, Amit Singer, and Yuguan Wang. Subspace
method of moments for ab initio 3-d single-particle cryo-em reconstruction. arXiv preprint
arXiv:2410.06889, 2024.
Benjamin Huber, Reinhold Schneider, and Sebastian Wolf. A randomized tensor train singular value
decomposition. In Compressed Sensing and its Applications: Second International MATHEON
Conference 2015, pages 261–290. Springer, 2017.
William Huggins, Piyush Patil, Bradley Mitchell, K Birgitta Whaley, and E Miles Stoudenmire.
Towards quantum machine learning with tensor networks. Quantum Science and technology, 4
(2):024001, 2019.
Yoonhaeng Hur, Jeremy G Hoskins, Michael Lindsey, E Miles Stoudenmire, and Yuehaw Khoo.
Generative modeling via tensor train sketching. Applied and Computational Harmonic Analysis,
67:101575, 2023.
Yuehaw Khoo, Yifan Peng, and Daren Wang. Nonparametric estimation via variance-reduced
sketching. arXiv preprint arXiv:2401.11646, 2024.
Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability
estimation. arXiv preprint arXiv:1606.03439, 2016.
Daniel Kressner, Bart Vandereycken, and Rik Voorhaar. Streaming tensor train approximation.
SIAM Journal on Scientific Computing, 45(5):A2610–A2631, 2023.
Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. Tensor completion for estimating
missing values in visual data. IEEE transactions on pattern analysis and machine intelligence,
35(1):208–220, 2012.
Qiao Liu, Jiaze Xu, Rui Jiang, and Wing Hung Wong. Density estimation using deep generative
neural networks. Proceedings of the National Academy of Sciences, 118(15):e2101344118, 2021.
YP Mack and Murray Rosenblatt. Multivariate k-nearest neighbor density estimates. Journal of
Multivariate Analysis, 9(1):1–15, 1979.
Michael W Mahoney and Petros Drineas. Cur matrix decompositions for improved data analysis.
Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.
Georgii S Novikov, Maxim E Panov, and Ivan V Oseledets. Tensor-train density estimation. In
Uncertainty in artificial intelligence, pages 1321–1331. PMLR, 2021.
Ivan Oseledets and Eugene Tyrtyshnikov. Tt-cross approximation for multidimensional arrays.
Linear Algebra and its Applications, 432(1):70–88, 2010.
34
George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density
estimation. Advances in neural information processing systems, 30, 2017.
Emanuel Parzen. On estimation of a probability density function and mode. The annals of math-
ematical statistics, 33(3):1065–1076, 1962.
Yifan Peng, Yian Chen, E Miles Stoudenmire, and Yuehaw Khoo. Generative modeling via hierar-
chical tensor sketching. arXiv preprint arXiv:2304.05305, 2023.
David Perez-Garcia, Frank Verstraete, Michael M Wolf, and J Ignacio Cirac. Matrix product state
representations. arXiv preprint quant-ph/0608197, 2006.
Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International
conference on machine learning, pages 1530–1538. PMLR, 2015.
Dmitry Savostyanov and Ivan Oseledets. Fast adaptive interpolation of multi-dimensional arrays
in tensor train format. In The 2011 International Workshop on Multidimensional (nD) Systems,
pages 1–8. IEEE, 2011.
David W Scott. Averaged shifted histograms: effective nonparametric density estimators in several
dimensions. The Annals of Statistics, pages 1024–1040, 1985.
Richik Sengupta, Soumik Adhikary, Ivan Oseledets, and Jacob Biamonte. Tensor networks in
machine learning. European Mathematical Society Magazine, (126):4–12, 2022.
Tianyi Shi, Maximilian Ruth, and Alex Townsend. Parallel algorithms for computing the tensor-
train decomposition. SIAM Journal on Scientific Computing, 45(3):C101–C130, 2023.
Qingquan Song, Hancheng Ge, James Caverlee, and Xia Hu. Tensor completion algorithms in big
data analytics. ACM Transactions on Knowledge Discovery from Data (TKDD), 13(1):1–48,
2019.
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribu-
tion. Advances in neural information processing systems, 32, 2019.
Edwin Stoudenmire and David J Schwab. Supervised learning with tensor networks. Advances in
neural information processing systems, 29, 2016.
Yiming Sun, Yang Guo, Charlene Luo, Joel Tropp, and Madeleine Udell. Low-rank tucker approx-
imation of a tensor from streaming data. SIAM Journal on Mathematics of Data Science, 2(4):
1123–1150, 2020.
Xun Tang and Lexing Ying. Solving high-dimensional fokker-planck equation with functional hier-
archical tensor. arXiv preprint arXiv:2312.07455, 2023.
Xun Tang, Yoonhaeng Hur, Yuehaw Khoo, and Lexing Ying. Generative modeling via tree tensor
network states. arXiv preprint arXiv:2209.01341, 2022.
35
Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural
autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–
7220, 2016.
Haiyong Wang. How much faster does the best polynomial approximation converge than legendre
projection? Numerische Mathematik, 147(2):481–503, 2021.
Steven R White. Density-matrix algorithms for quantum renormalization groups. Physical review
b, 48(14):10345, 1993.
Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines.
Advances in neural information processing systems, 13, 2000.
Adriano Z Zambom and Ronaldo Dias. A review of kernel density estimation with applications to
econometrics. International Econometric Review, 5(1):20–42, 2013.
In the remainder of this appendix, we omit the superscripts on the basis functions unless they are
needed to indicate the corresponding coordinate of dimensions. Based on the additive formulation
of ground truth p∗ in (38) and orthogonality of {ϕjl }nl=1 , c∗ can be expressed as
We can further simplify the expression of c∗j as follows. By the normalization condition of p∗ , qj∗ in
R1
the expression of p∗ satisfies dj=1 0 qj∗ (xj )dxj = 0. Without loss of generality, we assume that
P
Z 1
qj∗ (xj )dxj = 0. (48)
0
36
Definition 1. Let Bs,d be the s-cluster basis functions defined in (5). Define the s-cluster estimator
X X
p̂s (x) := 1 + ⟨p̂, φ1 ⟩φ1 (x) + · · · + ⟨p̂, φs ⟩φs (x) (50)
φ1 ∈B1,d φs ∈Bs,d
d
The corresponding coefficient tensor ĉ[s] ∈ Rn of p̂s is
Definition 2. Given matrix A ∈ Rm×n and positive integer r ⩽ min{m, n}, we define function
SVD : Rm×n × Z+ → Om×r satisfying
SVD(A, r) = U (52)
Proof. Let Ĝ1:d be the TT cores of c̃, to bound the error ∥c∗ − Ĝ1:d ∥F , we first use Lemma 4 to
formulate it as
Later we will bound the above two terms by Step 1 and Step 2 respectively. Before elaborating
these two steps, let us introduce two error estimates that will be used frequently in the proof of
this theorem. With high probability, we have
r
∗T ∗(k) (k) ∗ n log(dn)∥p∗ ∥∞
max ∥(Uk−1 ⊗ In )(c − ĉ )Vk ∥F ⩽ C1 , (55)
k∈{1,··· ,d} N
r
dK n2K
∗
∥c − ĉ∥F ⩽ C2 . (56)
N
Their proofs are provided respectively by Lemma 11 and Lemma 12. In the above inequality,
C1 , C2 are two sufficiently large universal constants. Uk∗ and Vk∗ are orthogonal matrices such that
k k d−k d−k
Uk∗ Uk∗T ∈ Rn ×n is the projection matrix onto the space Col(c∗(k) ), and Vk∗ Vk∗T ∈ Rn ×n is
the projection matrix onto the space Row(c∗(k) ). Due to the additive formulation (46) of c∗ , Uk∗
and Vk∗ have analytical expressions, shown in Lemma 8.
(56) provides an upper bound of the gap between empirical and ground truth coefficient tensor,
which scales as O(dK/2 ) upon using K-cluster basis p for the density estimator. Notably, with∗the
∗ ∗
projection Uk and Vk , this gap grows only at O( d log(dn)) under the Assumption 1 for ∥p ∥∞
37
as stated in (55). This is due to the 1-cluster structure of c∗ , which is the key to make our error
analysis sharp.
Step 1. To bound (53), we show by induction that for k ∈ {1, · · · , d − 1},
k
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F X 1
∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F ⩽ C3 2
(57)
N σ (c∗(j) )
j=1 2
for a universal sufficiently large constant C3 with high probability, where σ2 (c∗(j) ) is the second
largest singular value of the unfolding matrix c∗(j) .
We prove k = 1 first:
∥Ĝ1 ĜT1 c∗(1) − c∗(1) ∥2F = ∥Ĝ1 ĜT1 c∗(1) − U1∗ U1∗ T c∗(1) ∥2F ⩽ ∥Ĝ1 ĜT1 − U1∗ U1∗ T ∥22 ∥c∗ ∥2F (58)
We use the matrix perturbation theory in Corollary 3 to bound ∥Ĝ1 ĜT1 − U1∗ U1∗ T ∥2 . The spectrum
q q
dK n2K dK n2K
gap assumption in Corollary 3 holds since ∥ĉ(1) − c∗(1) ∥F ⩽ O N and O N ⩽
σ2 (c∗(1) ) by (56) and using a sufficiently large N such that (39) holds. Therefore, we have
∥(ĉ(1) − c∗(1) )V1∗ T ∥2 ∥ĉ(1) − c∗(1) ∥22
∥Ĝ1 ĜT1 − U1∗ U1∗ T ∥2 ⩽ C4 ( + ). (59)
σ2 (c∗(1) ) σ22 (c∗(1) )
for some universal constant C4 . According to (55), the first term
s
∥(ĉ(1) − c∗(1) )V1∗ T ∥2 n log(dn)∥p∗ ∥∞
⩽ C 1 (60)
σ2 (c∗(1) ) N σ22 (c∗(1) )
and the second term can be bounded by the first term upon using sufficiently large N such that
(39) holds. Therefore,
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F
∥Ĝ1 ĜT1 c∗(1) − c∗(1) ∥2F ⩽ C3 (61)
N σ22 (c∗(1) )
for some universal constant C3 .
Now suppose (57) holds for k. Then for k + 1,
∥Ĝ1:k+1 ĜT1:k+1 c∗(k+1) − c∗(k+1) ∥2F
=∥Ĝ1:k+1 ĜT1:k+1 c∗(k+1) − (Ĝ1:k ⊗ In )(Ĝ1:k ⊗ In )T c∗(k+1) ∥2F + ∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F
⩽∥Ĝ1:k+1 ĜT1:k+1 c∗(k+1) − (Ĝ1:k ⊗ In )(Ĝ1:k ⊗ In )T c∗(k+1) ∥2F (62)
k
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F X 1
+C3 . (63)
N σ 2 (c∗(j) )
j=1 2
where the equality follows from Lemma 3, and the inequality follows from induction (57). It suffices
to bound (62). Note that
(62) =∥(Ĝ1:k ⊗ In )Ĝk+1 ĜTk+1 (Ĝ1:k ⊗ In )T c∗(k+1) − (Ĝ1:k ⊗ In )(Ĝ1:k ⊗ In )T c∗(k+1) ∥2F
=∥(Ĝ1:k ⊗ In ) Ĝk+1 ĜTk+1 (Ĝ1:k ⊗ In )T c∗(k+1) − (Ĝ1:k ⊗ In )T c∗(k+1) ∥2F
38
Step 1A. Define the matrix formed by principal left singular vectors:
G̃k+1 = SVD (Ĝ1:k ⊗ In )T c∗(k+1) , 2 ∈ O2n×2 .
(65)
where the SVD function is defined in (52). Since Rank((Ĝ1:k−1 ⊗ In )T c∗(k) ) = 2, it follows that
(Ĝ1:k ⊗ In )T c∗(k+1) = G̃k+1 G̃Tk+1 (Ĝ1:k ⊗ In )T c∗(k+1) . (66)
Therefore by (64) and (66),
(62) =∥Ĝk+1 ĜTk+1 (Ĝ1:k+1 ⊗ In )T c∗(k+1) − G̃k+1 G̃Tk+1 (Ĝ1:k ⊗ In )T c∗(k+1) ∥2F
⩽∥Ĝk+1 ĜTk+1 − G̃k+1 G̃Tk+1 ∥22 ∥Ĝ1:k ⊗ In ∥22 ∥c∗(k+1) ∥2F
=∥Ĝk+1 ĜTk+1 − G̃k+1 G̃Tk+1 ∥22 ∥c∗ ∥2F . (67)
Perturbation of projection matrices ∥Ĝk+1 ĜTk+1 − G̃k+1 G̃Tk+1 ∥2 will again be estimated using Corol-
lary 3. More specifically, in Corollary 3, we consider  = (Ĝ1:k ⊗ In )T ĉ(k+1) and A = Ĝ1:k ⊗
In )T c∗(k+1) , and Ĝk+1 ĜTk+1 and G̃k+1 G̃Tk+1 are respectively the projection matrices of  and A. In
the subsequent Step 1B, we verify the conditions that should be satisfied in Corollary 3.
where C4 is a universal constant, B ∗ is given in Assumption 1. The second inequality above follows
from upper bound for ∥c∗ ∥2F in Corollary 2 and the lower bound for singular value in Lemma 10,
and the third inequality follows from the fact that dj=1 j(d−j)
1
= O( log(d)
P
d ). Consequently,
σ22 (c∗(k+1) )
∥(Ĝ1:k ⊗ In )(Ĝ1:k ⊗ In )T c∗(k+1) − c∗(k+1) ∥2F = ∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F ⩽ , (69)
4
where the inequality follows from the lower bound for the singular value c∗(k) given in Lemma 10
and the assumption (39) on sample size N . Therefore, by Lemma 18,
• Rank((Ĝ1:k ⊗ In )T c∗(k+1) ) = 2; (70)
∗ ∗T T ∗(k+1)
• Vk+1 Vk+1 is the projection matrix on to Row((Ĝ1:k ⊗ In ) c ); (71)
σ2 (c∗(k+1) )
• σ2 ((Ĝ1:k ⊗ In )T c∗(k+1) ) ⩾ . (72)
2
We start to apply Corollary 3 to Ĝk+1 and G̃k+1 , observe that
∥(Ĝ1:k ⊗ In )T c∗(k+1) − (Ĝ1:k ⊗ In )T ĉ(k+1) ∥F ⩽ ∥Ĝ1:k ⊗ In ∥2 ∥c∗(k+1) − ĉ(k+1) ∥F
r K 2K
∗ d n
=∥ĉ − c ∥F ⩽ C2 ⩽ 8σ2 (c∗(k+1) ) ⩽ 4σ2 ((Ĝ1:k ⊗ In )T c∗(k) ), (73)
N
39
k+1
where the equality follows from the fact that Ĝ1:k ⊗ In ∈ On ×2n , the second inequality follows
from (56), the third inequality follows from (39), and the last inequality follows from (72). Since
(73) and (71) hold, by Corollary 3
∥G̃k+1 G̃Tk+1 − Ĝk+1 Ĝk+1 ∥2
∥(Ĝ1:k ⊗ In )T (c∗(k+1) − ĉ(k+1) )Vk+1
∗ ∥
2
⩽ (74)
σ2 ((Ĝ1:k ⊗ In )T c∗(k+1) )
∥(Ĝ1:k ⊗ In )T (c∗(k+1) − ĉ(k+1) )∥22
+ (75)
σ22 ((Ĝ1:k ⊗ In )T c∗(k+1) )2
We will bound (74) in Step 1D and (75) in Step 1E.
40
Step 1E. Observe that
√ p
∥c∗(k+1) − ĉ(k+1) ∥22 4C22 dK n2K C3 n log(dn)∥p∗ ∥∞
(75) ⩽ 2 ⩽ ⩽ √ , (82)
σ2 ((Ĝ1:k ⊗ In )T c∗(k+1) )2 N σ22 (c∗(k+1) ) 2 N σ2 (c∗(k+1) )
where second inequality follows from (56) and (72), and the last inequality hold for sufficiently
large N in (39). Consequently, (74), (75), (81), and (82) together lead to
p
T
p n log(dn)∥p∗ ∥∞
∥G̃k+1 G̃k+1 − Ĝk+1 Ĝk+1 ∥2 ⩽ C3 √ . (83)
N σ2 (c∗(k+1) )
Step 2. We have
Note that
∗ ∗T 2 (d−1)
(85) ⩽2∥Ĝ1:d−1 ĜT1:d−1 − Ud−1 Ud−1 ∥2 ∥ĉ − c∗(d−1) ∥2F
∗ ∗T 2 n∥p∗ ∥∞ d log(dn)
=2∥Ĝ1:d−1 ĜT1:d−1 − Ud−1 Ud−1 ∥2 ∥ĉ − c∗ ∥2F ⩽ C5 2 ∗(d−1) ∥ĉ − c∗ ∥2F
σ2 (c ) N
n∥p∗ ∥∞ d log(dn) 2 dK n2K n log(dn)∥p∗ ∥∞
⩽C5 2 ∗(d−1) C2 ⩽ C6 , (87)
σ2 (c ) N N N
where C5 , C6 are universal constants. The second inequality follows from Lemma 5 given (84), the
third inequality follows from (56), and the fourth inequality follows from (39). In addition,
∗ ∗T n log(dn)∥p∗ ∥∞
(86) = 2∥(Ud−1 Ud−1 ⊗ In )(ĉ(d) − c∗(d) )∥2F ⩽ 2C12 . (88)
N
41
where the inequality follows from (55).
Therefore
n log(dn)∥p∗ ∥∞ n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F
(54) ⩽ C7 ⩽ 2C7 , (89)
N σ22 (c∗(d−1) )N
for some universal constant C7 , where the first inequality follows from (87) and (88), and the second
inequality follows from the fact that σ2 (c∗(d−1 ) ⩽ ∥c∗(d−1) ∥F = ∥c∗ ∥F . At this point, we combine
(84) and (89) and obtain the final estimation for the relative error:
v
∗
u
u n log(dn)∥p∗ ∥∞ X d−1
∥c − Ĝ1:d ∥F 1
= OP t .
∥c∗ ∥F N σ 2 (c∗(j) )
j=1 2
By further applying ∥p∗ ∥∞ ⩽ 1+B ∗ d from Assumption 1, we arrive at the results in Theorem 1.
Lemma 3. Let Ĝ1:d be the output tensor cores from Algorithm 1 with target rank r. For k ∈
{1, · · · , d − 1}, it holds that
42
Therefore (90) = ∥Ĝ1:k−1 ĜT1:k−1 c∗(k−1) − c∗(k) ∥2F .
k−1
Step 2. By Lemma 2, Ĝ1:k−1 ∈ On ×r . Then the kronecker product of two orthogonal ma-
k
trices is still orthogonal, Ĝ1:k−1 ⊗ In ∈ On ×rn . The details are in Lemma 19. It follows that
Therefore
T
(92) = ⟨(Ĝ1:k−1 ⊗ In )(Ĝ1:k−1 ⊗ In )T c∗(k) , (Ĝ1:k−1 ⊗ In )⊥ (Ĝ1:k−1 ⊗ In )⊥ c∗(k) ⟩ = 0.
Lemma 4. Let Ĝ1:d be the output tensor cores from Algorithm 1 with target rank r. Then
∥c∗ − Ĝ1:d ∥2F = ∥c∗(d−1) − Ĝ1:d−1 ĜT1:d−1 c∗(d−1) ∥2F + ∥Ĝ1:d−1 ĜT1:d−1 c∗(d−1) − Ĝ1:d−1 ĜT1:d−1 ĉ(d−1) ∥2F .
⟨Ĝ1:d−1 ĜT1:d−1 c∗(d−1) − c∗(d−1) , Ĝ1:d−1 ĜT1:d−1 ĉ(d−1) − Ĝ1:d−1 ĜT1:d−1 c∗(d−1) ⟩ = 0. (94)
d−1 ×r
By Lemma 2, Ĝ1:d−1 ∈ On . So
Lemma 5. Suppose all the conditions in Theorem 1 holds. Let Ĝ1:d be the output tensor cores
from Algorithm 1 with target rank r = 2. Suppose that
k
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F X 1
∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F ⩽C 2
(95)
N σ (c∗(j) )
j=1 2
43
where C is a universal constant. Let Uk∗ be defined as in Lemma 8. Note that by Lemma 8,
k k
Uk∗ Uk∗T ∈ Rn ×n is the projection matrix onto the space Col(c∗(k) ). Then
p r
T ∗ ∗T ′ n∥p∗ ∥∞ d log(dn)
∥Ĝ1:k Ĝ1:k − Uk Uk ∥2 ⩽ C , (96)
σ2 (c∗(k) ) N
Pd log(d)
where we have used the fact that 1
j=1 j(d−j) = O( d ) and C ′′ is some universal constant.
By Lemma 18, Rank(Ĝ1:k ĜT1:k c∗(k) ) = 2, and Ĝ1:k ĜT1:k is the projection matrix on Col(Ĝ1:k ĜT1:k c∗(k) ).
Consequently,
where the first inequality follows from Theorem 3 and Rank(c∗(k) ) = 2, the second inequality follows
from the assumption on sufficiently large sample size N as stated in (39), and the last inequality
follows
√
from (97). Furthermore, by the expressions of Cc in Corollary 2 and cσ in Lemma 10, we
have cCσ c can be bounded a universal constant, leading to the final estimation (96).
d
where c∗ ∈ Rn is the coefficient tensor formulated as (46). There exists an absolute constant C
such that
d
X
∥p∗ − p̃∥L2 ([0,1]d ) ⩽ Cn−β0 ∥qj∗ ∥H β0 ([0,1]) .
j=1
44
where c∗j ∈ Rn is defined in (49). Therefore
d
X n
X
∗
∥p − p̃∥L2 ([0,1]d ) ⩽ c∗j (lj )ϕlj − qj∗ .
j=1 lj =1 L2 ([0,1])
Denote Sn the class of polynomial up degree n on [0, 1], and Pn the projection operator from
L2 ([0, 1]) to Sn respect to L2 norm. Since Legendre polynomials are orthogonal, (49) and (48)
n
X
c∗j (lj )ϕlj = Pn (qj∗ ). (99)
lj =1
By the Legendre polynomial approximation theory shown in the appendix of Khoo et al. (2024), it
follows that
Xn
c∗j (lj )ϕlj − qj∗ ⩽ C∥qj∗ ∥H β0 ([0,1]) n−β0 (100)
lj =1 L2 ([0,1])
for some absolute constant C which does not depend on c∗j or qj∗ . The desired result follows
immediately.
Corollary 2. Under Assumption 1, for sufficiently large n, there exists a constant C such that
∥c∗ ∥2F ⩽ Cc d, where Cc = B ∗ + 2
d
where c∗ ∈ Rn is the coefficient tensor formulated as (46).
Proof. Observe that
∥c∗ ∥2F =∥1d + c∗1 ⊗ 1d−1 + 1 ⊗ c∗2 ⊗ 1d−2 + · · · + 1d−1 ⊗ c∗d ∥2F
d
X
=∥1d ∥2F + ∥c∗1 ⊗ 1d−1 ∥2F + · · · + ∥1 d−1
⊗ c∗d ∥2F =1+ ∥c∗j ∥2 .
j=1
where the first equality follows from (46) , and the second equality follows from orthogonality. In
addition, note that
n
X
∥c∗j ∥ = c∗j (lj )ϕlj L2 ([0,1])
⩽ ∥qj∗ ∥L2 ([0,1]) + C1 ∥qj∗ ∥H β0 ([0,1]) n−β0 ⩽ B ∗ + 1,
lj =1
where the first inequality follows from (100), and the last inequality follows from Assumption 1 and
n being sufficiently large. The desired result follows immediately.
45
Proof. It is direct to show that
Therefore (101) holds. The justification of (102) is similar and omitted for brevity.
Lemma 8. Let c∗ be coefficient tensor as in (46) and c∗j be as in (49). Denote e1 = [1, 0]T ,
k ×2 d−k ×2
e2 = [0, 1]T and let the matrices Uk∗ ∈ Rn and Vk∗ ∈ Rn be such that
k k d−k ×nd−k
Then Uk∗ Uk∗T ∈ Rn ×n is the projection matrix onto the space Col(c∗(k) ), and Vk∗ Vk∗T ∈ Rn
is the projection matrix onto the space Row(c∗(k) ).
Proof. Note that the set of vectors {1k , c∗1 ⊗ 1k−1 , 1 ⊗ c∗2 ⊗ 1k−2 , 1k−1 ⊗ c∗k } are orthogonal. Direct
calculations show that
Uk∗T Uk∗ = e1 ⊗ e1 + e2 ⊗ e2 = I2 ∈ R2×2 .
k k k
Therefore Uk∗ ∈ On ×2 . Since Col(Uk∗ ) = Col(c∗(k) ), Uk∗ Uk∗T ∈ Rn ×n is the projection matrix onto
the space Col(c∗(k) ). The justification of Vk∗ Vk∗T is similar and omitted for brevity.
Lemma 9. Let u∗k and vk∗ be as in (106) and (107) respectively. Denote
Pn ∗ (l )ϕ (x ) + · · · +
Pn ∗
∗ l1 =1 c1 1 l 1 1 lk =1 ck (lk )ϕlk (xk )
Φk (x1 , · · · , xk ) = p ∗ , (108)
∥c1 ∥2 + · · · + ∥c∗k ∥2
Pn ∗
Pn ∗
∗ lk+1 =1 ck+1 (lk+1 )ϕlk+1 (xk+1 ) + · · · + ld =1 cd (ld )ϕld (xd )
Ψk (xk+1 , · · · , xd ) = q . (109)
∥c∗k+1 ∥2 + · · · + ∥c∗d ∥2
46
Then u∗k is the coefficient tensor of Φ∗k , and vk∗ is the coefficient tensor of Ψ∗k . Then
In addition, suppose Assumption 1 holds. Then for sufficient large n such that
√ √ 4B ∗
∥Φ∗k ∥∞ ⩽ C ′ k and ∥Ψ∗k ∥∞ ⩽ C ′ d − k where C′ = .
b∗
Proof. To see that u∗k is the coefficient tensor of Φ∗k (x1 , · · · , xk ), it suffices to observe that the
(l1 , · · · , lk )-coordinate of u∗k is ⟨Φ∗k , ϕ1l1 · · · ϕklk ⟩. Since the coefficient tensor is distance preserving, it
holds that
∥Φ∗k ∥L2 ([0,1]k ) = ∥u∗k ∥F = 1.
Next, we bound ∥Φ∗k ∥∞ . By (99) and Theorem 5, it follows that there exists a constant C1 such
that
n
X
c∗j (l1 )ϕlj ∞ = ∥Pn (qj∗ )∥∞ ⩽ ∥qj∗ ∥∞ + C1 n−β0 .
lj =1
from some constant C2 , where the second equality follows from (99), the first inequality follows from
Corollary 4, the second inequality follows from the Assumption 1 that minj∈{1,··· ,d} ∥qj∗ ∥L2 ([0,1]) ⩾ b∗ ,
and the last inequality holds for sufficiently large n. Therefore, the denominator
q √
∥c∗1 ∥2 + · · · + ∥c∗k ∥2 ⩾ b∗ k/2 (111)
∗ √
Note that (110) and (111) immediately imply ∥Φ∗k ∥∞ ⩽ 4 Bb∗ k. The analysis for ∥Ψ∗k ∥∞ is the
same and therefore omitted.
Then the k-th unfolding of coefficient tensor c∗ has the smallest singular value formulated as
47
for k = 1, · · · , d. Suppose in addition that Assumption 1 holds, for sufficient large n, σ2 (c∗(k) ) is
lower bounded by r s
∗(k) k(d − k) 16B ∗4
σ2 (c ) ⩾ cσ where cσ = ∗2 (112)
d 1 + b4
fk∗ = c∗1 ⊗ 1k−1 + 1 ⊗ c∗2 ⊗ 1k−2 + · · · + 1k−1 ⊗ c∗k , and gk∗ = c∗k+1 ⊗ 1d−k−1 + · · · + 1d−k ⊗ c∗d .
Then one can easily check that a∗k = ∥f ∗ ∥2F and b∗k = ∥g ∗ ∥2F due to orthogonality, and we can write
k
where terms with ( ) are viewed as column vectors in Rn , and terms with { } are viewed as row
d−k
vectors in Rn . Therefore
1d−k + gk∗
∗(k) ∗
k
c = 1 fk .
1d−k
Observe that
1 + ∥gk∗ ∥2F 1 + ∥gk∗ ∥2F ∥fk∗ ∥F
k
1k
∗(k) ∗(k) T 1 1
1k fk∗ ∗ ∗
k
c (c ) = = 1 fk /∥fk ∥F
1 1 fk∗ ∥fk∗ ∥F ∥fk∗ ∥2F fk∗ /∥fk∗ ∥F
k
Note that 1k fk∗ /∥fk∗ ∥F is an orthogonal matrix in On ×2 . Therefore the eigenvalues of c∗(k) (c∗(k) )T
Pn ∗
Since lj =1 cj (lj )ϕlj L2 ([0,1]) = ∥c∗j ∥, it follows that
b∗2 k b∗2 (d − k)
⩽ a∗k ⩽ 4B ∗2 k and ⩽ b∗k ⩽ 4B ∗2 (d − k).
4 4
48
Therefore
1 + a∗k + b∗k − (1 + a∗k + b∗k )2 − 4a∗k b∗k
p
σ22 (c∗(k) ) =
p 2
(1 + a∗k + b∗k − (1 + a∗k + b∗k )2 − 4a∗k b∗k )(1 + a∗k + b∗k + (1 + a∗k + b∗k )2 − 4a∗k b∗k )
p
=
2(1 + a∗k + b∗k + (1 + a∗k + b∗k )2 − 4a∗k b∗k )
p
Proof. We suppose s ⩾ 3 in the proof, where the case of s = 2 is a simplified version from the same
calculations. Given any positive number t, we have
∗T ∗(k) (k) ∗ 2 2
P max ∥(Uk−1 ⊗ In )(c − ĉ [s])Vk ∥F ⩾ 4nt
k∈{1,··· ,d}
∗T
= P ∥(Uk−1 ⊗ In )(c∗(k) − ĉ(k) [s])Vk∗ ∥2F ⩾ 4nt2 for least one k ∈ {1, · · · , d}
d
X
∗T
⩽ P ∥(Uk−1 ⊗ In )(c∗(k) − ĉ(k) [s])Vk∗ ∥2F ⩾ 4nt2 (by union bound)
k=1
Xd (115)
2
= P (118) ⩾ 4nt
k=1
d X
X n
⩽ P |⟨p∗ − p̂, Φ∗k−1 ϕkl Ψ∗k ⟩| ⩾ t + P |⟨p∗ − p̂, ϕkl ⟩| ⩾ t
k=1 l=1
+ P |⟨p∗ − p̂, Φ∗k−1 ϕkl ⟩| ⩾ t + P |⟨p∗ − p̂, ϕkl Ψ∗k ⟩| ⩾ t .
We now bound the four terms in the double sums above by Lemma 16, which is based on Bernstein
inequality.
We first bound
∗ ∗ k ∗
P |⟨p − p̂, Φk−1 ϕl Ψk ⟩| ⩾ t .
∥Φ∗k−1 ϕkl Ψ∗k ∥L2 ([0,1]d ) = ∥Φ∗k−1 ∥L2 ([0,1]k−1 ) ∥ϕl ∥L2 ([0,1]) ∥Ψ∗k+1 ∥L2 ([0,1]d−k ) = 1,
49
and that
∥Φ∗k−1 ϕkl Ψ∗k ∥∞ = ∥Φ∗k−1 ∥∞ ∥ϕl ∥∞ ∥Ψ∗k+1 ∥∞ ⩽ C ′2
p
n(k − 1)(d − k).
4B ∗
where C ′ = b∗ .
Therefore, by Lemma 16,
−N t2
∗ ∗ k ∗
P |⟨p − p̂, Φk−1 ϕl Ψk ⟩| ⩾ t ⩽ 2 exp
+ ∥Φ∗k−1 ϕkl Ψ∗k ∥∞ t
∥p∗ ∥∞ ∥Φ∗k−1 ϕkl Ψ∗k ∥2L d
2 ([0,1] )
(116)
−N t2
ϵ
⩽ 2 exp p =: .
∗
∥p ∥∞ + C ′2 n(k − 1)(d − k)t 4dn
ϵ
where we assume the right-hand side is 4dn for some sufficiently small constant ϵ. Solving the last
line above yields the quadratic equation
8dn 8dn
N t2 − C ′2 n(k − 1)(d − k) log( )∥p∗ ∥∞ = 0,
p
)t − log(
ϵ ϵ
whose larger root can be bounded by
q
C ′2 n(k − 1)(d − k) log( 8dn C ′4 n(k − 1)(d − k)[log( 8dn 8dn
p
2 ∗
ϵ ) + ϵ )] + 4N log( ϵ )∥p ∥∞
t+ =
r 2N
log(dn)∥p∗ ∥∞
⩽ Cϵ =: tϵ
N
upon choosing sufficiently large N . Here Cϵ ∼ log(1/ϵ) is some sufficiently large positive constant
which only depends on ϵ. At this point, we have
ϵ
P |⟨p∗ − p̂, Φ∗k−1 ϕkl Ψ∗k ⟩| ⩾ tϵ ⩽ .
4dn
For the other three terms in the double sums, by Lemma 9 we have
√
∥ϕkl ∥∞ ⩽ n, ∥Φ∗k−1 ϕkl ∥∞ ⩽ n(k − 1), ∥ϕkl Ψ∗k ∥∞ ⩽ n(d − k).
p p
Following the same calculations, we can show that with high probability
ϵ
P ⟨p∗ − p̂, ϕkl ⟩ ⩾ tϵ ⩽
4dn
∗ ∗ k
ϵ
P ⟨p − p̂, Φk−1 ϕl ⟩ ⩾ tϵ ⩽ ,
4dn
ϵ
P ⟨p∗ − p̂, ϕkl Ψ∗k ⟩ ⩾ tϵ ⩽
4dn
with appropriate choice of N . Inserting the estimates above into (115), we conclude that
∗T ∗(k) (k) ∗ 2 2
P max ∥(Uk−1 ⊗ In )(c − ĉ [s])Vk ∥F ⩾ 4ntϵ ⩽ ϵ
k∈{1,··· ,d}
for any given positive ϵ. This indicates that, with high probability,
r
∗T ∗(k) √ n log(dn)∥p∗ ∥∞
max ∥(Uk−1 ⊗ In )(c (k)
− ĉ [s])Vk∗ ∥F ⩽ 2 ntϵ = 2Cϵ
k∈{1,··· ,d} N
50
Lemma 12. Suppose Assumption 1 holds. Let c∗ be the coefficient tensor of p∗ , and ĉ be the
s-cluster coefficient tensor estimator defined in Definition 1 with s ⩾ 2. Then
r s 2s
∗ d n
∥c − ĉ∥F = OP . (117)
N
Proof. It suffices to show that s 2s
∗ d n
ĉ∥2F
E ∥c − =O .
N
To this end, let Bj,d be the j-cluster basis functions defined in (5). Then by
Xs X X
∗ 2 ∗ 2 ∗ 2
E(∥c − ĉ∥F ) = E(⟨p̂ − p , φ0 ⟩ ) + · · · + E(⟨p̂ − p , φs ⟩ )
j=0 φ0 ∈B1,d φs ∈Bs,d
Qs
Let gs (x) = k=1 ϕijk (xjk ) be a generic element in Bs,d . Then since {ϕl }∞
l=1 are Legendre polyno-
√ s/2
mials, it follows that with ∥ϕl ∥∞ ⩽ l and therefore ∥gs ∥∞ ⩽ n . Consequently,
N
Var(gs (x(1) ))
∗ 2 1 X (i)
E(⟨p̂ − p , gs ⟩ ) = Var gs (x ) =
N N
i=1
g (x)p∗ (x)dx ∥gs ∥2∞ p∗ (x)dx
R 2
ns
R
⩽ s ⩽ ⩽ ,
N N N
where the last equality follows from the fact that p∗ is a density and so p∗ (x)dx = 1. Since the
R
s 2s
cardinality of Bs,d is at most ns ds , it follows that φs ∈Bs,d E(⟨p̂ − p∗ , φs ⟩2 ) ⩽ d Nn . Since s is a
P
bounded integers, it follows that
s
dj n2j
s 2s
∗ 2
X d n
E(∥c − ĉ∥F ) ⩽ =O .
N N
j=1
Lemma 13. Let c∗ be the coefficient tensor of p∗ , and ĉ[s] be the s-cluster coefficient tensor
estimator defined in Definition 1 with s ⩾ 2. Let Uk∗ and Vk∗ be defined as in (104) and (105)
respectively. Φ∗k and Ψ∗k are defined as in (108) and (109) respectively. The following equalities
hold:
• If s ⩾ 3,
∗T
∥(Uk−1 ⊗ In )(c∗(k) − ĉ[s](k) )Vk∗ ∥2F
n
(118)
X
∗ k 2 ∗ ∗ k 2 ∗ k ∗ 2 ∗ ∗ k ∗ 2
= ⟨p − p̂, ϕl ⟩ + ⟨p − p̂, Φk−1 ϕl ⟩ + ⟨p − p̂, ϕl Ψk ⟩ + ⟨p − p̂, Φk−1 ϕl Ψk ⟩ ;
l=1
• If s = 2,
n
X
∗T ∗(k)
∥(Uk−1 ⊗ In )(c − ĉ[s] (k)
)Vk∗ ∥2F = ∗ k 2 ∗ ∗ k 2 ∗ k ∗ 2
⟨p − p̂, ϕl ⟩ + ⟨p − p̂, Φk−1 ϕl ⟩ + ⟨p − p̂, ϕl Ψk ⟩ .
l=1
(119)
51
Proof. For l ∈ {1, · · · , n}, denote χl ∈ Rn be the l-th canonical basis. That is, for k ∈ {1, · · · , n}
(
0, if l ̸= k;
χl (k) =
1, if l = k.
Denote T the reshaped R2×n×2 tensor of the matrix (Uk−1 ∗T ⊗ I )(c∗(k) − ĉ[s](k) )V ∗ . It follows from
n k
(104), (105), (106) and (107) that, for any l ∈ {1, · · · , n} the (a, l, b)-coordinate of T satisfies
⟨c∗ − ĉ[s], 1k−1 ⊗ χl ⊗ 1d−k ⟩, if a = 1, b = 1;
⟨c∗ − ĉ[s], u∗ ⊗ χ ⊗ 1d−k ⟩, if a = 2, b = 1;
k−1 l
Ta,l,b = ∗
⟨c − ĉ[s], 1 k−1 ⊗ χl ⊗ vk∗ ⟩, if a = 1, b = 2;
⟨c∗ − ĉ[s], u∗ ⊗ χ ⊗ v ∗ ⟩,
if a = 2, b = 2.
k−1 l k
• u∗k−1 ⊗ χl ⊗ 1d−k is the coefficient tensor of Φ∗k−1 (x1 , · · · , xk−1 )ϕl (xk ) ∈ Span{B2,d };
• u∗k−1 ⊗χl ⊗vk∗ is the coefficient tensor of Φ∗k−1 (x1 , · · · , xk−1 )ϕl (xk )Ψ∗k (xk+1 , · · · , xd ) ∈ Span{B3,d }.
Step 2. Suppose s = 2. Note that Φ∗k−1 ϕkl Ψ∗k ∈ Span{B3,d }. By Lemma 14, ⟨p∗ − p̂, Φ∗k−1 ϕkl Ψ∗k ⟩ = 0.
The rest of the calculations are the same as Step 1.
Lemma 14. Suppose p∗ satisfies (38), and ĉ[s] is the s-cluster coefficient tensor estimator of p∗
in Definition 1, where 2 ⩽ s ⩽ d. Suppose t ∈ {1, · · · , d} and F : Rd → R is any function such that
d
F ∈ Span{Bt,d }. Let f ∈ Rn be the coefficient tensor of F as
Then (
0, if t > s;
⟨c∗ − ĉ[s], f ⟩ =
⟨p∗ − p̂, F ⟩, if t ⩽ s.
52
Proof. Observe that by orthogonality of {Bj,d }dj=1
n
X
⟨c∗ − ĉ[s], f ⟩ = (c∗ − ĉ[s])(l1 , · · · , ld ) · f (l1 , · · · , ld )
l1 ,··· ,ld =1
X n
= ⟨p∗ − p̂s , ϕ1l1 · · · ϕdld ⟩⟨F, ϕ1l1 · · · ϕdld ⟩
l1 ,··· ,ld =1
X X
= ⟨p∗ − p̂s , φ0 ⟩⟨F, φ0 ⟩ + · · · + ⟨p∗ − p̂s , φd ⟩⟨F, φd ⟩
φ0 ∈B0,d φd ∈Bd,d
X X
= ⟨p∗ − p̂s , φ0 ⟩⟨F, φ0 ⟩ + · · · + ⟨p∗ − p̂s , φs ⟩⟨F, φs ⟩. (120)
φ0 ∈B0,d φs ∈Bs,d
where in the second equality, p̂s is defined in (50), the last equality follows the observation that
p∗ ∈ Span{Bj,d }1j=0 and p̂s ∈ Span{Bj,d }sj=0 with s ⩾ 2. Therefore,
• If t > s, then ⟨F, φs ⟩ = 0 when φs ∈ Bs,d . So in this case (120) = 0;
• If t ⩽ s, then (120) = ⟨p∗ − p̂, F ⟩ by Lemma 15.
Lemma 15. Let Ω be a measurable set in arbitrary dimension. Let {ψk }∞ k=1 be a collection of
orthonormal basis in L2 (Ω) and let M be a m-dimensional linear subspace of L2 (Ω) such that
M = Span{ψk }m
k=1 .
R
Suppose F, G ∈ M. Denote ⟨F, G⟩ = Ω F (x)G(x)dx. Then
Xm
⟨F, G⟩ = ⟨F, ψk ⟩⟨G, ψk ⟩.
k=1
Proof. It suffices to observe that
∞
X m
X
⟨F, G⟩ = ⟨F, ψk ⟩⟨G, ψk ⟩ = ⟨F, ψk ⟩⟨G, ψk ⟩,
k=1 k=1
where the first equality follows from the assumption that {ψk }∞k=1 is a complete basis, and the
second equality follows from the fact that F ∈ M, and so ⟨F, ψk ⟩ = 0 if k > m.
Lemma 16. Suppose that {x(i) }N i=1 are independently sampled from the density p∗ : Rd → R+ .
1 PN
Denote the empirical distribution p̂(x) = N i=1 δx(i) (x). Let φ : Rd → R be any non-random
function. Then
−N t2
∗
P( ⟨p̂ − p , φ⟩ ⩾ t) ⩽ 2 exp .
∥p∗ ∥∞ ∥φ∥2L ([0,1]d ) + ∥φ∥∞ t
2
Proof. It suffices to apply the Bernstein’s inequality to the random variable ⟨p̂ − p∗ , φ⟩. Observe
∗ 1 PN (i) (i)
that ⟨p̂ − p , φ⟩ = N i=1 φ(x ) − E[φ(x )] . Since
Z
2
E( φ(x(i) ) − E[φ(x(i) )] ) = Var[φ(x(i) )] ⩽ E(φ2 (x(i) )) = φ2 (x)p∗ (x)dx ⩽ ∥φ∥2L2 ([0,1]d ) ∥p∗ ∥∞
and that ∥φ(x(i) ) − E[φ(x(i) )]∥∞ ⩽ 2∥φ∥∞ , the desired result follows immediately from the Bern-
stein’s inequality.
53
C.6 Orthonormal matrices
Lemma 17. Suppose M ∈ Rn×m and V ∈ On×a in the sense that V T V = Ia . Suppose in addition
that V T M = U ΣW T is the SVD of V T M , where U ∈ Oa×r , Σ ∈ Rr×r and W ∈ Or×m . Then
V V T M = (V U )ΣW T (121)
Lemma 18. Suppose M ∈ Rn×m such that Rank(M ) = r. Suppose in addition that U ∈ On×r and
that σr (M ) > 2∥U U T M − M ∥F . Then
• Rank(U U T M ) = Rank(U T M ) = r;
• Row(M ) = Row(U U T M );
σr (M )
• σr (U T M ) = σr (U U T M ) ⩾ 2 .
• U U T is the projection matrix onto Col(U U T M ). In other words, the column vectors of U form
a basis for Col(U U T M ).
σr (U U T M ) ⩾ σr (M ) − ∥U U T M − M ∥F > 0.
σr (M )
σr (U T M ) = σr (U U T M ) ⩾ σr (M ) − ∥U U T M − M ∥F ⩾ ,
2
where the equality follows from Lemma 17, and the first inequality follows from by Theorem 2.
54
Lemma 19. Let m, n, r, s ∈ Z+ be positive integers. Suppose U ∈ Om×r and W ∈ On×s . Then
U ⊗ W ∈ Omn×rs .
(U ⊗ W )T (U ⊗ W ) = (U T ⊗ W T )(U ⊗ W ) = U T U ⊗ W T W = Ir ⊗ Is = Irs .
Lemma 20. Let m, r, s ∈ Z+ be positive integers. Suppose U ∈ Om×r and V ∈ Orn×s . Then
(U ⊗ In )V ∈ Omn×s .
[(U ⊗ In )V ]T (U ⊗ In )V =V T (U T ⊗ In )(U ⊗ In )V = V T (U T U ⊗ In )V
=V T (Ir × In )V = V T Irn V = Is .
Lemma 21. Let l, m, n, r ∈ Z+ be positive integers. Let U ∈ Om×r and U⊥ ∈ Om×(n−r) be the
orthogonal complement matrix of U . Then
(U ⊗ Il )⊥ = U⊥ ⊗ Il .
Theorem 3. Let A,  and E be three matrices in Rn×m and that  = A + E. Suppose the SVD
of A, Â satisfies
T T
Σ1 0 V Σ̂1 0 V̂
A = U U⊥ , Â = Û Û⊥ .
0 Σ2 V⊥T 0 Σ̂2 V̂⊥T
Here r ⩽ min{n, m}, and Σ1 , Σ̂1 ∈ Rr×r contain the leading r singular values of A and  respec-
tively. If σr (A) − σr+1 (A) − ∥E∥2 > 0, then
2∥E∥2
∥U U T − Û Û T ∥2 ⩽ .
σr (A) − σr+1 (A) − ∥E∥2
55
Lemma 22. Let A,  and E be three matrices in Rn×m and that  = A + E. Suppose the SVD
of A, Â satisfies
T T
Σ1 0 V Σ̂1 0 V̂
A = U U⊥ , Â = Û Û⊥ .
0 Σ2 V⊥T 0 Σ̂2 V̂⊥T
Here r ⩽ min{n, m}, and Σ1 , Σ̂1 ∈ Rr×r contain the leading r singular values of A and  respec-
tively. Denote
If α2 ⩾ β 2 + min{z21
2 , z 2 }, then
12
αz21 + βz12
∥U U T − Û Û T ∥2 ⩽ 2 , z2 } .
α2 − β 2 − min{z21 12
Corollary 3. Let A,  and E be three matrices in Rn×m and that  = A+E. Suppose Rank(A) = r
and σr (A) ⩾ 4∥E∥2 . Write the SVD of A, Â as
Σ1 0 V T
T
Σ̂1 0 V̂
A= U U ⊥ , Â = Û Û⊥ (122)
0 0 V⊥ 0 Σ̂2 V̂⊥T
where Σ1 , Σ̂1 ∈ Rr×r contain the leading r singular values of A and  respectively. The for any
Ũ ∈ On×r , Ṽ ∈ Om×r such that
Ũ Ũ T = U U T and Ṽ Ṽ T = V V T ,
where the second equality follows from the fact that U T ÂV ∈ Rr×r , and the second equality follows
from Lemma 17. So by Theorem 2,
|α − σr (A)| ⩽ ∥U U T EV V T ∥2 = ∥E Ṽ Ṽ T ∥2 = ∥E Ṽ ∥2 (124)
56
where the first and the last equality both follow from Lemma 17. Consequently
1
α2 − β 2 − min{z21
2 2
, z12 } ⩾ σr2 (A) − 2∥E Ṽ ∥22 − ∥E∥22 − ∥E∥22
2
1 1
⩾ σr2 (A) − 4∥E∥22 ⩾ σr2 (A),
2 4
where the first inequality follows from (124), the second inequality follows from ∥E Ṽ ∥2 ⩽ ∥E∥2 ,
and the last inequality follows from σr (A) ⩾ 4∥E∥2 . Therefore by Lemma 22,
αz21 + βz12
∥Ũ Ũ T − Û Û T ∥2 = ∥U U T − Û Û T ∥2 ⩽ 2 , z2 }
α2 − β 2 − min{z21 12
(σr (A) + ∥E Ṽ ∥2 )∥E Ṽ ∥2 + ∥E∥22 4∥E Ṽ ∥2 8∥E∥22
⩽ 1 2 ⩽ + 2 ,
4 σr (A)
σr (A) σr (A)
Theorem 4. For any f ∈ C β0 ([0, 1]), there exists a polynomial pf of degree at most n such that
Proof. This is the well-known Jackson’s Theorem for polynomial approximation. See Section 7.2
of DeVore and Lorentz (1993) for a proof.
Corollary 4. Let f ∈ C β0 ([0, 1]). Denote Sn the class of polynomial up degree n on [0, 1], and Pn
the projection operator from L2 ([0, 1]) to Sn respect to L2 norm. Then there exists a constant C
which does not depend on f such that
Theorem 5. Let f ∈ C β0 ([0, 1]). Denote Sn the class of polynomial up degree n on [0, 1], and Pn
the projection operator from L2 ([0, 1]) to Sn respect to L2 norm. Then there exists an constant C
only depending on f such that
∥f − Pn (f )∥∞ ⩽ Cn−β0 .
57