KEMBAR78
Tensor Density Estimator by Convolution-Deconvolution | PDF | Basis (Linear Algebra) | Matrix (Mathematics)
0% found this document useful (0 votes)
13 views57 pages

Tensor Density Estimator by Convolution-Deconvolution

The document presents a linear algebraic framework for density estimation that involves convolving empirical distributions with smoothing kernels, compressing the result into a tensor train, and applying a deconvolution step to recover the estimated density. The proposed method achieves linear computational complexity and demonstrates high accuracy, outperforming deep learning methods in various tasks. The paper outlines several estimators and their advantages, emphasizing the reduction of variance through tensor-train algebra and the efficiency of the proposed algorithms.

Uploaded by

eswam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views57 pages

Tensor Density Estimator by Convolution-Deconvolution

The document presents a linear algebraic framework for density estimation that involves convolving empirical distributions with smoothing kernels, compressing the result into a tensor train, and applying a deconvolution step to recover the estimated density. The proposed method achieves linear computational complexity and demonstrates high accuracy, outperforming deep learning methods in various tasks. The paper outlines several estimators and their advantages, emphasizing the reduction of variance through tensor-train algebra and the efficiency of the proposed algorithms.

Uploaded by

eswam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Tensor Density Estimator by Convolution-Deconvolution

Yifan Peng1 , Siyao Yang∗2 , Yuehaw Khoo3 , and Daren Wang4


1,2
Committee on Computational and Applied Mathematics, University of Chicago
3
Committee on Computational and Applied Mathematics and Department of Statistics,
University of Chicago
arXiv:2412.18964v3 [math.NA] 11 Jul 2025

4
Department of Applied and Computational Mathematics and Statistics, University of
Notre Dame

Abstract
We propose a linear algebraic framework for performing density estimation. It consists of
three simple steps: convolving the empirical distribution with certain smoothing kernels to re-
move the exponentially large variance; compressing the empirical distribution after convolution
as a tensor train, with efficient tensor decomposition algorithms; and finally, applying a decon-
volution step to recover the estimated density from such tensor-train representation. Numerical
results demonstrate the high accuracy and efficiency of the proposed methods.

1 Introduction
Density estimation is a fundamental problem in data science with wide-ranging applications in
fields such as science (Cranmer (2001); Chen (2017)), engineering (Chaudhuri et al. (2014)), and
economics (Zambom and Dias (2013)). More precisely, given N data points {x(i) }N ∗
i=1 ∼ p , where
each x(i) is an independent and identically distributed (i.i.d.) sample from an unknown ground
truth distribution p∗ : Rd → R, the goal is to estimate p∗ based on empirical distribution
N
1 X
p̂(x) = δx(i) (x) (1)
N
i=1

where δx(i) (x) is the Dirac delta function centered at x(i) . In general, one cannot use the em-
pirical distribution as an estimator of the density p∗ , due to the lack of absolutely continuity or
exponentially large variance.

1.1 Our contributions:


In this paper, we propose a natural approach to density estimation, entirely based on linear al-
gebra. Our goal is to achieve computational complexity that is linear in dimensionality. This is
accomplished by viewing the density estimation process as applying a low-dimensional orthogonal
projector in function space onto the noisy p̂. Due to the dimension reduction, the exponentially

Corresponding author

1
large variance in p̂ is suppressed. Additionally, we leverage tensor-train (TT) algebra to perform
this projection with linear computational complexity in terms of both dimensionality and sample
size. This approach also yields a final density estimator in the form of a tensor train. Furthermore,
we theoretically show that, in addition to achieving linear scaling in computational complexity,
the estimation error also enjoys linear scaling under certain assumptions. Numerical experiments
demonstrate that our estimator outperforms deep learning methods on several tasks, including
Boltzmann sampling and generative modeling for real-world datasets. Here we list two remarkable
results in Figure 1. More details on the numerical experiments can be found in Section 9.

0.2
one of our methods
neural network
second-moment error

0.15

0.1

0.05
10 20 30 40 50 60 70 80 90 100

Figure 1: Left: density estimation for data sampled from a Boltzmann distribution. Figure vi-
sualizes the second-moment error for different dimensionality d. We compare the result of one of
our proposed tensor-train-based method called TT-SVD-kn (see Section 4) with that of one neural
network approach (Masked Autoregressive Flow); Right: images generated by another proposed
algorithm called TT-rSVD-t (see Section 6) on the MNIST dataset.

1.2 Main idea:


In this subsection, we outline the main idea of the proposed method. Our method is based on a
crucial assumption about the true density:
 
d
XX j j n n
X jj j
bl11l22 ϕl11 (xj1 )ϕjl22 (xj2 ) ,
X
p∗ (x) = µ(x) 1 + bl11 ϕl11 (xj1 ) + (2)
 
j1 =1 l1 =2 (j1 ,j2 )∈(d2) l1 ,l2 =2

where µ(x) = µ1 (x1 ) · · · µd (xd ) is a mean-field density, i.e., a separable density. {ϕjl }nl=1 are orthog-
onal basis functions with respect to µj and ϕj1 ≡ 1, i.e.,
Z
ϕjl1 (xj )ϕjl2 (xj )µj (dxj ) = δl1 ,l2 , (3)

and bjl11 , blj11lj22 are the corresponding coefficients. The indices (j1 , j2 ) ∈ d2 denote all unordered pairs


of distinct dimensions. In other words, we view p∗ as a perturbative expansion around a mean-field

2
density µ, where each term in the perturbation series involves only a few variables (up to two
variables in (2)). Although it is straightforward to generalize (2) to an expansion that includes
higher-order interactions involving more than two variables with decaying coefficients, for clarity of
presentation, we use (2) as our default model. Such an expansion is common in perturbation and
cluster expansion theory in statistical mechanics.
Based on this setting, we discuss several estimators with increasing levels of sophistication. The
first approach to estimate p∗ is based on a simple and natural subspace projection:

p̃ = µ ⊙ Φdiag(wK )ΦT p̂ .

(4)

For notational simplicity, we treat all functions and operators above as discrete matrices and vectors:
we regard the concatenation of basis functions

Φ := [Φl (x)]x,l

as a matrix with rows indexed by x and columns indexed by l. The operation ΦT p̂ corresponds to
convolution between basis functions and p̂, i.e., summing or integrating out the variable x. The
vector wK , indexed by l, contains the weights associated with the basis functions and depends on a
chosen positive integer K. diag(·) is the operator producing diagonal matrix with the elements of
the input vector on the main diagonal. The mean-field µ is also treated as a vector, indexed by x,
and ⊙ denotes the Hadamard (entrywise) product. Intuitively, the operator µ ⊙ (Φdiag(wK )ΦT ·)
resembles a projector that projects p̂ onto a function space that only involves a small number of
variables determined by the chosen integer chosen integer K. More precisely, let the complete (up
to a certain numerical precision) set of k-variable basis functions be defined as:
n o
Bk,d = ϕjl11 (xj1 ) · · · ϕjlkk (xjk ) | 1 ⩽ j1 < · · · < jk ⩽ d and 2 ⩽ l1 , · · · , lk ⩽ n . (5)

which we refer to as the k-cluster basis throughout the paper. In particular, B0,d = {1} only
contains the constant basis. Furthermore, we say a function is k-cluster if it can be expanded by
up to k-cluster basis. For example, the perturbed part of p∗ in (2) is 2-cluster. Based on these
definitions, Φ can be formulated as the concatenation of the k-cluster basis across different k:

Φ(x) := [Bk,d (x)]dk=0 . (6)

Furthermore, we define the weight vector wK as:


(
d

1 j∈ K ,
wK (j) := (7)
0 otherwise,

which is an indicator vector that selects the 0- to K-cluster basis. With these definitions, one can
verify that if p∗ takes the form of (2), and if the univariate basis functions {ϕjl }nl=1 satisfy the
orthogonality condition (3), then we can recover

p∗ = µ ⊙ Φdiag(wK )ΦT p∗

(8)

as long as K ⩾ 2. This fact motivates the proposed density estimator in (4). Moreover, unlike p̂,
whose variance scales as O(nd ), Lemma 1 shows that the variance of p̃ in (4) is significantly reduced,

3
scaling as polynomially dependent term O(dK ) due to the projection onto a lower-dimensional
function space. This substantial reduction in variance is a key advantage of our method.
While the estimator in (4) is intuitive, it is possible to further reduce the variance by impos-
ing additional assumptions on p∗ . Specifically, we assume that the low-order cluster coefficients
diag(wK )ΦT p∗ can be represented as a low-rank tensor train. The details of such a representation
are provided in Section 1.5. Under this assumption, our estimation procedure can be modified as:

p̃ = µ ⊙ Φ TT-compress diag(wK )ΦT p̂



, (9)

where we apply certain tensor-train compression algorithm, denoted as TT-compress, to the tensor
diag(wK )ΦT p̂. The singular value thresholding involved in the compression procedure (TT-compress)
can further suppress noise in diag(wK )ΦT p̂, leading to a more accurate density estimator.
While the estimator in (9) is expected to achieve a lower variance, a major drawback is that
the projection diag(wK )ΦT p̂ onto the cluster basis up to order K has a computational complexity
of O(dK nK ), corresponding to the number of cluster basis functions. This scaling quickly becomes
prohibitive as d and K increase. To address this computational bottleneck, we propose a final
estimator, which can be viewed as a more efficient variant of (9):

p̃ = µ ⊙ diag(w̃α )−1 Φ TT-compress diag(w̃α )ΦT p̂



, (10)
d
where w̃α ∈ Rn represents a soft down-weighting of the cluster basis. This approach differs from
using wK , which applies a hard threshold that entirely discards cluster bases involving more than
K variables. A key advantage of using such w̃α is that it can be designed as a rank-1 tensor,
allowing the sequence of operations in (10) to be executed with a computational complexity of only
O(dn). This drastically reduces the cost compared to the O(dK nK ) complexity in (9). Furthermore,
the final density estimator p̃ naturally takes the form of a tensor train, allowing essential tasks
such as moment computation, sample generation, and pointwise evaluation of the density to scale
linearly in d. In Section 2.3, we present a simple example demonstrating that the estimator in (10)
yields accurate results comparable to those of (9), thereby illustrating their similar compression
capabilities. The study of (10) will be the focus of this paper.

1.3 Related work


In this subsection, we survey several lines of works that are most closely related to the proposed
method.
Density estimation literature. Classical density estimation methods such as histograms
(Scott (1979, 1985)), kernel density estimators (Parzen (1962); Davis et al. (2011)) and nearest
neighbor density estimates (Mack and Rosenblatt (1979)) are known for their statistical stability
and numerical robustness in low-dimensional cases. But they often struggle with the curse of di-
mensionality, even in moderately high-dimensional spaces. More recently, deep generative modeling
has emerged as a powerful tool for generating new samples, particularly in the context of image,
based on neural network architecture. This approach includes energy-based models (Kim and
Bengio (2016); Du and Mordatch (2019); Song and Ermon (2019)) generative adversarial networks
(GANs) (Goodfellow et al. (2020); Liu et al. (2021)), normalizing flows (Dinh et al. (2014); Rezende
and Mohamed (2015); Dinh et al. (2016)), and autoregressive models (Germain et al. (2015); Uria
et al. (2016); Papamakarios et al. (2017)). While the numerical performance for these modern meth-
ods is impressive, statistical guarantees and non-convex optimization convergence analysis remain

4
challenging areas of research. Furthermore, for some of these models (e.g. energy based model),
generating i.i.d. samples can still be expensive. For methods like GANs, while it can generate
samples efficiently, it does not provide a way to evaluate the value of a density at any given point.
In short, it is difficult for a method to enjoy all the desirable properties such as easy to train, easy
to generate samples, can support evaluation of density values, and can be estimated without the
curse of dimensionality.
Tensor train from partial sensing of a function. Tensor train, also known as the matrix
product state (MPS) (White (1993); Perez-Garcia et al. (2006)) in the physics literature, has
recently been used when to uncover a function, when only limited number of observations of this
function are given. Methods for this include TT-cross (Oseledets and Tyrtyshnikov (2010)), DMRG-
cross (Savostyanov and Oseledets (2011)), TT completion (Liu et al. (2012); Song et al. (2019)),
and others (Sun et al. (2020); Kressner et al. (2023); Shi et al. (2023)). It has broad applications in
high-dimensional scientific computing problems (Kressner and Uschmajew (2016); Bachmayr et al.
(2016) and machine learning (Stoudenmire and Schwab (2016); Huggins et al. (2019); Sengupta
et al. (2022)).
Tensor train for density estimation. In a density estimation setting, we are not given
observed entries of a density function, but just an empirical distribution of it. The key challenge
lies in the randomness of the observed object, i.e. the empirical histogram, which contains an
exponentially large variance in terms of the dimensionality. Therefore, reducing variance in a high-
dimensional setting is crucial to design effective algorithms. Recently, work (Hur et al. (2023)) and
subsequent studies (Tang et al. (2022); Peng et al. (2023); Chen and Khoo (2023); Tang and Ying
(2023); Khoo et al. (2024)) employ sketching techniques in tensor decomposition step for density
estimation problems, which originates from randomized linear algebra literature. However, most
existing algorithms in these works do not have a general way to ensure a linear computational
complexity when sketching, in terms of dimensionality. Another approach for density estimation
with tensors is through optimization-based method. Works (Bradley et al. (2020); Han et al.
(2018); Novikov et al. (2021)) implement gradient descent methods and optimize tensor-network
representations according to some loss functions. However, the performance strongly depends on the
initialization and often suffers from the problem of local minima due to the highly non-convex nature
of the optimization landscape. Furthermore, these iterative procedures require multiple iterations
over N data points, resulting in higher computational complexity compared to the previous tensor
decomposition algorithms.

1.4 Organization
Here we summarize the organization of the paper. In Section 1.5, we introduce the tensor no-
tation and diagram convetions used throughout. In Section 2, we present the core ideas behind
our proposed “convolution–deconvolution” framework, highlighting its three key steps. The central
component is the TT-compress algorithm as discussed previously that compresses cluster coeffi-
cients into a tensor train. Section 3 – 6 introduce various of such TT-compress algorithms: Section
3 presents a baseline algorithm based on sequential truncated singular value decomposition (SVD),
along with its fast implementation for density estimation problems. Section 4 includes an algorithm
that utilizes the Nyström approximation to further accelerate the algorithm in Section 2. In Section
5, we study an algorithm that performs SVD efficiently by truncating the elements in the tensor,
and then introduce a further acceleration by employing a hierarchical matrix factorization. Section
6 proposes an algorithm utilizing random sketching matrices with low-rank tensor structures. Sec-

5
tion 7 presents a pre-processing procedure based on principal component analysis (PCA), enabling
our compression algorithms to handle very high-dimensional problems. Theoretical analysis for the
algorithms is provided in Section 8. In Section 9, we showcase the comprehensive numerical results
to demonstrate the high accuracy and efficiency for our proposed algorithms. Finally, in Section
10, we summarize our findings.

1.5 Tensor notations and diagrams


We begin with some definitions regarding tensors. A d-dimensional tensor A ∈ Rn1 ×n2 ···×nd is a
collection of entries denoted by A(i1 , i2 , · · · , id ) with 1 ⩽ ij ⩽ nj for 1 ⩽ j ⩽ d. Diagrammatically,
a tensor can be denoted as a node whose dimensionality is indicated by the number of the “legs”
of the node. In Figure 2(a), the diagrams for a three-dimensional tensor A and a two-dimensional
matrix B are plotted as an example. We use i1 , i2 , i3 to denote the three legs in tensor A. We may
also use a rectangle to denote a tensor when it has many legs in this paper.

(a) Tensor diagrams of 3- (b) Tensor diagram of tensor contraction.


dimensional A and 2-dimensional
B.

Figure 2: Basic tensor diagrams and operations.

Based on the definition of tensor, we list several tensor operations that will be used frequently
in the paper:
“◦” contraction. We use ◦ to represent the tensor cores contraction (A ◦ B)(i1 , i2 , j2 ) =
1. P
k A(i1 , i2 , k)B(k, j2 ). Such operation is denoted as diagram in Figure 2 (b) which combines
the last leg of A and the first leg of B. The resulting C = A ◦ B is a three-dimensional tensor.

2. Unfolding matrix. Given a d-dimensional tensor with entries A(i1 , i2 , · · · , id ), if we group


the first j indices into a multi-index i1:j and group the rest into another multi-index ij+1:d ,
we will obtain a j-th unfolding matrix A(j) ∈ R(n1 ···nj )×(nj+1 ···nd ) (also called j-th unfolding
of A) satisfying
A(j) (i1:j , ij+1:d ) = A(i1 , i2 , · · · , id ). (11)
In particular, A(0) ∈ R1×n1 ···nd and A(d) ∈ Rn1 ···nd ×1 are respectively the row and column
vectors reshaped from the tensor.
In general, the memory cost of a high-dimensional tensor suffers from curse of dimensionality.
Tensor-train representation is a way to decompose the whole exponential large d-dimensional tensor
with underlying low-rank structure into d number of components:

A = G1 ◦ G2 ◦ · · · ◦ Gd ∈ Rn1 ×···×nd (12)

where G1 ∈ Rn1 ×r1 , G2 ∈ Rr1 ×n2 ×r2 , · · · , Gd−1 ∈ Rrd−2 ×nd−1 ×rd−1 , Gd ∈ Rrd−1 ×nd are called tensor
cores and {rj }d−1
j=1 are the ranks. The TT representation can be visualized as the left panel of Figure

6
3. Throughout the paper, we denote the largest rank by
r = max rj .
j

Then the memory cost for storing A in TT form is at O(dnr2 ) , which has no curse of dimensionality
if r is small. In addition, a TT can also be used to represent a continuous function f (x1 , · · · , xd )
formulated as a tensor product basis function expansion:
n
X
f (x1 , x2 , · · · , xd ) = A(l1 , l2 , · · · , ld )ϕl1 (x1 )ϕl2 (x2 ) · · · ϕld (xd )
l1 ,l2 ,··· ,ld =1

where the coefficient A takes the TT format as (12) . Such functional tensor-train representation
is illustrated in the right panel of Figure 3.

x1 x2 x3 xd

ϕ ϕ ϕ ϕ
i1 i2 i3 id l1 l2 l3 ld

γ1 γ2 γ3 γd−1 γ1 γ2 γ3 γd−1
G1 G2 G3 ⋯ Gd G1 G2 G3 ⋯ Gd

Figure 3: Diagram for tensor-train representation. Left: discrete case. Right: continuous case.

2 Main approach
In this section, we detail the framework of our main approach. In order to perform (10), we perform
the following three steps:
1. Convolution. Project the empirical distribution p̂ to “coefficients” ĉ under certain basis
expansion with O(dN ) computational cost:
Z
ĉ = diag(w̃α )Φ p̂ = diag(w̃α ) Φ(x)T p̂(x)dx.
T
(13)

2. Tensor compression. Approximate ĉ as a TT c̃ = TT-compress(ĉ). This step can done by


several proposed algorithms listed in Table 1. Notably, several of these algorithms achieve
O(dN ) complexity. These complexity estimates will be further validated through numerical
experiments in Section 9.

Algorithm Corresponding section Computational complexity


TT-SVD §3 O(min(dnd+1 , dN 2 n))
TT-SVD-kn §4 O(dN n)
TT-SVD-c (K-cluster) §5 O(dK+1 N nK+1 )
TT-SVD-c (hierarchical) §5.1 O(d log(d)N n)
TT-rSVD-t §6 O(dN n)

Table 1: Computational complexity of proposed TT-compress.

7
3. Deconvolution. Get the desired density estimator based on the TT coefficient c̃ with O(dN )
computational cost:
p̃ = µ ⊙ (Φdiag(w̃α )−1 c̃). (14)
Here p̃ is also a TT, if Φdiag(w̃α )−1 has an appropriate low-rank tensor structure.

Overall, the computational cost our approach (with proper choices of TT-compress) can achieve
computational complexity that scales linearly with both the dimensionality d and sample size N ,
demonstrating the high efficiency. In the rest of this section, we will further elaborate convolution
(Step 1) and deconvolution (Step 3). We defer the details of tensor compression (Step 2) to Section 3
– 6.

2.1 Convolution step


d
The key part of this step is the choice of w̃α ∈ Rn , the soft down-weighting of the cluster basis
(6). In this work, we choose the weight of any k-cluster basis ϕjl11 (xj1 ) · · · ϕjlkk (xjk ) to be

αk , where parameter α ∈ (0, 1]. (15)

In this way, the coefficient tensor ĉ = diag(w̃α )ΦT p̂ has entries


N d
1 X ∥l−1∥0 Y j (i)
ĉ(l1 , l2 , · · · , ld ) = α ϕlj (xj ).
N
i=1 j=1

where ∥·∥0 is the zero norm counting the number of non-zero entries of a vector, and 1 is the all-one
d-dimensional vector. ∥l − 1∥0 is then the number of non-constant univariate basis functions among
ϕjl11 (xj1 ) · · · ϕjlkk (xjk ). Under such choice, diag(w̃α ) can be formulated as a rank-1 tensor operator,
allowing the fast convolution (and also fast deconvolution later) with only O(dN ) computational
cost. In fact, given samples {x(i) }N i=1 , ĉ is formulated as

N
1 X (i) (i) (i)
ĉ = Φ̃1 ⊗ Φ̃2 ⊗ · · · ⊗ Φ̃d (16)
N
i=1

(i) (i) (i)


where Φ̃j = [1, αϕj2 (xj ), · · · , αϕjn (xj )] ∈ Rn . This expression is plotted as tensor diagrams in
Figure 4. For future use, we introduce
Φ̃j ∈ RN ×n (17)
(i)
to denote the matrix whose rows are Φ̃j .
The idea of such choice is to assign smaller weights for higher-order cluster basis to suppress
the variance. The parameter α describes the denoising strength — a smaller α produces stronger
denoising and thus smaller variance. In particular when α = 1, this soft-thresholding method
(10) reduces to the hard-thresholding (9) using all cluster bases. We point out that by choosing
sufficiently small α, this soft-thresholding strategy has smaller variance compared to the hard-
thresholding approach (9). The following lemma considers a toy model to show the variance of the
hard-thresholding approach:

8
l1 l2 l3 ld

α α α α
l′1 l′2 l′3 l′d

l1 l2 l3 ⋯ ⋯ ld N ϕ1 ϕ2 ϕ3 ϕd
1
ĉ = N ∑ x1 x2 x3 xd

i=1 δx1(i) δx2(i) δx3(i) ⋯ ⋯ δxd(i)

Φ̃(i)
1
Φ̃(i)
2
Φ̃(i)
3
Φ̃(i)
d

Figure 4: Tensor diagram of ĉ = diag(w̃α )ΦT p̂. “α” denotes diagonal matrix diag(1, α, · · · , α) ∈
Rn×n ; “ϕj ” denotes “matrix” [ϕjl (x)]x,l .

Lemma 1. Suppose {ϕl }nl=1 are Legendre polynomials and the ground truth density p∗ satisfies (8)
with the mean-field component µ ∼ Unif([0, 1]d ), the following variance estimation holds

dK n2K
E∥Φdiag(wK )ΦT p̂ − Φdiag(wK )ΦT p∗ ∥2L2 = O( ). (18)
N
d
Proof. Note that the error ∥diag(wK )ΦT p̂ − diag(wK )ΦT p∗ ∥2F has at most nK terms, where

K
each term follows !2
Z Y K
jk ∗
( ϕlk (xjk ))(p̂(x) − p (x))dx (19)
k=1
d

for indices (j1 , · · · , jK ) ∈ K and l1 , · · · , lK ∈ {2, · · · , n}. Since Legendre polynomial satisfies

∥ϕjlkk ∥∞ ⩽ n, then

K
!2 K
nK nK
Z Y Z Y Z
jk ∗ 1 jk 2 ∗
E ( ϕlk (xjk ))(p̂(x) − p (x))dx = ( ϕlk (xjk )) p (x)dx ⩽ p∗ (x)dx =
N N N
k=1 k=1
(20)
Therefore, the total variance is bounded by

nK dK n2K
 
T T ∗ 2 T T ∗ 2 d
E∥Φdiag(wK )Φ p̂−Φdiag(wK )Φ p ∥L2 = E∥diag(wK )Φ p̂−Φ p ∥F ⩽ nK = O( ).
K N N
(21)

On the other hand, the variance of the soft-thresholding approach under the model assumption

9
in Lemma 1 is estimated by
E ∥Φdiag(w̃α )ΦT p̂ − Φdiag(w̃α )ΦT p∗ ∥2L2
 

= E ∥diag(w̃α )ΦT p̂ − diag(w̃α )ΦT p∗ ∥2F


 
   2
n Z d
ϕjlj (xj ) dx
X Y
= α2∥l−1∥0 E  (p̂(x) − p∗ (x)) 
l1 ,··· ,ld =1 j=1
n d
Z Y
1
(ϕjlj (xj ))2 p∗ (x)dx
X
= α2∥l−1∥0
N
l1 ,··· ,ld =1 j=1
n Z
1 X
2∥l−1∥0 ∥l−1∥0
⩽ α n p∗ (x)dx
N
l1 ,··· ,ld =1
n
1 X 1
= (α2 n)∥l−1∥0 = (1 + (n − 1)nα2 )d (22)
N N
l1 ,··· ,ld =1

where the inequality follows ∥ϕjlj ∥∞ ⩽ n for lj ≥ 2 and we use the binomial theorem for the last
equality. With this estimation, we conclude that if we set α = 1, the variance scales at least O(nd ),
indicating the curse of dimensionality. To avoid this, we choose
s
C
α= .
(n − 1)nd
for some constant C without d-dependency. Under this choice, (22) can be bounded by
1 C 1
E ∥Φdiag(w̃α )ΦT p̂ − Φdiag(w̃α )ΦT p∗ ∥2L2 ⩽ (1 + )d ⩽ eC
 
(23)
N d N
The resulting error has no curse of dimensionality. However, we also point out that α should not
be chosen to be too small, which might lead to instability in our later operations. This will be
further elaborated in Section 2.2.

2.2 Deconvolution step


After we have received the approximated coefficient c̃ in TT format:
c̃ = Ĝ1 ◦ Ĝ2 ◦ · · · ◦ Ĝd ∈ Rn×n×···×n .
in tensor compression step, the last step then obtains the final density estimator in TT format by
implementing the deconvolution
p̃ = µ ⊙ (Φdiag(w̃α )−1 c̃). (24)
The structure of Φdiag(w̃α )−1 c̃ is illustrated in Figure 5. Since c̃ is a low-rank tensor train, the
complexity of the evaluation on a single point x is only O(dn).
Remark 1. In practice, as dimensionality d grows, some elements in diag(w̃α ) become exponentially
small, which potentially lead to instability when computing the inverse. One solution is to implement
inverse with given threshold: p̃ = Φ(diag(α) + λI)−1 c̃. Here λ is a small thresholding constant to
avoid instability issues and I is identity matrix. In practice, we find that even without this post-
procedure, the tensor-train estimator does not suffer from such an instability.

10
x1 x2 x3 xd

ϕ1 ϕ2 ϕ3 ϕd
l1 l2 l3 ld

α −1 α −1 α −1 α −1
l′1 l′2 l′3 l′d

Ĝd
γ1 γ2 γ3 γd−1
Ĝ1 Ĝ2 Ĝ3 ⋯

Figure 5: Tensor diagram for the final density estimator p̃. “α−1 ” denotes diagonal matrix
diag(1, 1/α, · · · , 1/α) ∈ Rn×n ; “ϕj ” denotes “matrix” [ϕjl (x)]x,l .

2.3 Motivation: mean-field example


In this subsection, we walk through the main steps of the proposed approach using a mean-field
example of the ground truth density, i.e., p∗ = µ in (2). This toy example is used to illustrate
why the soft-thresholding strategy in (10) effectively projects an empirical density onto the space
spanned by low-order cluster bases, providing a similar effect as (9). The analysis can be extended
to general examples.
For convenience, we first introduce some notations for moments: we define the first-order mo-
ment q̂j ∈ R(n−1)×1 as
Z
q̂j (lj ) = ϕjlj +1 (xj )p̂(x)dx, lj = 1, · · · , n − 1, (25)

and the second-order moment q̂j,j ′ ∈ R(n−1)×(n−1)


Z

q̂j,j ′ (lj , lj ′ ) = ϕjlj +1 (xj )ϕjl′ +1 (xj ′ )p̂(x)dx,

lj , lj′ ′ = 1, · · · , n − 1. (26)
j′

Higher-order moments can be defined following the same logic.


We consider to compress the tensor diag(w̃α )ΦT p̂ into a tensor train using one specific approach
called CUR approximation (Mahoney and Drineas (2009)). Specifically, a given matrix A ∈ Rm×n
can be approximated by the following rank-r factorization

A = CU + R

where C ∈ Rm×r and R ∈ Rr×n respectively consist of selected r columns and rows. U is the
intersection (cross) submatrix formed by the selected rows and columns, and U + is the Moore-
Penrose generalized inverse of U . To ensure a good approximation, the columns and rows are
typically chosen to maximize the determinant of U . Repeating the procedure for all dimensions,
we can receive a tensor-train decomposition eventually. Further details will be discussed in Section
4. Based on the notations above, we first unfold the weighted projection tensor diag(w̃α )ΦT p̂ as
(0) d 2 [q̂ (0) ]d
" #
1 α[q̂ ]
j j=2 α ′ ′ · · ·
(diag(w̃α )ΦT p̂)(1) = (1)
j,j j,j =2
(1) =: B1 (27)
αq̂1 α2 [q̂1,j ]dj=2 α3 [q̂1,j,j ′ ]dj,j ′ =2 · · ·

11
in terms of the order of α, where we use [·] to represent the concatenation of those inside terms
through the column. Since the ground truth density p∗ is mean-field (rank-1), we choose only r = 1
column and row for CUR approximation of (diag(w̃α )ΦT p̂)(1) . With α chosen to be sufficiently
small, the (1, 1)-entry is the largest in B1 . Therefore, first row and first column are chosen for the
CUR approximation:  
1 h (0) (0)
i
B1 ≈ 1 α[q̂j ]dj=2 α2 [q̂j,j ′ ]dj,j ′ =2 · · · (28)
αq̂1 | {z }
=:B2
 
1
The left vector is the resulting first TT core (before normalization). Next, we continue to
αq̂1
implement the cross approximation on the unfolding of remaining tensor B2 . First row and first
column are again chosen following the same logic:
(0) (0)
" # 
α[q̂j ]dj=3 α2 [q̂j,j ′ ]dj,j ′ =3 · · ·

1 1 h (0) d 2 [q̂ (0) ]d
i
B2 = (1) (1) ≈ 1 α[q̂ ]
j j=3 α ′ ′
j,j j,j =3 · · · . (29)
αq̂2 α2 [q̂ ]dj=3 α3 [q̂ ′ ]d ′ ··· αq̂2 |
2,j 2,j,j j,j =3 {z }
=:B3

We repeat this procedure until decomposing the tensor into the product of d rank-1 cores:
     
T 1 1 1
=: TT-compress diag(w̃α )ΦT p̂ .

diag(w̃α )Φ p̂ ≈ ⊗ ⊗ ··· ⊗ (30)
αq̂1 αq̂2 αq̂d

Furthermore, we try to estimate the difference between the representation in (30) and hard trunca-
T
tion TT-compress diag(wK )Φ p̂ in (9), which are the summation of all k-cluster terms for k > K.
Since ground truth density p∗ is mean-field, the expectation of empirical one-marginal is E[q̂j ] = 0.
j n
Besides, variance Var[q̂j ] = O( n−1
N ) for each j due to the orthogonality of {ϕl }l=1 with respect to
n k/2
µ. Therefore, we can have an upper bound of k-cluster, αk ∥q̂j1 ⊗ · · · ⊗ q̂jk ∥F ⩽ Cαk ( N ) for some
constant C > 0. Then compared to the hard-thresholding K-cluster estimator in (9), the difference
towards the soft-thresholding estimator (10) is upper bounded by
d  
T
 T
 X d k n k/2
∥ TT-compress diag(wK )Φ p̂ − TT-compress diag(w̃α )Φ p̂ ∥F ⩽ C α ( )
k N
k=K+1
d r r
X αde n k αde n K+1
⩽C ( ) ⩽ Cd( ) (31)
K N K N
k=K+1

when αde
pn
K N < 1 for sufficient large N . The difference between two estimators is polynomially
small with respect to sample size N . This turns out to be a clear example to demonstrate the
reliability of soft-thresholding approach (10).

3 Tensor-train SVD (TT-SVD)


Starting from this section, we focus on Step 2 of the main approach — tensor compression. To this
end, we first propose to use the tensor-train SVD (TT-SVD) algorithm. The TT-SVD algorithm
is originally proposed in Oseledets (2011), which compresses a high-dimensional tensor into a low-
rank tensor train by performing truncated SVD sequentially. For our problem, we use TT-SVD to

12
find the approximation c̃ in TT format of ĉ depicted in Figure 4, which can be obtained following
the procedures in Algorithm 1.

Algorithm 1 TT-SVD
INPUT: Tensor ĉ ∈ Rn×n×···×n . Target rank {r1 , · · · , rd−1 }.
1: for j ∈ {1, · · · , d − 1} do
d−j
2: Perform SVD to Bj ∈ Rnrj−1 ×n , which is formed by the following tensor contraction
lj+1:d γj−1
Ĝ1 ⋯ Ĝj−1

{
lj ⋯ l1 ⋯ lj−1 lj lj+1 ⋯ ld
γj−1
Bj = ĉ

In particular, B1 = ĉ(1) which is the first unfolding matrix of ĉ defined in (11). With SVD
Bj = Uj Σj VjT , we obtain the j-th core Ĝj ∈ Rrj−1 ×n×rj , which is reshaped from the first rj
principle left singular vectors in Uj .
3: end for
4: Obtain the last core Ĝd = Bd ∈ Rrd−1 ×n .
OUTPUT: Tensor train c̃ = Ĝ1 ◦ Ĝ2 ◦ · · · ◦ Ĝd ∈ Rn×n×···×n .

Below, we briefly outline the idea behind this algorithm. When j = 1, we want to obtain the
first core Ĝ1 by performing SVD to ĉ(1) (which is B1 in Algorithm 1). Then we let Ĝ1 ∈ Rn×r1 to
be first r1 principle left singular vectors. The rationale of the first step is to approximate ĉ(1) as
ĉ(1) ≈ Ĝ1 (ĜT1 ĉ(1) ). In the j = 2 step, we want to obtain Ĝ2 , which gives a further approximation
d−1
to Ĝ1 (ĜT1 ĉ(1) ). The way we do this is to factorize ĜT1 ĉ(1) ∈ Rr1 ×n , by applying SVD to B2 ,
which is the reshaping of ĜT1 ĉ(1) . This completes the j = 2 step. Then we do the above procedure
recursively till j = d − 1. After implementing sequential SVD for d − 1 times, the remaining
Bd ∈ Rrd−1 ×n is a matrix and we take it as the final core in the tensor-train representation since
no more decomposition is required.
Naively, the computational cost from SVD in Algorithm 1 scales exponentially large as O(dnd+1 )
in terms of dimensionality. However, for the density estimation problem, we can take advantage of
the structure of ĉ as depicted in Figure 4 to achieve a fast implementation of Algorithm 1. We will
use the next subsection to elaborate such fast implementation.

3.1 Fast implementation of TT-SVD


In this subsection, we present a fast implementation of Algorithm 1 with a complexity of O(dN 2 nr),
which scales linearly with d and is therefore well-suited for high-dimensional cases. Notably, this
efficient implementation introduces no additional approximations and produces the same output as
the vanilla implementation in Algorithm 1.
The key of the fast implementation lies in fast calculations of SVD of Bj in Algorithm 1. Instead
of performing SVD to Bj directly to get its left singular vectors as in Line 2 of Algorithm 1, we
consider computing the eigenvectors of Bj BjT , which is a relatively small matrix of size nrj−1 ×nrj−1 .
However, one concern is that the cost of forming Bj BjT can in principle scale exponentially with
dimensionality, since the column size of Bj is nd−j as indicated by multiple legs between the two

13
nodes in Figure 6. Fortunately, by using the structure of ĉ as shown in Figure 4, one can form
Bj BjT much more efficiently. This is shown in Line 2 of Algorithm 2, which summarizes the TT-SVD
procedure that is implemented in a computationally feasible manner.
γj−1 γ′j−1
Ĝ1 ⋯ Ĝj−1 Ĝj−1 ⋯ Ĝ1
lj l′j
l1 ⋯ lj−1 lj l′j l′j−1 ⋯ l′1

=
γj−1 γ′j−1
Bj ⋯ Bj ĉ ĉ


lj+1:d lj+1:d

Figure 6: Tensor components for forming Bj BjT .

In Algorithm 2, the main computation lies in forming Dj and Ej in Line 2. A naive way
is to compute them independently, leading to O(dN 2 ) complexity. However, Dj and Ej can be
computed in a recursive way efficiently, where computations can be reused. More specifically, one
can precompute Ej by
Ej = Ej+1 ⊙ (Φ̃j Φ̃Tj ) (32)
starting from Ed = IN ×N which is N × N the identity matrix. Dj is computed recursively based
on core Ĝj−1 starting from D1 = Φ̃T1 according to Figure 7. The last core Ĝd is assigned as the
(i)
matrix Bd in algorithm 1, which is equal to the average of Dd over sample index i. Therefore, we
(i)
compute Ĝd as the average of Dd in Algorithm 2. In this way, the computational cost of Ej , Dj
and Bj BjT is O(N 2 nr) which has no dependency on dimensionality. Afterwards, one can solve the
core by eigen-decomposition of Bj BjT ∈ Rnrj−1 ×nrj−1 with a minor cost as O((nr)3 ). Overall, the
computational cost of this fast implementation is O(dN 2 nr).
While this fast implementation significantly reduces the computational complexity from expo-
nential to linear in d, we point out that such implementation still requires O(N 2 ) computations to
form each Ej , which can make it challenging with a large number of samples even in low dimen-
sional cases. In Section 4, we will further improve Algorithm 2 using kernel Nyström approximation,
which will bring computational cost to O(dN ) only.

γj−1 lj γj−1 lj
γj−2
Dj(i) = (i)
Dj−1
lj−1
Ĝj−1 Φ̃(i)
j

(i)
Figure 7: Tensor diagram illustrating the recursive calculation of Dj , which is the i-th column of
the matrix Dj in Algorithm 2.

14
Algorithm 2 Fast implementation of TT-SVD
INPUT: Samples {x(i) }N i=1 . Target rank {r1 , · · · , rd−1 }.
1: for j ∈ {1, · · · , d − 1} do
2: Perform eigen-decomposition to Aj := Bj BjT ∈ Rnrj−1 ×nrj−1 , which is computed by Aj =
1
D E DT where Dj ∈ Rnrj−1×N and Ej ∈ RN ×N are the resulting matrices by contraction in
N2 j j j
the boxes of the below figure:
lj l′j

= =
lj l′j γj−1 γ′j−1
Aj Bj Bj


γj−1 γ′j−1 lj+1:d

Dj Ej DjT
γj−1 γ′j−1
Ĝ1 ⋯ Ĝj−1 Φ̃(i′
j+1
)
Φ̃(i′
d
)
Ĝj−1 ⋯ Ĝ1
N
1 ⋯ ⋯ ⋯

l1 lj−1 lj lj+1 ld l′j l′j−1 l′1
N2 Φ̃(i) Φ̃(i) Φ̃(i) Φ̃(i′)
Φ̃(i′)
Φ̃(i′)
i,i′=1 Φ̃(i)
1 j−1 j j+1 Φ̃(i)
d j j−1 1

In particular, D1 = Φ̃T1 defined in (17). Here Dj and Ej should be computed recursively to


reuse calculations. With eigen-decomposition Aj = Uj Λj UjT , we obtain the j-th core Ĝj ∈
Rrj−1 ×n×rj , which is reshaped from the first rj principle eigenvectors in Uj .
3: end for
4: Obtain the last core:
N
1 X (i)
Ĝd = Dd ∈ Rrd−1 ×n
N
i=1
(i)
where Dd is the reshaping of i-th column of Dd .
OUTPUT: Tensor train c̃ = Ĝ1 ◦ Ĝ2 ◦ · · · ◦ Ĝd ∈ Rn×n×···×n .

4 TT-SVD with kernel Nyström approximation (TT-SVD-kn)


In this section, we propose a further modification based on the fast implementation of TT-SVD
developed in Section 3.1. This modification accelerates Algorithm 2 using kernel Nyström approx-
imation to reduce the complexity from O(dN 2 ) to O(dN ). As we have pointed out at the end of
Section 3.1, the O(N 2 ) complexity of Algorithm 2 is the consequence of exact evaluation of the
N × N matrix Ej in the step of eigen-decomposition for Bj BjT . The proposed modification in
this section will utilize kernel Nyström approximation (Williams and Seeger (2000)) to form Ej
efficiently.
To begin with, we briefly introduce the idea of kernel Nyström method to approximate a sym-
metric matrix as shown in Figure 8. Given a symmetric matrix A ∈ Rm×m , the kernel Nyström
method chooses r̃ columns of A uniformly at random without replacement and approximates A by
the following low-rank cross approximation

A ≈ M W +M T (33)

15
where M ∈ Rm×r̃ consists of the r̃ chosen columns, W ∈ Rr̃×r̃ is the intersection of the chosen r̃
columns and the corresponding r̃ rows and W + is the Moore-Penrose generalized inverse of W .
m r̃

m ≈ m ×
+
×
+
W
MT
M
A
Figure 8: Cross approximation of a symmetric matrix.

Such approximation can be applied to the symmetric matrix Ej in Algorithm 2, resulting in


the more efficient TT-SVD-kn algorithm summarized in Algorithm 4. By the recurrence relation
(32), Ej is formulated as

Ej = (Φ̃d Φ̃Td ) ⊙ (Φ̃d−1 Φ̃Td−1 ) ⊙ · · · ⊙ (Φ̃j+1 Φ̃Tj+1 ), (34)

which is depicted by the first equation in Figure 9. To apply kernel Nyström method, we randomly
(I)
sample r̃ rows in Ej with index I ⊂ {1, 2, · · · , N }, and we define Φ̃m ∈ Rr̃×n , which is denoted by
the blocks in each Φ̃m in Figure 9, as the resulting submatrices formed by the sampled rows.
N n n n

N = (N × ) ⊙ (N × ) ⊙ ⋯⊙ (N × )
Φ̃Td Φ̃Td−1 Φ̃Tj+1
Ej Φ̃d Φ̃d−1 Φ̃j+1

r̃ n

n
N = (N
⊙ ⊙ ⋯ ⊙ ) × ( r̃ ⊙ ⊙ ⋯⊙ )
T

Φ̃(I)
d
Φ̃(I)
d−1
Φ̃(I)
j+1
Mj Φ̃d Φ̃d−1 Φ̃j+1

r̃ n n
r̃ =( r̃ ⊙ ⊙ ⋯⊙ ) × ( r̃ ⊙ ⊙ ⋯⊙ )
T

Wj Φ̃(I) Φ̃(I) Φ̃(I) Φ̃(I) Φ̃(I) Φ̃(I)


d d−1 j+1 d d−1 j+1

Figure 9: Cross approximation of Ej , the most expensive component in Algorithm 2.

At this point, we perform the cross approximation for Ej :

Ej ≈ Mj Wj+ MjT

16
where Mj ∈ RN ×r̃ is the submatrix formed by the chosen columns in Ej and Wj ∈ Rr̃×r̃ is the
corresponding intersection. Using the expression (34), Mj and Wj are respectively formulated as
the second and third equation in Figure 9. The structures of Mj and Wj allow us to compute them
in a recursive manner as in Line 2 of Algorithm 3: suppose we have already obtained Φ̃j+2:d and
(I)
Φ̃j+2:d , Mj and Wj can be evaluated with complexity O(N n) (direct computations based on their
definitions in Figure 9 is O(dN n)). Consequently, Bj BjT in the previous Algorithm 2 can now be
approximated by
1 1
Bj BjT = 2 Dj Ej DjT ≈ 2 Dj Mj Wj+ MjT DjT
N N
as in Line 4 of Algorithm 4. The computational cost for the above matrix multiplications is
O(N nr max(nr, r̃)), which is significantly smaller compared to the original O(N 2 nr) complexity in
Algorithm 2. By considering the calculations for all d tensor cores, the total computational cost is
O(dN nr max(nr, r̃)). Here we remark that we have employed the same indices I for computing all
Ej . Such practice is the key to reuse the calculations in this recursive implementation so that we
can achieve a complexity that is only linear in sample size.

Algorithm 3 TT-SVD-kn
INPUT: Samples {x(i) }N i=1 . Target rank {r1 , · · · , rd−1 }. Number of sampled columns r̃.
1: Sample r̃ integers from {1, · · · , N } and let I be the set of sampled integers.
d−1 d−1
2: Pre-compute {Wj }j=1 and {Mj }j=1 based on sampled indices I recursively by Figure 9.
3: for j ∈ {1, · · · , d − 1} do
4: Perform eigen-decomposition to Bj BjT ∈ Rnrj−1 ×nrj−1 , which is approximated by

1
Bj BjT ≈ Aj := Dj Mj Wj+ MjT DjT
N2
where Dj is computed recursively as Figure 7. With eigen-decomposition Aj = Uj Λj UjT , we
obtain the j-th core Ĝj ∈ Rrj−1 ×n×rj , which is reshaped from the first rj principle eigenvectors
in Uj .
5: end for
6: Obtain the last core:
N
1 X (i)
Ĝd = Dd ∈ Rrd−1 ×n
N
i=1
(i)
where Dd is the i-th column of Dd .
OUTPUT: Tensor train c̃ = Ĝ1 ◦ Ĝ2 ◦ · · · ◦ Ĝd ∈ Rn×n×···×n .

5 TT-SVD with truncated cluster basis (TT-SVD-c)


In this section, we introduce the TT-SVD-c algorithm (Algorithm 4), which is a modified TT-
SVD algorithm based on truncated cluster basis sketching. The central idea is to utilize sketching
d−j
technique to approximate the full SVDs of Bj ∈ Rnrj−1 ×n in Line 2 of Algorithm 1. More
specifically, each step when we compute the left singular vectors Uj from Bj in Algorithm 1, we

17
instead compute the left singular vectors from a sketched Bj :
Bj′ = Bj Sj , (35)

where Sj is some appropriately chosen sketching matrix with size nd−j × r̃j , with sketching size
r̃j much smaller than nd−j (Line 2 of Algorithm 4). In this way, we can reduce the exponential
complexity of full SVD for Bj to O(max(nrr̃j2 , n2 r2 r̃j )) complexity of SVD for Bj′ .

Algorithm 4 TT-SVD-c
INPUT: Samples {x(i) }N i=1 . Target rank {r1 , · · · , rd−1 }. Cluster order K.
1: for j ∈ {1, · · · , d − 1} do
2: Perform SVD to Bj′ ∈ Rnrj−1 ×r̃j , which is formed by the following tensor contraction

d−j
where Sj ∈ Rn ×r̃j is the sketching matrix depending on cluster order K. The contraction
of the righ-hand side is done efficiently as shown in Figure 10. With SVD Bj′ = Uj Σj VjT , we
obtain the j-th core Ĝj ∈ Rrj−1 ×n×rj , which is reshaped from the first rj principle left singular
vectors in Uj .
3: end for
4: Obtain the last core Ĝd = Bd′ ∈ Rrd−1 ×n .
OUTPUT: Tensor train c̃ = Ĝ1 ◦ Ĝ2 ◦ · · · ◦ Ĝd ∈ Rn×n×···×n .

The key of this approach is to choose good sketching matrix Sj such that Bj′ retains the same
range, and thus the same left singular space, as Bj . Here, we follow a specific way to construct
sketching matrix Sj (Chen and Khoo (2023); Peng et al. (2023)), where we only include the low-
order cluster basis. We first introduce the set of k-cluster indices:

Jk,d = (l1 , · · · , ld ) | l1 , · · · , ld ∈ {1, · · · , n} and ∥l − 1∥0 = k ,

which are the basis indices of those cluster bases in Bk,d defined in (5) since ϕj1 ≡ 1.
Our choice of sketching matrix is based on the low-order cluster indices. To sketch Bj , we
d−j
construct Sj ∈ Rn ×r̃j as follows: each column of Sj is defined based on a multivariate index
SK
l ∈ k=0 Jk,d−j where K is the chosen cluster order:
1l1 ⊗ 1l2 ⊗ · · · ⊗ 1ld−j
where 1l is a n × 1 column vector evaluating as 1 on the lth component and 0 elsewhere. Under
this construction, Sj is a sparse matrix with column size
K
X
r̃j = |Jk,d−j | = O(dK nK ).
k=0

18
At this point, the only concern is that forming Bj′ = Bj Sj has exponential complexity by direct
matrix multiplication. Similar to Algorithm 2, we utilize the structure of ĉ (16):
N
1 1 X (i) (i) T
Bj′ = Bj Sj = Dj (Ej )T = Dj (Ej ) ∈ Rnrj−1 ×r̃j (36)
N N
i=1

(i) (i)
where Dj , Ej are defined in Figure 10 (c) and Dj , Ej represent the i-th column of matrix Dj , Ej ,
respectively. According to the structure of Dj , Ej , their computation can be done in a recursive
way following the diagram in Figure 10 (a) and (b).

γj−1 lj γj−1 lj
γj−2
Dj(i) = (i)
Dj−1
lj−1
Ĝj−1 Φ̃(i)
j

(a) (b)

(c)

Figure 10: Tensor diagrams illustrating recursive computation of Bj′ in Line 2 of Algorithm 4. (a)
(i)
recursive calculation of Dj , which is the i-th column of the matrix Dj ; (b) recursive calculation
(i) (i) (i)
of Ej , which is the i-th column of the matrix Ej ; (c) calculation of Bj′ based on Dj and Ej .

In terms of the computational complexity of Algorithm 4, the cost for {Ej }d−1
j=1 is O(d
K+1 N nK )

for k-cluster truncation. Besides, the cost for {Dj }d−1


j=1 follows the analysis in Figure 7 and Al-
2
gorithm 2, which is O(dN nr ). Furthermore, the computational complexity of multiplication in
Bj′ and the following SVD step requires O(dK+1 N nK+1 ). To summarize, the dominant term in
computational complexity is O(dK+1 N nK+1 ). The reduction is from standard TT-SVD algorithm
O(min(dnd+1 , dN 2 n)) to O(dK+1 N nK+1 ), which is significant especially for a large sample size N
and low-order K. In the following subsection, we further propose a modification of TT-SVD-c which
constructs sketching matrix Sj based on hierarchical matrix decomposition. This acceleration can
achieve a lower computational complexity as O(d log(d)).

5.1 Acceleration of TT-SVD-c via hierarchical matrix decomposition


In this subsection, we propose a modification that further accelerates TT-SVD-c method with
cluster order K = 1. The key insight of this acceleration lies in the assumption that the ground truth

19
density p∗ has high correlations between nearby variables (such as x1 and x2 ) and low correlation
between distant variables (such as x1 and xd ), which is satisfied by many real cases. In Algorithm 4,
the sketching matrix Sj considers all correlations equally without taking the distance into account,
which includes redundant correlations with distant variables and leads to a large column size of
Sj . In fact, even if cluster order K = 1, the column size is O(d), which results in the overall
computational cost of O(d2 ). This motivates us to further sketch Sj , which is done by hierarchical
matrix decomposition (Hackbusch (1999); Hackbusch and Khoromskij (2000); Bebendorf (2008))
of covariance matrix M , whose (j1 , j2 )-block is formulated as

Mj1 ,j2 = Ep̂ [ϕjl11 (xj1 )ϕjl22 (xj2 )] l1 ,l2 ∈ Rn×n


 
(37)

for all 1 ⩽ j1 , j2 ⩽ d.

Figure 11: Hierarchical matrix decomposition of covariance matrix for d = 8. The figure depicts the
three block matrices M1:4,5:8 , M1:2,3:4 , M1,2 whose top r̃ right singular vectors (visualized as dark
gray matrix in each block matrix).

To illustrate the idea, we consider an example with d = 8. Covariance matrix M is depicted


in Figure 11, where we use Mj1 :j2 ,j3 :j4 to represent the submatrix formed by blocks with row index
from j1 to j2 and column index from j3 to j4 . To implement hierarchical matrix decomposition, we
perform SVD on upper triangular off-diagonal blocks M1:4,5:8 , M1:2,3:4 , M1,2 and get top r̃ right sin-
gular vectors V1′ , V2′ , V3′ , respectively. Here the sketching size r̃ (different from that in Algorithm 4)
is chosen as a constant. V1′ , V2′ , V3′ respectively capture correlations with variables x5:8 , x3:4 and x2 .
Once the singular vectors V1′ , V2′ , V3′ are obtained, we can use them to construct sketching matri-
ces with smaller column sizes. When computing the first core Ĝ1 , we form Q1 = diag[V1′ , V2′ , V3′ ] ∈
R7n×3r̃ which captures the weighted correlation between x1 and x2:8 , and we use S1 and Q1 to
sketch B1 as B1′ = B1 S1 Q1 ∈ Rn×3r̃ . Once this is done, we perform SVD to B1′ to obtain Ĝ1 . For
the second core, we form Q2 = diag[V1′ , V2′ ] ∈ R6n×2r̃ to capture the weighted correlation between
variables x2 and variables x3:8 . Then we sketch B2′ = B2 S2 Q2 , and perform SVD to B2′ to obtain
Ĝ2 . Note that we can reuse V1′ , V2′ in the second step since both of them were already computed
in the first step. By repeating the procedure and generating new sketch Bj′ = Bj Sj Qj for the
remaining j, this hierarchical approach enables the efficient computation of all cores {Ĝj }dj=1 .

20
In terms of computational complexity, there are two major sources of computations. The first
is the SVD of off-diagonal block matrices (M1:4,5:8 , M1:2,3:4 , M1,2 for d = 8). For each block, we first
apply kernel Nyström approximation (33) with r̃ sampled rows and columns, followed by SVD. The
cost for this part of computations is O(d log(d)N nr̃). The second major computation is forming
Bj′ = Bj Sj Qj . The number of columns in the weight matrix Qj is at most log2 (d)r̃. Since Sj is a
sparse matrix, the computation of Bj′ = Bj Sj Qj for all j needs O(d log(d)N nr̃) computational cost
upon leveraging a similar strategy for computing (36). The computational costs discussed above
are the dominant cost in Algorithm 4. We can replace Sj with Sj Qj in Algorithm 4 and follow the
operations in Figure 10, the method achieves a reduced computational complexity O(d log(d)N n)
compared to the original complexity O(d2 N n2 ) for TT-SVD-c with cluster order K = 1.

6 TT-SVD with random tensorized sketching (TT-rSVD-t)


In this section, we propose another modification for TT-SVD algorithm via sketching. This ap-
proach follows the randomized sketching technique (Huber et al. (2017)), originated from random-
ized linear algebra literature (Halko et al. (2011)), to accelerate SVD. The key idea is to choose
d−j
sketching matrix Sj ∈ Rn ×r̃ as some random matrix, where r̃ is a given sketching size, which
is usually chosen to be reasonably larger than target rank r of the output tensor-train estimator.
Afterwards, similar as TT-SVD-c, we perform SVD to Bj′ = Bj Sj . In general, the matrix multipli-
cation Bj Sj has exponentially large computational cost. Our solution to this curse of dimensionality
is to form Sj by a set of random tensor trains, leading to the TT-rSVD-t algorithm summarized in
Algorithm 5.

Algorithm 5 TT-rSVD-t
INPUT: Samples {x(i) }N i=1 . Target rank {r1 , · · · , rd−1 }. Sketching size r̃.
1: Generate random tensor train with cores {Hj }dj=1 , where H1 ∈ Rn×r̃ , H2 , · · · , Hd−1 ∈
Rr̃×n×r̃ , Hd ∈ Rr̃×n .
2: for j ∈ {1, · · · , d − 1} do
3: Perform SVD to Bj′ ∈ Rnrj−1 ×r̃j , which is formed by the following tensor contraction

The contraction of the righ-hand side is done efficiently as shown in Figure 12. With SVD
Bj′ = Uj Σj VjT , we obtain the j-th core Ĝj ∈ Rrj−1 ×n×rj , which is reshaped from the first rj
principle left singular vectors in Uj .
4: end for
5: Obtain the last core Ĝd = Bd′ ∈ Rrd−1 ×n .
OUTPUT: Tensor train c̃ = Ĝ1 ◦ Ĝ2 ◦ · · · ◦ Ĝd ∈ Rn×n×···×n .

21
To implement Algorithm 5, we first generate a d-dimensional random tensor train with fixed
rank r̃ whose cores are H1 , H2 , · · · , Hd (Line 1). In practice, we can set each core as i.i.d. ran-
dom Gaussian matrix/tensor or random uniform matrix/tensor. When sketching Bj , we construct
d−j
sketching matrix Sj ∈ Rn ×r̃ as the reshaping of Hj+1 ◦ Hj+2 ◦ · · · ◦ Hd (Line 3). Afterwards, we
contract Bj with Sj and perform SVD to solve the j-th core Ĝj .
Similar to (36) for TT-SVD-c, Bj′ is computed recursively as illustrated in Figure 12(c). The
key difference lies in the new formulation of Ej , which arises from the new choice of sketching
matrix Sj . Leveraging the tensor-train structure, Ej can also be computed recursively as shown
in Figure 12(b). The total computational cost for {Ej }d−1 2
j=1 is O(dN nr̃ ). The computation of
{Dj }d−1
j=1 , depicted in Figure 12(a), follows the same approach used in Algorithm 2 and incurs a
computational cost of O(dN nr2 ). Overall, the SVD step required to compute all tensor cores has
a total complexity of O(dN nr̃2 ), which is linear with respect to both dimensionality and sample
size.

γj−1 lj γj−1 lj
γj−2
Dj(i) = (i)
Dj−1
lj−1
Ĝj−1 Φ̃(i)
j

(a) (b)

(c)

Figure 12: Tensor diagrams illustrating recursive computation of Bj′ in Line 2 of Algorithm 5. (a)
(i)
recursive calculation of Dj , which is the i-th column of the matrix Dj ; (b) recursive calculation
(i) (i) (i)
of Ej , which is the i-th column of the matrix Ej ; (c) calculation of Bj′ based on Dj and Ej .

7 Pre-processing steps for general distributions


In this section, we propose several pre-processing steps to extend the discussed framework to general
distributions. The pre-processing procedure consists of a dimension-reduction step using principal
component analysis (PCA), followed by a step to determine the mean-field and the univariate bases.
The full algorithm is summarized in Algorithm 6.

22
Algorithm 6 Tensor density estimator after variable re-ordering.
INPUT: Samples {x(i) }N ′ ′
i=1 . Embedding dimension d , d ⩽ d. Target rank {r1 , · · · , rd−1 }.
1: Apply PCA to the samples {x(i) }N ′

i=1 . Denote the top d orthonormal principal components
as Q ∈ R d×d (i) T (i)
. Define the new variables as z = Q x , i = 1, · · · , N , with the associated
empirical distribution denoted by p̂z .
2: Estimate the 1-marginals µ′1 (z1 ) · · · µ′d′ (zd′ ) using one-dimensional kernel density estimation.
Q′
The product of these 1-marginals forms the mean-field µ′ = dj=1 µ′j in the reduced coordinate
system.
j ′
3: Compute the orthogonal basis {ϕl (zj )}n l=1 for each j = 1, · · · , d , where the basis functions are
orthogonalized with respect to µ′j (zj ).
4: Obtain the d′ -dimensional tensor density estimator in the reduced coordinate system:

p̃z = µ′ ⊙ diag(w̃α )−1 Φ TT-compress diag(w̃α )ΦT p̂z ,




where TT-compress is any tensor-train compression algorithm detailed in Section 3 – 6. The


final density estimator in the original space is obtained as:
1
p̃(x) = p̃z (QT x),
Z
where Z is the normalization factor.
OUTPUT: Tensor density estimator p̃.

When dealing with a general distribution, the optimal ordering of variables is generally un-
known. A poor variable ordering may lead to a high-rank TT representation, significantly increasing
computational complexity. To address this, Step 1 of Algorithm 6 applies PCA to simultaneously
decorrelate the variables and reduce the dimensionality. The resulting reduced-coordinate variables
{z (i) }N
i=1 often exhibit simpler dependence structures, allowing a more compact TT representation.
The eigen-decomposition in PCA can be performed through a “streaming approach”, as in Henrik-
sen and Ward (2019); Hoskins et al. (2024); Khoo et al. (2024) with a complexity of O(N dd′ ).
In Step 2 of Algorithm 6, we aim to perform the projection (10) on the empirical distribution p̂z .
Q′
However, to do so, we require knowledge of the mean-field µ′ (z) = dj=1 µ′j (zj ). To approximate
this, we apply a kernel density estimator to each of the one-dimensional marginals of p̂z . This step
ensures that µ′ captures the product structure of the new coordinate system.
In Step 3, we construct orthonormal basis functions with respect to each one-dimensional
marginal. This can be achieved using a Gram-Schmidt procedure or a stable three-term recur-
rence method (see Gautschi (2004)) for polynomial basis functions.
Finally, in Step 4, we estimate the density p̂z in the reduced-coordinate system using the
tensor-train density estimator. This involves applying any tensor-train compression algorithms
TT-compress detailed in Section 3 – 6 to diag(w̃α )ΦT p̂z . The resulting density estimator p̃z is then
mapped back to the original space via the transformation p̃(x) = p̃z (QT x) followed by normaliza-
tion.

23
8 Theoretical results
In this section, we present an error analysis of our proposed methods. For simplicity, we focus pri-
marily on the hard-thresholding approach described in (9). Based on Section 2.3, we know the esti-
mator from hard-thresholding approach in (9) is close to the primarily discussed soft-thresholding
approach in (10). The detailed analysis of the soft-thresholding approach in (10) will be deferred
to future work.
We again use the notations ĉ = diag(wK )ΦT p̂ and c∗ = diag(wK )ΦT p∗ to denote the coefficient
tensors. For simplicity, we assume the mean-field component µ ∼ Unif([0, 1]d ). In this way, the
univariate basis functions {ϕl }nl=1 are orthonormal, and we take Legendre polynomials as an example
in our analysis. In addition, we propose the following assumptions on the ground truth density:
Assumption 1. Suppose the ground truth density function p∗ : [0, 1]d → R+ satisfies is an additive
model (1-cluster):
p∗ (x1 , · · · , xd ) = 1 + q1∗ (x1 ) + · · · + qd∗ (xd ), (38)
where qj∗ ∈ H β0 ([0, 1]) for some integer β0 ⩾ 1 satisfies the regularity

min ∥qj∗ ∥L2 ([0,1]) ⩾ b∗ and max ∥qj∗ ∥H β0 ([0,1]) ⩽ B ∗


j j

for two O(1) positive constants b∗ and B ∗ . Under this assumption, we can immediately obtained
that ∥p∗ ∥∞ ⩽ 1 + B ∗ d according to Sobolev embedding theorem.
Assumption 1 consider only the 1-cluster model and include the regularity constraint for each
one-dimensional component. This motivation stems from the fact that the left/right subspaces are
computationally straightforward, allowing for an efficient tensor-train representation of p∗ . This
specific structure, in turn, yields a favorable estimation error rate with respect to the dimensionality.
We state our main theorem under these assumptions:
Theorem 1. Suppose the ground truth p∗ satisfies Assumption 1, the empirical coefficient tensor
ĉ = diag(wK )ΦT p̂, and
N ⩾ C0 d2K−1 n4K−2 log(dn), (39)
where C0 is some sufficiently large constant independent of N , n and d. When we use TT-SVD
(Algorithm 1) as TT-compress with c̃ being the output with target rank r = 2, the following bound
holds with high probability:
v 

u
∗ d−1
∥c̃ − c ∥F u dnB log(dn) X 1
= O t  (40)
∥c∗ ∥F N σ 2 (c∗(j) )
2
j=1

where σ2 (·) denotes the second largest singular value.


The theorem states that the relative error of the coefficient decreases at Monte Carlo rate
O( √1N ) with respective to samples size N . Moreover, under the assumption that p∗ is 1-cluster,
the relative error grows only linearly with the dimensionality d, up to a logarithmic factor, if the
second largest (also the smallest since r = 2) singular values of all the unfolding matrices of c∗ are
uniformly lower bounded.
For the error with respect to density, the following corollary presents the error bound, including
additional approximation error. Both proofs are shown in the appendix.

24
Corollary 1. Suppose all the conditions in Theorem 1 hold, the following bound for density esti-
mator p̃ (24) by TT-SVD (Algorithm 1) with target rank r = 2 holds with high probability:
v
d−1
∥p̃ − p∗ ∥L2 ([0,1]d )
u ∗ log(dn) X

u
−β
dnB 1
=O tn 0+ , (41)
∥p∗ ∥L2 ([0,1]d ) N σ22 (c∗(j) )
j=1

where β0 is the regularity parameter specified in Assumption 1.

9 Numerical results
In this section, we carry out numerical experiments to examine the performance of our proposed
density estimation algorithms. In specific, we use TT-SVD-kn (Algorithm 3), TT-SVD-c (Algo-
rithm 4) with hierarchical matrix decomposition and TT-rSVD-t (Algorithm 5). Unless otherwise
noted, we use the following parameter settings throughout: we set the number of sampled columns
in Nyström approximation r̃ = 100 in TT-SVD-kn, sketching size r̃ = 10 in TT-SVD-c with hi-
erarchical matrix decomposition, sketching size r̃ = 30 in TT-rSVD-t and the kernel parameter
α = 0.01. These parameter choices are based on empirical observations.
For simplicity, we choose the mean-field density µ as a uniform distribution, so that {ϕjl }nl=1
can be chosen as any orthonormal basis functions. Here we use the Fourier basis

√1
 2L , for l = 1,


ϕl (x) = √1L cos( lπx
2L ) for even l > 1 (42)
 √1 sin( (l+1)πx ) for odd l > 1


L 2L

as the univariate basis function for all j = 1, · · · , d.

9.1 Gaussian mixture model


We first consider the d-dimensional Gaussian mixture model which is defined as :
1 1 1
M(0.18) + M(0.2) + M(0.22)
6 3 2
where
2 1
M(σ) = N (−0.5, σ 2 Id×d ) + N (0.5, σ 2 Id×d ).
3 3
We set the domain as the hypercube [−L, L]d where L = 1.5 with mesh size 0.1 in each direction.
We first fix dimension d = 10. We draw N i.i.d samples from the mixture model, and use
proposed three algorithms to fit a TT approximation for the empirical distribution of these samples.
For the algorithms, we set fixed target rank r = 3 and use n = 17 Fourier bases. In the left panel
of Figure 13, we test different sample sizes N and plot the relative L2 -error defined by

∥p̃ − p∗ ∥L2 /∥p∗ ∥L2



in logarithm scale. One can observe that the relative error decays with Monte Carlo rate O(1/ N ).
Here we remark that the ground truth p∗ for this Gaussian mixture model can be formulated as a

25
0.2
10 0
0.15
Relative error

Relative error
0.1
0.53277

0.05 0.75182
10 -1

8
24

48

96

92
8

38

76
12

25

51
10

20

40

81
3 6 9 12 15 18 21 242730
16

32
Figure 13: Gaussian mixture model: relative L2 -error for proposed algorithms. Left: d = 10 and
sample number N = 27 , 28 , · · · , 215 . Right: N = 106 and dimension d = 3, 6, 9, · · · , 30.

TT without approximation error, and the numerical L2 -norm based on quadrature of TTs can be
easily computed with O(d) complexity Oseledets and Tyrtyshnikov (2010).
Next, we fix the sample size N = 106 and test our algorithms with various dimensions d.
The relative L2 -error is plotted in right panel of Figure 13, which exhibits sublinear growth with
increasing dimensionality according to the curves. In particular, with N = 106 samples, our
algorithms can achieve two-digit relative L2 -error up to d = 12. For dimensionality higher than
that, we can expect that the error can also be controlled below 0.1 for a sufficiently large sample
size, as suggested by the Monte Carlo convergence observed in the previous experiment.
Finally, we verify the time complexity of the algorithms. In Figure 14, we plot the wall clock
time of three algorithms for fixed d = 5 with various N (left panel) and fixed N = 10000 with
various d (right panel). The rates of wall clock time verify the theoretical computational complexity
at listed in Table 1.

9.2 Ginzburg-Landau model


Now we consider a more practical problem in statistical physics as the Ginzburg-Landau model
Hoffmann and Tang (2012). In particular, we consider the density estimation for the Boltzmann
distribution
1
p∗ (x) = e−βV (x)
Z
where β is the inverse temperature, Z is the normalization factor such that p∗ (x)dx = 1, and V
R

is the Ginzburg-Landau potential energy.

9.2.1 1D Ginzburg-Landau model


We first consider a 1D Ginzburg-Landau model where the potential energy is formulated as
d+1
λ xi − xi−1 2
 
X 1
V (x1 , · · · , xd ) = + (1 − x2i )2 for x ∈ [−L, L]d (43)
2 h 4λ
i=1

26
Wall clock time (s) 10 0

0
10

Wall clock time (s)


10 -1 0.97824

1.119
-2
10

-1
10

10 -3

8
24

48

96

92
8

38

76
12

25

51
10

20

40

81
16 3 6 9 12 15 18 21 242730

32
Figure 14: Gaussian mixture model: average wall clock time (seconds) for proposed algorithms.
Left: d = 5 and sample number N = 27 , 28 , · · · , 215 . Right: N = 10000 and dimension d =
3, 6, 9, · · · , 30.

where x0 = xd+1 = 0, h = 1/(1 + d), λ = 0.03 and β = 1/8. We consider hypercube domain
[−L, L]d with L = 2.5 and mesh size 0.05 in each direction. For the ground truth p∗ , we use DMRG-
cross algorithm (see Savostyanov and Oseledets (2011)) to get p∗ in TT form, which is known to
provide highly accurate approximations with arbitrarily prescribed precision. The samples of p∗
are generated by running Langevin dynamics for sufficiently long time.
In Table 2, we list the relative L2 -error of the approximated densities estimated by N = 106
samples with different number of univariate basis functions. For this example, we set fixed target
rank r = 4. One can observe that the approximation improves as we use more basis functions.
Such improvement can be further visualized in Figure 15, where we plot the marginal distribution
of the approximated solutions solved by three algorithms when d = 20. All of them give accurate
match to the sample histograms when a sufficient number of basis functions are used.
GL-1D, TT-SVD-c GL-1D, TT-SVD-kn GL-1D, TT-rSVD-t
0.6 0.6 0.6

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0

-0.1 -0.1 -0.1


-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

Figure 15: 1D G-L model: x10 -marginal of approximated solutions for d = 20 and N = 106 under
different numbers of univariate basis functions used.

27
d=5 d = 10
n
TT-SVD-c TT-SVD-kn TT-rSVD-t TT-SVD-c TT-SVD-kn TT-rSVD-t
7 0.2810 0.2809 0.2810 0.4304 0.4304 0.4305
11 0.0782 0.0781 0.0781 0.1365 0.1365 0.1378
15 0.0305 0.0304 0.0301 0.0488 0.0483 0.0533
19 0.0275 0.0273 0.0289 0.0449 0.0404 0.0526
d = 15 d = 20
n
TT-SVD-c TT-SVD-kn TT-rSVD-t TT-SVD-c TT-SVD-kn TT-rSVD-t
7 0.6342 0.6342 0.6343 0.7897 0.7897 0.7898
11 0.2169 0.2167 0.2186 0.3017 0.3016 0.3037
15 0.0733 0.0740 0.0836 0.1126 0.1125 0.1158
19 0.0613 0.0602 0.0716 0.0884 0.0874 0.1099

Table 2: 1D G-L model: comparison of L2 relative errors with N = 106 and d = 5, 10, 15, 20 under
different numbers of univariate basis functions used.

9.2.2 2D Ginzburg-Landau model


Next, we consider the 2D Ginzburg-Landau model which is defined on a m × m lattice. The
potential energy is formulated as
m+1 m+1  m m
"  #
xi,j − xi−1,j 2 xi,j − xi,j−1 2
 
λX X 1 XX
V (x1,1 , · · · , xm,m ) = + + (1 − x2i,j )2
2 h h 4λ
i=1 j=1 i=1 j=1

with the boundary condition

x0,j = xm+1,j = 1, xi,0 = xi,m+1 = −1 ∀ i, j = 1, · · · , m

where we choose parameter h = 1/(1 + m), λ = 0.1 and inverse temperature β = 1. We set L = 2
with mesh size 0.05 for the domain. We consider a test example where m = 16 and thus the total
number of dimensionality is m2 = 256. For this 2D model, we apply the PCA pre-processing steps
described in Section 7 to order the dimensions and also reduce the dimensionality, where we choose
embedding dimension d′ = 100. In terms of the algorithms, we set target rank r = 10, α = 0.001
and we use N = 104 samples and n = 21 Fourier bases.
We evaluate the performance of our fitted tensor-train density estimators as generative models.
Specifically, we first fit tensor-train density estimators p̃ from the given N = 104 samples {xi }Ni=1
using proposed algorithms, then generate new samples {x̃i }N i=1 from the learned tensor trains.
Sampling each x̃i ∼ p̃ can be done via tensor-train conditional sampling

p̃(x1 , x2 , · · · , xd ) = p̃(x1 )p̃(x2 |x1 )p̃(x3 |x2 , x1 ) · · · p̃(xd |xd−1 , · · · , x1 ) (44)

where each
p̃(x1 , · · · , xi )
p̃(xi |xi−1 , · · · , x1 ) =
p̃(x1 , · · · , xi−1 )

28
is a conditional distribution of p̃, and p̃(x1 ), · · · , p̃(x1 , · · · , xd−1 ) are the marginal distributions. We
point out each conditional sampling can be done in O(d) complexity using the TT structure of p̃
Dolgov et al. (2020).

Figure 16: 2D G-L model on a 16 × 16 lattice: visualization of (x3,3 , x14,14 )-marginal (first row)
and (x6,12 , x8,12 )-marginal (second row). Left to right: reference sample histograms, samples from
tensor trains fitted by proposed algorithms.

Figure 16 illustrates the (x3,3 , x14,14 )-marginal and (x6,12 , x8,12 )-marginal distributions of the
generated samples. For comparison, we also show results for N ∗ = 105 reference samples. The visu-
alizations reveal that samples generated from tensor trains learned by all three proposed algorithms
can capture local minima accurately. To quantify the accuracy, we compute the second-moment
error
∥X̃ T X̃ − X ∗ T X ∗ ∥F /∥X ∗ T X ∗ ∥F (45)

where X̃ ∈ RN ×d is the matrix formed by the generated samples, and X ∗ ∈ RN ×d are reference
samples. The second-moment errors for TT-SVD-kn, TT-SVD-c and TT-rSVD-t algorithms are
respectively 0.0278, 0.0252, 0.0264.

9.3 Comparison with neural network approaches


In this section, we compare the performance of generative models fitted by our proposed tensor-
train algorithms with one example of neural network-based generative methods (Papamakarios
et al. (2017)). For neural network details, we use 5 transforms and each transform is a 3-hidden
layer neural network with width 128. We use Adam optimizer to train the neural network. The
experiments are based on selected examples from previous subsections as well as real data from
MNIST.

29
9.3.1 1D Ginzburg-Landau model
We first consider the Boltzmann distribution of 1D Ginzburg-Landau potential defined in (9.2.1)
where we choose model parameters λ = 0.005 and β = 1/20. We use N = 104 samples as training
data to fit tensor trains, and then generate N = 104 new samples from the learned tensor trains
by conditional sampling (44).
0.2
TT-SVD-c
TT-SVD-kn
TT-rSVD-t
second-moment error

0.15
NN

0.1

0.05
10 20 30 40 50 60 70 80 90 100

Figure 17: Comparison of second-moment errors of generated samples.

In Figure 17, we plot the second-moment errors of the samples generated by tensor-train al-
gorithms and a neural network, compared against a reference set of N ∗ = 106 samples across
various dimensions d = 10, 20, · · · , 100. Here we choose target rank r = 4 and use n = 17 Fourier
bases for all simulations. The three proposed tensor-train algorithms exhibit similar performance
and consistently achieve two-digit accuracy across all dimensions. Moreover, they are considerably
more accurate than the neural network. To visualize this accuracy gap, Figure 18 further shows
histograms of two selected marginal distributions from the generated samples for d = 100. It can
be observed that samples generated by the tensor-train estimators more accurately capture local
minima compared to those produced by the neural network.

9.3.2 Real data: MNIST


We also evaluate our algorithms on the MNIST dataset, which consists of 70,000 images of hand-
written digits (0 through 9), each labeled with its corresponding digit. Each image is represented
as a 28 × 28 pixel matrix. To learn the generative models, we treat the dataset as a collection of
70000 × 282 samples. For each digit class, we apply the TT-rSVD-t algorithm with parameters
α = 0.05, sketching rank r̃ = 40, target rank r = 10 and n = 15 Fourier bases, to learn tensor-train
representations from the corresponding samples of dimension d = 282 = 784. Again, the PCA
pre-processing steps are applied to reduce dimensionality, where embedding dimension is set to
d′ = 8. After training, we independently sample five images for each digit from the learned TT,
as shown in Figure 19(b). For comparison, Figure 19(a) displays five randomly selected images

30
Figure 18: 1D G-L model for d = 100: visualization of (x99 , x100 )-marginal (first row) and (x90 , x95 )-
marginal (second row) of samples from generative models. Left to right: reference samples, samples
from tensor trains under three proposed algorithms, and samples from neural network.

per digit from the original MNIST dataset. The generated images from the tensor train exhibit
well-preserved digit structures and visually plausible variations across classes.

10 Conclusion
In this paper, we propose a natural approach to high-dimensional density estimation entirely based
on linear algebra, which allows the construction of a density estimator as a low-rank functional
tensor train. To this end, we provide four tensor decomposition algorithms. The first one is based
on direct TT-SVD algorithm with fast implementation in a density estimation setting. The second
algorithm follows a Nystörm approximation. The third approach modifies the first approach by
thresholding certain elements in the tensor and employing a hierarchical matrix factorization to
accelerate computational efficiency. The last algorithm employs a tensorized sketching technique.
Several of these proposed algorithms can achieve linear computational complexity both in dimension
d and in sample size N . Theoretical analysis demonstrates that tensor-train estimators can also
achieve linear scaling of the estimation error with respect to the dimensionality d under some
specific conditions. Comprehensive numerical experiments validate the accuracy of all proposed
algorithms, in comparison to other density estimators based on neural networks.

11 Declaration
Acknowledgment. Y. Khoo acknowledges partial supports of DMS-2339439, DE-SC0022232
and Sloan foundation. Y. Peng acknowledges partial support of DE-SC0022232.

31
(a) Real images (b) Generated images

Figure 19: Generated and real images from MNIST. For each digit class, five images are randomly
selected for illustration.

References
Markus Bachmayr, Reinhold Schneider, and André Uschmajew. Tensor networks and hierarchi-
cal tensors for the solution of high-dimensional partial differential equations. Foundations of
Computational Mathematics, 16:1423–1472, 2016.

Mario Bebendorf. Hierarchical matrices. Springer, 2008.

Tai-Danae Bradley, E Miles Stoudenmire, and John Terilla. Modeling sequences with quantum
states: a look under the hood. Machine Learning: Science and Technology, 1(3):035008, 2020.

T Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with
applications to high-dimensional statistics. 2018.

Kamalika Chaudhuri, Sanjoy Dasgupta, Samory Kpotufe, and Ulrike Von Luxburg. Consistent
procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory,
60(12):7900–7912, 2014.

Yen-Chi Chen. A tutorial on kernel density estimation and recent advances. Biostatistics & Epi-
demiology, 1(1):161–187, 2017.

Yian Chen and Yuehaw Khoo. Combining particle and tensor-network methods for partial differ-
ential equations via sketching. arXiv preprint arXiv:2305.17884, 2023.

32
Kyle Cranmer. Kernel estimation in high-energy physics. Computer Physics Communications, 136
(3):198–207, 2001.

Richard A Davis, Keh-Shin Lii, and Dimitris N Politis. Remarks on some nonparametric estimates
of a density function. Selected Works of Murray Rosenblatt, pages 95–100, 2011.

Ronald A DeVore and George G Lorentz. Constructive approximation, volume 303. Springer Science
& Business Media, 1993.

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components
estimation. arXiv preprint arXiv:1410.8516, 2014.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv
preprint arXiv:1605.08803, 2016.

Sergey Dolgov, Karim Anaya-Izquierdo, Colin Fox, and Robert Scheichl. Approximation and sam-
pling of multivariate probability distributions in the tensor train decomposition. Statistics and
Computing, 30:603–625, 2020.

Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. Advances
in Neural Information Processing Systems, 32, 2019.

Walter Gautschi. Orthogonal polynomials: computation and approximation. OUP Oxford, 2004.

Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder
for distribution estimation. In International conference on machine learning, pages 881–889.
PMLR, 2015.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the
ACM, 63(11):139–144, 2020.

Wolfgang Hackbusch. A sparse matrix arithmetic based on-matrices. part i: Introduction to-
matrices. Computing, 62(2):89–108, 1999.

Wolfgang Hackbusch and Boris N Khoromskij. A sparse h-matrix arithmetic. part ii: Application
to multi-dimensional problems. Computing, 64:21–47, 2000.

Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:
Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53
(2):217–288, 2011.

Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. Unsupervised generative modeling
using matrix product states. Physical Review X, 8(3):031012, 2018.

Amelia Henriksen and Rachel Ward. Adaoja: Adaptive learning rates for streaming pca. arXiv
preprint arXiv:1905.12115, 2019.

Karl-Heinz Hoffmann and Qi Tang. Ginzburg-Landau Phase Transition Theory and Superconduc-
tivity. Birkhäuser Basel, 2012.

33
Jeremy Hoskins, Yuehaw Khoo, Oscar Mickelin, Amit Singer, and Yuguan Wang. Subspace
method of moments for ab initio 3-d single-particle cryo-em reconstruction. arXiv preprint
arXiv:2410.06889, 2024.

Benjamin Huber, Reinhold Schneider, and Sebastian Wolf. A randomized tensor train singular value
decomposition. In Compressed Sensing and its Applications: Second International MATHEON
Conference 2015, pages 261–290. Springer, 2017.

William Huggins, Piyush Patil, Bradley Mitchell, K Birgitta Whaley, and E Miles Stoudenmire.
Towards quantum machine learning with tensor networks. Quantum Science and technology, 4
(2):024001, 2019.

Yoonhaeng Hur, Jeremy G Hoskins, Michael Lindsey, E Miles Stoudenmire, and Yuehaw Khoo.
Generative modeling via tensor train sketching. Applied and Computational Harmonic Analysis,
67:101575, 2023.

Yuehaw Khoo, Yifan Peng, and Daren Wang. Nonparametric estimation via variance-reduced
sketching. arXiv preprint arXiv:2401.11646, 2024.

Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability
estimation. arXiv preprint arXiv:1606.03439, 2016.

Daniel Kressner and André Uschmajew. On low-rank approximability of solutions to high-


dimensional operator equations and eigenvalue problems. Linear Algebra and its Applications,
493:556–572, 2016.

Daniel Kressner, Bart Vandereycken, and Rik Voorhaar. Streaming tensor train approximation.
SIAM Journal on Scientific Computing, 45(5):A2610–A2631, 2023.

Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. Tensor completion for estimating
missing values in visual data. IEEE transactions on pattern analysis and machine intelligence,
35(1):208–220, 2012.

Qiao Liu, Jiaze Xu, Rui Jiang, and Wing Hung Wong. Density estimation using deep generative
neural networks. Proceedings of the National Academy of Sciences, 118(15):e2101344118, 2021.

YP Mack and Murray Rosenblatt. Multivariate k-nearest neighbor density estimates. Journal of
Multivariate Analysis, 9(1):1–15, 1979.

Michael W Mahoney and Petros Drineas. Cur matrix decompositions for improved data analysis.
Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.

Georgii S Novikov, Maxim E Panov, and Ivan V Oseledets. Tensor-train density estimation. In
Uncertainty in artificial intelligence, pages 1321–1331. PMLR, 2021.

Ivan Oseledets and Eugene Tyrtyshnikov. Tt-cross approximation for multidimensional arrays.
Linear Algebra and its Applications, 432(1):70–88, 2010.

Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):


2295–2317, 2011.

34
George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density
estimation. Advances in neural information processing systems, 30, 2017.

Emanuel Parzen. On estimation of a probability density function and mode. The annals of math-
ematical statistics, 33(3):1065–1076, 1962.

Yifan Peng, Yian Chen, E Miles Stoudenmire, and Yuehaw Khoo. Generative modeling via hierar-
chical tensor sketching. arXiv preprint arXiv:2304.05305, 2023.

David Perez-Garcia, Frank Verstraete, Michael M Wolf, and J Ignacio Cirac. Matrix product state
representations. arXiv preprint quant-ph/0608197, 2006.

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International
conference on machine learning, pages 1530–1538. PMLR, 2015.

Dmitry Savostyanov and Ivan Oseledets. Fast adaptive interpolation of multi-dimensional arrays
in tensor train format. In The 2011 International Workshop on Multidimensional (nD) Systems,
pages 1–8. IEEE, 2011.

David W Scott. On optimal and data-based histograms. Biometrika, 66(3):605–610, 1979.

David W Scott. Averaged shifted histograms: effective nonparametric density estimators in several
dimensions. The Annals of Statistics, pages 1024–1040, 1985.

Richik Sengupta, Soumik Adhikary, Ivan Oseledets, and Jacob Biamonte. Tensor networks in
machine learning. European Mathematical Society Magazine, (126):4–12, 2022.

Tianyi Shi, Maximilian Ruth, and Alex Townsend. Parallel algorithms for computing the tensor-
train decomposition. SIAM Journal on Scientific Computing, 45(3):C101–C130, 2023.

Qingquan Song, Hancheng Ge, James Caverlee, and Xia Hu. Tensor completion algorithms in big
data analytics. ACM Transactions on Knowledge Discovery from Data (TKDD), 13(1):1–48,
2019.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribu-
tion. Advances in neural information processing systems, 32, 2019.

Edwin Stoudenmire and David J Schwab. Supervised learning with tensor networks. Advances in
neural information processing systems, 29, 2016.

Yiming Sun, Yang Guo, Charlene Luo, Joel Tropp, and Madeleine Udell. Low-rank tucker approx-
imation of a tensor from streaming data. SIAM Journal on Mathematics of Data Science, 2(4):
1123–1150, 2020.

Xun Tang and Lexing Ying. Solving high-dimensional fokker-planck equation with functional hier-
archical tensor. arXiv preprint arXiv:2312.07455, 2023.

Xun Tang, Yoonhaeng Hur, Yuehaw Khoo, and Lexing Ying. Generative modeling via tree tensor
network states. arXiv preprint arXiv:2209.01341, 2022.

35
Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural
autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–
7220, 2016.

Haiyong Wang. How much faster does the best polynomial approximation converge than legendre
projection? Numerische Mathematik, 147(2):481–503, 2021.

Steven R White. Density-matrix algorithms for quantum renormalization groups. Physical review
b, 48(14):10345, 1993.

Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines.
Advances in neural information processing systems, 13, 2000.

Adriano Z Zambom and Ronaldo Dias. A review of kernel density estimation with applications to
econometrics. International Econometric Review, 5(1):20–42, 2013.

A Definitions and notations for the proof of Theorem 1


Before the proof, we introduce some definitions and notations that will be used in the main proof
in Appendix B.
d
To begin with, we use c∗ ∈ Rn to present the coefficient tensor of ground truth under a set of
orthonormal basis functions {ϕjl }nl=1 , whose entries are given by

c∗ (l1 , · · · , ld ) = ⟨p∗ , ϕ1l1 · · · ϕdld ⟩.

In the remainder of this appendix, we omit the superscripts on the basis functions unless they are
needed to indicate the corresponding coordinate of dimensions. Based on the additive formulation
of ground truth p∗ in (38) and orthogonality of {ϕjl }nl=1 , c∗ can be expressed as

c∗ = 1d + c∗1 ⊗ 1d−1 + 1 ⊗ c∗2 ⊗ 1d−2 + · · · + 1d−1 ⊗ c∗d , (46)


k
where 1k ∈ Rn represents the k times Kronecker product of 1 = [1, 0, · · · , 0]T ∈ Rn . c∗j ∈ Rn is a
vector whose l-th coordinate is given by
Z 1 Z 1

cj (l) = ··· p∗ (x1 , · · · , xd )ϕl (xj )dx1 · · · dxd . (47)
0 0

We can further simplify the expression of c∗j as follows. By the normalization condition of p∗ , qj∗ in
R1
the expression of p∗ satisfies dj=1 0 qj∗ (xj )dxj = 0. Without loss of generality, we assume that
P

Z 1
qj∗ (xj )dxj = 0. (48)
0

This further gives us


(
0, if l = 1;
c∗j (l) = R1 ∗
(49)
0 qj (xj )ϕl (xj )dxj , if 2 ⩽ l ⩽ n.

36
Definition 1. Let Bs,d be the s-cluster basis functions defined in (5). Define the s-cluster estimator
X X
p̂s (x) := 1 + ⟨p̂, φ1 ⟩φ1 (x) + · · · + ⟨p̂, φs ⟩φs (x) (50)
φ1 ∈B1,d φs ∈Bs,d

d
The corresponding coefficient tensor ĉ[s] ∈ Rn of p̂s is

ĉ[s](l1 , · · · , ld ) = ⟨p̂s , φl1 · · · φld ⟩. (51)

In this case, ĉ[s] will be referred to as the s-cluster coefficient estimator of c∗ .

Definition 2. Given matrix A ∈ Rm×n and positive integer r ⩽ min{m, n}, we define function
SVD : Rm×n × Z+ → Om×r satisfying

SVD(A, r) = U (52)

where A = U ΣV T is the top-r SVD of the matrix A.

B Proof of the main Theorem 1


In this proof, we simply denote ĉ[K] as defined in (51) by ĉ.

Proof. Let Ĝ1:d be the TT cores of c̃, to bound the error ∥c∗ − Ĝ1:d ∥F , we first use Lemma 4 to
formulate it as

∥c∗ − Ĝ1:d ∥2F


=∥Ĝ1:d−1 ĜT1:d−1 c∗(d−1) − c∗(d−1) ∥2F (53)
+∥Ĝ1:d−1 ĜT1:d−1 ĉ(d−1) − Ĝ1:d−1 ĜT1:d−1 c∗(d−1) ∥2F . (54)

Later we will bound the above two terms by Step 1 and Step 2 respectively. Before elaborating
these two steps, let us introduce two error estimates that will be used frequently in the proof of
this theorem. With high probability, we have
r
∗T ∗(k) (k) ∗ n log(dn)∥p∗ ∥∞
max ∥(Uk−1 ⊗ In )(c − ĉ )Vk ∥F ⩽ C1 , (55)
k∈{1,··· ,d} N
r
dK n2K
 

∥c − ĉ∥F ⩽ C2 . (56)
N

Their proofs are provided respectively by Lemma 11 and Lemma 12. In the above inequality,
C1 , C2 are two sufficiently large universal constants. Uk∗ and Vk∗ are orthogonal matrices such that
k k d−k d−k
Uk∗ Uk∗T ∈ Rn ×n is the projection matrix onto the space Col(c∗(k) ), and Vk∗ Vk∗T ∈ Rn ×n is
the projection matrix onto the space Row(c∗(k) ). Due to the additive formulation (46) of c∗ , Uk∗
and Vk∗ have analytical expressions, shown in Lemma 8.
(56) provides an upper bound of the gap between empirical and ground truth coefficient tensor,
which scales as O(dK/2 ) upon using K-cluster basis p for the density estimator. Notably, with∗the
∗ ∗
projection Uk and Vk , this gap grows only at O( d log(dn)) under the Assumption 1 for ∥p ∥∞

37
as stated in (55). This is due to the 1-cluster structure of c∗ , which is the key to make our error
analysis sharp.
Step 1. To bound (53), we show by induction that for k ∈ {1, · · · , d − 1},
k
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F X 1
∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F ⩽ C3 2
(57)
N σ (c∗(j) )
j=1 2

for a universal sufficiently large constant C3 with high probability, where σ2 (c∗(j) ) is the second
largest singular value of the unfolding matrix c∗(j) .
We prove k = 1 first:
∥Ĝ1 ĜT1 c∗(1) − c∗(1) ∥2F = ∥Ĝ1 ĜT1 c∗(1) − U1∗ U1∗ T c∗(1) ∥2F ⩽ ∥Ĝ1 ĜT1 − U1∗ U1∗ T ∥22 ∥c∗ ∥2F (58)

We use the matrix perturbation theory in Corollary 3 to bound ∥Ĝ1 ĜT1 − U1∗ U1∗ T ∥2 . The spectrum

q q
dK n2K dK n2K
gap assumption in Corollary 3 holds since ∥ĉ(1) − c∗(1) ∥F ⩽ O N and O N ⩽
σ2 (c∗(1) ) by (56) and using a sufficiently large N such that (39) holds. Therefore, we have
∥(ĉ(1) − c∗(1) )V1∗ T ∥2 ∥ĉ(1) − c∗(1) ∥22
∥Ĝ1 ĜT1 − U1∗ U1∗ T ∥2 ⩽ C4 ( + ). (59)
σ2 (c∗(1) ) σ22 (c∗(1) )
for some universal constant C4 . According to (55), the first term
s
∥(ĉ(1) − c∗(1) )V1∗ T ∥2 n log(dn)∥p∗ ∥∞
⩽ C 1 (60)
σ2 (c∗(1) ) N σ22 (c∗(1) )
and the second term can be bounded by the first term upon using sufficiently large N such that
(39) holds. Therefore,
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F
∥Ĝ1 ĜT1 c∗(1) − c∗(1) ∥2F ⩽ C3 (61)
N σ22 (c∗(1) )
for some universal constant C3 .
Now suppose (57) holds for k. Then for k + 1,
∥Ĝ1:k+1 ĜT1:k+1 c∗(k+1) − c∗(k+1) ∥2F
=∥Ĝ1:k+1 ĜT1:k+1 c∗(k+1) − (Ĝ1:k ⊗ In )(Ĝ1:k ⊗ In )T c∗(k+1) ∥2F + ∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F
⩽∥Ĝ1:k+1 ĜT1:k+1 c∗(k+1) − (Ĝ1:k ⊗ In )(Ĝ1:k ⊗ In )T c∗(k+1) ∥2F (62)
k
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F X 1
+C3 . (63)
N σ 2 (c∗(j) )
j=1 2

where the equality follows from Lemma 3, and the inequality follows from induction (57). It suffices
to bound (62). Note that
(62) =∥(Ĝ1:k ⊗ In )Ĝk+1 ĜTk+1 (Ĝ1:k ⊗ In )T c∗(k+1) − (Ĝ1:k ⊗ In )(Ĝ1:k ⊗ In )T c∗(k+1) ∥2F
=∥(Ĝ1:k ⊗ In ) Ĝk+1 ĜTk+1 (Ĝ1:k ⊗ In )T c∗(k+1) − (Ĝ1:k ⊗ In )T c∗(k+1) ∥2F
 

⩽∥Ĝ1:k ⊗ In ∥22 ∥Ĝk+1 ĜTk+1 (Ĝ1:k ⊗ In )T c∗(k+1) − (Ĝ1:k ⊗ In )T c∗(k+1) ∥2F


=∥Ĝk+1 ĜTk+1 (Ĝ1:k ⊗ In )T c∗(k+1) − (Ĝ1:k ⊗ In )T c∗(k+1) ∥2F . (64)

38
Step 1A. Define the matrix formed by principal left singular vectors:
G̃k+1 = SVD (Ĝ1:k ⊗ In )T c∗(k+1) , 2 ∈ O2n×2 .

(65)

where the SVD function is defined in (52). Since Rank((Ĝ1:k−1 ⊗ In )T c∗(k) ) = 2, it follows that
(Ĝ1:k ⊗ In )T c∗(k+1) = G̃k+1 G̃Tk+1 (Ĝ1:k ⊗ In )T c∗(k+1) . (66)
Therefore by (64) and (66),
(62) =∥Ĝk+1 ĜTk+1 (Ĝ1:k+1 ⊗ In )T c∗(k+1) − G̃k+1 G̃Tk+1 (Ĝ1:k ⊗ In )T c∗(k+1) ∥2F
⩽∥Ĝk+1 ĜTk+1 − G̃k+1 G̃Tk+1 ∥22 ∥Ĝ1:k ⊗ In ∥22 ∥c∗(k+1) ∥2F
=∥Ĝk+1 ĜTk+1 − G̃k+1 G̃Tk+1 ∥22 ∥c∗ ∥2F . (67)

Perturbation of projection matrices ∥Ĝk+1 ĜTk+1 − G̃k+1 G̃Tk+1 ∥2 will again be estimated using Corol-
lary 3. More specifically, in Corollary 3, we consider  = (Ĝ1:k ⊗ In )T ĉ(k+1) and A = Ĝ1:k ⊗
In )T c∗(k+1) , and Ĝk+1 ĜTk+1 and G̃k+1 G̃Tk+1 are respectively the projection matrices of  and A. In
the subsequent Step 1B, we verify the conditions that should be satisfied in Corollary 3.

Step 1B. By the induction assumption (57), we have


k
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F X 1
∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F ⩽ C3 2
N σ (c∗(j) )
j=1 2
(68)
d
C3 n log(dn)(1 + B ∗ d)Cc d X d C4 nd2 log2 (dn)
⩽ ⩽ ,
N c2σ j(d − j) N
j=1

where C4 is a universal constant, B ∗ is given in Assumption 1. The second inequality above follows
from upper bound for ∥c∗ ∥2F in Corollary 2 and the lower bound for singular value in Lemma 10,
and the third inequality follows from the fact that dj=1 j(d−j)
1
= O( log(d)
P
d ). Consequently,

σ22 (c∗(k+1) )
∥(Ĝ1:k ⊗ In )(Ĝ1:k ⊗ In )T c∗(k+1) − c∗(k+1) ∥2F = ∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F ⩽ , (69)
4
where the inequality follows from the lower bound for the singular value c∗(k) given in Lemma 10
and the assumption (39) on sample size N . Therefore, by Lemma 18,
• Rank((Ĝ1:k ⊗ In )T c∗(k+1) ) = 2; (70)
∗ ∗T T ∗(k+1)
• Vk+1 Vk+1 is the projection matrix on to Row((Ĝ1:k ⊗ In ) c ); (71)
σ2 (c∗(k+1) )
• σ2 ((Ĝ1:k ⊗ In )T c∗(k+1) ) ⩾ . (72)
2
We start to apply Corollary 3 to Ĝk+1 and G̃k+1 , observe that
∥(Ĝ1:k ⊗ In )T c∗(k+1) − (Ĝ1:k ⊗ In )T ĉ(k+1) ∥F ⩽ ∥Ĝ1:k ⊗ In ∥2 ∥c∗(k+1) − ĉ(k+1) ∥F
r K 2K 
∗ d n
=∥ĉ − c ∥F ⩽ C2 ⩽ 8σ2 (c∗(k+1) ) ⩽ 4σ2 ((Ĝ1:k ⊗ In )T c∗(k) ), (73)
N

39
k+1
where the equality follows from the fact that Ĝ1:k ⊗ In ∈ On ×2n , the second inequality follows
from (56), the third inequality follows from (39), and the last inequality follows from (72). Since
(73) and (71) hold, by Corollary 3
∥G̃k+1 G̃Tk+1 − Ĝk+1 Ĝk+1 ∥2
∥(Ĝ1:k ⊗ In )T (c∗(k+1) − ĉ(k+1) )Vk+1
∗ ∥
2
⩽ (74)
σ2 ((Ĝ1:k ⊗ In )T c∗(k+1) )
∥(Ĝ1:k ⊗ In )T (c∗(k+1) − ĉ(k+1) )∥22
+ (75)
σ22 ((Ĝ1:k ⊗ In )T c∗(k+1) )2
We will bound (74) in Step 1D and (75) in Step 1E.

Step 1D. For (74), we have


∥(Ĝ1:k ⊗ In )T (c∗(k+1) − ĉ(k+1) )Vk+1

∥2
=∥(Ĝ1:k ⊗ In )(Ĝ1:k ⊗ In )T (c∗(k+1) − ĉ(k+1) )Vk+1

∥2
=∥(Ĝ1:k ĜT1:k ⊗ In )(c∗(k+1) − ĉ(k+1) )Vk+1

∥2
⩽∥({Uk∗ Uk∗T − Ĝ1:k ĜT1:k } ⊗ In )(c∗(k+1) − ĉ(k+1) )Vk+1

∥2 (76)
+∥(Uk∗ Uk∗T ⊗ In )(c ∗(k+1) (k+1)
− ĉ ∗
)Vk+1 ∥2 . (77)
Note that
(77) =∥(Uk∗ ⊗ In )(Uk∗ ⊗ In )T (c∗(k+1) − ĉ(k+1) )Vk+1

∥2
=∥(Uk∗ ⊗ In )T (c∗(k+1) − ĉ(k+1) )Vk+1

∥2
r √ r (78)
n log(dn)∥p∗ ∥∞ C3 n log(dn)∥p∗ ∥∞
⩽C1 ⩽ ,
N 8 N
where the last inequality follows from (55) and we choose C3 in (61) to be at least 64C12 . In addition,
(76) ⩽∥{Uk∗ Uk∗T − Ĝ1:k ĜT1:k } ⊗ In ∥2 ∥c∗(k+1) − ĉ(k+1) ∥2 ∥Vk+1

∥2
r
dK n2K

=∥Uk∗ Uk∗T − Ĝ1:k ĜT1:k ∥2 ∥c∗
− ĉ∥2 ⩽ − ∥Uk∗ Uk∗T Ĝ1:k ĜT1:k ∥2 C2
N
p r r K 2K  √ r
n∥p∗ ∥∞ d log(dn) d n C3 n log(dn)∥p∗ ∥∞
⩽C4 C 2 ⩽ , (79)
σ2 (c∗(k) ) N N 8 N
where the second inequality follows from (56), the third inequality in which C4 is a universal
constant follows from Lemma 5, and the last inequality holds for sufficiently large N in (39).
Therefore (76), (77), (78), and (79) together lead to
√ r
T ∗(k+1) (k+1) ∗ C3 n log(dn)∥p∗ ∥∞
∥(Ĝ1:k ⊗ In ) (c − ĉ )Vk+1 ∥2 ⩽ . (80)
4 N
So by (80) and (72),
√ p
C3 n log(dn)∥p∗ ∥∞
(74) ⩽ √ . (81)
2 N σ2 (c∗(k+1) )

40
Step 1E. Observe that
√ p
∥c∗(k+1) − ĉ(k+1) ∥22 4C22 dK n2K C3 n log(dn)∥p∗ ∥∞
(75) ⩽ 2 ⩽ ⩽ √ , (82)
σ2 ((Ĝ1:k ⊗ In )T c∗(k+1) )2 N σ22 (c∗(k+1) ) 2 N σ2 (c∗(k+1) )
where second inequality follows from (56) and (72), and the last inequality hold for sufficiently
large N in (39). Consequently, (74), (75), (81), and (82) together lead to
p
T
p n log(dn)∥p∗ ∥∞
∥G̃k+1 G̃k+1 − Ĝk+1 Ĝk+1 ∥2 ⩽ C3 √ . (83)
N σ2 (c∗(k+1) )

Step 1F. We complete the induction in this step. Note that


n log(dn)∥p∗ ∥∞ ∗ 2
(62) ⩽∥Ĝk+1 ĜTk+1 − G̃k+1 G̃Tk+1 ∥22 ∥c∗ ∥2F ⩽ C3 ∥c ∥F ,
N σ22 (c∗(k+1) )
where the first inequality follows from (67), and the last inequality follows from (83). (62) and (63)
together lead to
k+1
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F X 1
∥Ĝ1:k+1 ĜT1:k+1 c∗(k+1) − c∗(k+1) ∥2F ⩽ C3 2
.
N σ (c∗(j) )
j=1 2

This completes the induction. Note that by induction


d−1
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F X 1
(53) = ∥Ĝ1:d−1 ĜT1:d−1 c∗(d−1) − c∗(d−1) ∥2F ⩽ C3 2
. (84)
N σ (c∗(j) )
j=1 2

Step 2. We have

(54) =∥Ĝ1:d−1 ĜT1:d−1 (ĉ(d−1) − c∗(d−1) )∥2F


∗ ∗T
⩽2∥(Ĝ1:d−1 ĜT1:d−1 − Ud−1 Ud−1 )(ĉ(d−1) − c∗(d−1) )∥2F (85)
∗ ∗T
+2∥Ud−1 Ud−1 (ĉ(d−1) − c∗(d−1) )∥2F . (86)

Note that
∗ ∗T 2 (d−1)
(85) ⩽2∥Ĝ1:d−1 ĜT1:d−1 − Ud−1 Ud−1 ∥2 ∥ĉ − c∗(d−1) ∥2F
∗ ∗T 2 n∥p∗ ∥∞ d log(dn)
=2∥Ĝ1:d−1 ĜT1:d−1 − Ud−1 Ud−1 ∥2 ∥ĉ − c∗ ∥2F ⩽ C5 2 ∗(d−1) ∥ĉ − c∗ ∥2F
σ2 (c ) N
n∥p∗ ∥∞ d log(dn) 2 dK n2K n log(dn)∥p∗ ∥∞
 
⩽C5 2 ∗(d−1) C2 ⩽ C6 , (87)
σ2 (c ) N N N
where C5 , C6 are universal constants. The second inequality follows from Lemma 5 given (84), the
third inequality follows from (56), and the fourth inequality follows from (39). In addition,

∗ ∗T n log(dn)∥p∗ ∥∞
(86) = 2∥(Ud−1 Ud−1 ⊗ In )(ĉ(d) − c∗(d) )∥2F ⩽ 2C12 . (88)
N

41
where the inequality follows from (55).

Therefore
n log(dn)∥p∗ ∥∞ n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F
(54) ⩽ C7 ⩽ 2C7 , (89)
N σ22 (c∗(d−1) )N
for some universal constant C7 , where the first inequality follows from (87) and (88), and the second
inequality follows from the fact that σ2 (c∗(d−1 ) ⩽ ∥c∗(d−1) ∥F = ∥c∗ ∥F . At this point, we combine
(84) and (89) and obtain the final estimation for the relative error:
v 

u
u n log(dn)∥p∗ ∥∞ X d−1
∥c − Ĝ1:d ∥F 1
= OP t .
∥c∗ ∥F N σ 2 (c∗(j) )
j=1 2

By further applying ∥p∗ ∥∞ ⩽ 1+B ∗ d from Assumption 1, we arrive at the results in Theorem 1.

C Some useful results for the proof of Theorem 1


C.1 TT-SVD
Lemma 2. Let Ĝ1:d be the output tensor cores from Algorithm 1 with target rank r. Then Ĝ1:k ∈
k
On ×r for k ∈ {1, · · · , d − 1},
Proof. We proceed by induction. Since Ĝ1 = SVD(ĉ(1) , r) ∈ Rn×r , it follows that Ĝ1 ∈ On×r .
k−1
Assume Ĝ1:k−1 ∈ On ×r , by Lemma 20 we have
k k
Ĝ1:k = Ĝ1:k−1 ⊗ In Ĝk ∈ Rn ×r ∈ On ×r .


Lemma 3. Let Ĝ1:d be the output tensor cores from Algorithm 1 with target rank r. For k ∈
{1, · · · , d − 1}, it holds that

∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F


=∥Ĝ1:k ĜT1:k c∗(k) − (Ĝ1:k−1 ⊗ In )(Ĝ1:k−1 ⊗ In )T c∗(k) ∥2F + ∥Ĝ1:k−1 ĜT1:k−1 c∗(k−1) − c∗(k−1) ∥2F .

Proof. Observe that

∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F


=∥Ĝ1:k ĜT1:k c∗(k) − (Ĝ1:k−1 ⊗ In )(Ĝ1:k−1 ⊗ In )T c∗(k) ∥2F
+∥(Ĝ1:k−1 ⊗ In )(Ĝ1:k−1 ⊗ In )T c∗(k) − c∗(k) ∥2F (90)
+⟨Ĝ1:k ĜT1:k c∗(k) , (Ĝ1:k−1 ⊗ In )(Ĝ1:k−1 ⊗ In )T c∗(k) − c∗(k) ⟩. (91)
T ∗(k) T ∗(k) ∗(k)
−⟨(Ĝ1:k−1 ⊗ In )(Ĝ1:k−1 ⊗ In ) c , (Ĝ1:k−1 ⊗ In )(Ĝ1:k−1 ⊗ In ) c −c ⟩. (92)

Step 1. Note that

(Ĝ1:k−1 ⊗ In )(Ĝ1:k−1 ⊗ In )T c∗(k) = {(Ĝ1:k−1 ĜT1:k−1 ) ⊗ In }c∗(k) = Ĝ1:k−1 ĜT1:k−1 c∗(k−1) .

42
Therefore (90) = ∥Ĝ1:k−1 ĜT1:k−1 c∗(k−1) − c∗(k) ∥2F .
k−1
Step 2. By Lemma 2, Ĝ1:k−1 ∈ On ×r . Then the kronecker product of two orthogonal ma-
k
trices is still orthogonal, Ĝ1:k−1 ⊗ In ∈ On ×rn . The details are in Lemma 19. It follows that

(Ĝ1:k−1 ⊗ In )(Ĝ1:k−1 ⊗ In )T c∗(k) − c∗(k) = (Ĝ1:k−1 ⊗ In )(Ĝ1:k−1 ⊗ In )T − Ink c∗(k)



T
=(Ĝ1:k−1 ⊗ In )⊥ (Ĝ1:k−1 ⊗ In )⊥ c∗(k) .

(93)

Therefore
T
(92) = ⟨(Ĝ1:k−1 ⊗ In )(Ĝ1:k−1 ⊗ In )T c∗(k) , (Ĝ1:k−1 ⊗ In )⊥ (Ĝ1:k−1 ⊗ In )⊥ c∗(k) ⟩ = 0.


Step 3. By (93), it follows that


T
(91) =⟨(Ĝ1:k−1 ⊗ In )Ĝk ĜT1:k c∗(k) , (Ĝ1:k−1 ⊗ In )⊥ (Ĝ1:k−1 ⊗ In )⊥ c∗(k) ⟩

T T
=⟨ (Ĝ1:k−1 ⊗ In )⊥ (Ĝ1:k−1 ⊗ In )Ĝk ĜT1:k c∗(k) , (Ĝ1:k−1 ⊗ In )⊥ c∗(k) ⟩ = 0.
 

This immediately leads to the desired result.

Lemma 4. Let Ĝ1:d be the output tensor cores from Algorithm 1 with target rank r. Then

∥c∗ − Ĝ1:d ∥2F = ∥c∗(d−1) − Ĝ1:d−1 ĜT1:d−1 c∗(d−1) ∥2F + ∥Ĝ1:d−1 ĜT1:d−1 c∗(d−1) − Ĝ1:d−1 ĜT1:d−1 ĉ(d−1) ∥2F .

Proof. Since Ĝ1:d−1 ĜT1:d−1 ĉ(d−1) is a reshape matrix of Ĝ1:d , we have

∥c∗ − Ĝ1:d ∥2F = ∥c∗(d−1) − Ĝ1:d−1 ĜT1:d−1 ĉ(d−1) ∥2F .

It suffices to show that

⟨Ĝ1:d−1 ĜT1:d−1 c∗(d−1) − c∗(d−1) , Ĝ1:d−1 ĜT1:d−1 ĉ(d−1) − Ĝ1:d−1 ĜT1:d−1 c∗(d−1) ⟩ = 0. (94)
d−1 ×r
By Lemma 2, Ĝ1:d−1 ∈ On . So

Ĝ1:d−1 ĜT1:d−1 c∗(d−1) − c∗(d−1) = (Ĝ1:d−1 )⊥ (ĜT1:d−1 )⊥ c∗(d−1) .

Therefore (Ĝ1:d−1 )T⊥ Ĝ1:d−1 = 0 and so

(94) =⟨(Ĝ1:d−1 )⊥ (Ĝ1:d−1 )T⊥ c∗(d−1) , Ĝ1:d−1 ĜT1:d−1 (ĉ(d−1) − c∗(d−1) )⟩


=⟨(Ĝ1:d−1 )T⊥ c∗(d−1) , (Ĝ1:d−1 )T⊥ Ĝ1:d−1 ĜT1:d−1 (ĉ(d−1) − c∗(d−1) )⟩ = 0.

Lemma 5. Suppose all the conditions in Theorem 1 holds. Let Ĝ1:d be the output tensor cores
from Algorithm 1 with target rank r = 2. Suppose that
k
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F X 1
∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F ⩽C 2
(95)
N σ (c∗(j) )
j=1 2

43
where C is a universal constant. Let Uk∗ be defined as in Lemma 8. Note that by Lemma 8,
k k
Uk∗ Uk∗T ∈ Rn ×n is the projection matrix onto the space Col(c∗(k) ). Then
p r
T ∗ ∗T ′ n∥p∗ ∥∞ d log(dn)
∥Ĝ1:k Ĝ1:k − Uk Uk ∥2 ⩽ C , (96)
σ2 (c∗(k) ) N

where C ′ a universal constant.


c2σ k(d−k)
Proof. Note that by Lemma 10, for k ∈ {1, · · · , d − 1}, σ22 (c∗(k) ) ⩾ d . In addition, by
Corollary 2, ∥c∗ ∥2F ⩽ Cc d where Cc = B ∗ + 2. Therefore by (95)
k
n log(dn)∥p∗ ∥∞ ∥c∗ ∥2F d X 1 ∗
′′ Cc n∥p ∥∞ d log(dn)
∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2F ⩽ C ⩽ C (97)
N c2σ j(d − j) c2σ N
j=1

Pd log(d)
where we have used the fact that 1
j=1 j(d−j) = O( d ) and C ′′ is some universal constant.
By Lemma 18, Rank(Ĝ1:k ĜT1:k c∗(k) ) = 2, and Ĝ1:k ĜT1:k is the projection matrix on Col(Ĝ1:k ĜT1:k c∗(k) ).
Consequently,

2∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2 2∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2


∥Ĝ1:k ĜT1:k − Uk∗ Uk∗T ∥2 ⩽ 1 ⩽ ∗(k) )
σ2 (c∗(k) ) − ∥Ĝ1:k ĜT1:k c∗(k) − c∗(k) ∥2 2 σ2 (c
p r
C ′′ Cc n∥p∗ ∥∞ d log(dn)
⩽4 ,
cσ σ2 (c∗(k) ) N
(98)

where the first inequality follows from Theorem 3 and Rank(c∗(k) ) = 2, the second inequality follows
from the assumption on sufficiently large sample size N as stated in (39), and the last inequality
follows

from (97). Furthermore, by the expressions of Cc in Corollary 2 and cσ in Lemma 10, we
have cCσ c can be bounded a universal constant, leading to the final estimation (96).

C.2 Ground truth density function


Lemma 6. Under Assumption 1, let
n
X n
X
p̃(x1 , · · · , xd ) = ··· c∗ (l1 , · · · , ld )ϕl1 (x1 ) · · · ϕld (xd )
l1 =1 ld =1

d
where c∗ ∈ Rn is the coefficient tensor formulated as (46). There exists an absolute constant C
such that
d
X
∥p∗ − p̃∥L2 ([0,1]d ) ⩽ Cn−β0 ∥qj∗ ∥H β0 ([0,1]) .
j=1

Proof. Direct calculations show that


n
X n
X
p̃(x1 , · · · , xd ) = 1 + c∗1 (l1 )ϕl1 (x1 ) + · · · + c∗d (ld )ϕld (xd ).
l1 =1 ld =1

44
where c∗j ∈ Rn is defined in (49). Therefore
d
X n
X

∥p − p̃∥L2 ([0,1]d ) ⩽ c∗j (lj )ϕlj − qj∗ .
j=1 lj =1 L2 ([0,1])

Denote Sn the class of polynomial up degree n on [0, 1], and Pn the projection operator from
L2 ([0, 1]) to Sn respect to L2 norm. Since Legendre polynomials are orthogonal, (49) and (48)
n
X
c∗j (lj )ϕlj = Pn (qj∗ ). (99)
lj =1

By the Legendre polynomial approximation theory shown in the appendix of Khoo et al. (2024), it
follows that
Xn
c∗j (lj )ϕlj − qj∗ ⩽ C∥qj∗ ∥H β0 ([0,1]) n−β0 (100)
lj =1 L2 ([0,1])

for some absolute constant C which does not depend on c∗j or qj∗ . The desired result follows
immediately.

Corollary 2. Under Assumption 1, for sufficiently large n, there exists a constant C such that
∥c∗ ∥2F ⩽ Cc d, where Cc = B ∗ + 2
d
where c∗ ∈ Rn is the coefficient tensor formulated as (46).
Proof. Observe that
∥c∗ ∥2F =∥1d + c∗1 ⊗ 1d−1 + 1 ⊗ c∗2 ⊗ 1d−2 + · · · + 1d−1 ⊗ c∗d ∥2F
d
X
=∥1d ∥2F + ∥c∗1 ⊗ 1d−1 ∥2F + · · · + ∥1 d−1
⊗ c∗d ∥2F =1+ ∥c∗j ∥2 .
j=1

where the first equality follows from (46) , and the second equality follows from orthogonality. In
addition, note that
n
X
∥c∗j ∥ = c∗j (lj )ϕlj L2 ([0,1])
⩽ ∥qj∗ ∥L2 ([0,1]) + C1 ∥qj∗ ∥H β0 ([0,1]) n−β0 ⩽ B ∗ + 1,
lj =1

where the first inequality follows from (100), and the last inequality follows from Assumption 1 and
n being sufficiently large. The desired result follows immediately.

C.3 Row and column spaces of the coefficient tensor


k d−k
Lemma 7. Let k ∈ {1, · · · , d − 1}. Let c∗(k) ∈ Rn × Rn be the k-th unfolding of the tensor
coefficient c as in (46) for k = 1, · · · , d − 1. The row and the column space of c∗(k) are given by

Col(c∗(k) ) = Span{1k , c∗1 ⊗ 1k−1 + 1 ⊗ c∗2 ⊗ 1k−2 + · · · + 1k−1 ⊗ c∗k }, (101)


∗(k)
Row(c ) = Span{1 d−k
, c∗k+1 ⊗1 d−k−1
+1⊗ c∗k+2 ⊗1 d−k−2
+ ··· + 1 d−k−1
⊗ c∗d }. (102)

Consequently, Rank(c∗(k) ) ⩽ 2. Furthermore, if qj∗ ̸= 0 for all j ∈ {1, · · · , d}, Rank(c∗(k) ) = 2.

45
Proof. It is direct to show that

c∗(k) =(1k ) ⊗ {1d−k } + (c∗1 ⊗ 1k−1 ) ⊗ {1d−k } + · · · + (1k−1 ⊗ c∗k ) ⊗ {1d−k }


(103)
+(1k ) ⊗ {c∗k+1 ⊗ 1d−k−1 } + · · · + (1k ) ⊗ {1d−k−1 ⊗ c∗d },
k
where terms with ( ) are viewed as column vectors in Rn , and terms with { } are viewed as row
d−k
vectors in Rn . To show (101) holds for the column space, it suffices to rewrite

c∗(k) =(1k ) ⊗ 1d−k + c∗k+1 ⊗ 1d−k−1 + 1 ⊗ c∗k+2 ⊗ 1d−k−2 + · · · + 1d−k−1 ⊗ c∗d




+ c∗1 ⊗ 1k−1 + 1 ⊗ c∗2 ⊗ 1k−2 + · · · + 1k−1 ⊗ c∗k ⊗ {1d−k }.




Therefore (101) holds. The justification of (102) is similar and omitted for brevity.

Lemma 8. Let c∗ be coefficient tensor as in (46) and c∗j be as in (49). Denote e1 = [1, 0]T ,
k ×2 d−k ×2
e2 = [0, 1]T and let the matrices Uk∗ ∈ Rn and Vk∗ ∈ Rn be such that

c∗1 ⊗ 1k−1 + 1 ⊗ c∗2 ⊗ 1k−2 + · · · + 1k−1 ⊗ c∗k


Uk∗ =1k ⊗ e1 + p ∗ ⊗ e2 , (104)
∥c1 ∥2 + · · · + ∥c∗k ∥2
c∗k+1 ⊗ 1d−k−1 + 1 ⊗ c∗k+2 ⊗ 1d−k−2 + · · · + 1d−k−1 ⊗ c∗d
Vk∗ =1d−k ⊗ e1 + q ⊗ e2 . (105)
∥c∗k+1 ∥2 + · · · + ∥c∗d ∥2

k k d−k ×nd−k
Then Uk∗ Uk∗T ∈ Rn ×n is the projection matrix onto the space Col(c∗(k) ), and Vk∗ Vk∗T ∈ Rn
is the projection matrix onto the space Row(c∗(k) ).

Proof. Note that the set of vectors {1k , c∗1 ⊗ 1k−1 , 1 ⊗ c∗2 ⊗ 1k−2 , 1k−1 ⊗ c∗k } are orthogonal. Direct
calculations show that
Uk∗T Uk∗ = e1 ⊗ e1 + e2 ⊗ e2 = I2 ∈ R2×2 .
k k k
Therefore Uk∗ ∈ On ×2 . Since Col(Uk∗ ) = Col(c∗(k) ), Uk∗ Uk∗T ∈ Rn ×n is the projection matrix onto
the space Col(c∗(k) ). The justification of Vk∗ Vk∗T is similar and omitted for brevity.

For simplicity, in (104) and (105) we introduce the notations

c∗1 ⊗ 1k−1 + 1 ⊗ c∗2 ⊗ 1k−2 + · · · + 1k−1 ⊗ c∗k k


u∗k = p ∗ ∗
∈ Rn , (106)
2
∥c1 ∥ + · · · + ∥ck ∥2

c∗k+1 ⊗ 1d−k−1 + 1 ⊗ c∗k+2 ⊗ 1d−k−2 + · · · + 1d−k−1 ⊗ c∗d d−k


vk∗ = q ∈ Rn . (107)
∥c∗k+1 ∥2 + · · · + ∥c∗d ∥2

Lemma 9. Let u∗k and vk∗ be as in (106) and (107) respectively. Denote
Pn ∗ (l )ϕ (x ) + · · · +
Pn ∗
∗ l1 =1 c1 1 l 1 1 lk =1 ck (lk )ϕlk (xk )
Φk (x1 , · · · , xk ) = p ∗ , (108)
∥c1 ∥2 + · · · + ∥c∗k ∥2
Pn ∗
Pn ∗
∗ lk+1 =1 ck+1 (lk+1 )ϕlk+1 (xk+1 ) + · · · + ld =1 cd (ld )ϕld (xd )
Ψk (xk+1 , · · · , xd ) = q . (109)
∥c∗k+1 ∥2 + · · · + ∥c∗d ∥2

46
Then u∗k is the coefficient tensor of Φ∗k , and vk∗ is the coefficient tensor of Ψ∗k . Then

∥Φ∗k ∥L2 ([0,1]k ) = ∥u∗k ∥F = 1 and ∥Ψ∗k ∥L2 ([0,1]d−k ) = ∥vk∗ ∥F = 1.

In addition, suppose Assumption 1 holds. Then for sufficient large n such that
√ √ 4B ∗
∥Φ∗k ∥∞ ⩽ C ′ k and ∥Ψ∗k ∥∞ ⩽ C ′ d − k where C′ = .
b∗
Proof. To see that u∗k is the coefficient tensor of Φ∗k (x1 , · · · , xk ), it suffices to observe that the
(l1 , · · · , lk )-coordinate of u∗k is ⟨Φ∗k , ϕ1l1 · · · ϕklk ⟩. Since the coefficient tensor is distance preserving, it
holds that
∥Φ∗k ∥L2 ([0,1]k ) = ∥u∗k ∥F = 1.

Next, we bound ∥Φ∗k ∥∞ . By (99) and Theorem 5, it follows that there exists a constant C1 such
that
n
X
c∗j (l1 )ϕlj ∞ = ∥Pn (qj∗ )∥∞ ⩽ ∥qj∗ ∥∞ + C1 n−β0 .
lj =1

Therefore, the numerator of (108) is bounded by


n
X n
X k
X
c∗1 (l1 )ϕl1 + · · · + c∗k (lk )ϕlk ⩽ ∥qj∗ ∥∞ + C1 kn−β0 ⩽ 2kB ∗ (110)
l1 =1 lk =1 ∞ j=1

upon using Assumption 1 and sufficiently large n. Observe that


n
X
∥c∗j ∥ = c∗j (lj )ϕlj L2 ([0,1])
= ∥Pn (qj∗ )∥L2 ([0,1]) ⩾ ∥qj∗ ∥L2 ([0,1]) − C2 n−β0 ⩾ b∗ − C2 n−β0 ⩾ b∗ /2
lj =1

from some constant C2 , where the second equality follows from (99), the first inequality follows from
Corollary 4, the second inequality follows from the Assumption 1 that minj∈{1,··· ,d} ∥qj∗ ∥L2 ([0,1]) ⩾ b∗ ,
and the last inequality holds for sufficiently large n. Therefore, the denominator
q √
∥c∗1 ∥2 + · · · + ∥c∗k ∥2 ⩾ b∗ k/2 (111)
∗ √
Note that (110) and (111) immediately imply ∥Φ∗k ∥∞ ⩽ 4 Bb∗ k. The analysis for ∥Ψ∗k ∥∞ is the
same and therefore omitted.

C.4 Singular Values of the reshaped coefficient tensor


Lemma 10. Let c∗ be defined as in (46). Denote
k
X d
X
a∗k = ∥c∗i ∥2 and b∗k = ∥c∗i ∥2 .
i=1 i=k+1

Then the k-th unfolding of coefficient tensor c∗ has the smallest singular value formulated as

1 + a∗k + b∗k − (1 + a∗k + b∗k )2 − 4a∗k b∗k


p
2 ∗(k)
σ2 (c ) =
2

47
for k = 1, · · · , d. Suppose in addition that Assumption 1 holds, for sufficient large n, σ2 (c∗(k) ) is
lower bounded by r s
∗(k) k(d − k) 16B ∗4
σ2 (c ) ⩾ cσ where cσ = ∗2 (112)
d 1 + b4

Proof. Let k ∈ {1, · · · , d}. Denote

fk∗ = c∗1 ⊗ 1k−1 + 1 ⊗ c∗2 ⊗ 1k−2 + · · · + 1k−1 ⊗ c∗k , and gk∗ = c∗k+1 ⊗ 1d−k−1 + · · · + 1d−k ⊗ c∗d .

Then one can easily check that a∗k = ∥f ∗ ∥2F and b∗k = ∥g ∗ ∥2F due to orthogonality, and we can write

c∗ = (1k ) ⊗ 1d−k + gk∗ + fk∗ ⊗ {1d−k }.


 

k
where terms with ( ) are viewed as column vectors in Rn , and terms with { } are viewed as row
d−k
vectors in Rn . Therefore
 1d−k + gk∗
 
∗(k) ∗
 k
c = 1 fk .
1d−k
Observe that
 1 + ∥gk∗ ∥2F  1 + ∥gk∗ ∥2F ∥fk∗ ∥F
  k
1k
   
∗(k) ∗(k) T 1 1
1k fk∗ ∗ ∗
  k
c (c ) = = 1 fk /∥fk ∥F
1 1 fk∗ ∥fk∗ ∥F ∥fk∗ ∥2F fk∗ /∥fk∗ ∥F
k
Note that 1k fk∗ /∥fk∗ ∥F is an orthogonal matrix in On ×2 . Therefore the eigenvalues of c∗(k) (c∗(k) )T
 

1 + ∥gk∗ ∥2F ∥fk∗ ∥F


 
coincide with the eigenvalues of ∈ R2×2 . Direct calculations show that
∥fk∗ ∥F ∥fk∗ ∥2F

1 + a∗k + b∗k − (1 + a∗k + b∗k )2 − 4a∗k b∗k


p
σ22 (c∗(k) ) = λ2 (c∗(k) ∗(k) T
(c ) )= .
2
For (112), note that by (100), there exists a constant C1
n
X
c∗j (lj )ϕlj − qj∗ ⩽ C1 n−β0 (113)
lj =1 L2 ([0,1])

Pn ∗
Since lj =1 cj (lj )ϕlj L2 ([0,1]) = ∥c∗j ∥, it follows that

∥qj∗ ∥L2 ([0,1]) − C1 n−β0 ⩽ ∥c∗j ∥ ⩽ ∥qj∗ ∥L2 ([0,1]) + C1 n−β0

It follows from Assumption 1 and (100) that

b∗ ⩽ ∥qj∗ ∥L2 ([0,1]) ⩽ B ∗ .


b∗
So for sufficiently large n, it follows that 2 ⩽ ∥c∗j ∥ ⩽ 2B ∗ . Consequently,

b∗2 k b∗2 (d − k)
⩽ a∗k ⩽ 4B ∗2 k and ⩽ b∗k ⩽ 4B ∗2 (d − k).
4 4

48
Therefore
1 + a∗k + b∗k − (1 + a∗k + b∗k )2 − 4a∗k b∗k
p
σ22 (c∗(k) ) =
p 2
(1 + a∗k + b∗k − (1 + a∗k + b∗k )2 − 4a∗k b∗k )(1 + a∗k + b∗k + (1 + a∗k + b∗k )2 − 4a∗k b∗k )
p
=
2(1 + a∗k + b∗k + (1 + a∗k + b∗k )2 − 4a∗k b∗k )
p

4a∗k b∗k 16B ∗4 k(d − k) 16B ∗4 k(d − k)


⩾ ⩾ ⩾
4(1 + a∗k + b∗k ) ∗2
1 + b4 d
∗2
(1 + b4 )d

C.5 Variance of cluster basis estimators


Lemma 11. Suppose Assumption 1 holds. Let c∗ be the coefficient tensor of p∗ , and ĉ be the
s-cluster coefficient tensor estimator defined in Definition 1 with s ⩾ 2. Let Uk∗ and Vk∗ be defined
as in Lemma 8. Then,
r !
n log(dn)∥p∗∥
∗T ∞
max ∥(Uk−1 ⊗ In )(c∗(k) − ĉ(k) [s])Vk∗ ∥F = OP . (114)
k∈{1,··· ,d} N

Proof. We suppose s ⩾ 3 in the proof, where the case of s = 2 is a simplified version from the same
calculations. Given any positive number t, we have
 
∗T ∗(k) (k) ∗ 2 2
P max ∥(Uk−1 ⊗ In )(c − ĉ [s])Vk ∥F ⩾ 4nt
k∈{1,··· ,d}
 
∗T
= P ∥(Uk−1 ⊗ In )(c∗(k) − ĉ(k) [s])Vk∗ ∥2F ⩾ 4nt2 for least one k ∈ {1, · · · , d}
d
X  
∗T
⩽ P ∥(Uk−1 ⊗ In )(c∗(k) − ĉ(k) [s])Vk∗ ∥2F ⩾ 4nt2 (by union bound)
k=1
Xd (115)
2

= P (118) ⩾ 4nt
k=1
d X
X n    
⩽ P |⟨p∗ − p̂, Φ∗k−1 ϕkl Ψ∗k ⟩| ⩾ t + P |⟨p∗ − p̂, ϕkl ⟩| ⩾ t
k=1 l=1
   
+ P |⟨p∗ − p̂, Φ∗k−1 ϕkl ⟩| ⩾ t + P |⟨p∗ − p̂, ϕkl Ψ∗k ⟩| ⩾ t .

We now bound the four terms in the double sums above by Lemma 16, which is based on Bernstein
inequality.
We first bound  
∗ ∗ k ∗
P |⟨p − p̂, Φk−1 ϕl Ψk ⟩| ⩾ t .

Note that by Lemma 9, it holds that

∥Φ∗k−1 ϕkl Ψ∗k ∥L2 ([0,1]d ) = ∥Φ∗k−1 ∥L2 ([0,1]k−1 ) ∥ϕl ∥L2 ([0,1]) ∥Ψ∗k+1 ∥L2 ([0,1]d−k ) = 1,

49
and that
∥Φ∗k−1 ϕkl Ψ∗k ∥∞ = ∥Φ∗k−1 ∥∞ ∥ϕl ∥∞ ∥Ψ∗k+1 ∥∞ ⩽ C ′2
p
n(k − 1)(d − k).
4B ∗
where C ′ = b∗ .
Therefore, by Lemma 16,

−N t2
   
∗ ∗ k ∗
P |⟨p − p̂, Φk−1 ϕl Ψk ⟩| ⩾ t ⩽ 2 exp
+ ∥Φ∗k−1 ϕkl Ψ∗k ∥∞ t
∥p∗ ∥∞ ∥Φ∗k−1 ϕkl Ψ∗k ∥2L d
2 ([0,1] )
(116)
−N t2
 
ϵ
⩽ 2 exp p =: .

∥p ∥∞ + C ′2 n(k − 1)(d − k)t 4dn
ϵ
where we assume the right-hand side is 4dn for some sufficiently small constant ϵ. Solving the last
line above yields the quadratic equation
8dn 8dn
N t2 − C ′2 n(k − 1)(d − k) log( )∥p∗ ∥∞ = 0,
p
)t − log(
ϵ ϵ
whose larger root can be bounded by
q
C ′2 n(k − 1)(d − k) log( 8dn C ′4 n(k − 1)(d − k)[log( 8dn 8dn
p
2 ∗
ϵ ) + ϵ )] + 4N log( ϵ )∥p ∥∞
t+ =
r 2N
log(dn)∥p∗ ∥∞
⩽ Cϵ =: tϵ
N
upon choosing sufficiently large N . Here Cϵ ∼ log(1/ϵ) is some sufficiently large positive constant
which only depends on ϵ. At this point, we have
  ϵ
P |⟨p∗ − p̂, Φ∗k−1 ϕkl Ψ∗k ⟩| ⩾ tϵ ⩽ .
4dn
For the other three terms in the double sums, by Lemma 9 we have

∥ϕkl ∥∞ ⩽ n, ∥Φ∗k−1 ϕkl ∥∞ ⩽ n(k − 1), ∥ϕkl Ψ∗k ∥∞ ⩽ n(d − k).
p p

Following the same calculations, we can show that with high probability
  ϵ
P ⟨p∗ − p̂, ϕkl ⟩ ⩾ tϵ ⩽
4dn

∗ ∗ k
 ϵ
P ⟨p − p̂, Φk−1 ϕl ⟩ ⩾ tϵ ⩽ ,
4dn
  ϵ
P ⟨p∗ − p̂, ϕkl Ψ∗k ⟩ ⩾ tϵ ⩽
4dn
with appropriate choice of N . Inserting the estimates above into (115), we conclude that
 
∗T ∗(k) (k) ∗ 2 2
P max ∥(Uk−1 ⊗ In )(c − ĉ [s])Vk ∥F ⩾ 4ntϵ ⩽ ϵ
k∈{1,··· ,d}

for any given positive ϵ. This indicates that, with high probability,
r
∗T ∗(k) √ n log(dn)∥p∗ ∥∞
max ∥(Uk−1 ⊗ In )(c (k)
− ĉ [s])Vk∗ ∥F ⩽ 2 ntϵ = 2Cϵ
k∈{1,··· ,d} N

50
Lemma 12. Suppose Assumption 1 holds. Let c∗ be the coefficient tensor of p∗ , and ĉ be the
s-cluster coefficient tensor estimator defined in Definition 1 with s ⩾ 2. Then
r s 2s 
∗ d n
∥c − ĉ∥F = OP . (117)
N
Proof. It suffices to show that  s 2s 
∗ d n
ĉ∥2F

E ∥c − =O .
N
To this end, let Bj,d be the j-cluster basis functions defined in (5). Then by
Xs  X X 
∗ 2 ∗ 2 ∗ 2
E(∥c − ĉ∥F ) = E(⟨p̂ − p , φ0 ⟩ ) + · · · + E(⟨p̂ − p , φs ⟩ )
j=0 φ0 ∈B1,d φs ∈Bs,d
Qs
Let gs (x) = k=1 ϕijk (xjk ) be a generic element in Bs,d . Then since {ϕl }∞
l=1 are Legendre polyno-
√ s/2
mials, it follows that with ∥ϕl ∥∞ ⩽ l and therefore ∥gs ∥∞ ⩽ n . Consequently,
N
Var(gs (x(1) ))
 
∗ 2 1 X (i)
E(⟨p̂ − p , gs ⟩ ) = Var gs (x ) =
N N
i=1
g (x)p∗ (x)dx ∥gs ∥2∞ p∗ (x)dx
R 2
ns
R
⩽ s ⩽ ⩽ ,
N N N
where the last equality follows from the fact that p∗ is a density and so p∗ (x)dx = 1. Since the
R
s 2s
cardinality of Bs,d is at most ns ds , it follows that φs ∈Bs,d E(⟨p̂ − p∗ , φs ⟩2 ) ⩽ d Nn . Since s is a
P
bounded integers, it follows that
s
dj n2j
 s 2s 
∗ 2
X d n
E(∥c − ĉ∥F ) ⩽ =O .
N N
j=1

Lemma 13. Let c∗ be the coefficient tensor of p∗ , and ĉ[s] be the s-cluster coefficient tensor
estimator defined in Definition 1 with s ⩾ 2. Let Uk∗ and Vk∗ be defined as in (104) and (105)
respectively. Φ∗k and Ψ∗k are defined as in (108) and (109) respectively. The following equalities
hold:
• If s ⩾ 3,
∗T
∥(Uk−1 ⊗ In )(c∗(k) − ĉ[s](k) )Vk∗ ∥2F
n 
(118)
X 
∗ k 2 ∗ ∗ k 2 ∗ k ∗ 2 ∗ ∗ k ∗ 2
= ⟨p − p̂, ϕl ⟩ + ⟨p − p̂, Φk−1 ϕl ⟩ + ⟨p − p̂, ϕl Ψk ⟩ + ⟨p − p̂, Φk−1 ϕl Ψk ⟩ ;
l=1

• If s = 2,
n 
X 
∗T ∗(k)
∥(Uk−1 ⊗ In )(c − ĉ[s] (k)
)Vk∗ ∥2F = ∗ k 2 ∗ ∗ k 2 ∗ k ∗ 2
⟨p − p̂, ϕl ⟩ + ⟨p − p̂, Φk−1 ϕl ⟩ + ⟨p − p̂, ϕl Ψk ⟩ .
l=1
(119)

51
Proof. For l ∈ {1, · · · , n}, denote χl ∈ Rn be the l-th canonical basis. That is, for k ∈ {1, · · · , n}
(
0, if l ̸= k;
χl (k) =
1, if l = k.

Denote T the reshaped R2×n×2 tensor of the matrix (Uk−1 ∗T ⊗ I )(c∗(k) − ĉ[s](k) )V ∗ . It follows from
n k
(104), (105), (106) and (107) that, for any l ∈ {1, · · · , n} the (a, l, b)-coordinate of T satisfies



⟨c∗ − ĉ[s], 1k−1 ⊗ χl ⊗ 1d−k ⟩, if a = 1, b = 1;
⟨c∗ − ĉ[s], u∗ ⊗ χ ⊗ 1d−k ⟩, if a = 2, b = 1;

k−1 l
Ta,l,b = ∗


⟨c − ĉ[s], 1 k−1 ⊗ χl ⊗ vk∗ ⟩, if a = 1, b = 2;
⟨c∗ − ĉ[s], u∗ ⊗ χ ⊗ v ∗ ⟩,

if a = 2, b = 2.
k−1 l k

By (108) and (109),

• 1k−1 ⊗ χl ⊗ 1d−k is the coefficient tensor of ϕl (xk ) ∈ Span{B1,d };

• u∗k−1 ⊗ χl ⊗ 1d−k is the coefficient tensor of Φ∗k−1 (x1 , · · · , xk−1 )ϕl (xk ) ∈ Span{B2,d };

• 1k−1 ⊗ χl ⊗ vk∗ is the coefficient tensor of ϕl (xk )Ψ∗k (xk+1 , · · · , xd ) ∈ Span{B2,d };

• u∗k−1 ⊗χl ⊗vk∗ is the coefficient tensor of Φ∗k−1 (x1 , · · · , xk−1 )ϕl (xk )Ψ∗k (xk+1 , · · · , xd ) ∈ Span{B3,d }.

Step 1. Suppose s ⩾ 3. By Lemma 14,

• ⟨c∗ − ĉ[s], 1k−1 ⊗ χl ⊗ 1d−k ⟩ = ⟨p∗ − p̂, ϕkl ⟩;

• ⟨c∗ − ĉ[s], u∗k−1 ⊗ χl ⊗ 1d−k ⟩ = ⟨p∗ − p̂, Φ∗k−1 ϕkl ⟩;

• ⟨c∗ − ĉ[s], 1k−1 ⊗ χl ⊗ vk∗ ⟩ = ⟨p∗ − p̂, ϕkl Ψ∗k ⟩;

• ⟨c∗ − ĉ[s], u∗k−1 ⊗ χl ⊗ vk∗ ⟩ = ⟨p∗ − p̂, Φ∗k−1 ϕkl Ψ∗k ⟩.

This immediately leads to (118).

Step 2. Suppose s = 2. Note that Φ∗k−1 ϕkl Ψ∗k ∈ Span{B3,d }. By Lemma 14, ⟨p∗ − p̂, Φ∗k−1 ϕkl Ψ∗k ⟩ = 0.
The rest of the calculations are the same as Step 1.

Lemma 14. Suppose p∗ satisfies (38), and ĉ[s] is the s-cluster coefficient tensor estimator of p∗
in Definition 1, where 2 ⩽ s ⩽ d. Suppose t ∈ {1, · · · , d} and F : Rd → R is any function such that
d
F ∈ Span{Bt,d }. Let f ∈ Rn be the coefficient tensor of F as

f (l1 , · · · , ld ) := ⟨F, ϕ1l1 · · · ϕdld ⟩ for 1 ⩽ l1 , · · · , ld ⩽ n.

Then (
0, if t > s;
⟨c∗ − ĉ[s], f ⟩ =
⟨p∗ − p̂, F ⟩, if t ⩽ s.

52
Proof. Observe that by orthogonality of {Bj,d }dj=1
n
X
⟨c∗ − ĉ[s], f ⟩ = (c∗ − ĉ[s])(l1 , · · · , ld ) · f (l1 , · · · , ld )
l1 ,··· ,ld =1
X n
= ⟨p∗ − p̂s , ϕ1l1 · · · ϕdld ⟩⟨F, ϕ1l1 · · · ϕdld ⟩
l1 ,··· ,ld =1
X X
= ⟨p∗ − p̂s , φ0 ⟩⟨F, φ0 ⟩ + · · · + ⟨p∗ − p̂s , φd ⟩⟨F, φd ⟩
φ0 ∈B0,d φd ∈Bd,d
X X
= ⟨p∗ − p̂s , φ0 ⟩⟨F, φ0 ⟩ + · · · + ⟨p∗ − p̂s , φs ⟩⟨F, φs ⟩. (120)
φ0 ∈B0,d φs ∈Bs,d

where in the second equality, p̂s is defined in (50), the last equality follows the observation that
p∗ ∈ Span{Bj,d }1j=0 and p̂s ∈ Span{Bj,d }sj=0 with s ⩾ 2. Therefore,
• If t > s, then ⟨F, φs ⟩ = 0 when φs ∈ Bs,d . So in this case (120) = 0;
• If t ⩽ s, then (120) = ⟨p∗ − p̂, F ⟩ by Lemma 15.

Lemma 15. Let Ω be a measurable set in arbitrary dimension. Let {ψk }∞ k=1 be a collection of
orthonormal basis in L2 (Ω) and let M be a m-dimensional linear subspace of L2 (Ω) such that
M = Span{ψk }m
k=1 .
R
Suppose F, G ∈ M. Denote ⟨F, G⟩ = Ω F (x)G(x)dx. Then
Xm
⟨F, G⟩ = ⟨F, ψk ⟩⟨G, ψk ⟩.
k=1
Proof. It suffices to observe that

X m
X
⟨F, G⟩ = ⟨F, ψk ⟩⟨G, ψk ⟩ = ⟨F, ψk ⟩⟨G, ψk ⟩,
k=1 k=1
where the first equality follows from the assumption that {ψk }∞k=1 is a complete basis, and the
second equality follows from the fact that F ∈ M, and so ⟨F, ψk ⟩ = 0 if k > m.
Lemma 16. Suppose that {x(i) }N i=1 are independently sampled from the density p∗ : Rd → R+ .
1 PN
Denote the empirical distribution p̂(x) = N i=1 δx(i) (x). Let φ : Rd → R be any non-random
function. Then
−N t2
 

P( ⟨p̂ − p , φ⟩ ⩾ t) ⩽ 2 exp .
∥p∗ ∥∞ ∥φ∥2L ([0,1]d ) + ∥φ∥∞ t
2

Proof. It suffices to apply the Bernstein’s inequality to the random variable ⟨p̂ − p∗ , φ⟩. Observe
∗ 1 PN (i) (i)

that ⟨p̂ − p , φ⟩ = N i=1 φ(x ) − E[φ(x )] . Since
Z
2
E( φ(x(i) ) − E[φ(x(i) )] ) = Var[φ(x(i) )] ⩽ E(φ2 (x(i) )) = φ2 (x)p∗ (x)dx ⩽ ∥φ∥2L2 ([0,1]d ) ∥p∗ ∥∞


and that ∥φ(x(i) ) − E[φ(x(i) )]∥∞ ⩽ 2∥φ∥∞ , the desired result follows immediately from the Bern-
stein’s inequality.

53
C.6 Orthonormal matrices
Lemma 17. Suppose M ∈ Rn×m and V ∈ On×a in the sense that V T V = Ia . Suppose in addition
that V T M = U ΣW T is the SVD of V T M , where U ∈ Oa×r , Σ ∈ Rr×r and W ∈ Or×m . Then

V V T M = (V U )ΣW T (121)

is the SVD of V V T M . In particular, Rank(V T M ) = Rank(V V T M ) and σj (V T M ) = σj (V V T M )


for all 1 ⩽ j ⩽ Rank(V T M ).

Proof. Note that


(V U )T V U = U T V T V U = U T Ia U = U T U = Ir .
Therefore V U ∈ On×r . Consequently, (121) holds by the uniqueness of SVD. Since the singular
values both V T M and V V T M are both determined by Σ, the desired results follow.

Lemma 18. Suppose M ∈ Rn×m such that Rank(M ) = r. Suppose in addition that U ∈ On×r and
that σr (M ) > 2∥U U T M − M ∥F . Then

• Rank(U U T M ) = Rank(U T M ) = r;

• Row(M ) = Row(U U T M );
σr (M )
• σr (U T M ) = σr (U U T M ) ⩾ 2 .

• U U T is the projection matrix onto Col(U U T M ). In other words, the column vectors of U form
a basis for Col(U U T M ).

Proof. Note that Row(U U T M ) ⊂ Row(M ). In addition, by Theorem 2,

σr (U U T M ) ⩾ σr (M ) − ∥U U T M − M ∥F > 0.

So Rank(U U T M ) ⩾ r. Therefore dim(Row(U U T M )) ⩾ r. Since Row(U U T M ) ⊂ Row(M ) and


dim(Row(M )) = Rank(M ) = r, it follows that

Row(U U T M ) = Row(M ) and Rank(U U T M ) = r.

By Lemma 17, Rank(U T M ) = Rank(U U T M ) = r. In addition,

σr (M )
σr (U T M ) = σr (U U T M ) ⩾ σr (M ) − ∥U U T M − M ∥F ⩾ ,
2
where the equality follows from Lemma 17, and the first inequality follows from by Theorem 2.

Denote U = [u1 , · · · , ur ], where {uk }rk=1 are orthogonal. Then

Col(U U T M ) ⊂ Span{uk }rk=1 .

Since dim(Span{uk }rk=1 ) = dim(Col(U U T M )) = r, it follows that Span({uk }rk=1 ) = Col(U U T M ).


Consequently {uk }rk=1 form a basis for Col(U U T M ).

54
Lemma 19. Let m, n, r, s ∈ Z+ be positive integers. Suppose U ∈ Om×r and W ∈ On×s . Then
U ⊗ W ∈ Omn×rs .

Proof. It suffices to observe that

(U ⊗ W )T (U ⊗ W ) = (U T ⊗ W T )(U ⊗ W ) = U T U ⊗ W T W = Ir ⊗ Is = Irs .

Lemma 20. Let m, r, s ∈ Z+ be positive integers. Suppose U ∈ Om×r and V ∈ Orn×s . Then
(U ⊗ In )V ∈ Omn×s .

Proof. It suffices to observe that

[(U ⊗ In )V ]T (U ⊗ In )V =V T (U T ⊗ In )(U ⊗ In )V = V T (U T U ⊗ In )V
=V T (Ir × In )V = V T Irn V = Is .

Lemma 21. Let l, m, n, r ∈ Z+ be positive integers. Let U ∈ Om×r and U⊥ ∈ Om×(n−r) be the
orthogonal complement matrix of U . Then

(U ⊗ Il )⊥ = U⊥ ⊗ Il .

Proof. It suffices to note that U⊥ ⊗ Il ∈ Oml×(n−r)l and that ⟨U ⊗ Il , U⊥ ⊗ Il ⟩ = 0.

C.7 Matrix perturbation bound


Theorem 2. Let A,  and E be three matrices in Rn×m and that  = A + E. Then

|σr (A) − σr (Â)| ⩽ ∥E∥2 .

Proof. This is the well-known Weyl’s inequality for singular values.

Theorem 3. Let A,  and E be three matrices in Rn×m and that  = A + E. Suppose the SVD
of A, Â satisfies
  T   T
  Σ1 0 V   Σ̂1 0 V̂
A = U U⊥ , Â = Û Û⊥ .
0 Σ2 V⊥T 0 Σ̂2 V̂⊥T

Here r ⩽ min{n, m}, and Σ1 , Σ̂1 ∈ Rr×r contain the leading r singular values of A and  respec-
tively. If σr (A) − σr+1 (A) − ∥E∥2 > 0, then

2∥E∥2
∥U U T − Û Û T ∥2 ⩽ .
σr (A) − σr+1 (A) − ∥E∥2

Proof. This is the well-known Wedin’s theorem.

55
Lemma 22. Let A,  and E be three matrices in Rn×m and that  = A + E. Suppose the SVD
of A, Â satisfies
  T   T
  Σ1 0 V   Σ̂1 0 V̂
A = U U⊥ , Â = Û Û⊥ .
0 Σ2 V⊥T 0 Σ̂2 V̂⊥T

Here r ⩽ min{n, m}, and Σ1 , Σ̂1 ∈ Rr×r contain the leading r singular values of A and  respec-
tively. Denote

α = σmin (U T ÂV ), β = ∥U⊥T ÂV⊥ ∥2 , z21 = ∥U⊥T EV ∥2 , and z12 = ∥U T EV⊥ ∥2 ,

If α2 ⩾ β 2 + min{z21
2 , z 2 }, then
12

αz21 + βz12
∥U U T − Û Û T ∥2 ⩽ 2 , z2 } .
α2 − β 2 − min{z21 12

Proof. See Theorem 1 of Cai and Zhang (2018).

Corollary 3. Let A,  and E be three matrices in Rn×m and that  = A+E. Suppose Rank(A) = r
and σr (A) ⩾ 4∥E∥2 . Write the SVD of A, Â as

 Σ1 0 V T
     T
   Σ̂1 0 V̂
A= U U ⊥ , Â = Û Û⊥ (122)
0 0 V⊥ 0 Σ̂2 V̂⊥T

where Σ1 , Σ̂1 ∈ Rr×r contain the leading r singular values of A and  respectively. The for any
Ũ ∈ On×r , Ṽ ∈ Om×r such that

Ũ Ũ T = U U T and Ṽ Ṽ T = V V T ,

there exists a universal constant C such that


∥E∥22
 
T T ∥E Ṽ ∥2
∥Ũ Ũ − Û Û ∥2 ⩽ C + 2 . (123)
σr (A) σr (A)
Proof. Let α, α, z21 , z12 be defined as in Lemma 22. Observe that

α = σmin (U T ÂV ) = σr (U T ÂV ) = σr (U U T ÂV V T ) = σr (U U T (A + E)V V T )

where the second equality follows from the fact that U T ÂV ∈ Rr×r , and the second equality follows
from Lemma 17. So by Theorem 2,

|α − σr (A)| ⩽ ∥U U T EV V T ∥2 = ∥E Ṽ Ṽ T ∥2 = ∥E Ṽ ∥2 (124)

In addition, since Rank(A) = r, it holds that U⊥T AV⊥ = 0. Therefore,

β = ∥U⊥T (A + E)V⊥ ∥2 = ∥U⊥T EV⊥ ∥2 ⩽ ∥E∥2 .

In addition, we have that z12 ⩽ ∥E∥2 and that

z21 ⩽ ∥EV ∥2 = ∥EV V T ∥2 = ∥E Ṽ Ṽ T ∥2 = ∥E Ṽ ∥2 ,

56
where the first and the last equality both follow from Lemma 17. Consequently
1
α2 − β 2 − min{z21
2 2
, z12 } ⩾ σr2 (A) − 2∥E Ṽ ∥22 − ∥E∥22 − ∥E∥22
2
1 1
⩾ σr2 (A) − 4∥E∥22 ⩾ σr2 (A),
2 4

where the first inequality follows from (124), the second inequality follows from ∥E Ṽ ∥2 ⩽ ∥E∥2 ,
and the last inequality follows from σr (A) ⩾ 4∥E∥2 . Therefore by Lemma 22,

αz21 + βz12
∥Ũ Ũ T − Û Û T ∥2 = ∥U U T − Û Û T ∥2 ⩽ 2 , z2 }
α2 − β 2 − min{z21 12
(σr (A) + ∥E Ṽ ∥2 )∥E Ṽ ∥2 + ∥E∥22 4∥E Ṽ ∥2 8∥E∥22
⩽ 1 2 ⩽ + 2 ,
4 σr (A)
σr (A) σr (A)

where ∥E Ṽ ∥2 ⩽ ∥E∥2 is used in the last inequality.

C.8 Polynomial Approximation Theory


Let C β0 ([0, 1]) denote the class of functions with continuous derivatives up to order β0 ⩾ 1.

Theorem 4. For any f ∈ C β0 ([0, 1]), there exists a polynomial pf of degree at most n such that

∥f − pf ∥∞ ⩽ C∥f (β0 ) ∥∞ n−β0 , (125)

where C is an absolute constant which does not depend on f .

Proof. This is the well-known Jackson’s Theorem for polynomial approximation. See Section 7.2
of DeVore and Lorentz (1993) for a proof.

Corollary 4. Let f ∈ C β0 ([0, 1]). Denote Sn the class of polynomial up degree n on [0, 1], and Pn
the projection operator from L2 ([0, 1]) to Sn respect to L2 norm. Then there exists a constant C
which does not depend on f such that

∥f − Pn (f )∥L2 ([0,1]) ⩽ C∥f (β0 ) ∥∞ n−β0 .

Proof. Let pf be the polynomial satisfying (125). Then

∥f − Pn (f )∥L2 ([0,1]) ⩽ ∥f − pf ∥L2 ([0,1]) ⩽ ∥f − pf ∥∞ ⩽ C∥f (β0 ) ∥∞ n−β0 .

Theorem 5. Let f ∈ C β0 ([0, 1]). Denote Sn the class of polynomial up degree n on [0, 1], and Pn
the projection operator from L2 ([0, 1]) to Sn respect to L2 norm. Then there exists an constant C
only depending on f such that
∥f − Pn (f )∥∞ ⩽ Cn−β0 .

Proof. See Theorem 4.8 of Wang (2021).

57

You might also like