This document discusses tensor decomposition with Python. It begins by explaining what tensor decomposition and factorization are, and how they can be used to represent multi-dimensional datasets and perform dimensionality reduction. It then discusses matrix and tensor factorization methods like NMF, topic modeling, and CP/PARAFAC decomposition. The remainder of the document provides examples of tensor decomposition using Python tools and libraries, and discusses applications to analyzing temporal network and sensor data.
TENSOR DECOMPOSITION WITHPYTHON
LEARNING STRUCTURES FROM MULTIDIMENSIONAL DATA
ANDRÉ PANISSON
@apanisson
ISI Foundation, Torino & New York City
2.
WHAT IS DATADECOMPOSITION?
DECOMPOSITION == FACTORIZATION
Representation a dataset as a sum of (interpretable) parts
▸ Represent data as the combination of many components / factors
▸ Dimensionality reduction: each new dimension
represents a latent variable:
▸ text corpus => topics
▸ shopping behaviour => segments (user segmentation)
▸ social network => groups, communities
▸ psychology surveys => personality traits
▸ electronic medical records => health conditions
▸ chemical solutions => chemical ingredients
DATA DECOMPOSITION
▸ Decompositionof data represented in two dimensions:
MATRIX FACTORIZATION
▸ text: documents X terms
▸ surveys: subjects X questions
▸ electronic medical records: patients X diagnosis/drugs
▸ Decomposition of data represented in more dimensions:
TENSOR FACTORIZATION
▸ social networks: user X user (adjacency matrix) X time
▸ text: authors X terms X time
▸ spectroscopy:
solution sample X wavelength (emission) X wavelength (excitation)
5.
WHY TENSOR FACTORIZATION+ PYTHON?
▸ Matrix Factorization is already used in many fields
▸ Tensor Factorization is becoming very popular
for multiway data analysis
▸ TF is very useful to explore time-varying network data
▸ But still, the most used tool is Matlab
▸ There’s room for improvement in
the Python libraries for TF
FACTOR ANALYSIS
Spearman ~1900
X≈WH
Xtestsx subjects ≈ Wtests x intelligences Hintelligences x subjects
Spearman, 1927: The abilities of man.
≈
tests
subjects subjects
tests
Int.
Int.
X W
H
8.
TOPIC MODELING /LATENT SEMANTIC ANALYSIS
Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84.
. , ,
. , ,
. . .
gene
dna
genetic
life
evolve
organism
brai n
neuron
nerve
data
number
computer
. , ,
Topics Documents
Topic proportions and
assignments
0.04
0.02
0.01
0.04
0.02
0.01
0.02
0.01
0.01
0.02
0.02
0.01
data
number
computer
. , ,
0.02
0.02
0.01
9.
TOPIC MODELING /LATENT SEMANTIC ANALYSIS
X≈WH
Non-negative Matrix Factorization (NMF):
(~1970 Lawson, ~1995 Paatero, ~2000 Lee & Seung)
2005 Gaussier et al. "Relation between PLSA and NMF and implications."
arg min
W,H
kX WHk s. t. W, H 0
≈
documents
terms terms
documents
topic
topic
Sparse
Matrix! W
H
10.
NON-NEGATIVE MATRIX FACTORIZATION(NMF)
NMF gives Part based representation
(Lee & Seung – Nature 1999)
NMF
=×
Original
PCA
×
=
NMF is similar to Spectral Clustering
(Ding et al. - SDM 2005)
arg min
W,H
kX WHk s. t. W, H 0
W W •
XHT
WHHT
H H •
WT
X
WTWH
NMF brings interpretation!
11.
from sklearn importdatasets, decomposition, utils
digits = datasets.fetch_mldata('MNIST original')
A = utils.shuffle(digits.data)
nmf = decomposition.NMF(n_components=20)
W = nmf.fit_transform(A)
H = nmf.components_
plt.rc("image", cmap="binary")
plt.figure(figsize=(8,4))
for i in range(20):
plt.subplot(2,5,i+1)
plt.imshow(H[i].reshape(28,28))
plt.xticks(())
plt.yticks(())
plt.tight_layout()
BEYOND MATRICES: HIGHDIMENSIONAL DATASETS
Cichocki et al. Nonnegative Matrix and Tensor Factorizations
Environmental analysis
▸ Measurement as a function of (Location, Time, Variable)
Sensory analysis
▸ Score as a function of (Wine sample, Judge, Attribute)
Process analysis
▸ Measurement as a function of (Batch, Variable, time)
Spectroscopy
▸ Intensity as a function of (Wavelength, Retention, Sample, Time,
Location, …)
…
MULTIWAY DATA ANALYSIS
RANK-1 TENSOR
The outerproduct of N vectors results in a rank-1 tensor
array([[[ 1., 2.],
[ 2., 4.],
[ 3., 6.],
[ 4., 8.]],
[[ 2., 4.],
[ 4., 8.],
[ 6., 12.],
[ 8., 16.]],
[[ 3., 6.],
[ 6., 12.],
[ 9., 18.],
[ 12., 24.]]])
a = np.array([1, 2, 3])
b = np.array([1, 2, 3, 4])
c = np.array([1, 2])
T = np.zeros((a.shape[0], b.shape[0], c.shape[0]))
for i in range(a.shape[0]):
for j in range(b.shape[0]):
for k in range(c.shape[0]):
T[i, j, k] = a[i] * b[j] * c[k]
T = a(1)
· · · a(N)
=
a
c
b
Ti,j,k = a
(1)
i a
(2)
j a
(3)
k
22.
TENSOR RANK
▸ Everytensor can be written as a sum of rank-1 tensors
=
a1 aJ
c1 cJ
b1 bJ
+ +
▸ Tensor rank: smallest number of rank-1 tensors
that can generate it by summing up
X ⇡
RX
r=1
a(1)
r a(2)
r · · · a(N)
r ⌘ JA(1)
, A(2)
, · · · , A(N)
K
T ⇡
RX
r=1
ar br cr ⌘ JA, B, CK
23.
array([[[ 61., 82.],
[74., 100.],
[ 87., 118.],
[ 100., 136.]],
[[ 77., 104.],
[ 94., 128.],
[ 111., 152.],
[ 128., 176.]],
[[ 93., 126.],
[ 114., 156.],
[ 135., 186.],
[ 156., 216.]]])
A = np.array([[1, 2, 3],
[4, 5, 6]]).T
B = np.array([[1, 2, 3, 4],
[5, 6, 7, 8]]).T
C = np.array([[1, 2],
[3, 4]]).T
T = np.zeros((A.shape[0], B.shape[0], C.shape[0]))
for i in range(A.shape[0]):
for j in range(B.shape[0]):
for k in range(C.shape[0]):
for r in range(A.shape[1]):
T[i, j, k] += A[i, r] * B[j, r] * C[k, r]
T = np.einsum('ir,jr,kr->ijk', A, B, C)
: Kruskal Tensorbr cr ⌘ JA, B, CK
24.
TENSOR FACTORIZATION
▸ CANDECOMP/PARAFACfactorization (CP)
▸ extensions of SVD / PCA / NMF to tensors
NON-NEGATIVE TENSOR FACTORIZATION
▸ Decompose a non-negative tensor to
a sum of R non-negative rank-1 tensors
arg min
A,B,C
kT JA, B, CKk
with JA, B, CK ⌘
RX
r=1
ar br cr
subject to A 0, B 0, C 0
25.
TENSOR FACTORIZATION: HOWTO
Alternating Least Squares(ALS):
Fix all but one factor matrix to which LS is applied
min
A 0
kT(1) A(C B)T
k
min
B 0
kT(2) B(C A)T
k
min
C 0
kT(3) C(B A)T
k
denotes the Khatri-Rao product, which is a
column-wise Kronecker product, i.e., C B = [c1 ⌦ b1, c2 ⌦ b2, . . . , cr ⌦ br]
T(1) = ˆA(ˆC ˆB)T
T(2) = ˆB(ˆC ˆA)T
T(3) = ˆC(ˆB ˆA)T
Unfolded Tensor
on the kth mode
26.
F = [zeros(n,r), zeros(m, r), zeros(o, r)]
FF_init = np.rand((len(F), r, r))
def iter_solver(T, F, FF_init):
# Update each factor
for k in range(len(F)):
# Compute the inner-product matrix
FF = ones((r, r))
for i in range(k) + range(k+1, len(F)):
FF = FF * FF_init[i]
# unfolded tensor times Khatri-Rao product
XF = T.uttkrp(F, k)
F[k] = F[k]*XF/(F[k].dot(FF))
# F[k] = nnls(FF, XF.T).T
FF_init[k] = (F[k].T.dot(F[k]))
return F, FF_init
min
A 0
kT(1) A(C B)T
k
min
B 0
kT(2) B(C A)T
k
min
C 0
kT(3) C(B A)T
k
arg min
W,H
kX WHk s.
J. Kim and H. Park. Fast Nonnegative Tensor Factorization with an Active-set-like Method.
In High-Performance Scientific Computing: Algorithms and Applications, Springer, 2012, pp. 311-326.
W W •
XHT
WHHT
T(1)(C B)
27.
HOW TO INTERPRET:USER X TERM X TIME
X is a 3-way tensor in which
xnmt is 1 if the term m was used by user n at interval t,
0 otherwise
ANxK
is the the association of each user n to a factor k
BMxK
is the association of each term m to a factor k
CTxK
shows the time activity of each factor
users
users
C
=
X
A
B
(N×M×T)
(T×K)
(N×K)
(M×K)
terms
tim
e
tim
e
terms
factors