Tutorial | PKDD 2005
A practical Time-Series
Tutorial with MATLAB
Michalis Vlachos
IBM T.J.
T.J. Watson Research Center
Hawthorne, NY, 10532
Tutorial | Time-Series with Matlab
About this tutorial
The goal of this tutorial is to show you that time-series
research (or research in general) can be made fun, when it
involves visualizing ideas, that can be achieved with concise
programming.
Matlab enables us to do that.
Will I be able
to use this
MATLAB
right away
after the
tutorial?
I am definately
smarter than her,
but I am not a timetimeseries person, perper-se.
I wonder what I gain
from this tutorial
tutorial
Tutorial | Time-Series with Matlab
Disclaimer
I am not affiliated with Mathworks in any way
but I do like using Matlab a lot
since it makes my life easier
Errors and bugs are most likely contained in this tutorial.
I might be responsible for some of them.
Tutorial | Time-Series with Matlab
Timeline of tutorial
Matlab introduction
I will try to convince you that Matlab is cool
Brief introduction to its many features
TimeTime-series with Matlab
Introduction
TimeTime-Series Representations
Distance Measures
Lower Bounding
Clustering/Classification/Visualization
Applications
4
Tutorial | Time-Series with Matlab
What this tutorial is NOT about
Moving averages
Autoregressive models
Forecasting/Prediction
Stationarity
Letsnot
notescape
escapeinto
into
Lets
mathematics.
Lets
mathematics. Lets
Oh...buy
Oh...buy
mybooks
books
my
too!
too!
stickwith
withreality.
reality.
stick
Seasonality
..or about complex mathematical formulas
Michael Crichton
5
Tutorial | Time-Series with Matlab
PART I: Matlab Introduction
Tutorial | Time-Series with Matlab
Why does anyone need Matlab?
Matlab enables the efficient
Exploratory Data Analysis (EDA)
Science progresses through observation
-- Isaac Newton
Isaac Newton
The greatest value of a picture is that is forces us to
notice what we never expected to see
-- John Tukey
John Tukey
7
Tutorial | Time-Series with Matlab
Matlab
Interpreted Language
Easy code maintenance (code is very compact)
Very fast array/vector manipulation
Support for OOP
Easy plotting and visualization
Easy Integration with other Languages/OSs
Interact with C/C++, COM Objects, DLLs
Build in Java support (and compiler)
Ability to make executable files
Multi-Platform Support (Windows, Mac, Linux)
Extensive number of Toolboxes
Image, Statistics, Bioinformatics, etc
8
Tutorial | Time-Series with Matlab
History of Matlab (MATrix LABoratory)
The most important thing in the programming language is the name.
I have recently invented a very good name and now I am looking for a
suitable language. -- R. Knuth
Programmed by Cleve Moler as an interface for
EISPACK & LINPACK
Cleve Moler
1957: Moler goes to Caltech. Studies numerical
Analysis
1961: Goes to Stanford. Works with G. Forsythe on
Laplacian eigenvalues.
1977: First edition of Matlab; 2000 lines of Fortran
80 functions (now more than 8000 functions)
1979: Met with Jack Little in Stanford. Started working
on porting it to C
1984: Mathworks is founded
Video:http://www.mathworks.com/company/aboutus/founders/origins_of_matlab_wm.html
9
Tutorial | Time-Series with Matlab
10
Tutorial | Time-Series with Matlab
Current State of Matlab/Mathworks
Matlab, Simulink, Stateflow
Matlab version 7, service pack 2
Used in variety of industries
Aerospace, defense, computers, communication, biotech
Mathworks still is privately owned
Used in >3,500 Universities, with >500,000 users worldwide
2004 Revenue: 300 M.
2004 Employees: 1,000
Pricing:
~2000$ (Commercial use),
~100$ (Student Edition)
Money is better than poverty, if only for financial reasons. Woody Allen
11
Tutorial | Time-Series with Matlab
Who needs Matlab?
R&D companies for easy application deployment
Professors
Lab assignments
Matlab allows focus on algorithms not on language features
Students
Batch processing of files
No more incomprehensible perl code!
Great environment for testing ideas
Quick coding of ideas, then porting to C/Java etc
Easy visualization
Its cheap! (for students at least)
12
Tutorial | Time-Series with Matlab
Starting up Matlab
Personally I'm always ready to learn, although I do not always like
being taught.
Sir Winston Churchill
Dos/Unix like directory navigation
Commands like:
cd
pwd
mkdir
For navigation it is easier to just
copy/paste the path from explorer
E.g.:
cd c:\documents\
13
Tutorial | Time-Series with Matlab
Matlab Environment
Command Window:
- type commands
- load scripts
Workspace:
Loaded Variables/Types/Size
14
Tutorial | Time-Series with Matlab
Matlab Environment
Command Window:
- type commands
- load scripts
Workspace:
Loaded Variables/Types/Size
Help contains a comprehensive
introduction to all functions
15
Tutorial | Time-Series with Matlab
Matlab Environment
Command Window:
- type commands
- load scripts
Workspace:
Loaded Variables/Types/Size
Excellent demos and
tutorial of the various
features and toolboxes
16
Tutorial | Time-Series with Matlab
Starting with Matlab
Everything is arrays
Manipulation of arrays is faster than regular manipulation
with for-loops
a = [1 2 3 4 5 6 7 9 10] % define an array
17
Tutorial | Time-Series with Matlab
Populating arrays
Plot sinusoid function
a = [0:0.3:2*pi] % generate values from 0 to 2pi (with step of 0.3)
b = cos(a)
cos(a) % access cos at positions contained in array [a]
plot(a,b)
plot(a,b) % plot a (x(x-axis) against b (y(y-axis)
Related:
linspace(-100,100,15); % generate 15 values between -100 and 100
18
Tutorial | Time-Series with Matlab
Array Access
Access array elements
>> a(1)
>> a(1:3)
ans =
0
ans =
0.3000
0.6000
Set array elements
>> a(1) = 100
>> a(1:3) = [100 100 100]
100]
19
Tutorial | Time-Series with Matlab
2D Arrays
Can access whole columns or rows
Lets define a 2D array
>> a = [1 2 3; 4 5 6]
a =
>> a(1,:)
Row-wise access
ans =
1
4
2
5
3
6
>> a(2,2)
>> a(:,1)
ans =
ans =
3
Column-wise access
1
4
A good listener is not only popular everywhere, but after a while he gets to know something. Wilson Mizner 20
10
Tutorial | Time-Series with Matlab
Column-wise computation
For arrays greater than 1D, all computations happen
column-by-column
>> a = [1 2 3; 3 2 1]
a =
>> max(a)
max(a)
ans =
1
3
2
2
3
1
>> mean(a)
mean(a)
>> sort(a)
ans =
ans =
2.0000
2.0000
2.0000
1
3
2
2
1
3
21
Tutorial | Time-Series with Matlab
Concatenating arrays
Column-wise or row-wise
Row next to row
>> a = [1 2 3];
>> b = [4 5 6];
>> c = [a b]
>> a = [1;2];
>> b = [3;4];
>> c = [a b]
c =
Column next to column
c =
1
>>
>>
>>
a
b
c
=
=
=
Row below to row
[1 2 3];
[4 5 6];
[a; b]
c =
1
2
>>
>>
>>
a
b
c
=
=
=
3
4
[1;2];
[3;4];
[a; b]
Column below column
c =
1
4
2
5
3
6
1
2
3
4
22
11
Tutorial | Time-Series with Matlab
Initializing arrays
Create array of ones [ones]
>> a = ones(1,3)
a =
1
>> a = ones(2,2)*5;
a =
1
5
5
5
5
>> a = ones(1,3)*inf
a =
Inf Inf Inf
Create array of zeroes [zeros]
Good for initializing arrays
>> a = zeros(1,4)
a =
0
>> a = zeros(3,1) + [1 2 3]
3]
a =
1
2
3
23
Tutorial | Time-Series with Matlab
Reshaping and Replicating Arrays
Changing the array shape [reshape]
(eg, for easier column-wise computation)
>> a = [1 2 3 4 5 6]
6]; % make it into a column
>> reshape(a,2,3)
ans =
1
2
3
4
reshape(X,[M,N]):
[M,N] matrix of
columnwise version
of X
5
6
Replicating an array [repmat]
>> a = [1 2 3];
>> repmat(a,1,2)
ans =
repmat(X,[M,N]):
make [M,N] tiles of X
>> repmat(a,2,1)
repmat(a,2,1)
ans =
1
2
1
2
3
3
24
12
Tutorial | Time-Series with Matlab
Useful Array functions
Last element of array [end]
>> a = [1 3 2 5];
>> a(end)
>> a = [1 3 2 5];
>> a(enda(end-1)
ans =
ans =
2
Length of array [length]
Length = 4
>> length(a)
a=
ans =
>> [rows, columns] = size(a)
rows = 1
columns = 4
columns = 4
rows = 1
Dimensions of array [size]
25
Tutorial | Time-Series with Matlab
Useful Array functions
Find a specific element [find] **
>> a = [1 3 2 5 10 5 2 3];
>> b = find(a==2)
b =
3
Sorting [sort] ***
>> a = [1 3 2 5];
>> [s,i]=sort(a)
a=
s=
i=
s =
1
i =
1
Indicates the index
where the element
came from
26
13
Tutorial | Time-Series with Matlab
Visualizing Data and Exporting Figures
Use Fishers Iris dataset
>> load fisheriris
4 dimensions, 3 species
Petal length & width, sepal length & width
Iris:
virginica/versicolor/setosa
meas (150x4 array):
Holds 4D measurements
...
'versicolor'
'versicolor'
'versicolor'
'versicolor'
'versicolor'
'virginica'
'virginica'
'virginica'
'virginica
species (150x1 cell array):
Holds name of species for
the specific measurement
...
27
strcmp, scatter, hold on
Tutorial | Time-Series with Matlab
Visualizing Data (2D)
>>
>>
>>
>>
>>
>>
>>
>>
idx_setosa = strcmp(species, setosa
setosa); % rows of setosa data
idx_virginica = strcmp(species, virginica
virginica); % rows of virginica
setosa = meas(idx_setosa,[1:2]);
virgin = meas(idx_virginica,[1:2]);
scatter(setosa(:,1), setosa(:,2)); % plot in blue circles by default
hold on;
scatter(virgin(:,1), virgin(:,2), rs
rs); % red[r
red[r]] squares[s
squares[s]] for these
idx_setosa
...
1
1
1
0
0
0
...
The world is governed more by appearances rather than realities --Daniel Webster
An array of zeros and
ones indicating the
positions where the
keyword setosa was
found
28
14
scatter3
Tutorial | Time-Series with Matlab
Visualizing Data (3D)
>>
>>
>>
idx_setosa = strcmp(species, setosa
setosa); % rows of setosa data
idx_virginica = strcmp(species, virginica
virginica); % rows of virginica
idx_versicolor = strcmp(species, versicolor
versicolor); % rows of versicolor
>>
>>
>>
>>
>>
>>
>>
setosa = meas(idx_setosa,[1:3]);
virgin = meas(idx_virginica,[1:3]);
versi = meas(idx_versicolor,[1:3]);
scatter3(setosa(:,1), setosa(:,2),setosa(:,3)); % plot in blue circles by default
hold on;
scatter3(virgin(:,1), virgin(:,2),virgin(:,3), rs
rs); % red[r
red[r]] squares[s
squares[s]] for these
scatter3(versi(:,1), virgin(:,2),versi(:,3), gx
gx); % green xs
7
6
5
>> grid on; % show grid on axis
>> rotate3D on; % rotate with mouse
4
3
2
1
4.5
4
3.5
3
2.5
2
4.5
5.5
6.5
7.5
29
Tutorial | Time-Series with Matlab
Changing Plots Visually
Zoom out
Zoom in
Create line
Create Arrow
Select Object
Add text
Computers are useless. They can only give you answers. Pablo Picasso
30
15
Tutorial | Time-Series with Matlab
Changing Plots Visually
Add titles
Add labels on axis
Change tick labels
Add grids to axis
Change color of line
Change thickness/
Linestyle
etc
31
Tutorial | Time-Series with Matlab
Changing Plots Visually (Example)
Change color and
width of a line
A
Right click
C
32
16
Tutorial | Time-Series with Matlab
Changing Plots Visually (Example)
The result
Other Styles:
3
2
1
0
-1
-2
-3
0
3
10
20
30
40
50
60
70
80
90
10
20
30
40
50
60
70
80
90
100
2
1
0
-1
-2
-3
0
33
100
Tutorial | Time-Series with Matlab
Changing Figure Properties with Code
GUIs are easy, but sooner or later we realize that
coding is faster
>> a = cumsum(randn(365,1)); % random walk of 365 values
If this represents a years
worth of measurements of an
imaginary quantity, we will
change:
x-axis annotation to months
Axis labels
Put title in the figure
Include some greek letters
in the title just for fun
Real men do it command-line --Anonymous
34
17
Tutorial | Time-Series with Matlab
Changing Figure Properties with Code
Axis annotation to months
>> axis tight; % irrelevant but useful...
>> xx = [15:30:365];
>> set(gca, xtick
xtick,xx)
The result
Real men do it command-line --Anonymous
35
Tutorial | Time-Series with Matlab
Changing Figure Properties with Code
Axis annotation to months
>> set(gca,
set(gca,xticklabel
xticklabel,[
,[Jan
Jan; ...
Feb
Feb;Mar
Mar])
The result
Real men do it command-line --Anonymous
36
18
Tutorial | Time-Series with Matlab
Changing Figure Properties with Code
Other latex examples:
Axis labels and title
\alpha, \beta, e^{-\alpha} etc
>> title(
title(My measurements (\
(\epsilon/\
epsilon/\pi)
pi))
>> ylabel(
ylabel(Imaginary Quantity
Quantity)
>> xlabel(
xlabel(Month of 2005
2005)
Real men do it command-line --Anonymous
37
Tutorial | Time-Series with Matlab
Saving Figures
Matlab allows to save the figures (.fig) for later
processing
.fig can be later
opened through
Matlab
You can always put-off for tomorrow, what you can do today. -Anonymous
38
19
Tutorial | Time-Series with Matlab
Exporting Figures
Export to:
emf, eps, jpg, etc
39
Tutorial | Time-Series with Matlab
Exporting figures (code)
You can also achieve the same result with Matlab code
Matlab code:
% extract to color eps
print -depsc myImage.eps;
myImage.eps; % from commandcommand-line
print(gcf,
print(gcf,-depsc
depsc,myImage
myImage) % using variable as name
40
20
Tutorial | Time-Series with Matlab
Visualizing Data - 2D Bars
1
2
3
4
colormap
bars
time = [100 120 80 70]; % our data
h = bar(time);
bar(time); % get handle
cmap = [1 0 0; 0 1 0; 0 0 1; .5 0 1]; % colors
colormap(cmap);
colormap(cmap); % create colormap
cdata = [1 2 3 4]; % assign colors
set(h,'CDataMapping','direct','CData',cdata);
set(h,'CDataMapping','direct','CData',cdata);
41
Tutorial | Time-Series with Matlab
Visualizing Data - 3D Bars
data
10
10
9
8
6
6
3
8
6
4
2
0
8
6
6
5
3
2
colormap
7
5
4
4
2
1
64
1
2
3
5
6
7
0
0.0198
0.0397
0.0595
0.0794
0.0992
1.0000
1.0000
1.0000
1.0000
0
0.0124
0.0248
0.0372
0.0496
0.0620
...
0.7440
0.7564
0.7688
0.7812
0
0.0079
0.0158
0.0237
0.0316
0.0395
0.4738
0.4817
0.4896
0.4975
3
data = [ 10 8 7; 9 6 5; 8 6 4; 6 5 4; 6 3 2; 3 2 1];
bar3([1 2 3 5 6 7], data);
c = colormap(gray);
colormap(gray); % get colors of colormap
c = c(20:55,:); % get some colors
colormap(c);
colormap(c); % new colormap
42
21
Tutorial | Time-Series with Matlab
Visualizing Data - Surfaces
data
10
9
10
7
6
9 10
5
4
10
3
2
1
10
8
10
6
8
6
2
0
The value at position
x-y of the array
indicates the height of
the surface
data = [1:10];
data = repmat(data,10,1);
repmat(data,10,1); % create data
surface(data,'FaceColor',[1 1 1], 'Edgecolor
', [0 0 1]); % plot data
'Edgecolor',
view(3);
view(3); grid on; % change viewpoint and put axis lines
43
Tutorial | Time-Series with Matlab
Creating .m files
Standard text files
Script: A series of Matlab commands (no input/output arguments)
Functions: Programs that accept input and return output
Right click
44
22
Tutorial | Time-Series with Matlab
Creating .m files
M editor
Double click
45
cumsum, num2str, save
Tutorial | Time-Series with Matlab
Creating .m files
The following script will create:
An array with 10 random walk vectors
Will save them under text files: 1.dat, , 10.dat
Sample Script
myScript.m
a = cumsum(randn(100,10)); % 10 random walk data of length 100
for i=1:size(a,2),
% number of columns
data = a(:,i) ;
fname = [num2str(i) .dat
.dat]; % a string is a vector of characters!
save(fname, data
data,-ASCII
ASCII); % save each column in a text file
end
Write this in the
M editor
A random walk time-series
cumsum(A)
10
15
10
-5
10
20
30
40
50
60
70
80
90 100
and execute by typing the
name on the Matlab
command line
46
23
Tutorial | Time-Series with Matlab
Functions in .m scripts
When we need to:
Organize our code
Frequently change parameters in our scripts
keyword output argument function name
input argument
function dataN = zNorm(data)
% ZNORM zNormalization of vector
% subtract mean and divide by std
Help Text
(help function_name)
if (nargin
<1), % check parameters
(nargin<1),
error(
error(Not enough arguments
arguments);
end
data = data mean(data);
mean(data); % subtract mean
data = data/std(data
); % divide by std
data/std(data);
dataN = data;
Function Body
function [a,b] = myFunc(data, x, y) % pass & return more arguments
See also:varargin, varargout
47
Tutorial | Time-Series with Matlab
Cell Arrays
Cells that hold other Matlab arrays
Lets read the files of a directory
>> f = dir(
dir(*.dat
*.dat) % read file contents
f =
15x1 struct array with fields:
name
date
bytes
isdir
for i=1:length(f),
a{i} = load(f(i).name);
N = length(a{i});
plot3([1:N], a{i}(:,1), a{i}(:,2), ...
r-, Linewidth
Linewidth, 1.5);
grid on;
pause;
600
500
cla;
400
end
Struct Array
1
name
date
bytes
isdir
a
).n
f(1
me
2
3
4
5
300
200
100
0
1000
1500
500
1000
48
500
0
24
Tutorial | Time-Series with Matlab
Reading/Writing Files
Load/Save are faster than C style I/O operations
But fscanf, fprintf can be useful for file formatting
or reading non-Matlab files
fid = fopen('fischer.txt', 'wt');
for i=1:length(species),
fprintf(fid, '%6.4f %6.4f %6.4f %6.4f %s\
%s\n', meas(i,:), species{i});
end
fclose(fid);
Output file:
Elements are accessed column-wise (again)
x = 0:.1:1; y = [x; exp(x)];
fid = fopen('exp.txt','w');
fprintf(fid,'%6.2f %12.8f\
%12.8f\n',y);
fclose(fid);
0
1
0.1
1.1052
0.2
1.2214
0.3
1.3499
0.4
0.4
1.4918
0.5
1.6487
1.6487
0.6
1.8221
0.7
2.0138
49
Tutorial | Time-Series with Matlab
Flow Control/Loops
if (else/elseif) , switch
Check logical conditions
while
Execute statements infinite number of times
for
Execute statements a fixed number of times
break, continue
return
Return execution to the invoking function
Life is pleasant. Death is peaceful. Its the transition thats troublesome. Isaac Asimov
50
25
Tutorial | Time-Series with Matlab
For-Loop or vectorization?
clear all;
tic;
for i=1:50000
a(i)
a(i) = sin(i);
sin(i);
end
toc
clear all;
a = zeros(1,50000);
zeros(1,50000);
tic;
for i=1:50000
a(i)
a(i) = sin(i);
sin(i);
end
toc
clear all;
tic;
i = [1:50000];
a = sin(i);
sin(i);
toc;
toc;
elapsed_time =
5.0070
tic, toc, clear all
Pre-allocate arrays that
store output results
No need for Matlab to
resize everytime
Functions are faster than
scripts
elapsed_time =
0.1400
Compiled into pseudocode
Load/Save faster than
Matlab I/O functions
After v. 6.5 of Matlab there
is for-loop vectorization
(interpreter)
elapsed_time =
0.0200
Time not importantonly life important. The Fifth Element
Vectorizations help, but
not so obvious how to
achieve many times
51
Tutorial | Time-Series with Matlab
Matlab Profiler
Find which portions of code take up
most of the execution time
Identify bottlenecks
Vectorize offending code
Time not importantonly life important. The Fifth Element
52
26
Tutorial | Time-Series with Matlab
Hints &Tips
There is always an easier (and faster) way
Typically there is a specialized function for what you want to
achieve
Learn vectorization techniques, by peaking at the
actual Matlab files:
edit [fname], eg
edit mean
edit princomp
Matlab Help contains many
vectorization examples
53
Tutorial | Time-Series with Matlab
Debugging
Beware of bugs in the above code; I have only proved it correct, not tried it
-- R. Knuth
Not as frequently required as in C/C++
Set breakpoints, step, step in, check variables values
Set breakpoints
54
27
Tutorial | Time-Series with Matlab
Debugging
Full control over variables and execution path
F10: step, F11: step in (visit functions, as well)
A
F10
C
Either this man is dead or my watch has stopped. Groucho Marx
55
Tutorial | Time-Series with Matlab
Advanced Features 3D modeling/Volume Rendering
Very easy volume manipulation and rendering
56
28
Tutorial | Time-Series with Matlab
Advanced Features Making Animations (Example)
Create animation by changing the camera viewpoint
-1
-2
0
0
0
-1
-2
-3
-3
0
-1
-2
50
50
50
-3
-1
1
0
100
100
-1
100
0
-1
azimuth = [50:100 99:99:-1:50]; % azimuth range of values
for k = 1:length(azimuth),
1:length(azimuth),
plot3(1:length(a),
);
plot3(1:length(a), a(:,1), a(:,2), 'r', 'Linewidth',2
'Linewidth',2);
grid on;
view(azimuth(k),30);
view(azimuth(k),30); % change new
M(k)
M(k) = getframe;
getframe; % save the frame
end
movie(M,20);
movie(M,20); % play movie 20 times
See also:movie2avi
57
Tutorial | Time-Series with Matlab
Advanced Features GUIs
Built-in Development Environment
Buttons, figures, Menus, sliders, etc
Several Examples in Help
Directory listing
Address book reader
GUI with multiple axis
58
29
Tutorial | Time-Series with Matlab
Advanced Features Using Java
Matlab is shipped with Java Virtual
Machine (JVM)
Access Java API (eg I/O or networking)
Import Java classes and construct objects
Pass data between Java objects and
Matlab variables
59
Tutorial | Time-Series with Matlab
Advanced Features Using Java (Example)
Stock Quote Query
Connect to Yahoo server
http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?
objectId=4069&objectType=file
disp('Contacting YAHOO server using ...');
disp(['url = java.net.URL('
java.net.URL(' urlString ')']);
end;
url = java.net.URL(urlString);
java.net.URL(urlString);
try
stream = openStream(url);
openStream(url);
ireader = java.io.InputStreamReader(stream);
java.io.InputStreamReader(stream);
breader = java.io.BufferedReader(ireader);
java.io.BufferedReader(ireader);
connect_query_data=
connect_query_data= 1; %connect made;
catch
connect_query_data=
connect_query_data= -1; %could not connect
case;
disp(['URL:
disp(['URL: ' urlString]);
urlString]);
error(['Could not connect to server. It may
be unavailable. Try again later.']);
stockdata={};
stockdata={};
return;
end
60
30
Tutorial | Time-Series with Matlab
Matlab Toolboxes
You can buy many specialized toolboxes from Mathworks
Image Processing, Statistics, Bio-Informatics, etc
There are many equivalent free toolboxes too:
SVM toolbox
http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox/
Wavelets
http://www.math.rutgers.edu/~ojanen/wavekit/
Speech Processing
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
Bayesian Networks
http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html
61
Tutorial | Time-Series with Matlab
In case I get stuck
help [command] (on the command line)
eg. help fft
Ivehad
hadaawonderful
wonderful
Ive
evening.
Butthis
this
evening. But
wasnt
it
wasnt it
Menu: help -> matlab help
Excellent introduction on various topics
Matlab webinars
http://www.mathworks.com/company/events/archived_webinars.html?fp
Google groups
comp.soft-sys.matlab
You can find *anything* here
Someone else had the same
problem before you!
62
31
Tutorial | Time-Series with Matlab
PART II: Time Series Analysis
Eightpercent
percentof
of
Eight
success
is
showing
success is showing
up.
up.
63
Tutorial | Time-Series with Matlab
What is a time-series
Definition:
Definition:AAsequence
sequenceof
ofmeasurements
measurementsover
overtime
time
ECG
Medicine
Stock Market
Meteorology
Geology
Astronomy
Chemistry
Biometrics
Robotics
64.0
62.8
62.0
66.0
62.0
32.0
86.4
...
21.6
45.2
43.2
53.0
43.2
42.8
43.2
36.4
16.9
10.0
Sunspot
Earthquake
time
64
32
Tutorial | Time-Series with Matlab
Applications (Image Matching)
Cluster 1
Many types of data can be
converted to time-series
Image
Color Histogram
600
Cluster 2
400
200
0
50
100
150
200
250
50
100
150
200
250
50
100
150
200
250
400
200
800
600
400
200
0
Time-Series
65
Tutorial | Time-Series with Matlab
Applications (Shapes)
Recognize type of leaf based on its shape
Ulmus carpinifolia
Acer platanoides
Salix fragilis
Tilia
Quercus robur
Convert perimeter into a sequence of values
Special thanks to A. Ratanamahatana &
E. Keogh for the leaf video.
66
33
Tutorial | Time-Series with Matlab
Applications (Motion Capture)
Motion-Capture (MOCAP) Data (Movies, Games)
Track position of several joints over time
3*17 joints = 51 parameters per frame
MOCAPdata
data
MOCAP
my
precious
my precious
67
Tutorial | Time-Series with Matlab
Applications (Video)
Video-tracking / Surveillance
Visual tracking of body features (2D time-series)
Sign Language recognition (3D time-series)
Video Tracking of body feature
over time (Athens1, Athens2)
68
34
Tutorial | Time-Series with Matlab
Time-Series and Matlab
Time-series can be represented as vectors or arrays
Fast vector manipulation
Most linear operations (eg euclidean distance, correlation) can
be trivially vectorized
Easy visualization
Many built-in functions
Specialized Toolboxes
69
Tutorial | Time-Series with Matlab
Becoming
Becoming sufficiently
sufficiently
familiar
familiar with
with something
something
is
a
substitute
is a substitute for
for
understanding
understanding it.
it.
PART II: Time Series Matching
Introduction
70
35
Tutorial | Time-Series with Matlab
Basic Data-Mining problem
Todays databases are becoming too large. Search is difficult.
How can we overcome this obstacle?
Basic structure of data-mining solution:
Represent data in a new format
Search few data in the new representation
Examine even fewer original data
Provide guarantees about the search results
Provide some type of data/result visualization
71
Tutorial | Time-Series with Matlab
Basic Time-Series Matching Problem
Distance
query
D = 7.3
Linear Scan:
Objective: Compare the query with
all sequences in DB and return
the k most similar sequences to
the query.
Database
Databasewith
withtime-series:
time-series:
Medical
sequences
Medical sequences
Images,
Images,etc
etc
Sequence
Length:100-1000pts
Sequence Length:100-1000pts
DB
DBSize:
Size:11TByte
TByte
D = 10.2
D = 11.8
D = 17
D = 22
72
36
Tutorial | Time-Series with Matlab
What other problems can we solve?
Clustering: Place time-series into similar groups
Classification: To which group is a time-series most similar to?
query
?
?
?
73
Tutorial | Time-Series with Matlab
Hierarchical Clustering
Very generic & powerful tool
Provides visual data grouping
Pairwise
distances
D1,1
D2,1
DM,N
1. Merge objects with
smallest distance
2. Reevaluate distances
3. Repeat process
Z = linkage(D);
H = dendrogram(Z);
74
37
Tutorial | Time-Series with Matlab
Partitional Clustering
Faster than hierarchical clustering
Typically provides suboptimal solutions (local minima)
Not good performance for high dimensions
K-Means Algorithm:
0.9
0.8
1. Initialize k clusters (k specified
by user) randomly.
0.7
0.6
2. Repeat until convergence
0.5
1. Assign each object to the
nearest cluster center.
0.4
0.3
2. Re-estimate cluster centers.
0.2
0.1
0.2
0.3
0.4
See: kmeans
0.5
0.6
0.7
0.8
0.9
75
Tutorial | Time-Series with Matlab
K-Means Demo
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.5
0.5
1.5
76
38
Tutorial | Time-Series with Matlab
K-Means Clustering for Time-Series
So how is kMeans applied for Time-Series that are high-dimensional?
Perform kMeans on a compressed dimensionality
Original
sequences
Compressed
sequences
Clustering
space
0.4
0.2
-0.2
-0.4
-0.6
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
77
Tutorial | Time-Series with Matlab
Classification
Typically classification can be made easier if we have clustered the objects
Class A
0.4
0.2
-0.2
Project query in the
new space and find
its closest cluster
So, query Q is more
similar to class B
-0.4
-0.6
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
Class B
78
39
Tutorial | Time-Series with Matlab
Nearest Neighbor Classification
We need not perform clustering before classification. We can classify an
object based on the class majority of its nearest neighbors/matches.
Elfs
Hobbits
10
9
Hair Length
8
7
6
5
4
3
2
1
1
9 10
Height
79
Tutorial | Time-Series with Matlab
Example
What do we need?
1. Define Similarity
2. Search fast
Dimensionality Reduction
(compress data)
80
40
Tutorial | Time-Series with Matlab
All
All models
models are
are wrong,
wrong,
but
some
are
useful
but some are useful
PART II: Time Series Matching
Similarity/Distance functions
81
Tutorial | Time-Series with Matlab
Notion of Similarity I
Solution to any time-series problem, boils down to a proper
definition of *similarity*
Similarity is always subjective.
(i.e. it depends on the application)
82
41
Tutorial | Time-Series with Matlab
Notion of Similarity II
Similarity depends on the features we consider
(i.e. how we will describe or compress the sequences)
83
Tutorial | Time-Series with Matlab
Metric and Non-metric Distance Functions
Distance functions
Metric
Non-Metric
Euclidean Distance
Time Warping
Correlation
LCSS
Properties
Positivity:
Positivity:d(x,y)
d(x,y)0
0and
andd(x,y)=0,
d(x,y)=0,ififx=y
x=y
Symmetry:
Symmetry:d(x,y)
d(x,y)==d(y,x)
d(y,x)
IfIfany
anyof
ofthese
theseis
isnot
not
obeyed
then
the
obeyed then thedistance
distance
is
isaanon-metric
non-metric
Triangle
TriangleInequality:
Inequality:d(x,z)
d(x,z)d(x,y)
d(x,y)++d(y,z)
d(y,z)
84
42
Tutorial | Time-Series with Matlab
Triangle Inequality
Triangle
TriangleInequality:
Inequality:d(x,z)
d(x,z)d(x,y)
d(x,y)++d(y,z)
d(y,z)
z
Metric distance
functions can exploit
the triangle inequality
to speed-up search
Intuitively, if:
- x is similar to y and,
- y is similar to z, then,
- x is similar to z too.
85
Tutorial | Time-Series with Matlab
Triangle Inequality (Importance)
Triangle
TriangleInequality:
Inequality:d(x,z)
d(x,z)d(x,y)
d(x,y)++d(y,z)
d(y,z)
Assume:
d(Q,bestMatch) = 20
and
d(Q,B) =150
Then, since d(A,B)=20
d(Q,A) d(Q,B) d(B,A)
d(Q,A) 150 20 = 130
So we dont have to retrieve A from disk
20
110
20
90
110
90
0
86
43
Tutorial | Time-Series with Matlab
Non-Metric Distance Functions
Man
similar to
bat??
Matching
Matchingflexibility
flexibility
Robustness
Robustnessto
tooutliers
outliers
Bat
similar to
batman
Stretching
Stretchingin
intime/space
time/space
Support
Supportfor
fordifferent
differentsizes/lengths
sizes/lengths
Batman
similar
to man
Speeding-up
Speeding-upsearch
searchcan
canbe
be
difficult
difficult
87
Tutorial | Time-Series with Matlab
Euclidean Distance
Most widely used distance measure
n
Definition: L2 =
(a[i] b[i])
i =1
20
40
60
80
100
L2 = sqrt(sum((asqrt(sum((a-b).^2)); % return Euclidean distance
88
44
Tutorial | Time-Series with Matlab
Euclidean Distance (Vectorization)
Question: If I want to compare many sequences to each other do I have to
use a for-loop?
Answer: No, one can use the following equation to perform matrix
computations only
||A-B|| = sqrt ( ||A||2 + ||B||2 - 2*A.B )
M sequences
B: DxN matrix
Result is MxN matrix
A=
Of length D
A: DxM matrix
result
D1,1
D2,1
DM,N
aa=
.*b); ab=a'*b;
aa=sum(a.*a);
sum(a.*a); bb=sum(b
bb=sum(b.*b);
ab=a'*b;
d = sqrt(repmat(aa',[1 size(bb,2)])
);
size(bb,2)]) + repmat(bb,[size(aa,2)
repmat(bb,[size(aa,2) 1]) - 2*ab
2*ab);
89
Tutorial | Time-Series with Matlab
Data Preprocessing (Baseline Removal)
average value of A
average value of B
a = a mean(a);
mean(a);
90
45
Tutorial | Time-Series with Matlab
Data Preprocessing (Rescaling)
a = a ./ std(a);
std(a);
91
Tutorial | Time-Series with Matlab
Dynamic Time-Warping (Motivation)
Euclidean distance or warping cannot compensate for small distortions in
time axis.
A
B
According to Euclidean distance
B is more similar to A than to C
C
Solution: Allow for compression & decompression in time
92
46
Tutorial | Time-Series with Matlab
Dynamic Time-Warping
First used in speech recognition
for recognizing words spoken at
different speeds
Same idea can work equally well for
generic time-series data
---Maat--llaabb-------------------
----Mat-lab--------------------------
93
Tutorial | Time-Series with Matlab
Dynamic Time-Warping (how does it work?)
The intuition is that we copy an element multiple times so as to achieve a
better matching
Euclidean
Euclideandistance
distance
T1
=
[1,
1,
T1 = [1, 1,2,2,2]
2]
dd==11
T2
T2==[1,
[1,2,2,2,2,2]
2]
One-to-one linear alignment
Warping
Warpingdistance
distance
T1
=
[1,
1,
T1 = [1, 1,2,2,2]
2]
dd==00
T2
T2==[1,
[1,2,2,2,2,2]
2]
One-to-many non-linear alignment
94
47
Tutorial | Time-Series with Matlab
Dynamic Time-Warping (implementation)
It is implemented using dynamic programming. Create an array that stores
all solutions for all possible subsequences.
c(i,j)
)+
c(i,j) ==D(A
D(Ai,B
i,Bj j) +
min{
c(i-1,j-1)
min{ c(i-1,j-1), ,c(i-1,j
c(i-1,j)), ,c(i,j-1)
c(i,j-1)}}
Recursive equation
95
Tutorial | Time-Series with Matlab
Dynamic Time-Warping (Examples)
So does it work better than Euclidean? Well yes! After all it is more costly..
Dynamic Time Warping
18
20
17
13
16
14
12
19
15
11
3
9
8
7
5
6
2
10
4
1
Euclidean Distance
18
16
7
13
14
3
9
6
2
15
11
19
10
20
17
5
12
8
4
1
MIT arrhythmia database
96
48
Tutorial | Time-Series with Matlab
Dynamic Time-Warping (Can we speed it up?)
Complexity is O(n2). We can reduce it to O(n) simply by restricting the
warping path.
A
We now only fill only a small
portion of the array
Minimum
Bounding
Envelope
(MBE)
97
Tutorial | Time-Series with Matlab
Dynamic Time-Warping (restricted warping)
Camera-Mouse dataset
The restriction of the warping path helps:
A. Speed-up execution
B. Avoid extreme (degenerate) matchings
C. Improve clustering/classification
accuracy
Classification Accuracy
Camera Mouse
Australian Sign Language
10% warping is adequate
Warping Length
98
49
Tutorial | Time-Series with Matlab
Longest Common Subsequence (LCSS)
With Time Warping extreme values (outliers) can destroy the distance
estimates. The LCSS model can offer more resilience to noise and impose
spatial constraints too.
ignore majority
of noise
match
match
Matching within time and in space
Everything that is outside the bounding
envelope can never be matched
99
Tutorial | Time-Series with Matlab
Longest Common Subsequence (LCSS)
LCSS is more resilient to noise than DTW.
Disadvantages of DTW:
A. All points are matched
B. Outliers can distort distance
ignore majority
of noise
C. One-to-many mapping
Advantages of LCSS:
A. Outlying values not matched
B. Distance/Similarity distorted less
C. Constraints in time & space
match
match
100
50
Tutorial | Time-Series with Matlab
Longest Common Subsequence (Implementation)
Similar dynamic programming solution as DTW, but now we measure
similarity not distance.
Can also be expressed as distance
101
Tutorial | Time-Series with Matlab
Distance Measure Comparison
Dataset
Camera-Mouse
ASL
ASL+noise
Method
Time (sec)
Accuracy
Euclidean
34
20%
DTW
237
80%
LCSS
210
100%
Euclidean
2.2
33%
DTW
9.1
44%
LCSS
8.2
46%
Euclidean
2.1
11%
DTW
9.3
15%
LCSS
8.3
31%
LCSS offers enhanced robustness under noisy conditions
102
51
Tutorial | Time-Series with Matlab
Distance Measure Comparison (Overview)
Method
Complexity
Elastic Matching
One-to-one Matching
Noise
Robustness
O(n)
DTW
O(n*)
LCSS
O(n*)
Euclidean
103
Tutorial | Time-Series with Matlab
PART II: Time Series Matching
Lower Bounding
104
52
Tutorial | Time-Series with Matlab
Basic Time-Series Problem Revisited
Objective: Instead of comparing the query to the
original sequences (Linear Scan/LS) , lets compare
the query to simplified versions of the DB timeseries.
query
This
ThisDB
DBcan
cantypically
typically
fit
fitin
inmemory
memory
105
Tutorial | Time-Series with Matlab
Compression Dimensionality Reduction
Project all sequences into a new space, and
search this space instead (eg project timeseries from 100-D space to 2-D space)
Feature 1
A
B
C
Feature 2
query
One can also organize the low-dimensional
points into a hierarchical index structure. In
this tutorial we will not go over indexing
techniques.
Question: When searching the original space it is guaranteed that we
will find the best match. Does this hold (or under which circumstances)
in the new compressed space?
106
53
Tutorial | Time-Series with Matlab
Concept of Lower Bounding
You can guarantee similar results to Linear Scan in the original
dimensionality, as long as you provide a Lower Bounding (LB) function
(in low dim) to the original distance (high dim.)
GEMINI, GEneric Multimedia INdexIng
So, for projection from high dim. (N) to low dim. (n): Aa, Bb etc
DDLB (a,b)
<= D (A,B)
LB (a,b) <= Dtrue
true(A,B)
5
Projection onto X-axis
C B
4
0
C
3
EF
4
Projection on some other axis
0
0
D
2
False alarm (not a problem)
B C
EF
False dismissal (bad!)
Find everything within range of 1 from A
107
Tutorial | Time-Series with Matlab
Generic Search using Lower Bounding
simplified
DB
original
DB
Answer
Superset
Final
Answer
set
Verify
against
original
DB
simplified
query
query
108
54
Tutorial | Time-Series with Matlab
Lower Bounding Example
sequences
query
109
Tutorial | Time-Series with Matlab
Lower Bounding Example
sequences
query
110
55
Tutorial | Time-Series with Matlab
Lower Bounding Example
sequences
Lower Bounds
4.6399
37.9032
19.5174
72.1846
67.1436
78.0920
70.9273
63.7253
1.4121
111
Tutorial | Time-Series with Matlab
Lower Bounding Example
sequences
Lower Bounds
True Distance
4.6399
46.7790
37.9032
108.8856
19.5174
113.5873
72.1846
104.5062
67.1436
119.4087
78.0920
120.0066
70.9273
111.6011
63.7253
119.0635
1.4121
17.2540
BestSoFar
112
56
Tutorial | Time-Series with Matlab
Lower Bounding the Euclidean distance
There are many dimensionality reduction (compression ) techniques for time-series
data. The following ones can be used to lower bound the Euclidean distance.
20 40 60 80 100 120
DFT
20 40 60 80 100 120
DWT
20 40 60 80 100 120
SVD
20 40 60 80 100 120
APCA
20 40 60 80 100 120
PAA
20 40 60 80 100 120
PLA
Figure by Eamonn Keogh, Time-Series Tutorial
113
Tutorial | Time-Series with Matlab
Fourier Decomposition
Decompose a time-series into sum of sine waves
DFT:
IDFT:
Everysignal
signalcan
can
Every
be
represented
as
be represented as
a
superposition
of
a superposition of
sinesand
andcosines
cosines
sines
(alas
nobody
(alas nobody
believesme)
me)
believes
114
57
Tutorial | Time-Series with Matlab
Fourier Decomposition
Decompose a time-series into sum of sine waves
DFT:
IDFT:
fa = fft(a);
fft(a); % Fourier decomposition
fa(5:end) = 0; % keep first 5 coefficients (low frequencies)
reconstr = real(ifft(fa));
real(ifft(fa)); % reconstruct signal
X(f)
x(n)
-0.3633
-0.4446
-0.6280 + 0.2709i
-0.9864
-0.4929 + 0.0399i
-0.3254
-1.0143 + 0.9520i
-0.6938
0.7200 - 1.0571i
-0.1086
-0.0411 + 0.1674i
-0.3470
-0.5120 - 0.3572i
0.5849
0.9860 + 0.8043i
1.5927
-0.3680 - 0.1296i
-0.9430
-0.0517 - 0.0830i
-0.3037
-0.9158 + 0.4481i
-0.7805
1.1212 - 0.6795i
-0.1953
0.2667 + 0.1100i
-0.3037
0.2667 - 0.1100i
0.2381
1.1212 + 0.6795i
2.8389
-0.9158 - 0.4481i
-0.7046
-0.0517 + 0.0830i
-0.5529
-0.3680 + 0.1296i
-0.6721
0.9860 - 0.8043i
0.1189
-0.5120 + 0.3572i
0.2706
-0.0411 - 0.1674i
-0.0003
0.7200 + 1.0571i
1.3976
-1.0143 - 0.9520i
-0.4987
-0.4929 - 0.0399i
-0.2387
-0.6280 - 0.2709i
-0.7588
Life is complex, it has both real and imaginary parts.
115
Tutorial | Time-Series with Matlab
Fourier Decomposition
How much space we gain by compressing random walk data?
Reconstruction using 1coefficients
-5
50
100
150
200
250
1 coeff > 60% of energy
10 coeff > 90% of energy
116
58
Tutorial | Time-Series with Matlab
Fourier Decomposition
How much space we gain by compressing random walk data?
Reconstruction using 2coefficients
-5
50
100
150
200
250
1 coeff > 60% of energy
10 coeff > 90% of energy
117
Tutorial | Time-Series with Matlab
Fourier Decomposition
How much space we gain by compressing random walk data?
Reconstruction using 7coefficients
-5
50
100
150
200
250
1 coeff > 60% of energy
10 coeff > 90% of energy
118
59
Tutorial | Time-Series with Matlab
Fourier Decomposition
How much space we gain by compressing random walk data?
Reconstruction using 20coefficients
-5
50
100
150
200
250
1 coeff > 60% of energy
10 coeff > 90% of energy
119
Tutorial | Time-Series with Matlab
Fourier Decomposition
How much space we gain by compressing random walk data?
Energy Percentage
Error
1
1500
0.95
0.9
1000
0.85
0.8
0.75
500
0.7
0.65
0
20
40
60
80
Coefficients
100
120
20
40
60
80
Coefficients
100
120
1 coeff > 60% of energy
10 coeff > 90% of energy
120
60
Tutorial | Time-Series with Matlab
Fourier Decomposition
Which coefficients are important?
We can measure the energy of each coefficient
Energy = Real(X(fk))2 + Imag(X(fk))2
Most of data-mining research
uses first k coefficients:
Good for random walk
signals (eg stock market)
Easy to index
Not good for general signals
fa = fft(a);
fft(a); % Fourier decomposition
N = length(a);
length(a); % how many?
fa = fa(1:ceil(N/2));
fa(1:ceil(N/2)); % keep first half only
mag = 2*abs(fa).^2
; % calculate energy
2*abs(fa).^2;
121
Tutorial | Time-Series with Matlab
Fourier Decomposition
Which coefficients are important?
We can measure the energy of each coefficient
Energy = Real(X(fk))2 + Imag(X(fk))2
Usage of the coefficients with
highest energy:
Good for all types of signals
Believed to be difficult to
index
CAN be indexed using
metric trees
122
61
Tutorial | Time-Series with Matlab
Code for Reconstructed Sequence
a = load('randomWalk.dat');
a = aa-mean(a)/std(a);
X(f)
0
-0.6280 + 0.2709i
keep
% zz-normalization
-0.4929 + 0.0399i
-1.0143 + 0.9520i
0.7200 - 1.0571i
fa = fft(a);
-0.0411 + 0.1674i
maxInd = ceil(length(a)/2);
N = length(a);
-0.5120 - 0.3572i
% until the middle
0.9860 + 0.8043i
-0.3680 - 0.1296i
energy = zeros(maxIndzeros(maxInd-1, 1);
E = sum(a.^2);
-0.0517 - 0.0830i
% energy of a
-0.9158 + 0.4481i
1.1212 - 0.6795i
for ind=2:maxInd,
fa_N = fa;
fa_N(ind+1:Nfa_N(ind+1:N-ind+1) = 0;
r = real(ifft(fa_N));
end
Ignore
% copy fourier
% zero out unused
% reconstruction
0.2667 + 0.1100i
0.2667 - 0.1100i
1.1212 + 0.6795i
-0.9158 - 0.4481i
-0.0517 + 0.0830i
plot(r, 'r','LineWidth',2); hold on;
plot(a,'k');
title(['Reconstruction using ' num2str(indnum2str(ind-1) 'coefficients']);
set(gca,'plotboxaspectratio', [3 1 1]);
axis tight
pause;
% wait for key
cla;
% clear axis
keep
-0.3680 + 0.1296i
0.9860 - 0.8043i
-0.5120 + 0.3572i
-0.0411 - 0.1674i
0.7200 + 1.0571i
-1.0143 - 0.9520i
-0.4929 - 0.0399i
-0.6280 - 0.2709i
123
Tutorial | Time-Series with Matlab
Code for Plotting the Error
a = load('randomWalk.dat');
a = aa-mean(a)/std(a);
fa = fft(a);
maxInd = ceil(length(a)/2);
N = length(a);
energy = zeros(maxIndzeros(maxInd-1, 1);
E = sum(a.^2);
for ind=2:maxInd,
fa_N = fa;
fa_N(ind+1:Nfa_N(ind+1:N-ind+1) = 0;
r = real(ifft(fa_N));
% zz-normalization
This is the same
% until the middle
% energy of a
% copy fourier
% zero out unused
% reconstruction
energy(indenergy(ind-1) = sum(r.^2); % energy of reconstruction
error(inderror(ind-1) = sum(abs(rsum(abs(r-a).^2); % error
end
E = ones(maxIndones(maxInd-1, 1)*E;
error = E - energy;
ratio = energy ./ E;
subplot(1,2,1);
% left plot
plot([1:maxIndplot([1:maxInd-1], error, 'r', 'LineWidth',1.5);
subplot(1,2,2);
% right plot
plot([1:maxIndplot([1:maxInd-1], ratio, 'b', 'LineWidth',1.5);
124
62
Tutorial | Time-Series with Matlab
Lower Bounding using Fourier coefficients
Parsevals Theorem states that energy in the frequency domain equals the
energy in the time domain:
Euclidean distance
or, that
If we just keep some of the coefficients, their sum of squares always
underestimates (ie lower bounds) the Euclidean distance:
125
Tutorial | Time-Series with Matlab
Lower Bounding using Fourier coefficients -Example
x
y
Note the normalization
x = cumsum(randn(100,1));
cumsum(randn(100,1));
y = cumsum(randn(100,1));
cumsum(randn(100,1));
euclid_Time = sqrt(sum((xsqrt(sum((x-y).^2));
120.9051
fx = fft(x)/sqrt(length(x));
fft(x)/sqrt(length(x));
fy = fft(y)/sqrt(length(x));
fft(y)/sqrt(length(x));
euclid_Freq = sqrt(sum(abs(fx - fy).^2));
fy).^2));
Keeping 10 coefficients
the distance is:
115.5556 < 120.9051
120.9051
126
63
Tutorial | Time-Series with Matlab
Fourier Decomposition
O(nlogn)
O(nlogn)complexity
complexity
Tried
Triedand
andtested
tested
Hardware
Hardwareimplementations
implementations
Many
Manyapplications:
applications:
compression
compression
Not
Notgood
goodapproximation
approximationfor
for
bursty
signals
bursty signals
Not
Notgood
goodapproximation
approximationfor
for
signals
with
signals withflat
flatand
andbusy
busy
sections
sections
(requires
(requiresmany
manycoefficients)
coefficients)
smoothing
smoothing
periodicity
periodicitydetection
detection
127
Tutorial | Time-Series with Matlab
Wavelets Why exist?
Similar concept with Fourier decomposition
Fourier coefficients represent global contributions,
wavelets are localized
Fourier is good for smooth, random walk data,
but not for bursty data or flat data
128
64
Tutorial | Time-Series with Matlab
Wavelets (Haar) - Intuition
Wavelet coefficients, still represent an inner product
(projection) of the signal with some basis functions.
These functions have lengths that are powers of two (full
sequence length, half, quarter etc)
An arithmetic example
c-d00
X = [9,7,3,5]
c+d00
D
etc
Haar coefficients: {c, d00, d10, d11,}
Haar = [6,2,1,-1]
c = 6 = (9+7+3+5)/4
c + d00 = 6+2 = 8 = (9+7)/2
c - d00 = 6-2 = 4 = (3+5)/2
etc
See also:wavemenu
129
Tutorial | Time-Series with Matlab
Wavelets in Matlab
Specialized Matlab interface
for wavelets
See also:wavemenu
130
65
Tutorial | Time-Series with Matlab
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation
Represent time-series as a sequence of segments
Essentially a projection of the Haar coefficients in time
Reconstruction using 1coefficients
2
1
0
-1
-2
50
100
150
200
250
131
Tutorial | Time-Series with Matlab
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation
Represent time-series as a sequence of segments
Essentially a projection of the Haar coefficients in time
Reconstruction using 2coefficients
2
1
0
-1
-2
50
100
150
200
250
132
66
Tutorial | Time-Series with Matlab
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation
Represent time-series as a sequence of segments
Essentially a projection of the Haar coefficients in time
Reconstruction using 4coefficients
2
1
0
-1
-2
50
100
150
200
250
133
Tutorial | Time-Series with Matlab
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation
Represent time-series as a sequence of segments
Essentially a projection of the Haar coefficients in time
Reconstruction using 8coefficients
2
1
0
-1
-2
50
100
150
200
250
134
67
Tutorial | Time-Series with Matlab
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation
Represent time-series as a sequence of segments
Essentially a projection of the Haar coefficients in time
Reconstruction using 16coefficients
2
1
0
-1
-2
50
100
150
200
250
135
Tutorial | Time-Series with Matlab
PAA (Piecewise Aggregate Approximation)
also featured as Piecewise Constant Approximation
Represent time-series as a sequence of segments
Essentially a projection of the Haar coefficients in time
Reconstruction using 32coefficients
2
1
0
-1
-2
50
100
150
200
250
136
68
Tutorial | Time-Series with Matlab
PAA Matlab Code
function data = paa(s, numCoeff)
% PAA(s, numcoeff)
% s: sequence vector (Nx1 or Nx1)
% numCoeff: number of PAA segments
% data: PAA sequence (Nx1)
N = length(s);
segLen = N/numCoeff;
% length of sequence
% assume it's integer
sN = reshape(s, segLen, numCoeff);
avg = mean(sN);
data = repmat(avg, segLen, 1);
data = data(:);
%
%
%
%
break in segments
average segments
expand segments
make column
numCoeff
137
Tutorial | Time-Series with Matlab
PAA Matlab Code
function data = paa(s, numCoeff)
% PAA(s, numcoeff)
% s: sequence vector (Nx1 or Nx1)
% numCoeff: number of PAA segments
% data: PAA sequence (Nx1)
N = length(s);
segLen = N/numCoeff;
% length of sequence
% assume it's integer
sN = reshape(s, segLen, numCoeff);
avg = mean(sN);
data = repmat(avg, segLen, 1);
data = data(:);
%
%
%
%
N=8
segLen = 2
break in segments
average segments
expand segments
make column
numCoeff
138
69
Tutorial | Time-Series with Matlab
PAA Matlab Code
function data = paa(s, numCoeff)
% PAA(s, numcoeff)
% s: sequence vector (Nx1 or Nx1)
% numCoeff: number of PAA segments
% data: PAA sequence (Nx1)
N = length(s);
segLen = N/numCoeff;
% length of sequence
% assume it's integer
2
sN = reshape(s, segLen, numCoeff);
avg = mean(sN);
data = repmat(avg, segLen, 1);
data = data(:);
s
sN
N=8
segLen = 2
%
%
%
%
break in segments
average segments
expand segments
make column
numCoeff
139
Tutorial | Time-Series with Matlab
PAA Matlab Code
function data = paa(s, numCoeff)
% PAA(s, numcoeff)
% s: sequence vector (Nx1 or Nx1)
% numCoeff: number of PAA segments
% data: PAA sequence (Nx1)
N = length(s);
segLen = N/numCoeff;
% length of sequence
% assume it's integer
sN = reshape(s, segLen, numCoeff);
avg = mean(sN);
data = repmat(avg, segLen, 1);
data = data(:);
sN
1.5
3.5
5.5
7.5
avg
%
%
%
%
N=8
segLen = 2
break in segments
average segments
expand segments
make column
numCoeff
140
70
Tutorial | Time-Series with Matlab
PAA Matlab Code
function data = paa(s, numCoeff)
% PAA(s, numcoeff)
% s: sequence vector (1xN)
% numCoeff: number of PAA segments
% data: PAA sequence (1xN)
N = length(s);
segLen = N/numCoeff;
% length of sequence
% assume it's integer
sN = reshape(s, segLen, numCoeff);
avg = mean(sN);
2
data = repmat(avg, segLen, 1);
data = data(:)
data(:);
sN
1.5
3.5
5.5
7.5
avg
%
%
%
%
N=8
segLen = 2
break in segments
average segments
expand segments
make row
numCoeff
data
1.5
3.5
5.5
7.5
1.5
3.5
5.5
7.5
141
Tutorial | Time-Series with Matlab
PAA Matlab Code
function data = paa(s, numCoeff)
% PAA(s, numcoeff)
% s: sequence vector (1xN)
% numCoeff: number of PAA segments
% data: PAA sequence (1xN)
N = length(s);
segLen = N/numCoeff;
% length of sequence
% assume it's integer
sN = reshape(s, segLen, numCoeff);
avg = mean(sN);
data = repmat(avg, segLen, 1);
data = data(:)
data(:);
sN
1.5
3.5
5.5
7.5
avg
%
%
%
%
break in segments
average segments
expand segments
make row
numCoeff
data
data
N=8
segLen = 2
1.5
3.5
5.5
7.5
1.5
3.5
5.5
7.5
1.5
1.5
3.5
3.5
5.5
5.5
7.5
7.5
142
71
Tutorial | Time-Series with Matlab
APCA (Adaptive Piecewise Constant Approximation)
PAA
Not all haar/PAA coefficients
are equally important
Segments of
equal size
Intuition: Keep ones with the
highest energy
Segments of variable length
APCA
Segments of
variable size
APCA is good for bursty
signals
PAA requires 1 number per
segment, APCA requires
2: [value, length]
E.g. 10 bits for a
sequence of 1024 points
143
Tutorial | Time-Series with Matlab
Wavelet Decomposition
O(n)
O(n)complexity
complexity
Hierarchical
Hierarchicalstructure
structure
Progressive
Progressivetransmission
transmission
Most
Mostdata-mining
data-miningresearch
research
still
utilizes
still utilizesHaar
Haarwavelets
wavelets
because
becauseof
oftheir
theirsimplicity.
simplicity.
Better
Betterlocalization
localization
Good
Goodfor
forbursty
burstysignals
signals
Many
Manyapplications:
applications:
compression
compression
periodicity
periodicitydetection
detection
144
72
Tutorial | Time-Series with Matlab
Piecewise Linear Approximation (PLA)
Approximate a sequence
with multiple linear
segments
First such algorithms
appeared in cartography
for map approximation
Many implementations
Optimal
Greedy Bottom-Up
Greedy Top-down
Genetic, etc
You can find a bottom-up implementation here:
http://www.cs.ucr.edu/~eamonn/TSDMA/time_series_toolbox/
145
Tutorial | Time-Series with Matlab
Piecewise Linear Approximation (PLA)
Approximate a sequence
with multiple linear
segments
First such algorithms
appeared in cartography
for map approximation
146
73
Tutorial | Time-Series with Matlab
Piecewise Linear Approximation (PLA)
Approximate a sequence
with multiple linear
segments
First such algorithms
appeared in cartography
for map approximation
147
Tutorial | Time-Series with Matlab
Piecewise Linear Approximation (PLA)
Approximate a sequence
with multiple linear
segments
First such algorithms
appeared in cartography
for map approximation
148
74
Tutorial | Time-Series with Matlab
Piecewise Linear Approximation (PLA)
Approximate a sequence
with multiple linear
segments
First such algorithms
appeared in cartography
for map approximation
149
Tutorial | Time-Series with Matlab
Piecewise Linear Approximation (PLA)
Approximate a sequence
with multiple linear
segments
First such algorithms
appeared in cartography
for map approximation
150
75
Tutorial | Time-Series with Matlab
Piecewise Linear Approximation (PLA)
O(nlogn)
O(nlogn)complexity
complexityfor
for
bottom
up
bottom upalgorithm
algorithm
Visually
Visuallynot
notvery
verysmooth
smoothor
or
pleasing.
pleasing.
Incremental
Incrementalcomputation
computation
possible
possible
Provable
Provableerror
errorbounds
bounds
Applications
Applicationsfor:
for:
Image
Image//signal
signal
simplification
simplification
Trend
Trenddetection
detection
151
Tutorial | Time-Series with Matlab
Singular Value Decomposition (SVD)
SVD attempts to find the optimal basis for describing a set
of multidimensional points
Objective: Find the axis (directions) that describe better the
data variance
y
We need 2 numbers (x,y)
for every point
y
Now we can describe each
point with 1 number, their
projection on the line
New axis and position of points
(after projection and rotation)
152
76
Tutorial | Time-Series with Matlab
Singular Value Decomposition (SVD)
Each time-series is essentially a multidimensional point
Objective: Find the eigenwaves (basis) whose linear
combination describes best the sequences. Eigenwaves are
data-dependent.
AMxn = UMxr * rxr * VTnxr
eigenwave 0
Factoring of data array into 3
matrices
eigenwave 1
each of length n
eigenwave 4
A linear combination of the
eigenwaves can produce any
sequence in the database
M sequences
eigenwave 3
[U,S,V]
U,S,V] = svd(A)
svd(A)
153
Tutorial | Time-Series with Matlab
Singular Value Decomposition
Optimal
Optimaldimensionality
dimensionality
reduction
reductionin
inEuclidean
Euclidean
distance
sense
distance sense
Cannot
Cannotbe
beapplied
appliedfor
forjust
just
one
sequence.
A
set
one sequence. A setof
of
sequences
sequencesis
isrequired.
required.
SVD
SVDis
isaavery
verypowerful
powerfultool
tool
in
inmany
manydomains:
domains:
Websearch
Websearch(PageRank)
(PageRank)
Addition
Additionof
ofaasequence
sequencein
in
database
databaserequires
requires
recomputation
recomputation
Very
Verycostly
costlyto
tocompute.
compute.
2
2
Time:
min{
O(M
Time: min{ O(M2n),
n),O(Mn
O(Mn2)})}
Space:
Space:O(Mn)
O(Mn)
MMsequences
sequencesof
oflength
lengthnn
154
77
Tutorial | Time-Series with Matlab
Symbolic Approximation
Assign a different symbol based on range of values
Find ranges either from data histogram or uniformly
c
c
c
b
b
a
20
a
40
60
80
100
120
baabccbc
You can find an implementation here:
http://www.ise.gmu.edu/~jessica/sax.htm
155
Tutorial | Time-Series with Matlab
Symbolic Approximations
Linear
Linearcomplexity
complexity
After
Aftersymbolization
symbolizationmany
many
tools
toolsfrom
frombioinformatics
bioinformatics
can
canbe
beused
used
Markov
Markovmodels
models
Number
Numberof
ofregions
regions
(alphabet
(alphabetlength)
length)can
canaffect
affect
the
quality
of
result
the quality of result
Suffix-Trees,
Suffix-Trees,etc
etc
156
78
Tutorial | Time-Series with Matlab
Multidimensional Time-Series
Ari,are
areyou
yousure
surethe
the
Ari,
worldisisnot
not1D?
1D?
world
Catching momentum lately
Applications for mobile trajectories, sensor
networks, epidemiology, etc
Lets see how to approximate 2D
trajectories with
Minimum Bounding Rectangles
Aristotle
157
Tutorial | Time-Series with Matlab
Multidimensional MBRs
Find Bounding rectangles that completely contain a trajectory
given some optimization criteria (eg minimize volume)
On my income tax 1040 it says "Check this box
if you are blind." I wanted to put a check mark
about three inches away.
- Tom Lehrer
158
79
Tutorial | Time-Series with Matlab
Comparison of different Dim. Reduction Techniques
159
Tutorial | Time-Series with Matlab
So which dimensionality reduction is the best?
APCAisis
APCA
better
better
Fourierisis
Fourier
good
good
1993
PAA!
PAA!
thanPAA!
PAA!
than
2000
2001
Chebyshev
Chebyshev
better
isisbetter
thanAPCA!
APCA!
than
2004
The
The
future
future isis
symbolic!
symbolic!
2005
Absence of proof is no proof of absence.
- Michael Crichton
160
80
Tutorial | Time-Series with Matlab
Comparisons
Lets see how tight the lower bounds are for a variety on 65 datasets
Average Lower Bound
A. No approach
is better on all
datasets
Median Lower Bound
B. Best coeff.
techniques
can offer
tighter
bounds
C. Choice of
compression
depends on
application
Note: similar results also reported by Keogh in SIGKDD02
161
Tutorial | Time-Series with Matlab
PART II: Time Series Matching
Lower Bounding the DTW and LCSS
162
81
Tutorial | Time-Series with Matlab
Lower Bounding the Dynamic Time Warping
Recent approaches use the Minimum Bounding Envelope
for bounding the DTW
Create Minimum Bounding Envelope (MBE) of query Q
Calculate distance between MBE of Q and any sequence A
One can show that: D(MBE(Q)
D(MBE(Q),A) < DTW(Q,A)
DTW(Q,A)
LB = sqrt(sum([[A > U].* [A-U]; [A < L].* [L-A]].^2));
One Matlab command!
U
MBE(Q)
A
However, this representation
is uncompressed. Both MBE
and the DB sequence can be
compressed using any of the
previously mentioned
techniques.
163
Tutorial | Time-Series with Matlab
Lower Bounding the Dynamic Time Warping
LB by Keogh
approximate MBE and
sequence using MBRs
LB = 13.84
Q
LB by Zhu and Shasha
approximate MBE and
sequence using PAA
LB = 25.41
Q
A
164
82
Tutorial | Time-Series with Matlab
Lower Bounding the Dynamic Time Warping
An even tighter lower bound can be achieved by warping the MBE
approximation against any other compressed signal.
LB_Warp = 29.05
Lower Bounding approaches for DTW,
will typically yield at least an order of
magnitude speed improvement
compared to the nave approach.
Lets compare the 3 LB approaches:
165
Tutorial | Time-Series with Matlab
Time Comparisons
We will use DTW (and the corresponding LBs) for recognition of hand-written
digits/shapes.
Accuracy: Using DTW we can achieve recognition above 90%.
Running Time: runTime LB_Warp < runTime LB_Zhu < runTime LB-Keogh
Pruning Power: For some queries LB_Warp can examine up to 65 time
fewer sequences
166
83
Tutorial | Time-Series with Matlab
Upper Bounding the LCSS
Since LCSS measures similarity and similarity is the inverse of distance, to
speed up LCSS we need to upper bound it.
LCSS(MBE
LCSS(MBEQQ,A)
,A)>=
>=LCSS(Q,A)
LCSS(Q,A)
Query
Indexed Sequence
Sim.=50/77
= 0.64
44 points
6 points
167
Tutorial | Time-Series with Matlab
LCSS Application Image Handwriting
Library of Congress has 54 million
manuscripts (20TB of text)
Increasing interest for automatic
transcribing
Word annotation:
1.1.Extract
Extractwords
wordsfrom
fromdocument
document
2.2.Extract
image
features
Extract image features
3.3.Annotate
Annotateaasubset
subsetof
ofwords
words
4.4.Classify
Classifyremaining
remainingwords
words
Feature Value
0.8
0.6
0.4
0.2
0
50
100
150
200
Column
250
300
350
400
Features:
George Washington Manuscript
- Black pixels / column
- Ink-paper transitions/ col , etc
168
84
Tutorial | Time-Series with Matlab
LCSS Application Image Handwriting
Utilized 2D time-series (2 features)
Returned 3-Nearest Neighbors of following words
Classification accuracy > 70%
169
Tutorial | Time-Series with Matlab
PART II: Time Series Analysis
Test Case and Structural Similarity Measures
170
85
Tutorial | Time-Series with Matlab
Analyzing Time-Series Weblogs
PKDD 2005
Porto
Weblog of user
requests over
time
Priceline
171
Tutorial | Time-Series with Matlab
Weblog Data Representation
Record
We
can aggregate information, eg, number of requests per day for each
keyword
Query: Spiderman
Requests
May 2002. Spiderman 1
was released in theaters
Jan
Feb Mar
Apr May Jun
Jul
Aug Sep Okt
Capture trends and periodicities
Privacy preserving
Nov Dec
Google Zeitgeist
172
86
Tutorial | Time-Series with Matlab
Finding similar patterns in query logs
We can find useful patterns and correlation in the user demand
patterns which can be useful for:
Search engine optimization
Recommendations
Advertisement pricing (e.g. keyword more expensive at the popular
months)
Requests
Query: xbox
Query: ps2
Jan
Feb Mar
Apr May
Jun
Jul
Aug Sep Okt
Nov Dec
Game consoles are more
popular closer to Christmas
173
Tutorial | Time-Series with Matlab
Matching of Weblog data
Use Euclidean distance to match time-series. But which dimensionality
reduction technique to use?
Lets look at the data:
Query Bach
1 year span
The data is smooth and highly
periodic, so we can use Fourier
decomposition.
Instead of using the first Fourier
coefficients we can use the best ones
instead.
Lets see how the approximation will
look:
Query stock market
174
87
Tutorial | Time-Series with Matlab
First Fourier Coefficients vs Best Fourier Coefficients
Using the best coefficients, provides a
very high quality approximation of the
original time-series
175
Tutorial | Time-Series with Matlab
Matching results I
Query = Lance Armstrong
2000
2001
2002
LeTour
0
2000
2001
2002
Tour De France
0
2000
2001
2002
176
88
Tutorial | Time-Series with Matlab
Matching results II
Query = Christmas
2000
2001
2002
Knn4: Christmas coloring
books
Knn8: Christmas baking
Knn12: Christmas clipart
Knn20: Santa Letters
177
Tutorial | Time-Series with Matlab
Finding Structural Matches
The Euclidean distance cannot distill all the potentially useful
information in the weblog data.
Some data are periodic, while other are bursty. We will attempt to
provide similarity measures that are based on periodicity and
burstiness.
Query cinema. Weakly periodicity.
Peak of period every Friday.
Query Elvis. Burst in demand on
16th August. Death anniversary of
Elvis Presley
178
89
Tutorial | Time-Series with Matlab
Periodic Matching
Frequency
F ( x), F ( y )
Ignore Phase/
Keep important
components
Calculate
Distance
arg max || F ( x) ||, F ( x + )
k
arg max || F ( y ) ||, F ( y + )
k
cinema
D1 =|| F ( x + ) F ( y + ) ||
Periodogram
D2 =|| F ( x + ) F ( y + ) ||
stock
easter
0
10
15
20
25
30
35
40
45
50
10
15
20
25
30
35
40
45
50
christmas
179
Tutorial | Time-Series with Matlab
Matching Results with Periodic Measure
Now we can discover more flexible matches. We observe a clear
separation between seasonal and periodic sequences.
180
90
Tutorial | Time-Series with Matlab
Matching Results with Periodic Measure
Compute pairwise periodic distances and do a mapping of the
sequences on 2D using Multi-dimensional scaling (MDS).
181
Tutorial | Time-Series with Matlab
Matching Based on Bursts
Another method of performing structural matching can be achieved
using burst features of sequences.
Burst feature detection can be useful for:
Identification of important events
Query-by-burst
Harry Potter 2 (November 15 2002)
Harry Potter 1
(Movie)
50
100
Harry Potter 1
(DVD)
150
200
2002: Harry Potter demand
250
300
350
182
91
Tutorial | Time-Series with Matlab
Burst Detection
Burst detection is similar to anomaly detection.
Create distribution of values (eg gaussian model)
Any value that deviates from the observed distribution (eg more than 3
std) can be considered as burst.
Valentines
Day
Mothers
Day
183
Tutorial | Time-Series with Matlab
Query-by-burst
To perform query-by-burst we can perform the following steps:
1.
Find burst regions in given query
2.
Represent query bursts as time segments
3.
Find which sequences in DB have overlapping burst regions.
184
92
Tutorial | Time-Series with Matlab
Query-by-burst Results
Queries
www.nhc.noaa.gov
Pentagon attack
Cheap gifts
Matches
Nostradamus prediction
Tropical Storm
Scarfs
185
Tutorial | Time-Series with Matlab
Structural Similarity Measures
Periodic similarity achieves high clustering/classification accuracy in
ECG data
DTW
34
33
30
35
27
26
36
31
28
32
29
25
24
21
17
13
23
20
22
19
15
18
16
14
11
7
9
6
3
2
10
4
12
8
5
1
Periodic Measure
Incorrect
Grouping
36
35
33
28
27
26
32
34
30
31
29
25
18
23
20
19
17
24
22
16
14
15
21
13
12
8
2
7
11
5
9
3
10
6
4
1
186
93
Tutorial | Time-Series with Matlab
Structural Similarity Measures
Periodic similarity is a very powerful visualization tool.
Random W alk
Random W alk
Sunspots: 1869 to 1990
Sunspots: 1749 to 1869
Great Lakes (Ontario)
Great Lakes (Erie)
Power Demand: April-June (Dutch)
Power Demand: Jan-March (Dutch)
Power Demand: April-June (Italian)
Power Demand: Jan-March (Italian)
Random
Random
Video Surveillance: Eamonn, no gun
Video Surveillance: Eamonn, gun
Video Surveillance: Ann, no gun
Video Surveillance: Ann, gun
Koski ECG: fast 2
Koski ECG: fast 1
Koski ECG: slow 2
Koski ECG: slow 1
MotorCurrent: healthy 2
MotorCurrent: healthy 1
MotorCurrent: broken bars 2
MotorCurrent: broken bars 1
187
Tutorial | Time-Series with Matlab
Structural Similarity Measures
Burst correlation can provide useful insights for understanding which
sequences are related/connected. Applications for:
Gene Expression Data
Stock market data (identification of causal chains of events)
Query: Which stocks exhibited trading bursts during 9/11 attacks?
PRICELINE:
Stock value dropped
NICE SYSTEMS:
Stock value increased
(provider of air traffic
control systems)
188
94
Tutorial | Time-Series with Matlab
Conclusion
The traditional shape matching measures cannot address all timeseries matching problems and applications.
Structural distance measures can provide more flexibility.
There are many other exciting time-series problems that havent been
covered in this tutorial:
Anomaly Detection
dontwant
wantto
to
IIdont
achieve
immortality
achieve immortality
throughmy
myworkI
workI
through
want
to
achieve
want to achieve itit
throughnot
notdying.
dying.
through
Frequent pattern Discovery
Rule Discovery
etc
189
95