MINING SEQUENTIAL PATTERNS
Sequential Pattern Mining
Sequence database consists of sequences of ordered
elements or events (with or without time)
Sequential Pattern Mining is the mining of frequently
occurring ordered events or subsequences as patterns
Example:
Customer shopping sequences:
First buy computer, then CD-ROM, and then
digital camera, within 3 months.
Web access patterns, Weather prediction
Usually categorical or symbolic data
Numeric data analysis Time Series Analysis
I = {I1, I2, Ip} Set of items
Sequence s = <e1 e2 e3 el>
Ordered list of events
Each event is an element of the sequence
In Shopping databases, event shopping trip
in which customers bought items (x1 x2 xq)
All trips by customer form a sequence
Item can occur at most once in an event, but
several times in a sequence
Sequence with length l : l-sequence
A sequence = <a1a2an> is a subsequence of
= <b1b2..bm
integers j1, j2, jn between 1 and m such that a1
j1, a2
j2, an
jn
is a
subA sequence database, S, is a set of tuples, <SID,s>,
where SID is a sequence ID and s is a sequence
A tuple <SID, s> is said to contain a sequence a, if a
is a subsequence of s
The support of a sequence in a sequence database S
is the number of tuples in the database containing
Given the minimum support threshold, a sequence a
is frequent in sequence database S if support S(a) >=
min sup
A frequent sequence is called a sequential pattern
A sequential pattern with length l is called an lpattern
Sequential Patterns
Length of Sequence 1 9
It contains a multiple times but sequence 1 will
contribute only one to the support of <a>
Sequence <a(bc)df> - Subsequence of Sequence 1
Support for <(ab)c> is 2 (Present in 1 and 3)
Frequent as it satisfies minimum support of 2
Scalable Methods for Mining Sequential Patterns
Full Set Vs Closed Set
A sequential patterns s is closed if there exists no
s where s is a proper super-sequence of s and s
has same support as s
Subsequences of a frequent sequence are also
ferquent
Mining closed sequential patterns avoids
generation of unnecessary sub-sequences
GSP Candidate generate and test approach on
horizontal data format
SPADE Candidate generate and test approach on
vertical data format
PrefixScan does not require candidate generation
All approaches exploit Apriori property every non
empty subsequence of a sequential pattern is a
sequential pattern
GSP: Sequential Pattern Mining
GSP (Generalized Sequential Patterns)
Multi-pass, Candidate generate and test approach
proposed by Agrawal and Srikant
Outline of the method
Initially, every item in DB is a candidate of
length-1
for each level (i.e., sequences of length-k) do
Scan database to collect support count for
each candidate sequence
Generate candidate length-(k+1) sequences
from length-k frequent sequences using
Apriori
A (k-1)-sequence w1 is merged with
another (k-1)-sequence w2 to produce a
candidate k-sequence if the subsequence
obtained by removing the first event in w1
is the same as the subsequence obtained
by removing the last event in w2
repeat until no frequent sequence or no candidate
can be found
Major strength: Candidate pruning by Apriori
Weakness : Generates large number of candidates
SPADE: Sequential Pattern Mining on Vertical data
format
SPADE Sequential PAttern Discovery using
Equivalent classes
Vertical
data format <itemset: (sequence_ID,
event_ID)>
ID_list : Set of (sequence_ID, event_ID) pairs for a
given itemset
Mapping from horizontal to vertical format requires
one scan
Support of k-sequences can be determined by joining
the ID lists of (k-1) sequences
To find candidate 2-sequences, all single items
are joined if they are frequent, if they share the
same sequence identifier and if their event
identifier follows a sequential ordering
Patterns are grown similarly
Reduces scans of the sequence database
As the length of the frequent sequence increases, the
size of ID_list decreases results in fast joins
But large set of candidates are still generated
PrefixSpan: Prefix-Projected Sequential Pattern
Growth
Pattern Growth does not require candidate
generation
Finds frequent single items
Constructs FP-tree
Projected databases associated with each frequent
item are generated from FP-tree
Builds prefix patterns which it concatenates with
suffix patterns to find frequent patterns
PrefixSpan
Given a sequence = <e1e2en> (where each ei
corresponds to a frequent event), a sequence
<e1e2em> (m<=n) is called a prefix
e'i = ei for i<=m-1
e'm
m
All frequent items in (em em) are alphabetically
after those in em
m em+1en> is called the suffix of
m = (em
em )
Mining Sequential patterns:
{<x1>,<x2>,<xn>} complete set of length-1
sequential patterns.
The complete set of Sequential patterns in S
can be partitioned into n disjoint subsets.
ith subset is the set of sequential patterns with
prefix <xi>
: length-l sequential pattern and { 1
2,
m} set of all length (l+1) sequential patterns
The complete set of sequential patterns with
disjoint subsets
jth
j
PrefixSpan Example
Step 1: find length-1 sequential patterns
<a>, <b>, <c>, <d>, <e>, <f>
Step 2: divide search space. The complete set of seq.
pat. can be partitioned into 6 subsets:
The ones having prefix <a>;
The ones having prefix <b>;
The ones having prefix <f>
Only need to consider projections w.r.t. <a>
<a>-projected database: (only first occurrence of
a
is
considered)
<(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
In <a> projected database frequent items are
a:2, b:4, _b:2, c:4, d:2 and f:2
Find all the length-2 seq. pat. Having prefix <a>:
<aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
Further partition into 6 subsets
Having prefix <aa> - {< (_bc)(ac)d(cf)>,
<(_e)>} No frequent subsequences
Having prefix <(ab)> - <(_c)(ac)d(cf)> and
<(df)cb>
Frequent patterns: <c> <d> <f> <dc> :
<(ab)c> <(ab)d> <(ab)f> <(ab)dc>
Having prefix <ac> - <(ac)d(cf)>, <(bc)(ae)>,
<b>, <bc>
Frequent patterns: <a> <b> <c> : <aca>
<acb> <acc>
.
PrefixSpan
No candidate sequence needs to be generated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing projected
databases
Can be improved by pseudo-projections
Pseudo-projection
Registers the index of the corresponding sequence
and the starting position of the projected suffix in
the sequence instead of physical projection
Avoids physically copying postfixes
Efficient in running time and space when
database can be held in main memory
For large data combination of physical and
pseudo projection
Sequential Pattern Mining
Performance rating : PrefixSpan, SPADE, GSP
All three are slow when there is a large number of
frequent subsequences
Closed Sequential Patterns
Closed Subsequences contain no super sequence
with the same support
Reduces number of sequences considered
CloSpan
Based on equivalence of projected databases
Two projected sequence databases are
equivalent iff the total number of items match
CloSpan prunes non-closed sequences
whenever two projected databases are exactly
the same by stopping the growth of one
Requires Post-processing to eliminate any
remaining non-closed sequential patterns
BIDE (Bidirectional Search) algorithm avoids
additional checking
Multi-dimensional, multi-level Sequential patterns
Additional information maybe associated with
Sequence ID Customer age, address, group and
profession
Additional information associated with items
item category, brand, model type, model number,
place
Example: Retired customers who purchase a
digital camera are likely to purchase a color
printer within a month
Additional information can be attached with
Sequence ID / Item ID
Constraint based Mining of Sequential Patterns
Un-focused mining reduces the efficiency and
usability of frequent-pattern mining
Constraint based mining incorporates user-specified
constraints to reduce the search space
Regular expressions can be used to specify
pattern templates
Helps to improve efficiency of mining and
interestingness of patterns
Constraint based Mining
Constraints related to duration T
Constraints related to maximal or minimal length
anti-monotonic / monotonic constraints
Anti-monotonic constraint: T<= 10
Monotonic : T > 10
Succinct : T = 2005
Periodic patterns related to sets of partitioned
sequences such as every two weeks before and
after an earthquake
Event folding window w
specifies the periodicity for treating events as
occurring together
w=0 No event sequence folding
w=T time-insensitive frequent patterns
Gap between events
Gap=0 Strictly consecutive sequential
patterns
min_gap and max_gap
Exact gaps and approximate gaps
Serial Episodes
Set of events occurring in total order
Parallel Episodes
Occurrence order is trivial
Examples
(A|B)C*(D|E) A and B first occur (relative
order is unimportant) followed by any number
of C events followed by D and E (in any
order)
C = <a*{bb|(bc)d|dd}> a-projected databases
followed by SuffixSpan
Periodicity Analysis for Time-Related Sequence data
Periodicity analysis mining of periodic patterns
searching for recurring patterns in time-related
sequence data
Seasons,
tides,
traffic
patterns,
power
consumption
Often performed over time-series data
Full periodic pattern
Every point in time contributes (precisely or
approximately) to the periodicity
Example: All days in a year contribute to
season cycle
Partial periodic pattern
Specifies periodic behavior of a time-related
sequence at some but not all points of time
Example: ABC reads the paper between
7:00-7:30 am every week day
Periodicity Analysis
Synchronous periodicity event occurs at a relatively
fixed offset in each stable period
3 pm every day
Asynchronous periodicity event fluctuates in
loosely defined period
Precise or approximate depending on data value or
offset within a period
Mining partial periodicity leads to the discovery of
cyclic or periodic association rules (rules that
associate a set of events that occur periodically)
Example: If tea sells well between 3 5 pm
dinner will also sell well between 7 9 pm on
weekends