Sequential pattern mining

A sequence : < (ef) (ab) (df) c b >
A sequence database
SID sequence An element may contain a set of items.
Items within an element are unordered
10 <a(abc)(ac)d(cf)>
and we list them alphabetically.
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb> <a(bc)df> is a subsequence of
40 <eg(af)cbc> <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential
pattern 6

CHALLENGES ON SEQUENTIAL
PATTERN MINING
A huge number of possible sequential patterns are hidden in
databases

A mining algorithm should
 find the complete set of patterns, when possible, satisfying the
minimum support (frequency) threshold
 be highly efficient, scalable, involving only a small number of
database scans
 be able to incorporate various kinds of user-specific
constraints

7

The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

Itemset
3rd scan L3 Itemset sup
C3
{B, C, E} {B, C, E} 2
10

The Apriori Algorithm [Pseudo-Code]

Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk != ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
11

APRIORI ADV/DISADV

 Advantages:
 Uses large itemset property.
 Easily parallelized
 Easy to implement.

 Disadvantages:
 Assumes transaction database is memory resident.
 Requires up to m database scans.

 J. Han, J. Pei, and Y. Yin 2000
 Depth-first search
 Avoid explicit candidate generation
 Adopt divide-and-conquer strategy
 Two step approach
Step1:Build a compact data
structure called FP tree
Step2:Extract frequent itemsets
from FP tree.

Step 1: FP-Tree Construction
 FP-Tree is constructed using 2 passes over the data-set:

Pass 1:
 Scan data and find support for each item.
 Discard infrequent items.
 Sort frequent items in decreasing order based on
their support.

Pass 2:

Nodes correspond to items and have a counter

1. FP-Growth reads 1 transaction at a time and maps it to a path

2. Fixed order is used, so paths can overlap when transactions share items (when
they have the same prfix ).
– In this case, counters are incremented

3. Pointers are maintained between nodes containing the same item, creating singly
linked lists (dotted lines)
– The more paths that overlap, the higher the compression. FP-tree may fit in
memory.

4. Frequent itemsets extracted from the FP-Tree.

 Start from each frequent length-1 pattern (as an initial suffix
pattern)
 construct its conditional pattern base (a ―subdatabase,‖which
consists of the set of prefix paths in the FP-tree co-occurring
with the suffix pattern)
 Construct its (conditional) FP-tree, and perform mining
recursively on such a tree.
 The pattern growth is achieved by the concatenation of the
suffix pattern with the frequent patterns generated from a
conditional FP-tree.

Table : Table after
first scan of database
Table : Transactional data

Fig . FP – Tree Construction

EXAMPLE CONT

Table:Mining FP Tree by creating conditional (sub)-pattern bases

EXAMPLE CONT

Fig.The conditional FP-tree associated with the conditiona node I3

FP-FROWTH ADV/DISADV

Advantages of FP-Growth
• only 2 passes over data-set
• ―compresses‖ data-set
• no candidate generation
• much faster than Apriori

Disadvantages of FP-Growth
• FP-Tree may not fit in memory!!
• FP-Tree is expensive to build

APPLICATIONS

Customer shopping sequences:
 First buy computer, then CD-ROM, and then digital camera, within 3
months.

Medical treatments, natural disasters (e.g., earthquakes), science
& eng. processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures

22

Sequential pattern mining

More Related Content

What's hot

Similar to Sequential pattern mining

Recently uploaded

In this document

Sequential pattern mining