KEMBAR78
Sequential pattern mining | PPTX
GUIDE : MS. ANAGHA CHAUDHARI
A sequence : < (ef) (ab) (df) c b >
A sequence database
SID        sequence             An element may contain a set of items.
                                Items within an element are unordered
10     <a(abc)(ac)d(cf)>
                                and we list them alphabetically.
20      <(ad)c(bc)(ae)>
30      <(ef)(ab)(df)cb>        <a(bc)df> is a subsequence of
40        <eg(af)cbc>           <a(abc)(ac)d(cf)>

  Given support threshold min_sup =2, <(ab)c> is a sequential
  pattern                                                                6
CHALLENGES ON SEQUENTIAL
PATTERN MINING
 A huge number of possible sequential patterns are hidden in
  databases

 A mining algorithm should
    find the complete set of patterns, when possible, satisfying the
     minimum support (frequency) threshold
    be highly efficient, scalable, involving only a small number of
     database scans
    be able to incorporate various kinds of user-specific
     constraints

                                                        7
The Apriori Algorithm—An Example
                      Supmin = 2      Itemset       sup
                                                                     Itemset     sup
Database TDB                             {A}         2
 Tid        Items
                                                           L1          {A}         2
                               C1        {B}         3
                                                                       {B}         3
 10         A, C, D                      {C}         3
                          1st scan                                     {C}         3
 20         B, C, E                      {D}         1
                                                                       {E}         3
 30     A, B, C, E                       {E}         3
 40          B, E
                              C2     Itemset    sup               C2         Itemset
                                      {A, B}     1
 L2    Itemset        sup                                 2nd scan            {A, B}
                                      {A, C}     2
        {A, C}         2                                                      {A, C}
                                      {A, E}     1
        {B, C}         2
                                      {B, C}     2                            {A, E}
        {B, E}         3
                                      {B, E}     3                            {B, C}
        {C, E}         2
                                      {C, E}     2                            {B, E}
                                                                              {C, E}

              Itemset
                              3rd scan         L3   Itemset     sup
       C3
              {B, C, E}                             {B, C, E}    2
                                                                                       10
The Apriori Algorithm [Pseudo-Code]

Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk != ; k++) do begin
  Ck+1 = candidates generated from Lk;
  for each transaction t in database do
    increment the count of all candidates in Ck+1 that are
     contained in t
  Lk+1 = candidates in Ck+1 with min_support
  end
return k Lk;
                                                             11
APRIORI ADV/DISADV

 Advantages:
   Uses large itemset property.
   Easily parallelized
   Easy to implement.

 Disadvantages:
   Assumes transaction database is memory resident.
   Requires up to m database scans.
   J. Han, J. Pei, and Y. Yin 2000
   Depth-first search
   Avoid explicit candidate generation
   Adopt divide-and-conquer strategy
   Two step approach
    Step1:Build a compact data
          structure called FP tree
    Step2:Extract frequent itemsets
           from FP tree.
Step 1: FP-Tree Construction
 FP-Tree is constructed using 2 passes over the data-set:

  Pass 1:
    Scan data and find support for each item.
    Discard infrequent items.
    Sort frequent items in decreasing order based on
      their support.
Pass 2:

Nodes correspond to items and have a counter

1.     FP-Growth reads 1 transaction at a time and maps it to a path

2.     Fixed order is used, so paths can overlap when transactions share items (when
       they have the same prfix ).
     – In this case, counters are incremented

3.      Pointers are maintained between nodes containing the same item, creating singly
       linked lists (dotted lines)
     – The more paths that overlap, the higher the compression. FP-tree may fit in
       memory.

4.     Frequent itemsets extracted from the FP-Tree.
 Start from each frequent length-1 pattern (as an initial suffix
  pattern)
 construct its conditional pattern base (a ―subdatabase,‖which
  consists of the set of prefix paths in the FP-tree co-occurring
  with the suffix pattern)
 Construct its (conditional) FP-tree, and perform mining
  recursively on such a tree.
 The pattern growth is achieved by the concatenation of the
  suffix pattern with the frequent patterns generated from a
  conditional FP-tree.
Table : Table after
                             first scan of database
Table : Transactional data
Fig . FP – Tree Construction
EXAMPLE CONT




Table:Mining FP Tree by creating conditional (sub)-pattern bases
EXAMPLE CONT




Fig.The conditional FP-tree associated with the conditiona node I3
FP-FROWTH ADV/DISADV

Advantages of FP-Growth
  • only 2 passes over data-set
  • ―compresses‖ data-set
  • no candidate generation
  • much faster than Apriori

Disadvantages of FP-Growth
  • FP-Tree may not fit in memory!!
  • FP-Tree is expensive to build
APPLICATIONS



Customer shopping sequences:
   First buy computer, then CD-ROM, and then digital camera, within 3
    months.

Medical treatments, natural disasters (e.g., earthquakes), science
 & eng. processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures


                                                                  22
THANK YOU
Sequential pattern mining

Sequential pattern mining

  • 1.
    GUIDE : MS.ANAGHA CHAUDHARI
  • 6.
    A sequence :< (ef) (ab) (df) c b > A sequence database SID sequence An element may contain a set of items. Items within an element are unordered 10 <a(abc)(ac)d(cf)> and we list them alphabetically. 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> <a(bc)df> is a subsequence of 40 <eg(af)cbc> <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern 6
  • 7.
    CHALLENGES ON SEQUENTIAL PATTERNMINING A huge number of possible sequential patterns are hidden in databases A mining algorithm should  find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold  be highly efficient, scalable, involving only a small number of database scans  be able to incorporate various kinds of user-specific constraints 7
  • 10.
    The Apriori Algorithm—AnExample Supmin = 2 Itemset sup Itemset sup Database TDB {A} 2 Tid Items L1 {A} 2 C1 {B} 3 {B} 3 10 A, C, D {C} 3 1st scan {C} 3 20 B, C, E {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E C2 Itemset sup C2 Itemset {A, B} 1 L2 Itemset sup 2nd scan {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {B, C} 2 {A, E} {B, E} 3 {B, E} 3 {B, C} {C, E} 2 {C, E} 2 {B, E} {C, E} Itemset 3rd scan L3 Itemset sup C3 {B, C, E} {B, C, E} 2 10
  • 11.
    The Apriori Algorithm[Pseudo-Code] Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk != ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; 11
  • 12.
    APRIORI ADV/DISADV  Advantages:  Uses large itemset property.  Easily parallelized  Easy to implement.  Disadvantages:  Assumes transaction database is memory resident.  Requires up to m database scans.
  • 13.
    J. Han, J. Pei, and Y. Yin 2000  Depth-first search  Avoid explicit candidate generation  Adopt divide-and-conquer strategy  Two step approach Step1:Build a compact data structure called FP tree Step2:Extract frequent itemsets from FP tree.
  • 14.
    Step 1: FP-TreeConstruction  FP-Tree is constructed using 2 passes over the data-set: Pass 1:  Scan data and find support for each item.  Discard infrequent items.  Sort frequent items in decreasing order based on their support.
  • 15.
    Pass 2: Nodes correspondto items and have a counter 1. FP-Growth reads 1 transaction at a time and maps it to a path 2. Fixed order is used, so paths can overlap when transactions share items (when they have the same prfix ). – In this case, counters are incremented 3. Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines) – The more paths that overlap, the higher the compression. FP-tree may fit in memory. 4. Frequent itemsets extracted from the FP-Tree.
  • 16.
     Start fromeach frequent length-1 pattern (as an initial suffix pattern)  construct its conditional pattern base (a ―subdatabase,‖which consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern)  Construct its (conditional) FP-tree, and perform mining recursively on such a tree.  The pattern growth is achieved by the concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree.
  • 17.
    Table : Tableafter first scan of database Table : Transactional data
  • 18.
    Fig . FP– Tree Construction
  • 19.
    EXAMPLE CONT Table:Mining FPTree by creating conditional (sub)-pattern bases
  • 20.
    EXAMPLE CONT Fig.The conditionalFP-tree associated with the conditiona node I3
  • 21.
    FP-FROWTH ADV/DISADV Advantages ofFP-Growth • only 2 passes over data-set • ―compresses‖ data-set • no candidate generation • much faster than Apriori Disadvantages of FP-Growth • FP-Tree may not fit in memory!! • FP-Tree is expensive to build
  • 22.
    APPLICATIONS Customer shopping sequences:  First buy computer, then CD-ROM, and then digital camera, within 3 months. Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc. Telephone calling patterns, Weblog click streams DNA sequences and gene structures 22
  • 26.