Università degli Studi di Milano
Master Degree in Computer Science
Information Management
         course
      Teacher: Alberto Ceselli
      Lecture 15: 25/11/2014
      Data Mining:
          Concepts and
           Techniques
                  (3rd ed.)
            — Chapter 6 —
  Jiawei Han, Micheline Kamber, and Jian Pei
 University of Illinois at Urbana-Champaign &
           Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
                                                2
Chapter 5: Mining Frequent Patterns, Association
 and Correlations: Basic Concepts and Methods
    Basic Concepts
    Frequent Itemset Mining Methods
    Which Patterns Are Interesting?—Pattern
     Evaluation Methods
    Summary
                                                   3
                What Is Frequent Pattern
                        Analysis?
   Frequent pattern: a pattern (a set of items, subsequences,
    substructures, etc.) that occurs frequently in a data set
   First proposed by Agrawal, Imielinski, and Swami (1993) in the
    context of frequent itemsets and association rule mining
   Motivation: Finding inherent regularities in data
       What products were often purchased together? Beer and
        diapers?!
       What are the subsequent purchases after buying a PC?
       What kinds of DNA are sensitive to this new drug?
       Can we automatically classify web documents?
   Applications
       Basket data analysis, cross-marketing ..., Web log (click stream)
        analysis, and DNA sequence analysis.                            4
                    Basic Concepts: Frequent
                            Patterns
Tid               Items bought                 itemset: A set of one or
10               Beer, Nuts, Diaper             more items
20           Beer, Coffee, Diaper              k-itemset X = {x1, …, xk}
30               Beer, Diaper, Eggs
40                Nuts, Eggs, Milk
                                               (absolute) support, or,
50       Nuts, Coffee, Diaper, Eggs,
                                                support count of X: number
                    Milk                        of occurrences of an itemset
                 Customer                       X in the dataset
                              Customer
                 buys both    buys diaper      (relative) support, s, is the
                                                fraction of transactions that
                                                contains X (i.e., the
                                                probability that a
                                                transaction contains X)
     Customer
     buys beer
                                               An itemset X is frequent if
                                                X’s support is no less than a
                                                minsup threshold                6
       Basic Concepts: Association Rules
 Tid        Items bought                  Find all the rules X  Y fixing
 10       Beer, Nuts, Diaper
                                           a minimum support and
 20      Beer, Coffee, Diaper
 30       Beer, Diaper, Eggs               confidence
 40        Nuts, Eggs, Milk                  support, s, probability that
 50 Nuts, Coffee, Diaper, Eggs, Milk
                                              a transaction contains X 
          Customer
                         Customer             Y
          buys both
                         buys
                         diaper
                                             confidence, c, conditional
                                              probability that a
                                              transaction having X also
   Customer                                   contains Y
   buys beer
Let minsup = 50%, minconf = 50% Association rules: (many more!)
Freq. Pat.: Beer:3, Nuts:3, Diaper:4,  Beer  Diaper (60%, 100%)
Eggs:3, {Beer, Diaper}:3              
                                        Diaper  Beer (60%, 75%)
                                                                             7
           Closed Patterns and Max-
                   Patterns
   A (long) pattern contains a combinatorial number
    of sub-patterns, e.g., {a1, …, a100} contains
             ( )( ) ( )
             100 + 100 +...+ 100 =2100−1=1.27⋅1030
              1     2        100
    sub-patterns!
   Idea: restrict to closed and maximal patterns
      An itemset X is a closed p. if X is frequent and
       there exists no super-pattern Y ⊃ X, with the
       same support as X
      An itemset X is a maximal p. if X is frequent and
       there exists no super-pattern Y ⊃ X, which is
       also frequent
   Closed pattern is a lossless compression of freq.
    Patterns: reducing the # of patterns and rules         8
           Closed Patterns and Max-
                   Patterns
   Exercise.
    DB = {<a1, …, a100>, < a1, …, a50>}
       Min_sup = 1.
   What is the set of closed itemset?
       <a1, …, a100>: 1
       < a1, …, a50>: 2
   What is the set of maximal pattern?
       <a1, …, a100>: 1
   What is the set of all patterns? !!
                                          9
Computational Complexity of Frequent
          Itemset Mining
   How many itemsets are potentially to be generated in the worst
    case?
       The number of frequent itemsets to be generated is sensitive
        to the minsup threshold
       When minsup is low, there exist potentially an exponential
        number of frequent itemsets
       The worst case: MN where M: # distinct items, and N: max
        length of transactions
   The worst case complexty vs. the expected probability
       Ex. Suppose Amazon has 104 kinds of products
         
             The chance to pick up one product 10-4
         
             The chance to pick up a particular set of 10 products: ~10-
             40
         
             What is the chance this particular set of 10 products to be
             frequent (e.g. 103 times in 109 transactions)?                10
Chapter 5: Mining Frequent Patterns, Association
 and Correlations: Basic Concepts and Methods
    Basic Concepts
    Frequent Itemset Mining Methods
    Which Patterns Are Interesting?—Pattern
     Evaluation Methods
    Summary
                                               11
    Scalable Frequent Itemset Mining
                Methods
   Apriori: Candidate Generate&Test Approach
   Improving the Efficiency of Apriori
   FPGrowth: A Frequent Pattern-Growth
    Approach
   ECLAT: Frequent Pattern Mining with
    Vertical Data Format
                                                12
    The Downward Closure Property and
         Scalable Mining Methods
   The downward closure property of frequent patterns
      Any subset of a frequent itemset is frequent
      If {beer, diaper, nuts} is frequent, so is {beer,
       diaper}
      i.e., every transaction having {beer, diaper, nuts}
       also contains {beer, diaper}
   Scalable mining methods: Three major approaches
      Apriori (Agrawal & Srikant@VLDB’94)
      Freq. pattern growth (FPgrowth—Han, Pei & Yin
       @SIGMOD’00)
      Vertical data format approach (Charm—Zaki &
       Hsiao @SDM’02)
                                                         13
    Apriori: A Candidate Generate & Test
                  Approach
   Apriori pruning principle: If there is any itemset
    which is infrequent, its superset should not be
    generated/tested! (Agrawal & Srikant @VLDB’94,
    Mannila, et al. @ KDD’ 94)
   Method:
       Initially, scan DB once to get frequent 1-itemset
       Generate length (k+1) candidate itemsets from
        length k frequent itemsets
       Test the candidates against DB
       Terminate when no frequent or candidate set can
        be generated                                    14
        The Apriori Algorithm—An Example
             Supmin = 2                Itemset       sup
                                                                    Itemset       sup
Database TDB                              {A}            2
                                                              L1         {A}       2
 Tid   Items        C1                    {B}            3
                                                                         {B}       3
 10         A, C, D                       {C}            3
 20         B, C, E        1st scan                                      {C}       3
                                          {D}            1
                                                                         {E}       3
 30     A, B, C, E                        {E}            3
 40          B, E
                               C2     Itemset    sup                 C2    Itemset
  L2                                  {A, B}         1
       Itemset        sup                                    2nd scan          {A, B}
        {A, C}         2              {A, C}         2
                                                                               {A, C}
        {B, C}         2              {A, E}         1
                                                                               {A, E}
        {B, E}         3              {B, C}         2
                                                                               {B, C}
        {C, E}         2              {B, E}         3
                                                                               {B, E}
                                      {C, E}         2
                                                                               {C, E}
       C3    Itemset
                               3rd scan         L3   Itemset       sup
             {B, C, E}                               {B, C, E}      2
                                                                                        15
  The Apriori Algorithm (Pseudo-
                     Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
  Ck+1 = candidates generated from Lk;
  for each transaction t in database do
    increment the count of all candidates in Ck+1
     that are contained in t
  Lk+1 = candidates in Ck+1 with enough support
  end
return k Lk;                                       16
             Implementation of Apriori
   How to generate candidates?
       Step 1: self-joining Lk
       Step 2: pruning
   Example of Candidate-generation
       L3={abc, abd, acd, ace, bcd}
       Self-joining: L3*L3
         
             abcd from abc and abd
         
             acde from acd and ace
       Pruning:
            acde is removed because ade is not in L3
       C4 = {abcd}
                                                        17
    How to Count Supports of Candidates?
   Why counting supports of candidates is a problem?
        The total number of candidates can be huge
        Each transaction may contain many candidates
   Method:
        Candidate itemsets are stored in a hash-tree
        Leaf nodes of hash-tree contain a list of itemsets
         and counts
        Interior nodes contain a hash table
        Subset function: finds all the candidates contained
         in a transaction
                                                              18
       Counting Supports of Candidates
              Using Hash Tree
         Subset function
                                 Transaction: 1 2 3 5 6
                   3,6,9
        1,4,7
              2,5,8
              1+2356
           13+56                          234
                                          567
                           145                 345        356   367
                                         136                    368
                                                          357
           12+356
                                                          689
                           124
                           457 125     159
                               458
Build: store only frequent candidates and their count; do it
incrementally while building Lk
Query for a candidate: visit the tree;
Query for an itemset: perform a visit for each sub-itemset;           19
     Generating Association Rules from
             frequent itemsets
   When all frequent itemsets are found, generate
    strong association rules:
      Pick each frequent itemset F, generate all its
       nonempty subsets
      For each such subset S, test the rule
         
           S → (F \ S)
      support(S → (F \ S)) is above the threshold (as F is
       frequent by construction)
      confidence (S → (F \ S)) = P( (F \ S) | S ) =
           count(F) / count(S)
         
           count(F) and count(S) are known, and so
           checking is quick
                                                              20
           Candidate Generation: An SQL
                 Implementation
   SQL Implementation of candidate generation
      Suppose the items in Lk-1 are listed in an order
        Step 1: self-joining Lk-1
          insert into Ck
          select p.item1, p.item2, …, p.itemk-1, q.itemk-1
          from Lk-1 p, Lk-1 q
          where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
           q.itemk-1
        Step 2: pruning
          forall itemsets c in Ck do
               forall (k-1)-subsets s of c do
                   if (s is not in Lk-1) then delete c from Ck
   Use object-relational extensions like UDFs, BLOBs, and Table
    functions for efficient implementation [See: S. Sarawagi, S. Thomas,
    and R. Agrawal. Integrating association rule mining with relational
    database systems: Alternatives and implications. SIGMOD’98]            21
    Scalable Frequent Itemset Mining
                Methods
   Apriori: A Candidate Generation-and-Test Approach
   Improving the Efficiency of Apriori
   FPGrowth: A Frequent Pattern-Growth Approach
   ECLAT: Frequent Pattern Mining with Vertical Data
    Format
   Mining Close Frequent Patterns and Maxpatterns
                                                        22
Further Improvement of the Apriori Method
    Major computational challenges
        Multiple scans of transaction database
        Huge number of candidates
        Tedious workload of support counting for
         candidates
    Improving Apriori: general ideas
        Reduce passes of transaction database scans
        Shrink number of candidates
        Facilitate support counting of candidates
                                                       23
     Partition: Scan Database Only
                  Twice
   Any itemset that is potentially frequent in DB must
    be frequent in at least one of the partitions of DB
      Scan 1: partition database and find local
       frequent patterns
      Scan 2: consolidate global frequent patterns
   A. Savasere, E. Omiecinski and S. Navathe ’95
        Sampling for Frequent Patterns
   Select a sample of original database, mine
    frequent patterns within sample using Apriori
   Scan database once to verify frequent itemsets
    found in sample, only borders of closure of
    frequent patterns are checked
       Example: check abcd instead of ab, ac, …, etc.
   Scan database again to find missed frequent
    patterns
   H. Toivonen. Sampling large databases for
    association rules. In VLDB’96
                                                         26
                Dynamic Itemset Counting:
                 Reduce Number of Scans
              ABCD
                                              Once both A and D are determined
                                               frequent, the counting of AD begins
  ABC ABD ACD BCD                             Once all length-2 subsets of BCD are
                                               determined frequent, the counting of
                                               BCD begins
 AB   AC      BC       AD   BD     CD
                                                          Transactions
                                                        1-itemsets
      A     B      C    D
                                 Apriori                2-itemsets
                                                             …
               {}
          Itemset lattice                               1-itemsets
                                                                                      1st pass
                                                         2-items
S. Brin R. Motwani, J. Ullman,
and S. Tsur. Dynamic itemset DIC                                         3-items
counting and implication
                                                                                      2nd pass
rules for market basket data.
SIGMOD’97                                                                          27