KEMBAR78
Dbm630 lecture05 | PDF
DBM630: Data Mining and
                       Data Warehousing

                              MS.IT. Rangsit University
                                                 Semester 2/2011




                                        Lecture 5
                          Association Rule Mining

    by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1
Topics
 Association rule mining
 Mining single-dimensional association rules
 Mining multilevel association rules
 Other measurements: interest and conviction
 Association rule mining to correlation analysis




 2                          Data Warehousing and Data Mining by Kritsada Sriphaew
What is Association Mining?
      Association rule mining:
          Finding frequent patterns, associations, correlations, or causal
           structures among sets of items or objects in transaction
           databases, relational databases, and other information
           repositories.
      Applications:
          Basket data analysis, cross-marketing, catalog design,
           clustering, classification, etc.
   Ex.: Rule form: “Body  Head [support, confidence]”
buys(x, “diapers*”)  Consequent [support, confidence]”
      “Antecedent  buys(x, “beers”)            [0.5%,60%]
major(x, “CS”)^takes(x, “DB”)  grade(x, “A”)                                [1%, 75%]
   3                                  Data Warehousing and Data Mining by Kritsada Sriphaew
A typical example of association rule mining is
market basket analysis.




4                      Data Warehousing and Data Mining by Kritsada Sriphaew
Rule Measures: Support/Confidence
       Find all the rules “Antecedent(s)  Consequent(s)” with minimum
        support and confidence
           support, s, probability that a transaction contains {A  C}
           confidence, c, conditional probability that a transaction having A also contains C
       Let min. sup. 50%, and min. conf. 50%,                 •   Support= 50% means that 50% of all
                                                                   transactions under analysis show that
           A  C (s=50%, c=66.7%)                                 A and C are purchased together
                                                               •   Confidence=66.7% means that 66.7% of the
           C  A (s=50%, c=100%)                                  customers who purchased A also bought C
       Typically association rules are considered
        interesting if they satisfy both
        a minimum support threshold and                             Transactional databases
        a mininum confidence threshold                             Transaction ID      Items Bought
       Such thresholds can be set by users                            2000            A,B,C
        or domain experts                                              1000            A,C
                                                                       4000            A,D
                                                                       5000            B,E,F

    5                                             Data Warehousing and Data Mining by Kritsada Sriphaew
Rule Measures: Support/Confidence
                                                                              probability
        TransID      Items Bought          Rule: A C
        T001         A,B,C
        T002         A,C                   support (AC) = P({AC}) = P(AC)
        T003         A,D                   confidence(AC) = P(C|A)
        T004         B,E,F                                  = P({AC})/P({A})

Frequency      •   A  B (1/4 = 25%, 1/3 = 33.3%)           Customer buys both (A&C)
A =3           •   B  A (1/4 = 25%, 1/2 = 50%)
                                                                      Customer buys diaper(C)
B =2           •   A  C (2/4 = 50%, 2/3 = 66.7%)
               •   C  A (2/4 = 50%, 2/2 =100%)
C =2
               •   A, B  C (1/4 = 25%, 1/1 = 100%)
AB = 1         •   A, C  B (1/4 = 25%, 1/2 = 50%)
AC = 2         •   B, C  A (1/4 = 25%, 1/1 = 100%)
BC = 1
ABC = 1
                                                             Customer buys beer (A)
    6                                           Data Warehousing and Data Mining by Kritsada Sriphaew
Association Rule: Support/Confidence for
  Relational Tables
         In case that each transaction is a row in a relational table
         Find: all rules that correlate the presence of one set of
          attributes with that of another set of attributes
                                   outlook     temp.   humidity   windy   Sponsor   play-time   play
• If temperature = hot
                                   sunny        hot      high     True     Sony        85       Y
  then humidity = high
  (s=3/10,c=3/5)                   sunny        hot      high     False     HP         90       Y

                                  overcast      hot    normal     True     Ford        63       Y
• If windy=true and play=Y         rainy       mild      high     True     Ford        5        N
  then humidity=high and
                                   rainy       cool      low      False     HP         56       Y
  outlook=overcast
  (s=2/10, c=2/4)                  sunny        hot      low      True     Sony        25       N

                                   rainy       cool    normal     True     Nokia       5        N
• If windy=true and play=Y        overcast     mild      high     True    Honda        86       Y
  and humidity=high
                                   rainy       mild      low      False    Ford        78       Y
  then outlook=overcast
  (s=2/10, c=2/3)                 overcast      hot      high     True     Sony        74       Y



      7                                      Data Warehousing and Data Mining by Kritsada Sriphaew
Association Rule Mining: Types
       Boolean vs. quantitative associations (Based on the types of
        values handled) (Single vs. multiple Dim.)
        SQLServer ^ DMBooks  DBMiner [0.2%, 60%]
        buys(x, “SQLServer”) ^ buys(x, “DMBook”) 
                             buys(x, “DBMiner”) [0.2%, 60%]
        age(x, “30..39”) ^ income(x, “42..48K”) 
                                    buys(x, “PC”) [1%, 75%]

       Single level vs. multilevel analysis
           What brands of beers are associated with what brands of diapers?
       Various extensions
           Maxpatterns and closed itemsets

    8                                     Data Warehousing and Data Mining by Kritsada Sriphaew
An Example (single dimensional Boolean
 association Rule Mining)
        For rule A  C:                                    Min. support 50%
            support = support({A, C}) = 50%                Min. confidence 50%
            confidence = support({A, C})/support({A}) = 66.7%
        The Apriori principle:
            Any subset of a frequent itemset must be
             frequent
Transaction ID           Items Bought              Frequent Itemset Support
    2000                 A,B,C                     {A}                 75%
    1000                 A,C                       {B}                 50%
    4000                 A,D                       {C}                 50%
    5000                 B,E,F                     {A,C}               50%

     9                                     Data Warehousing and Data Mining by Kritsada Sriphaew
Two Steps in Mining Association Rules
   A subset of a frequent itemset must also be a frequent
    itemset
      i.e., if {AB} is a frequent itemset, both {A} and {B} must
       be a frequent itemset
   Iteratively find frequent itemsets with cardinality from 1 to
    k (k-itemset)
Step1: Find the frequent itemsets: the sets of items that
         have minimum support

Step2: Use the frequent itemsets to generate association
       rules

 10                             Data Warehousing and Data Mining by Kritsada Sriphaew
Find the frequent itemsets

The Apriori Algorithm
    Join Step: Ck is generated by joining Lk-1with itself
    Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset
     of a frequent k-itemset
    Pseudo-code:
         Ck: Candidate itemset of size k
         Lk : frequent itemset of size k

         L1 = {frequent 1-itemsets};                  1
         for (k = 1; Lk !=f; k++) do begin
             Ck+1 = candidates generated from Lk;          2
             for each transaction t in database do
                increment the count of all candidates in Ck+1
                that are contained in t
                Lk+1 = candidates in Ck+1 with min_support
             end
         return Uk Lk;
    11                                Data Warehousing and Data Mining by Kritsada Sriphaew
The Apriori Algorithm — Example
     Database D              itemset sup.
                                              L1 itemset sup.
     TID   Items          C1    {1}   2            {1}        2
     100   134                  {2}   3            {2}        3
     200   235        Scan D    {3}   3            {3}        3
     300   1235                 {4}   1            {5}        3
     400   25                   {5}   3
                         C2 itemset sup            C2 itemset
     L2 itemset sup          {1   2}     1    Scan D    {1 2}
           {1 3}   2         {1   3}     2               {1   3}
           {2 3}   2         {1   5}     1               {1   5}
                             {2   3}     2               {2   3}
           {2 5}   3
                             {2   5}     3               {2   5}
           {3 5}   2
                             {3   5}     2               {3   5}
        C3 itemset       Scan D        L3 itemset sup
            {2 3 5}                        {2 3 5} 2
12
How to Generate Candidates?
   Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
   INSERT INTO Ck
   SELECT p.item1, p.item2, …,
          p.itemk-1, q.itemk-1
   FROM       Lk-1 p, Lk-1 q
   WHERE p.item1          = q.item1, …,
          p.itemk-2 = q.itemk-2,
          p.itemk-1 < q.itemk-1
Step 2: pruning
ForAll itemsets c IN Ck DO
 ForAll (k-1)-subsets s OF c DO
13
   IF (s is not in Lk-1) THEN DELETE c FROM Ck
Example of Generating Candidates
            L3={abc, abd, acd, ace, bcd}

Self-joining: L3×L3
abc + abd       abcd
acd + ace       acde

                  Pruning:
C4={abcd}         acde is removed because
                  ade is not in L3

 14                   Data Warehousing and Data Mining by Kritsada Sriphaew
How to Count Supports of Candidates?
   Why counting supports of candidates a problem?
       The total number of candidates can be very huge
       One transaction may contain many candidates
   Method:
       Candidate itemsets are stored in a hash-tree
       Leaf node of hash-tree contains a list of itemsets and
        counts
       Interior node contains a hash table
       Subset function: finds all the candidates contained in a
        transaction
15                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Subset Function
     Subset function: finds all the candidates contained in a
      transaction. (1) Generate Hash Tree (2) Hashing each item in
      the transactions                   2 1

C2                              1          3    1+1                Database
     itemset                                                          TID      Items
       {1 2}                               5     1                    100      134
       {1 3}            f                                             200      235
       {1 5}                               3                          300      1235
                                                 1+1
       {2 3}                    2                                     400      25
       {2 5}                               5     1+1+1
       {3 5}
                                3          5     1+1

     16                             Data Warehousing and Data Mining by Kritsada Sriphaew
Is Apriori Fast Enough? — Performance
Bottlenecks
    The core of the Apriori algorithm:
        Use frequent (k – 1)-itemsets to generate candidate frequent
         k-itemsets
        Use database scan and pattern matching to collect counts for
         the candidate itemsets
    The bottleneck of Apriori: candidate generation
        Huge candidate sets:
          104 frequent 1-itemset will generate 107 candidate 2-
           itemsets
          To discover a frequent pattern of size 100, e.g., {a1, a2, …,
           a100}, one needs to generate 2100  1030 candidates.
        Multiple scans of database:
          Needs (n +1 ) scans, n is the length of the longest pattern

    17                               Data Warehousing and Data Mining by Kritsada Sriphaew
Mining Frequent Patterns Without Candidate
Generation
   Compress a large database into a compact,
    Frequent-Pattern tree (FP-tree) structure
       highly condensed, but complete for frequent pattern
        mining
       avoid costly database scans
   Develop an efficient, FP-tree-based frequent pattern
    mining method
       A divide-and-conquer methodology: decompose mining
        tasks into smaller ones
       Avoid candidate generation: sub-database test only!
18                                Data Warehousing and Data Mining by Kritsada Sriphaew
Construct FP-tree from Transaction DB
     TID    Items bought           (ordered) frequent items
     100    {f, a, c, d, g, i, m, p}      {f, c, a, m, p}
     200    {a, b, c, f, l, m, o}         {f, c, a, b, m}            min_support = 0.5
     300    {b, f, h, j, o}               {f, b}
     400    {b, c, k, s, p}               {c, b, p}
     500    {a, f, c, e, l, p, m, n}      {f, c, a, m, p}
                                                                                    {}
Steps:                               Header Table

1.    Scan DB once, find             Item frequency head                    f:4          c:1
      frequent 1-itemset              f      4
      (single item pattern)          c       4                          c:3       b:1    b:1
                                     a       3
2.    Order frequent items           b       3                          a:3              p:1
      in frequency                   m       3
      descending order               p       3                       m:2      b:1
3.    Scan DB again,
      construct FP-tree                                              p:2      m:1
      19                                   Data Warehousing and Data Mining by Kritsada Sriphaew
Mining Frequent Patterns using FP-tree
    General idea (divide-and-conquer)
        Recursively grow frequent pattern path using the FP-tree
    Method
        For each item, construct its conditional pattern-base, and then its
         conditional FP-tree
        Repeat the process on each newly created conditional FP-tree
        Until the resulting FP-tree is empty, or it contains only one path (single
         path will generate all the combinations of its sub-paths, each of which is a
         frequent pattern)
    Benefit: Completeness & Compactness
        Completeness: never breaks a long pattern of any transaction and
         preserves complete information for frequent pattern mining
        Compactness: reduces irrelevant information (infrequent items are gone),
         orders in frequency descending ordering (more frequent items are likely to
         be shared), and smaller than the original database.

    20                                     Data Warehousing and Data Mining by Kritsada Sriphaew
Step 1: From FP-tree to Conditional Pattern Base
    Starting at the frequent header table in the FP-tree
    Traverse the FP-tree by following the link of each frequent item
    Accumulate all of transformed prefix paths of that item to form a
     conditional pattern base

                                           {}
       Header Table
                                                      Conditional pattern bases
       Item frequency head         f:4          c:1   item      cond. pattn base
        f      4
                               c:3       b:1    b:1   c         f:3
       c       4
       a       3                                      a         fc:3
       b       3               a:3              p:1   b         fca:1, f:1, c:1
       m       3
       p       3             m:2     b:1
                                                      m         fca:2, fcab:1
                                                      p         fcam:2, cb:1
                             p:2     m:1
  21                           Knowledge Management and Discovery © Kritsada Sriphaew
Step 2: Construct Conditional FP-tree
    For each pattern-base
        Accumulate the count for each item in the base
        Construct the FP-tree for the frequent items of the pattern base
                                                           m-conditional pattern base:
                                              {}            fca:2,
Header Table
                                                            fcab:1
Item frequency head                   f:4          c:1
                                                                            All frequent
 f      4                                                                   patterns
c       4                         c:3       b:1    b:1    {}
                                                                            concerning m
a       3
b       3                         a:3              p:1    f:3               m,
m       3
                                                          c:3               fm, cm, am,
p       3                      m:2      b:1
                                                                            fcm, fam, cam,
                                                          a:3
                               p:2      m:1                                 fcam
    22
                                                         m-conditional FP-tree
Mining Frequent Patterns by
(Creating Conditional Pattern-Bases)

 Item Conditional pattern-base Conditional FP-tree
   p    {(fcam:2), (cb:1)}         {(c:3)}|p
     m    {(fca:2), (fcab:1)}              {(f:3, c:3, a:3)}|m
     b   {(fca:1), (f:1), (c:1)}                     Empty
     a          {(fc:3)}                       {(f:3, c:3)}|a
     c          {(f:3)}                            {(f:3)}|c
     f          Empty                                Empty


23                         Data Warehousing and Data Mining by Kritsada Sriphaew
Step 3: Recursively mine the conditional FP-
  tree                                {}

                                                         f:3
         {}
                                                         c:3
         f:3
                                               am-conditional FP-tree
         c:3
                                                         {}
         a:3
                                                         f:3
m-conditional FP-tree
                                               cm-conditional FP-tree

                                                         {}

                                                         f:3
                                              cam-conditional FP-tree

    24                  Data Warehousing and Data Mining by Kritsada Sriphaew
Single FP-tree Path Generation
    Suppose an FP-tree T has a single path P
    The complete set of frequent pattern of T can be generated by
     enumeration of all the combinations of the sub-paths of P
                                                     m-conditional pattern base:
                                        {}            fca:2,
Header Table
                                                      fcab:1
Item frequency head            f:4           c:1
                                                                All frequent
 f      4                                                       patterns
c       4                   c:3      b:1     b:1    {}
                                                                concerning m
a       3
b       3                   a:3              p:1    f:3         m,
m       3
                                                   c:3          fm, cm, am,
p       3                m:2      b:1
                                                                fcm, fam, cam,
                                                   a:3
                         p:2      m:1                           fcam
    25
                                                   m-conditional FP-tree
FP-growth vs. Apriori: Scalability With the
Support Threshold
                                             Data set T25I20D10K
                  100
                   90
                   80                         D1 FP-growth runtime
 Run time(sec.)




                   70                         D1 Apriori runtime
                   60
                   50
                   40
                   30
                   20
                   10
                    0
                        0      1                       2                          3
                            Support threshold(%)
26                                 Data Warehousing and Data Mining by Kritsada Sriphaew
CHARM - Mining Closed Association Rules
    Instead of horizontal DB format, vertical format is used.
    Instead of traditional frequent itemsets, closed frequent
     itemsets are mined.
           Horizontal DB                           Vertical DB
         Transaction   Items                  Items Transaction
         1             ABDE                   A          1345
         2             BCE                    B          123456
         3             ABDE                   C          2456
         4             ABCE                   D          1356
         5             ABCDE                  E          12345
         6             BCD
    27                           Data Warehousing and Data Mining by Kritsada Sriphaew
CHARM – Frequent Itemsets and Their Supports
   An example database and its frequent itemsets

          Items Trans.      Support                Itemsets

          A     1345        1.00                   B
          B     123456      0.83                   BE, E
          C     2456        0.67                   A, C, D, AB,AE,
          D     1356                               BC, BD, ABE
          E     12345       0.50                   AD, CE, DE,
                                                   ABD, ADE, BCE,
                                                   BDE, ABDE
Vertical DB
                                       Min. support = 0.5

 28                         Data Warehousing and Data Mining by Kritsada Sriphaew
CHARM - Closed Itemsets
   Closed frequent itemsets and their corresponding
    frequent itemsets
     Closed
     Itemsets    Tidsets      Sup. Freq. Itemsets
     B           123456       1.00        B
     BE          12345        0.83        BE, E
     ABE         1345         0.67        ABE, AB, AE, A
     BD          1356         0.67        BD, D
     BC          2456         0.67        BC, C
     ABDE        135          0.50        ABDE, ABD, ADE,
                                          BDE, AD, DE
     BCE         245          0.50        CE, BCE
29                          Data Warehousing and Data Mining by Kritsada Sriphaew
The CHARM Algorithm
 CHARM (? I  T, minsup);                                      CHARM-PROPERY(Nodes, NewN)
 1. Nodes = { Ij  t(Ij) : Ij  I  |t(Ij )|  minsup }          1. if (|Y|  minsup) then
 2. CHARM-EXTEND (Nodes, C)                                      2.    if t(Xi) = t(Xj) then       // Propery 1
                                                                 3.        Remove Xj from Nodes
 CHARM-EXTEND (Nodes, C)                                         4.        Replace all Xi with X’
 3. for each Xi  t(Xi) in Nodes                                 5.    else if t(Xj)  t(Xj) then // Propery 2
 4.    NewN = f and X = Xi                                       6.        Replace all Xi with X’
 5.    for each Xj  t(Xj) in Nodes, with f(j) > f(I)            7.    else if t(Xj)  t(Xj) then // Propery 3
 6.         X’ = X  Xj and Y = t(Xi)  t(Xj)                    8.        Remove Xj from Nodes
 7.         CHARM-PROPERTY(Nodes, NewN)                          9.        Add X  Y to NewN
 8.     if NewN  f then CHARM-EXTEND(NewN)                      10. else if t(Xj)  t(Xj) then // Propery 4
 9.     C = C  {X} // if X is not subsumed                      11.       Add X  Y to NewN
                                               f

          Ax1345      Bx123456                  Cx2456       Dx1356 Ex12345
            ABx1345
             ABEx1345


      ABCx45         ABDx135               BCx2456              BDx1356           BEx12345
                      ABDEx135

                                           BCDx56         BCEx245         BDEx135
30                                                        Data Warehousing and Data Mining by Kritsada Sriphaew
Presentation of Association Rules (Table Form)




31                    Data Warehousing and Data Mining by Kritsada Sriphaew
Visualization of Association Rule Using
Plane Graph




32
Visualization of Association Rule Using Rule
Graph




33
Mining multilevel association rules from transactional databases
Multiple-Level Association Rules
                                          TID                        ITEMS
    Items often form hierarchy.             T1               {1121, 1122, 1212}

    Items at the lower level are            T2            {1222, 1121, 1122, 1213}

     expected to have lower                  T3                  {1124, 1213}
                                             T4          {1111, 1211, 1232, 1221, 1223}
     support.
                                                               Food
    Rules regarding itemsets at                                (1)
     the appropriate levels could                 Milk                    Bread
     be quite useful.                             (11)                     (12)
    Transaction database can be     Skim             2%               Wheat        White
     encoded based on                (111)           (112)             (121)        (122)
     dimensions and levels
                                                                                      Wonder
    We can explore shared               Fraser             Sunset                    (1222)
     multi-level mining                  (1121)             (1124)
                                                                           Wonder
                                                                           (1213)
    34                                Data Warehousing and Data Mining by Kritsada Sriphaew
Mining Multi-Level Associations
   A top_down, progressive deepening approach:
       First find high-level strong rules:
               milk  bread            [20%, 60%]
       Then find their lower-level “weaker” rules:
               2% milk  wheat bread                        [6%, 50%]

   Variations at mining multiple-level association rules.
       Level-crossed association rules:
           2% milk  Wonder wheat bread                              [3%, 60%]
       Association rules with multiple, alternative hierarchies:
           2% milk  Wonder bread                      [8%, 72%]
35                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-level Association: Redundancy Filtering
 Some rules may be redundant due to “ancestor”
  relationships between items.
 Example
      milk  wheat bread           [s=8%, c=70%]
      2% milk  wheat bread               [s=2%, c=72%]
 We say the first rule is an ancestor of the second
  rule.
 A rule is redundant if its support is close to the
  “expected” value, based on the rule’s ancestor.
 36                        Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-Level Mining: Progressive Deepening
    A top-down, progressive deepening approach:
        First mine high-level frequent items:
          milk (15%), bread (10%)
        Then mine their lower-level “weaker” frequent itemsets:
          2% milk (5%), wheat bread (4%)
    Different min_support threshold across multi-levels lead
     to different algorithms:
        If adopting the same min_support across multi-levels
           then toss t if any of t’s ancestors is infrequent.
        If adopting reduced min_support at lower levels
           then examine only those descendants whose ancestor’s
            support is frequent/non-negligible.

    37                               Data Warehousing and Data Mining by Kritsada Sriphaew
Problem of Confidence
    Example: (Aggarwal & Yu, PODS98)
        Among 5000 students
            3000 play basketball
            3750 eat cereal
            2000 both play basket ball and eat cereal
        play basketball  eat cereal [40%, 66.7%] is misleading because the overall
         percentage of students eating cereal is 75% which is higher than 66.7%.
        play basketball  not eat cereal [20%, 33.3%] is far more accurate, although
         with lower support and confidence


                               basketball not basketball sum(row)
                    cereal          2000           1750     3750
                    not cereal      1000            250     1250
                    sum(col.)       3000           2000     5000
    38                                           Data Warehousing and Data Mining by Kritsada Sriphaew
Interest/Lift/Correlation
    Interest (or lift, correlation)
        taking both P(A) and P(B) in consideration                      P( A  B)
        P(AB)=P(B)P(A), if A and B are independent                    P( A) P( B)
         events
        A and B negatively correlated, if the value is less
         than 1; otherwise A and B positively correlated
                                                                     2000
                     basketball not basketball sum(row)              5000
          cereal          2000           1750     3750                       0.889
                                                                  3000 3750
          not cereal      1000            250     1250                
                                                                  5000 5000
          sum(col.)       3000           2000     5000

                                                                      1000
        Lift(play basketball  eat cereal) = 0.89                    5000    1.33
        Lift(play basketball  not eat cereal) = 1.33             3000 1250
                                                                        
                                                                   5000 5000
    39                                    Data Warehousing and Data Mining by Kritsada Sriphaew
Conviction
    Conviction (Brin, 1997)
                                                                       (1  Support ( B))
                                      Conviction ( A  B) 
        0 <= conv(AB) <=                                       (1  Confidence ( A  B)
        A and B are statistically independent if and only if
         conv(AB) = 1
        0 < conv(AB) < 1 if and only if p(B|A) < p(B)
         B is negatively correlated with A.
        1 < conv(AB) <  if and only if p(B|A) > p(B)
         B is positively correlated with A.                                      1
                                                                                    3750
                basketball not basketball sum(row)                                  5000  0.375
     cereal          2000           1750     3750                               1  0.667
     not cereal      1000            250     1250
                                                                                      1250
     sum(col.)       3000           2000     5000                                 1
                                                                                      5000  2.25
         conviction(play basketball  eat cereal) = 0.375
                                                                                  1  0.333
         conviction(play basketball  not eat cereal) = 2.25
    40                                        Data Warehousing and Data Mining by Kritsada Sriphaew
From Association Mining to Correlation Analysis

    Ex. Strong rules are not necessarily interesting
         Of 10000 transactions
             • 6000 customer transactions include computer games
             • 7500 customer transactions include videos
             • 4000 customer transactions include both computer game and video

                                             • Suppose that data mining program for
             videos           games
                                               discovering association rules is run on
                                               the data, using min_sup of 30% and
                                               min_conf. of 60%
                                             • The following association rule is
                                               discovered:
                      4,000

         buys(X, “computer games”)  buys(X, “videos”)
                        [s=40%, c=66%]
    41
                =4000/10000           =4000/6000
A misleading “strong” association rule

         buys(X, “computer games”)  buys(X, “videos”)
                     [support=40%, confidence=66%]

    This rule is misleading because the probability of purchasing video is
     75% (>66%)
    In fact, computer games and videos are negatively associated because
     the purchase of one of these items actually decreases the likelihood of
     purchasing the other. Therefore, we could easily make unwise business
     decisions based on this rule



    42
                                      Data Warehousing and Data Mining by Kritsada Sriphaew
From Association Analysis to Correlation
Analysis
    To help filter out misleading “strong” association
    Correlation rules
        A  B [support, confidence, correlation]
    Lift is a simple correlation measure that is given as follows
        The occurrence of itemset A is independent of the occurrence of itemset B if
         P(AB) = P(A)P(B);
        Otherwise, itemset A and B are dependent and correlated
        lift(A,B) = P(AB) / P(A)P(B) = P(B|A) / P(B) = conf(AB) / sup(B)
        If lift(A,B) < 1, then the occurrence of A is negatively correlated with the
         occurrence of B
        If lift(A,B) > 1, then A and B is positively correlated, meaning that the occurrence
         of one implies the occurrence of the other.


    43
                                               Data Warehousing and Data Mining by Kritsada Sriphaew
From Association Analysis to Correlation
Analysis (Cont.)
    Ex. Correlation analysis using lift

         buys(X, “computer games”)  buys(X, “videos”)
                     [support=40%, confidence=66%]

        The lift of this rule is
         P{game,video} / (P{game} × P{video}) = 0.40/(0.6 ×0.75) = 0.89
        There is a negative correlation between the occurrence of {game} and {video}



    Ex. Is this following rule misleading?
    Buy walnuts  Buy milk [1%, 80%]”
        if 85% of customers buy milk

    44
                                               Data Warehousing and Data Mining by Kritsada Sriphaew
Homework
   ให้ transactional database ซึ่งเป็น LOG ไฟล์บันทึกการเข้าเยี่ยมชมเว็บเพจของปู้ใช้แต่ละคน
    ในช่วงระยะเวลาหนึ่ง จงหากฎสัมพันธ์ที่น่าเชื่อถือ โดยสมมติว่าเราเป็นปู้วเิ คราะห์ข้อมูล มี
    สิทธิตั้ง minimum support และ minimum confidence ด้วยตัวเอง พร้อมอธิบายเหตุปล
    ประกอบว่าทาไมถึงตั้งค่านั้น และตรวจสอบด้วยว่ากฎเหล่านั้นเป็น misleading หรือไม่ ถ้ามีให้
    แก้ไขอย่างไร
                       TID                          List of items
                      T001          P1, P2, P3, P4
                      T002          P3, P6
                      T003          P2, P5, P1
                      T004          P5, P4, P3,P6
                      T005          P1, P3, P4, P2


                     P1            P2                        P4                       P6
                                               P3                        P5
    45
Feb 26, 2011 (14:00)
   Quiz I
       Star-net Query (Multidimensional Table)
       Data Cube Computation (Memory Calculation)
       Data Preprocessing (Normalization, Smoothing by binning)
       Association Rule Mining




46                               Data Warehousing and Data Mining by Kritsada Sriphaew

Dbm630 lecture05

  • 1.
    DBM630: Data Miningand Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 5 Association Rule Mining by Kritsada Sriphaew (sriphaew.k AT gmail.com) 1
  • 2.
    Topics  Association rulemining  Mining single-dimensional association rules  Mining multilevel association rules  Other measurements: interest and conviction  Association rule mining to correlation analysis 2 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 3.
    What is AssociationMining?  Association rule mining:  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.  Applications:  Basket data analysis, cross-marketing, catalog design, clustering, classification, etc.  Ex.: Rule form: “Body  Head [support, confidence]” buys(x, “diapers*”)  Consequent [support, confidence]” “Antecedent  buys(x, “beers”) [0.5%,60%] major(x, “CS”)^takes(x, “DB”)  grade(x, “A”) [1%, 75%] 3 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 4.
    A typical exampleof association rule mining is market basket analysis. 4 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 5.
    Rule Measures: Support/Confidence  Find all the rules “Antecedent(s)  Consequent(s)” with minimum support and confidence  support, s, probability that a transaction contains {A  C}  confidence, c, conditional probability that a transaction having A also contains C  Let min. sup. 50%, and min. conf. 50%, • Support= 50% means that 50% of all transactions under analysis show that  A  C (s=50%, c=66.7%) A and C are purchased together • Confidence=66.7% means that 66.7% of the  C  A (s=50%, c=100%) customers who purchased A also bought C  Typically association rules are considered interesting if they satisfy both a minimum support threshold and Transactional databases a mininum confidence threshold Transaction ID Items Bought  Such thresholds can be set by users 2000 A,B,C or domain experts 1000 A,C 4000 A,D 5000 B,E,F 5 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 6.
    Rule Measures: Support/Confidence probability TransID Items Bought Rule: A C T001 A,B,C T002 A,C support (AC) = P({AC}) = P(AC) T003 A,D confidence(AC) = P(C|A) T004 B,E,F = P({AC})/P({A}) Frequency • A  B (1/4 = 25%, 1/3 = 33.3%) Customer buys both (A&C) A =3 • B  A (1/4 = 25%, 1/2 = 50%) Customer buys diaper(C) B =2 • A  C (2/4 = 50%, 2/3 = 66.7%) • C  A (2/4 = 50%, 2/2 =100%) C =2 • A, B  C (1/4 = 25%, 1/1 = 100%) AB = 1 • A, C  B (1/4 = 25%, 1/2 = 50%) AC = 2 • B, C  A (1/4 = 25%, 1/1 = 100%) BC = 1 ABC = 1 Customer buys beer (A) 6 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 7.
    Association Rule: Support/Confidencefor Relational Tables  In case that each transaction is a row in a relational table  Find: all rules that correlate the presence of one set of attributes with that of another set of attributes outlook temp. humidity windy Sponsor play-time play • If temperature = hot sunny hot high True Sony 85 Y then humidity = high (s=3/10,c=3/5) sunny hot high False HP 90 Y overcast hot normal True Ford 63 Y • If windy=true and play=Y rainy mild high True Ford 5 N then humidity=high and rainy cool low False HP 56 Y outlook=overcast (s=2/10, c=2/4) sunny hot low True Sony 25 N rainy cool normal True Nokia 5 N • If windy=true and play=Y overcast mild high True Honda 86 Y and humidity=high rainy mild low False Ford 78 Y then outlook=overcast (s=2/10, c=2/3) overcast hot high True Sony 74 Y 7 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 8.
    Association Rule Mining:Types  Boolean vs. quantitative associations (Based on the types of values handled) (Single vs. multiple Dim.) SQLServer ^ DMBooks  DBMiner [0.2%, 60%] buys(x, “SQLServer”) ^ buys(x, “DMBook”)  buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “PC”) [1%, 75%]  Single level vs. multilevel analysis  What brands of beers are associated with what brands of diapers?  Various extensions  Maxpatterns and closed itemsets 8 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 9.
    An Example (singledimensional Boolean association Rule Mining)  For rule A  C: Min. support 50%  support = support({A, C}) = 50% Min. confidence 50%  confidence = support({A, C})/support({A}) = 66.7%  The Apriori principle:  Any subset of a frequent itemset must be frequent Transaction ID Items Bought Frequent Itemset Support 2000 A,B,C {A} 75% 1000 A,C {B} 50% 4000 A,D {C} 50% 5000 B,E,F {A,C} 50% 9 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 10.
    Two Steps inMining Association Rules  A subset of a frequent itemset must also be a frequent itemset  i.e., if {AB} is a frequent itemset, both {A} and {B} must be a frequent itemset  Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Step1: Find the frequent itemsets: the sets of items that have minimum support Step2: Use the frequent itemsets to generate association rules 10 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 11.
    Find the frequentitemsets The Apriori Algorithm  Join Step: Ck is generated by joining Lk-1with itself  Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset  Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent 1-itemsets}; 1 for (k = 1; Lk !=f; k++) do begin Ck+1 = candidates generated from Lk; 2 for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return Uk Lk; 11 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 12.
    The Apriori Algorithm— Example Database D itemset sup. L1 itemset sup. TID Items C1 {1} 2 {1} 2 100 134 {2} 3 {2} 3 200 235 Scan D {3} 3 {3} 3 300 1235 {4} 1 {5} 3 400 25 {5} 3 C2 itemset sup C2 itemset L2 itemset sup {1 2} 1 Scan D {1 2} {1 3} 2 {1 3} 2 {1 3} {2 3} 2 {1 5} 1 {1 5} {2 3} 2 {2 3} {2 5} 3 {2 5} 3 {2 5} {3 5} 2 {3 5} 2 {3 5} C3 itemset Scan D L3 itemset sup {2 3 5} {2 3 5} 2 12
  • 13.
    How to GenerateCandidates?  Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1 = q.item1, …, p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning ForAll itemsets c IN Ck DO ForAll (k-1)-subsets s OF c DO 13 IF (s is not in Lk-1) THEN DELETE c FROM Ck
  • 14.
    Example of GeneratingCandidates L3={abc, abd, acd, ace, bcd} Self-joining: L3×L3 abc + abd  abcd acd + ace  acde Pruning: C4={abcd} acde is removed because ade is not in L3 14 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 15.
    How to CountSupports of Candidates?  Why counting supports of candidates a problem?  The total number of candidates can be very huge  One transaction may contain many candidates  Method:  Candidate itemsets are stored in a hash-tree  Leaf node of hash-tree contains a list of itemsets and counts  Interior node contains a hash table  Subset function: finds all the candidates contained in a transaction 15 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 16.
    Subset Function  Subset function: finds all the candidates contained in a transaction. (1) Generate Hash Tree (2) Hashing each item in the transactions 2 1 C2 1 3 1+1 Database itemset TID Items {1 2} 5 1 100 134 {1 3} f 200 235 {1 5} 3 300 1235 1+1 {2 3} 2 400 25 {2 5} 5 1+1+1 {3 5} 3 5 1+1 16 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 17.
    Is Apriori FastEnough? — Performance Bottlenecks  The core of the Apriori algorithm:  Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets  Use database scan and pattern matching to collect counts for the candidate itemsets  The bottleneck of Apriori: candidate generation  Huge candidate sets:  104 frequent 1-itemset will generate 107 candidate 2- itemsets  To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates.  Multiple scans of database:  Needs (n +1 ) scans, n is the length of the longest pattern 17 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 18.
    Mining Frequent PatternsWithout Candidate Generation  Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure  highly condensed, but complete for frequent pattern mining  avoid costly database scans  Develop an efficient, FP-tree-based frequent pattern mining method  A divide-and-conquer methodology: decompose mining tasks into smaller ones  Avoid candidate generation: sub-database test only! 18 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 19.
    Construct FP-tree fromTransaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 0.5 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {} Steps: Header Table 1. Scan DB once, find Item frequency head f:4 c:1 frequent 1-itemset f 4 (single item pattern) c 4 c:3 b:1 b:1 a 3 2. Order frequent items b 3 a:3 p:1 in frequency m 3 descending order p 3 m:2 b:1 3. Scan DB again, construct FP-tree p:2 m:1 19 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 20.
    Mining Frequent Patternsusing FP-tree  General idea (divide-and-conquer)  Recursively grow frequent pattern path using the FP-tree  Method  For each item, construct its conditional pattern-base, and then its conditional FP-tree  Repeat the process on each newly created conditional FP-tree  Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)  Benefit: Completeness & Compactness  Completeness: never breaks a long pattern of any transaction and preserves complete information for frequent pattern mining  Compactness: reduces irrelevant information (infrequent items are gone), orders in frequency descending ordering (more frequent items are likely to be shared), and smaller than the original database. 20 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 21.
    Step 1: FromFP-tree to Conditional Pattern Base  Starting at the frequent header table in the FP-tree  Traverse the FP-tree by following the link of each frequent item  Accumulate all of transformed prefix paths of that item to form a conditional pattern base {} Header Table Conditional pattern bases Item frequency head f:4 c:1 item cond. pattn base f 4 c:3 b:1 b:1 c f:3 c 4 a 3 a fc:3 b 3 a:3 p:1 b fca:1, f:1, c:1 m 3 p 3 m:2 b:1 m fca:2, fcab:1 p fcam:2, cb:1 p:2 m:1 21 Knowledge Management and Discovery © Kritsada Sriphaew
  • 22.
    Step 2: ConstructConditional FP-tree  For each pattern-base  Accumulate the count for each item in the base  Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: {} fca:2, Header Table fcab:1 Item frequency head f:4 c:1 All frequent f 4 patterns c 4 c:3 b:1 b:1 {} concerning m a 3 b 3 a:3 p:1 f:3 m, m 3 c:3 fm, cm, am, p 3 m:2 b:1 fcm, fam, cam, a:3 p:2 m:1 fcam 22 m-conditional FP-tree
  • 23.
    Mining Frequent Patternsby (Creating Conditional Pattern-Bases) Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty 23 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 24.
    Step 3: Recursivelymine the conditional FP- tree {} f:3 {} c:3 f:3 am-conditional FP-tree c:3 {} a:3 f:3 m-conditional FP-tree cm-conditional FP-tree {} f:3 cam-conditional FP-tree 24 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 25.
    Single FP-tree PathGeneration  Suppose an FP-tree T has a single path P  The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P m-conditional pattern base: {} fca:2, Header Table fcab:1 Item frequency head f:4 c:1 All frequent f 4 patterns c 4 c:3 b:1 b:1 {} concerning m a 3 b 3 a:3 p:1 f:3 m, m 3 c:3 fm, cm, am, p 3 m:2 b:1 fcm, fam, cam, a:3 p:2 m:1 fcam 25 m-conditional FP-tree
  • 26.
    FP-growth vs. Apriori:Scalability With the Support Threshold Data set T25I20D10K 100 90 80 D1 FP-growth runtime Run time(sec.) 70 D1 Apriori runtime 60 50 40 30 20 10 0 0 1 2 3 Support threshold(%) 26 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 27.
    CHARM - MiningClosed Association Rules  Instead of horizontal DB format, vertical format is used.  Instead of traditional frequent itemsets, closed frequent itemsets are mined. Horizontal DB Vertical DB Transaction Items Items Transaction 1 ABDE A 1345 2 BCE B 123456 3 ABDE C 2456 4 ABCE D 1356 5 ABCDE E 12345 6 BCD 27 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 28.
    CHARM – FrequentItemsets and Their Supports  An example database and its frequent itemsets Items Trans. Support Itemsets A 1345 1.00 B B 123456 0.83 BE, E C 2456 0.67 A, C, D, AB,AE, D 1356 BC, BD, ABE E 12345 0.50 AD, CE, DE, ABD, ADE, BCE, BDE, ABDE Vertical DB Min. support = 0.5 28 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 29.
    CHARM - ClosedItemsets  Closed frequent itemsets and their corresponding frequent itemsets Closed Itemsets Tidsets Sup. Freq. Itemsets B 123456 1.00 B BE 12345 0.83 BE, E ABE 1345 0.67 ABE, AB, AE, A BD 1356 0.67 BD, D BC 2456 0.67 BC, C ABDE 135 0.50 ABDE, ABD, ADE, BDE, AD, DE BCE 245 0.50 CE, BCE 29 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 30.
    The CHARM Algorithm CHARM (? I  T, minsup); CHARM-PROPERY(Nodes, NewN) 1. Nodes = { Ij  t(Ij) : Ij  I  |t(Ij )|  minsup } 1. if (|Y|  minsup) then 2. CHARM-EXTEND (Nodes, C) 2. if t(Xi) = t(Xj) then // Propery 1 3. Remove Xj from Nodes CHARM-EXTEND (Nodes, C) 4. Replace all Xi with X’ 3. for each Xi  t(Xi) in Nodes 5. else if t(Xj)  t(Xj) then // Propery 2 4. NewN = f and X = Xi 6. Replace all Xi with X’ 5. for each Xj  t(Xj) in Nodes, with f(j) > f(I) 7. else if t(Xj)  t(Xj) then // Propery 3 6. X’ = X  Xj and Y = t(Xi)  t(Xj) 8. Remove Xj from Nodes 7. CHARM-PROPERTY(Nodes, NewN) 9. Add X  Y to NewN 8. if NewN  f then CHARM-EXTEND(NewN) 10. else if t(Xj)  t(Xj) then // Propery 4 9. C = C  {X} // if X is not subsumed 11. Add X  Y to NewN f Ax1345 Bx123456 Cx2456 Dx1356 Ex12345 ABx1345 ABEx1345 ABCx45 ABDx135 BCx2456 BDx1356 BEx12345 ABDEx135 BCDx56 BCEx245 BDEx135 30 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 31.
    Presentation of AssociationRules (Table Form) 31 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 32.
    Visualization of AssociationRule Using Plane Graph 32
  • 33.
    Visualization of AssociationRule Using Rule Graph 33
  • 34.
    Mining multilevel associationrules from transactional databases Multiple-Level Association Rules TID ITEMS  Items often form hierarchy. T1 {1121, 1122, 1212}  Items at the lower level are T2 {1222, 1121, 1122, 1213} expected to have lower T3 {1124, 1213} T4 {1111, 1211, 1232, 1221, 1223} support. Food  Rules regarding itemsets at (1) the appropriate levels could Milk Bread be quite useful. (11) (12)  Transaction database can be Skim 2% Wheat White encoded based on (111) (112) (121) (122) dimensions and levels Wonder  We can explore shared Fraser Sunset (1222) multi-level mining (1121) (1124) Wonder (1213) 34 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 35.
    Mining Multi-Level Associations  A top_down, progressive deepening approach:  First find high-level strong rules:  milk  bread [20%, 60%]  Then find their lower-level “weaker” rules:  2% milk  wheat bread [6%, 50%]  Variations at mining multiple-level association rules.  Level-crossed association rules: 2% milk  Wonder wheat bread [3%, 60%]  Association rules with multiple, alternative hierarchies: 2% milk  Wonder bread [8%, 72%] 35 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 36.
    Multi-level Association: RedundancyFiltering  Some rules may be redundant due to “ancestor” relationships between items.  Example milk  wheat bread [s=8%, c=70%] 2% milk  wheat bread [s=2%, c=72%]  We say the first rule is an ancestor of the second rule.  A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. 36 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 37.
    Multi-Level Mining: ProgressiveDeepening  A top-down, progressive deepening approach:  First mine high-level frequent items:  milk (15%), bread (10%)  Then mine their lower-level “weaker” frequent itemsets:  2% milk (5%), wheat bread (4%)  Different min_support threshold across multi-levels lead to different algorithms:  If adopting the same min_support across multi-levels  then toss t if any of t’s ancestors is infrequent.  If adopting reduced min_support at lower levels  then examine only those descendants whose ancestor’s support is frequent/non-negligible. 37 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 38.
    Problem of Confidence  Example: (Aggarwal & Yu, PODS98)  Among 5000 students  3000 play basketball  3750 eat cereal  2000 both play basket ball and eat cereal  play basketball  eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.  play basketball  not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000 38 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 39.
    Interest/Lift/Correlation  Interest (or lift, correlation)  taking both P(A) and P(B) in consideration P( A  B)  P(AB)=P(B)P(A), if A and B are independent P( A) P( B) events  A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated 2000 basketball not basketball sum(row) 5000 cereal 2000 1750 3750  0.889 3000 3750 not cereal 1000 250 1250  5000 5000 sum(col.) 3000 2000 5000 1000  Lift(play basketball  eat cereal) = 0.89 5000  1.33  Lift(play basketball  not eat cereal) = 1.33 3000 1250  5000 5000 39 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 40.
    Conviction  Conviction (Brin, 1997) (1  Support ( B))  Conviction ( A  B)   0 <= conv(AB) <=  (1  Confidence ( A  B)  A and B are statistically independent if and only if conv(AB) = 1  0 < conv(AB) < 1 if and only if p(B|A) < p(B) B is negatively correlated with A.  1 < conv(AB) <  if and only if p(B|A) > p(B) B is positively correlated with A. 1 3750 basketball not basketball sum(row) 5000  0.375 cereal 2000 1750 3750 1  0.667 not cereal 1000 250 1250 1250 sum(col.) 3000 2000 5000 1 5000  2.25 conviction(play basketball  eat cereal) = 0.375 1  0.333 conviction(play basketball  not eat cereal) = 2.25 40 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 41.
    From Association Miningto Correlation Analysis  Ex. Strong rules are not necessarily interesting Of 10000 transactions • 6000 customer transactions include computer games • 7500 customer transactions include videos • 4000 customer transactions include both computer game and video • Suppose that data mining program for videos games discovering association rules is run on the data, using min_sup of 30% and min_conf. of 60% • The following association rule is discovered: 4,000 buys(X, “computer games”)  buys(X, “videos”) [s=40%, c=66%] 41 =4000/10000 =4000/6000
  • 42.
    A misleading “strong”association rule buys(X, “computer games”)  buys(X, “videos”) [support=40%, confidence=66%]  This rule is misleading because the probability of purchasing video is 75% (>66%)  In fact, computer games and videos are negatively associated because the purchase of one of these items actually decreases the likelihood of purchasing the other. Therefore, we could easily make unwise business decisions based on this rule 42 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 43.
    From Association Analysisto Correlation Analysis  To help filter out misleading “strong” association  Correlation rules  A  B [support, confidence, correlation]  Lift is a simple correlation measure that is given as follows  The occurrence of itemset A is independent of the occurrence of itemset B if P(AB) = P(A)P(B);  Otherwise, itemset A and B are dependent and correlated  lift(A,B) = P(AB) / P(A)P(B) = P(B|A) / P(B) = conf(AB) / sup(B)  If lift(A,B) < 1, then the occurrence of A is negatively correlated with the occurrence of B  If lift(A,B) > 1, then A and B is positively correlated, meaning that the occurrence of one implies the occurrence of the other. 43 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 44.
    From Association Analysisto Correlation Analysis (Cont.)  Ex. Correlation analysis using lift buys(X, “computer games”)  buys(X, “videos”) [support=40%, confidence=66%]  The lift of this rule is P{game,video} / (P{game} × P{video}) = 0.40/(0.6 ×0.75) = 0.89  There is a negative correlation between the occurrence of {game} and {video}  Ex. Is this following rule misleading?  Buy walnuts  Buy milk [1%, 80%]”  if 85% of customers buy milk 44 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 45.
    Homework  ให้ transactional database ซึ่งเป็น LOG ไฟล์บันทึกการเข้าเยี่ยมชมเว็บเพจของปู้ใช้แต่ละคน ในช่วงระยะเวลาหนึ่ง จงหากฎสัมพันธ์ที่น่าเชื่อถือ โดยสมมติว่าเราเป็นปู้วเิ คราะห์ข้อมูล มี สิทธิตั้ง minimum support และ minimum confidence ด้วยตัวเอง พร้อมอธิบายเหตุปล ประกอบว่าทาไมถึงตั้งค่านั้น และตรวจสอบด้วยว่ากฎเหล่านั้นเป็น misleading หรือไม่ ถ้ามีให้ แก้ไขอย่างไร TID List of items T001 P1, P2, P3, P4 T002 P3, P6 T003 P2, P5, P1 T004 P5, P4, P3,P6 T005 P1, P3, P4, P2 P1 P2 P4 P6 P3 P5 45
  • 46.
    Feb 26, 2011(14:00)  Quiz I  Star-net Query (Multidimensional Table)  Data Cube Computation (Memory Calculation)  Data Preprocessing (Normalization, Smoothing by binning)  Association Rule Mining 46 Data Warehousing and Data Mining by Kritsada Sriphaew