KEMBAR78
Data Mining Chapter 4 Association Analysis | PDF | Categorical Variable | Algorithms
0% found this document useful (0 votes)
49 views31 pages

Data Mining Chapter 4 Association Analysis

a note on data mining ioe attached for the ioe exam, TU KEC KANTIPUR ENGINEERING COLLEGE , DHAPAKHEL Binod Wosti

Uploaded by

akafle99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views31 pages

Data Mining Chapter 4 Association Analysis

a note on data mining ioe attached for the ioe exam, TU KEC KANTIPUR ENGINEERING COLLEGE , DHAPAKHEL Binod Wosti

Uploaded by

akafle99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Association Analysis

Chapter 4
Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not
5 Bread, Milk, Diaper, Coke causality!
Definition: Frequent Itemset

• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items
• An itemset that contains k items 1 Bread, Milk
• Support count () 2 Bread, Diaper, Beer, Eggs
– Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
– E.g. ({Milk, Bread,Diaper}) = 2 4 Bread, Milk, Diaper, Beer
• Support 5 Bread, Milk, Diaper, Coke
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than
or equal to a minsup threshold
Definition: Association Rule
 Association Rule
TID Items
– An implication expression of the form X 
1 Bread, Milk
Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
3 Milk, Diaper, Beer, Coke
{Milk, Diaper}  {Beer}
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
 Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain both Example:
X and Y {Milk , Diaper}  Beer
– Confidence (c)
 Measures how often items in Y  (Milk, Diaper, Beer) 2
appear in transactions that s  0.4
contain X |T| 5
 (Milk, Diaper, Beer ) 2
c  0.67
 (Milk , Diaper ) 3
Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent
itemset

• Frequent itemset generation is still


computationally expensive
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there are
2d possible candidate
ABCDE itemsets
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match each transaction against every candidate


– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

• Apriori principle holds due to the following


property of the support measure:
X , Y : ( X  Y )  s( X ) s(Y )

– Support of an itemset never exceeds the support of its


subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
null null

A B AC BD CE D E

AB AC AD AB AE AC BC AD BD AE BE BC CD BD CE BE DE CD CE DE

Found to be
Infrequent
ABC ABD ABE ABCACD ABD ACE ABEADE ACDBCD ACEBCE ADE BDE BCDCDE BCE BDE CDE

ABCD ABCE ABCD


ABDE ABCE
ACDE ABDE
BCDE ACDE BCDE

Pruned
supersets ABCDE ABCDE
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count


6
C1 + 6C2 + 6C3 = 41 {Bread,Milk,Diaper} 3
With support-based pruning,
6 + 6 + 1 = 13
Apriori Algorithm
• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Prune candidate itemsets containing subsets of length k that are
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that are
frequent

– Generate rules from frequent item sets from level 2 upwards


using minimum confidence
Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets is frequent
null

A B C D E
Maximal
Itemsets

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequen ABCD
E
t Itemsets Border
Closed Itemset
• An itemset is closed if none of its immediate supersets has the
same support as the itemset

Itemset Support
{A} 4
TID Items Itemset Support
{B} 5
1 {A,B} {A,B,C} 2
{C} 3
2 {B,C,D} {A,B,D} 3
{D} 4
3 {A,B,C,D} {A,C,D} 2
{A,B} 4
4 {A,B,D} {B,C,D} 3
{A,C} 2
5 {A,B,C,D} {A,B,C,D} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3
FP-growth Algorithm
• Use a compressed representation of the database
using an FP-tree

• Once an FP-tree has been constructed, it uses a


recursive divide-and-conquer approach to mine
the frequent itemsets
FP-tree construction
null
After reading TID=1:

A:1
TID Items
1 {A,B}
2 {B,C,D} B:1
3 {A,C,D,E}
4 {A,D,E} After reading TID=2:
5 {A,B,C} null
6 {A,B,C,D} B:1
A:1
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:1 C:1
10 {B,C,E}
D:1
FP-Tree Construction
TID Items
Transaction
1 {A,B}
2 {B,C,D}
Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1

Header table D:1


C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D Pointers are used to assist
E frequent itemset generation
FP-growth

Conditional Pattern base for D:


null
P = {(A:1,B:1,C:1),
A:7 B:1 (A:1,B:1),
(A:1,C:1),
(A:1),
B:5 C:1 (B:1,C:1)}
C:1 D:1
Recursively apply FP-growth
D:1 on P
C:3
D:1
D:1 Frequent Itemsets found (with
sup > 1):
D:1 AD, BD, CD, ACD, BCD
Frequent Pattern (FP) Growth Method

• Mining frequent itmesets without candidate generation.


• It is a divide and conquers strategy.
• It compress the database representing frequent items into a
frequent –pattern tree (FP-Tree), which retains the itemset
association information.
• Divides the compressed database into a set of conditional
databases, each associated with one frequent item or pattern
fragment and then mines each such database separately.
• FP-Growth method transforms the problem of finding long
frequent patterns to searching for shorter ones recursively and
then concatenating the suffix.
• It uses least frequent items as suffix .
Advantage and Dis-advantage

• Advantage:
– Reduce search cost, has good selectivity, faster than apriori.

• Disadvantes:
– When the database is large, it is sometimes unrealistic to
construct a man memory based FP-tree.
FP-Tree algorithm
• Create root node of tree, labeled with null.

• Scan the transactional database.

• The items in each transaction are processed in sorted order


(Descending) and branch is created for each transaction.
FP-Tree algorithm
• Start from each frequent length pattern as an initial suffix pattern.

• Construct conditional pattern base. (Pattern base is a sub database


which consists of the set of prefix paths in the FP-tree co-occurring
with suffix pattern.

• Construct its FP-tree and perform mining recursively on such a


tree
Categorical data
• Categorical data is a statistical data type consisting of categorical
variables, used for observed data whose value is one of a fixed number of
nominal categories.
• More specifically, categorical data may derive from either or both of
observations made of qualitative data, where the observations are
summarized as counts or cross tabulations, or of quantitative data.
• Observations might be directly observed counts of events happening or
they might counts of values that occur within given intervals.
• Often, purely categorical data are summarized in the form of a
contingency table.
• However, particularly when considering data analysis, it is common to
use the term "categorical data" to apply to data sets that, while containing
some categorical variables, may also contain non-categorical variables.
Potential Issues
• What if attribute has many possible values: Example: attribute country has
more than 200 possible values. Many of the attribute values may have very low
support.
– Potential solution: Aggregate the low-support attributes values.
• What if distribution of attribute values is highly skewed: Example: 95% of the
visitors have Buy = No. Most of the items will be associated with (Buy=No)
item
– Potential solution: drop the highly frequent items

Handling Categorical Attributes


• Transform categorical attribute into asymmetric binary variables. i.e If the
outcomes of a binary variable are not equally important.
• Introduce a new “item” for each distinct attribute- value pair.
Sequential Pattern
• Mining of frequently occurring ordered events or subsequences as
patterns. Eg: web sequence, book issued in library etc.
• Used mostly in marketing, customer analysis, prediction modeling.
• A sequence is an ordered list of events where an item can occur at most
in an event of a sequence but can occur multiple times in different events
of a sequence.
• Given a set of sequences, where each sequence consists of a list of events
or elements and each event consists of set of items, given a minimum
support threshold, sequential pattern mining finds all frequent
subsequences.
• Sequence with minimum support is called frequent sequence or
sequential pattern.
• A sequential pattern with length ‘l’ is called an l-pattern sequential
pattern.
Sequential Pattern
• Sequential pattern is computationally challenging because such
mining may generate combinationally explosive number of
intermediate subsequences.
• For efficient and scalable sequential pattern mining two common
approaches are:
– Mining the full set of sequential patterns
– Mining only the set of closed sequential pattern
• A sequence database is a set of tuples with sequence_ID and
sequences. Eg

Sequence_ID Sequence

1 {(a, (a,b,c), (a,c), (b,c)}

2 {(a,b,c), (a,d),e,(d,e)}

3 {( c,d), (a,d,e),e}

4 { (e,f,),d,(a,b,c),f]
Sub-graph Patterns
• It finds characteristics sub-graphs within the network.
• It is a form of graph search.
• Given a labeled graph data set, D = {G1, G2, …….,Gn}, a
frequent graph has minimum support not less than minimum
threshold support.
• Frequent sub-graph pattern can be discovered by generating
frequent substructures candidate and hence check the frequency of
each candidate.
• Apriori methos and frequent –growth are tow common basic
methods for finding frequent sub-graph
• Extend association rule mining to finding frequent subgraphs
What is frequent-pattern mining in Data Streams?
• Frequent-pattern mining finds a set of patterns that occur
frequently in a data set, where a pattern can be a set of items
(called an itemset), a subsequence, or a substructure.
• A pattern is considered frequent if its count satisfies a minimum
support. Scalable methods for mining frequent patterns have been
extensively studied for static data sets.
• Challenges in mining data streams:
– Many existing frequent-pattern mining algorithms require the system to scan
the whole data set more than once, but this is unrealistic for infinite data
streams.
– A frequent itemset can become infrequent as well. The number of infrequent
itemsets is exponential and so it is impossible to keep track of all of them.

You might also like