Algorithms for Association Rule Mining - A General
Survey and Comparison
Jochen Hipp Ulrich G0ntzer Gholamreza
Wilhelm Schickard-lnstitute Wilhelm Schickard-lnstitute Nakhaeizadeh
University of TObingen Universityof T0bingen DaimlerChrysler AG
72076 TObingen,Germany 72076 TObingen, Germany Research & Technology
FT3/AD
jochen.hipp @infor- guentzer@ infor- 89081 UIm, Germany
matik.uni-tuebingen.de matik.uni-tuebingen.de
rheza.nakhaeizadeh @
daimlerchrysler.com
ABSTRACT W h e n mining association rules there are mainly two prob-
lems to deal with: First of all there is the algorithmic com-
Today there are several efficient algorithms t h a t cope with
plexity. The number of rules grows exponentially with the
the popular a n d computationally expensive task of associ-
number of items. Fortunately today's algorithms are able to
ation rule mining. Actually, these algorithms are more or
efficiently prune this immense search space based on mini-
less described on their own. In this paper we explain t h e
real thresholds for quality measures on the rules. Second,
f u n d a m e n t a l s of association rule mining a n d moreover de-
interesting rules must be picked from the set of generated
rive a general framework. Based on this we describe to- rules. This might be quite costly because the generated rule
day's approaches in context by pointing out c o m m o n aspects
sets normally are quite large - e.g. more than 100, 000 rules
a n d differences. After t h a t we thoroughly investigate their are n o t u n c o m m o n - a n d in c o n t r a s t t h e percentage of use-
s t r e n g t h s a n d weaknesses a n d carry out several r u n t i m e ex-
ful rules is typically only a very small fraction. T h e work
periments. It t u r n s out t h a t t h e r u n t i m e behavior of t h e
concerning t h e second problem mainly focuses on support-
algorithms is much more similar as to be expected.
ing t h e user when browsing the rule set, e.g. [14] a n d t h e
development of further useful quality measures on t h e rules,
1. INTRODUCTION e.g. [7; 6; 22].
1.1 Association Rules 1.2 Outline of the Paper
Since its introduction in 1993 [1] the task of association rule In this p a p e r we deal w i t h t h e algorithmic aspects of associ-
mining has received a great deal of attention. Today t h e ation rule mining. In fact, a b r o a d variety of efficient algo-
mining of such rules is still one of the most popular p a t t e r n - r i t h m s to mine association rules have b e e n developed during
discovery m e t h o d s in KDD. the last years. These approaches are more or less described
separately in t h e corresponding literature. To overcome this
In brief, a n association rule is an expression X =~ Y, where
X a n d Y are sets of items. T h e meaning of such rules is situation we give a general survey of t h e basic ideas b e h i n d
association rule mining. In Section 2 we identify the ba-
quite intuitive: Given a d a t a b a s e ~ of transactions - where
sic strategies a n d describe t h e m in detail. T h e resulting
each t r a n s a c t i o n T E D is a set of items - , X :=~ Y expresses
framework is used in Section 3 to systematize a n d present
t h a t whenever a t r a n s a c t i o n T contains X t h a n T proba-
t o d a y ' s most c o m m o n approaches in context. F u r t h e r m o r e
bly contains Y also. T h e probability or rule confidence is
we show t h e c o m m o n principles a n d differences between t h e
defined as t h e percentage of transactions containing Y in
addition to X with regard to t h e overall n u m b e r of trans- algorithms. Finally in Section 4 we complete our overview
actions containing X . T h a t is, the rule confidence can be with a comparison of t h e algorithms concerning efficiency.
This comparison is based on theoretic considerations a n d
u n d e r s t o o d as t h e conditional probability p(Y C TIX C T).
concrete r u n t i m e experiments. In Section 5 we conclude
T h e idea of mining association rules originates from t h e
with a short s u m m a r y of our results.
analysis of market-basket d a t a where rules like "A customer
who buys p r o d u c t s xl a n d x2 will also b u y p r o d u c t y with
probability c%." axe found. Their direct applicability to 1.3 Related Work
business problems together w i t h their inherent tmderstand- In our work we mainly restrict ourselves to w h a t we call
ability - even for n o n d a t a mining experts - m a d e associ- t h e "classic association rule problem". T h a t is, t h e min-
ation rules a popular mining method. Moreover it became ing of all rules existing in a d a t a b a s e ~ w i t h respect to
clear t h a t association rules are not restricted to dependency minimal thresholds on certain quality measures, l ) in this
analysis in t h e context of retail applications, b u t are suc- case consists of m a r k e t - b a s k e t like data, t h a t is, t r a n s a c t i o n s
cessfully applicable to a wide range of business problems. containing 1 0 - 20 i t e m s in t h e average o u t of a t o t a l set of
1,000 - 100, 000 items.
A l t h o u g h the "classic problem" is still topic of further re-
search, during recent years m a n y algorithms for special-
S I G K D D Explorations. Volume 2, Issue 1 - page 58
ized tasks have been developed: First of all, there axe the early growing number of items still implies an exponential
approaches that enhance the association rules itself. E.g. growing number of itemsets that need to be considered.
quantitative association rules, e.g. [24], generalized associ- For the special case I = {1, 2, 3, 4} we visualize the search
ation rules, e.g. [23; 12] and to some extent the work on space that forms a lattice in Figure 1, c.f. [28]. The frequent
sequential patterns, e.g. [3; 15]. Moreover there axe several
generalizations of the rule problem, e.g. [16; 27]. (}
In addition algorithms were developed that mine well de-
fined subsets of the rule set according to specified items or
quality measures etc, e.g. general constraints [17; 25], op-
timized rules [8; 20], maximal frequent itemsets [28], and {I} {2} {3} [4}
frequent closed itemsets [18; 19]. Moreover there axe algo-
rithms to mine dense databases [5]. These approaches axe
supplemented by algorithms for online mining of association
rules, e.g. [10] and incremental algorithms, e.g. [26; 4].
{I , ~
2. BASIC PRINCIPLES
2.1 Formal Problem Description {L2,3} [ [ ,2,4} I {1,3.41 12,3,4}
Let Z = {Xl,... ,xn} be a set of distinct literals, called
items. A set X C Z with k = IX I is called a k-itemset
or simply an itemset. Let a database ~ be a multi-set of
subsets of Z. Each T E D is called a transaction. We say {1,2,3.4}
that a transaction T E :D supports an itemset X C Z if
X C T holds. An association rule is an expression X =~ Y, Figure 1: Lattice for I = {1, 2, 3, 4}
where X, Y axe itemsets and X n Y = O holds. The frac-
tion of transactions T supporting an itemset X with re- itemsets axe located in the upper part of the figure whereas
spect to database ~D is called the support of X, supp(X) = the infrequent ones axe located in the lower part. Although
[{T 6 ~D [ X C T } [ / I ~ [. The support of a rule X =:~ Y is de- we do not explicitly specify support values for each of the
fined as supp(X =~ Y) = supp(XUY). The confidence of this itemsets, we assume that the bold border separates the fre-
rule is defined as conf(X =~ Y) = supp(X U Y)/supp(X), quent from the infrequent itemsets. The existence of such
c.f. [2]. a border is independent of any particular database :D and
As mentioned before the main challenge when mining associ- minsupp. Its existence is solely guaranteed by the downward
ation rules is the immense number of rules that theoretically closure property of itemset support.
must be considered. In fact the number of rules grows expo- The basic principle of the common algorithms is to employ
nentially with [I[. Since it is neither practical nor desirable this border to efficiently prune the search space. As soon
to mine such a huge set of rules, the rule sets axe typically as the border is found, we axe able to restrict ourselves on
restricted by minimal thresholds for the quality measures determining the support values of the itemsets above the
support and confidence, rninsupp and rninconf respectively. border and to ignore the itemsets below.
This restriction allows us to split the problem into two sep- Let map: I ~ { 1 , . . . , l I I } be a mapping that maps all
axate parts [2]: An itemset X is frequent if supp(X) > min- items x E I one-to-one onto natural numbers. Now the
supp. Once, .T = {X C I I X frequent}, the set of all fre- items can be seen as totally ordered by the relation "<" be-
quent itemsets together with their support values is known, tween natural numbers. In addition, for X C Z let X.item :
deriving the desired association rules is straight forwaxd {1,... , IXI} ~ I : n ~ X.itemn be a mapping with X.item,~
(See [2] for minor enhancements.): For every X E .T check denoting the n-th item of the items x E X increasingly
the confidence of all rules X \ Y =~ Y, Y C X, ~ # Y ¢ X sorted by "<". The n-prefix of an itemset X with n _< IX]
and drop those that do not achieve minconf. According to is then defined by P = {X.item,n I 1 < m < n}, c.f. [12].
its definition above, it suffices to know all support values of Let the classes E ( P ) , P C I with E ( P ) = { X C I I IXI =
the subsets of X to determine the confidence of each rule. IPI + 1 and P is a prefix of X} be the nodes of a tree. Two
The knowledge about the support values of all subsets of X nodes axe connected by an edge, if all itemsets of a class E
is ensured by the downward closure property of itemset sup- can be generated by joining two itemsets of the parent class
port: All subsets of a frequent itemset must also be frequent, E ' , e.g. Figure 2.
c.f. I2]. Together with the downward closure property of itemset
With that in mind the task of association rule mining can support this implies the following: If the parent class El
be reduced to the problem of finding all itemsets that axe of a class E does not contain at least two frequent itemsets
frequent with respect to a given minimal threshold minsupp. than E must also not contain any frequent itemset. If we
The rest of this paper and most of the literature on associ- encounter such a class E ' on our way down the tree, then
ation rule mining addresses exactly this topic. we have reached the border separating the infrequent from
the frequent itemsets. We do not need to go behind this
2.2 Traversing the Search Space border so we prune E and all descendants of E from the
As explained we need to find all itemsets that satisfy min- search space.
supp. For practical applications looking at all subsets of I The latter procedure allows us to efficiently restrict the num-
is doomed to failure by the huge search space. In fact, a lin- ber of itemsets to investigate. We simply determine the
SIGKDD Explorations. Volume 2, Issue 1 - page 59
~{]} {2} {3} {4}1 between t h e approaches.
3.1 Systematization
The algorithms that we consider in this paper are system-
ii 3 atized in Figure 3. W e characterize each of the algorithms
a) by its strategy to traverse the search space and b) by its
strategy to determine the support values of the itemsets. In
Figure 2: Tree for :Z = {1, 2, 3, 4}
s u p p o r t values only of those itemsets t h a t we "visit" on
our search for the border between frequent a n d infrequent
itemsets. Finally, t h e actual strategy to search for t h e bor-
der is at our own choice. Today's c o m m o n approaches em-
ploy b o t h breadth-first search (BFS) or depth-first search Apriori Partition FP- Eclat
(DFS). W i t h BFS t h e s u p p o r t values of all (k - 1)-itemsets growth
are determined before counting t h e support values of t h e AprioriTID
k-itemsets. In contrast, DFS recursively descends following DIC
t h e tree s t r u c t u r e defined above.
2.3 Determine Itemset Supports Figure 3: Systematization of t h e Algorithms
In the following a n itemset t h a t is potentially frequent a n d
for which we decide to determine its support during lattice addition a n algorithm m a y employ specific optimizations for
traversal is called a candidate itemset or simply a candidate. further speedup.
One c o m m o n approach to determine the support value of a n
itemset is to directly count its occurrences in the database. 3.2 BFS and Counting Occurrences
For t h a t purpose a counter is set up and initialized to zero T h e most popular algorithm of this t y p e is A p r i o r i [2]
for each itemset t h a t is currently under investigation. T h e n where also t h e downward closure p r o p e r t y of itemset sup-
all transactions are scanned a n d whenever one of the candi- port was introduced. Apriori makes additional use of this
dates is recognized as a subset of a transaction, its counter property by pruning those candidates that have an infre-
is incremented. Typically subset generation a n d candidate quent subset before counting their supports. This optimiza-
lookup is integrated a n d i m p l e m e n t e d on a hashtree or a tion becomes possible because B F S ensures that the support
similar d a t a structure. In brief, not all subsets of each trans- values of all subsets of a candidate are known in advance.
action are generated b u t only those t h a t are contained in t h e Apriori counts all candidates of a cardinality k together in
candidates or t h a t have a prefix in c o m m o n w i t h a t least one one scan over the database. The criticalpart is looking up
of t h e candidates, c.f. [2] for further details. the candidates in each of the transactions. For this purpose
A n o t h e r approach is to determine the s u p p o r t values of can- [2] introduces a so called hashtree structure. The items in
didates by set intersections. A rid is a unique t r a n s a c t i o n each transaction are used to descend in the hashtree. W h e n -
identifier. For a single item the tidlist is the set of identifiers ever we reach one of its leafs, we find a set of candidates
t h a t correspond to the transactions containing this item. having a c o m m o n prefix that is contained in the transaction.
Accordingly tidlists also exist for every itemset X a n d are Then these candidates are searched in the transaction that
denoted by X.tidlist. T h e tidlist of a candidate C = X U Y is has been encoded as a bitmap before. In the case of success
o b t a i n e d by C.tidlist = X.tidlist A Y.tidlist. T h e tidlists are the counter of the candidate in the tree is incremented.
sorted in ascending order to allow efficient intersections. A p r i o r i T I D [2] is an extension of the basic Apriori ap-
Note t h a t by buffering t h e tidlists of frequent candidates as proach. Instead of relying on the raw database AprioriTID
intermediate results, we remarkably speedup t h e generation internally represents each transaction by the current candi-
of t h e tidlists of t h e following candidates. Finally the actual dates it contains. With AprioriHybrid both approaches
s u p p o r t of a candidate is o b t a i n e d by determining [C.tlist[. are combined, c.f. [2]. To some extent also S E T M [13] is
an Apriori(TID)-like algorithm which is intended to be im-
3. COMMON ALGORITHMS plemented directly in SQL.
In this section we briefly describe a n d systematize the most D I C is a further variation of the Apriori-Algorithm [7]. D I C
c o m m o n algorithms. We do this by referring to t h e funda- softens the strictseparation between counting and generat-
mentals of frequent itemset generation t h a t we identified in ing candidates. Whenever a candidate reaches minsupp, that
t h e previous section. Our goal is n o t to go to m u c h into is even when this candidate has not yet "seen" all trans-
detail b u t to show t h e basic principles a n d t h e differences actions, D I C starts generating additional candidates based
S I G K D D Explorations. Volume 2, Issue 1 - page 60
on it. For t h a t purpose a prefix-tree is employed. In con- In addition in [28] algorithms t h a t mine only t h e maximal
t r a s t to t h e hashtree, each node - leaf node or inner node - frequent itemsets are introduced, e.g. M a x E c l a t . An item-
of the prefix-tree is assigned to exactly one candidate re- set X is maximal frequent if for every frequent itemset Y
spectively frequent itemset. In contrast to the usage of a X C Y ~ Y = X holds. We do not consider these algo-
hashtree t h a t means whenever we reach a node we can be r i t h m s because although it is straight forward to derive t h e
sure t h a t the itemset associated with this node is contained set of all frequent itemsets from t h e maximal frequent item-
in the transaction. Furthermore interlocking s u p p o r t deter- sets this does not hold for t h e corresponding s u p p o r t values.
m i n a t i o n and candidate generation decreases t h e n u m b e r of W i t h o u t those, we are not able to derive rule confidences a n d
database scans. therefore we c a n n o t generate association rules.
3.3 BFS and TID-List Intersections
T h e P a r t i t l o n - A l g o r i t h m [21] is a n Apriori-like algorithm 4. COMPARISON OF THE ALGORITHMS
t h a t uses set intersections to determine support values. As In this section we compare t h e algorithms a n d explain t h e
described above Apriori determines the support values of all observed differences in performance behavior.
(k - 1)-candidates before counting the k-candidates. T h e
problem is t h a t Partition of course wants to use t h e tidlists 4.1 Experiments
of the frequent (k - 1)-itemsets to generate the tidlists of To carry out performance studies we implemented t h e most
t h e k-candidates. Obviously the size of those intermediate c o m m o n algorithms to mine frequent itemsets, namely Apri-
results easily grows beyond the physical memory limitations ori, DIC, Partition, a n d Eclat, in C + + . Actually we h a d to
of c o m m o n machines. To overcome this Partition splits the leave out DIC in t h e charts for reasons explained later. In
d a t a b a s e into several chunks t h a t are t r e a t e d independently. addition we did not consider AprioriTID a n d F P - g r o w t h be-
T h e size of each chunk is chosen in such a way t h a t all in- cause these algorithms were designed to mine d a t a t h a t is
termediate tidlists fit into m a i n memory. After determining not typical for retail environments, t h a t is d a t a containing
t h e frequent itemsets for each d a t a b a s e chunk, an extra scan quite long patterns.
is necessary to ensure t h a t t h e locally frequent itemsets are The experiments were performed on a SUN UltraSPARC-II
also globally frequent. workstation clocked at 248Mhz. T h e experiments in Fig-
ures 4 - 11 were carried out on synthetic datasets from [2;
3.4 DFS and Counting Occurrences 21]. These datasets were generated with a d a t a generator [2]
Counting occurrences assumes candidate sets of a reasonable t h a t simulates t h e buying behavior of customers in retail
size. For each of those candidate sets a database scan is per- business. D a t a s e t "T10.I4.D100K" means a n average trans-
formed. E.g. Apriori t h a t relies on BFS scans the d a t a b a s e action size of 10, a n average size of the maximal potentially
once for every candidate size k. W h e n using DFS the candi- frequent itemsets of 4 a n d 100,000 generated transactions.
date sets consist only of the itemsets of one of t h e nodes of T h e n u m b e r of p a t t e r n s was set to 2,000 a n d the n u m b e r of
the tree from Section 2.2. Obviously scanning t h e d a t a b a s e items to 1,000.
for every node results in tremendous overhead. The simple In addition to the experiments from [2; 21], we restricted
combination of DFS with counting occurrences is therefore t h e maximal length of generated itemsets from 1 up to 9
of no practical relevance, c.f.[ll]. on t h e dataset "T20.I4.D100K" a t minsupp = 0.33%, c.f.
Recently in [9] a fundamentally new approach called F P - Figure 9. Figures 10 a n d 11 show the behavior of t h e al-
g r o w t h was introduced. In a preprocessing step FP-growth gorithms on real-world applications. T h e basket d a t a con-
derives a highly condensed representation of t h e t r a n s a c t i o n sists of a b o u t 70,000 customer transactions with approxi-
data, the so called FP-tree. T h e generation of t h e FP-tree is mately 60,000 different items. T h e average transaction size
done by counting occurrences a n d DFS. In contrast to for- is ~ 10.5 items. T h e car equipment d a t a contains infor-
mer DFS-approaches, F P - g r o w t h does not follow the nodes m a t i o n a b o u t 700, 000 cars with a b o u t 6,000 items. In the
of the tree from Subsection 2.2, b u t directly descends to average ~ 20 items are assigned to each ear.
s o m e part of the itemsets in t h e search space. In a sec- It is i m p o r t a n t to say t h a t all algorithms scale linearly with
ond step F P - g r o w t h uses the FP-tree to derive the s u p p o r t the d a t a b a s e size.
values of all frequent itemsets.
4.2 Counting Occurrences vs. Intersecting Sets
3.5 DFS and TID-List Intersections The basic question concerning the r u n t i m e of the algorithms
In [28] the algorithm E c l a t is introduced, t h a t combines is w h e t h e r counting occurrences or intersecting tidlists shows
DFS with tidlist intersections. W h e n using DFS it suffices b e t t e r performance results. T h e advantage of counting is
to keep the tidlists on t h e p a t h from the root down to the t h a t only candidates t h a t actually occur in the transactions
class currently investigated in memory. T h a t is, splitting cause any effort. In contrast, a n intersection m e a n s at least
t h e d a t a b a s e as done by Partition is no longer needed. passing t h r o u g h all tids of the smaller of t h e two tidiists,
Eclat employs an optimization called "fast intersections". even if the candidate is not contained in the d a t a b a s e at all.
Whenever we intersect two tidlists t h e n we are only inter- ("Fast Intersections" save some costs b u t we still need to
ested in t h e resulting tidlist if its cardinality reaches min- pass a substantial n u m b e r of tids.) B u t intersections also
supp. In other words, we should break off each intersection have their benefits. Counting implies looking up the candi-
as soon as it is sure t h a t it will not achieve this threshold. dates in t h e transactions. Of course this can get quite ex-
Eclat originally generates only frequent itemsets of size > pensive for candidates of higher cardinality. O n t h e contrary
3. We modified Eclat to mine also the frequent 1- a n d when using intersections t h e size of the candidate under in-
2-itemsets by calling it on t h e class t h a t contains the 1- vestigation does not have any influence.
itemsets together with their tidlists. In practice b o t h effects seem to balance out on t h e basket-
SIGKDD Explorations. Volume 2, Issue 1 - page 61
................ 1.......................... T......................~,............................................................
~ ~ ;
i ', i -e-,-
................. f ................................. !.................. ! ........................ ::::~ ...............................
iT.:::........ ! i i i ~ ..........
il ~ i i i .,~" j
! ......................
i
'~f ..................................
{i.............................
- i--::::-,.=:::~...........................
-;::::
::;-~-~--::................. k .................... . . . . . . . . . . . . . . . . .
10 "'~ .... , i i l ................
......... ~.............................i................. +...............................i .................................i....................
o i
mmupp ~%
Figure 8: T 2 0 . I 6 . D 1 0 0 K
Figure 4: T 1 0 . I 2 . D 1 0 0 K
14o| ~ i ! ~ .i.:
/ ~ "~ ~ ~' ~ ...... %=.:;.~-::-~'-~==~
; " ..... i i i ~ i
| i)]/ I ~.................. i ........... ~ ........... " ................... i ..................... iL ................... ' ..............
+ ................................ !~............. ~ .................. .7 .................................... ~ ................
..........i ~ " i ~ i i i
- ' -i ............ i ................... i........... i .....................i ................... i................ i ...................
1 2 4 s • 7 m
2 0,715 0.5 0,30 0.~
enllnsup~ kt %
Figure 9: M a x i m a l Frequent Itemset Size
Figure 5: T 1 0 . I 4 . D 1 0 0 K
i l ................. ~ ...................................................... .............................. ~ .................... ~II" " ::~;~::: I
.................. --I ........................ I .......................{ ..............................! ..............................! 7': "...........
iiiZiii ;ii !.............................
. . . . . . . . .i,..............................
. .. .. .. . . .i. . .ii. . . . I. . . . .
, .......................i................................................i ..........................
............... i ............................. i ................. I ........................
~.......i:~'
.. ~:i~':::~
i-.-. i:::.--~,-<=~.'. ...........
I
j ............................................................................... [..................... t--/-;:7
II i i --I i ! ........ .......
1~ ..............................t.........................r........................1.........................! 7 - - ) ~ ..........
• , j,j /
I / t ...... ~" /
.................. 4=:: .............. : = : : ~ : : : : i i ~ i = ~ 4 ~ ~ ...............
., y t ................. ................................... ...................
°1 0.6 o.2s o.12 o.oe
mmuppln %
mmpp~%
Figure 10: B a s k e t D a t a
Figure 6: T 2 0 . I 2 . D 1 0 0 K
m~mupp ~ %
Figure 11: Car E q u i p m e n t
Figure 7: T 2 0 . I 4 . D 1 0 0 K
S I G K D D Explorations. Volume 2, Issue 1 - page 62
like data. The runtime behavior in Figures 4 - 8 does Our experiments suggest that the effect of additional can-
not show any substantial differences between the algorithm didate pruning is rather small for the datasets we took into
Apriori that counts occurrences and the tidlists intersecting consideration. Additional pruning does not help Partition
algorithms Partition and Eclat. Only at quite low aver- to compensate Eclat's advantage based on the "fast inter-
age size of the maximal potentially frequent itemsets, e.g. sections". This impression was also supported by further
"T10.I2.D100K" in Figure 4, Apriori is somehow superior studies with an enhanced version of Eclat that incorporates
whereas at larger average size of the maximal potentially fre- additional candidates pruning by using right-most DFS.
quent itemsets, e.g. "T20.I6.D100K" in Figure 8, Partition
and Eclat perform better. The same explanation holds for 4.5 Splitting the Database
the real-world experiments. With an average size of ~ 2.2 Partition needs to split the database. Whereas this opti-
items at rninsupp = 0.06% the frequent itemsets found in mization helps to cope with large databases it adds the ad-
the basket data were rather short compared to the frequent ditional overhead of an extra pass to determine the globally
itemsets from the car equipment database, that contained frequent itemsets. In our experiments the size of the trans-
4.1 items in the average at minsupp ----3%. action sets were always small enough that we could employ
In Figure 9 it becomes clear what happens behind the scenes: Partition without splitting. In [21] especially at lower val-
Eclat and Partition spend most of their time with determin- ues for minsupp Partition that splits suffers strongly. The
ing the support values of the 2- and 3-candidates whereas reason is the increasing number of locally frequent itemsets
Apriori is efficientlyhandling such small itemsets. In con- that finally turn out to be globally infrequent.
trast for itemsets with size > 4 the additional effort caused
for Eclat and Partition is to be neglected whereas this does 4.6 "Fast Intersections"
not hold for Apriori. Eclat's fast intersections are obviously an advantage. The
overhead caused by checking whether minsupp is still reach-
4.3 Relaxing the Separation between Candi- able is clearly outweighed by breaking off unnecessary in-
date Generation and Support Counting tersections. As a result in all our experiments Eclat beats
O n the one hand the introduction of the prefix-treewith D I C Partition with a nearly constant factor.
allows relaxing the separation between candidate generation
and support counting and therefore reduces the number of
database scans. Moreover each node is assigned to precisely 5. CONCLUSION
one itemset. Consequently looking up candidates in bitmap- In this paper we dealt with the algorithmic aspects of asso-
encoded transactions is no longer necessary. O n the other ciation rule mining. We restricted ourselves to the "classic"
hand the m e m o r y usage of the prefix-tree caused problems, association rule problem, that is the generation of all associ-
when we experimented with our own implementation of DIC: ation rules that exist in market basket-like data with respect
The prefix-treeis already setup before all frequent 1-itemsets to minimal thresholds for support and confidence.
are known. That means a mapping to frequent items as From the broad variety of efficient algorithms that have been
described in [2] to keep the m e m o r y usage of hash tables in developed we compared the most important ones. We sys-
each node of the tree small is not possible. Moreover this tematized the algorithms and analyzed their performance
effect is strengthened by the fact that every candidate is based on both runtime experiments and theoretic consid-
stored in its own node and that in addition a separate node erations. The results were quite surprising: Although we
for each prefix of the candidate exists. Another drawback of identified fundamental differences concerning the employed
the prefix-tree is that the frequent itemsets are stored in the strategies, the algorithms show quite similar runtime behav-
same tree as the candidates. Each frequent k-itemset that is ior in our experiments. At least there is no algorithm that
not prefix of any kS-candidate with k ~ > k imposes overhead is fundamentally beating out the other ones. In fact our
when counting that is avoided by the hashtree-approach of experiments showed that the advantages and disadvantages
Apriori. we identified concerning the strategy to determine the sup-
Actually we did not come to a final result concerning the port values of the frequent itemsets nearly balance out on
efficiency of DIC. But what we want to say is that DIC's market basket-like data.
advantage of reducing the number of database scans should In a forthcoming paper we pursue the development of a hy-
not be overestimated in a retailenvironment. In [7] a perfor- brid approach that efficiently combines counting occurrences
mance gain of only about 3 0 % compared with basic Apriori and ticUist intersections [11].
is detected for data with quite small average size of the max-
imal potentially frequent itemsets.
6. REFERENCES
4.4 Additional Candidate Pruning
Typically the main task of the algorithms is determining [1] It. Agrawal, T. Imielinski, and A. Swami. Mining asso-
support values. That is, the time spent with candidate gen- ciation rules between sets of items in large databases.
eration - and candidate pruning - can be neglected. Conse- In Proc. of the ACM SIGMOD Int'l Conf. on Manage-
quently Apriori's candidate pruning step helps to reduce the ment of Data (A CM SIGMOD '93), Washington, USA,
candidates to be counted but does not add any substantial May 1993.
overhead, it is important to note that basic DFS, as em-
ployed by Eclat does not allow proper subset pruning. Only [2] It. Agrawal and It. Srikant. Fast algorithms for min-
right-most D F S as introduced in [12] for the mining of gen- ing association rules. In Proc. of the 20th Int'l Conf.
eralized association rules allows transferring the additional on Very Large Databases (VLDB 'g~), Santiago, Chile,
prune step of Apriori to algorithms using DFS. June 1994.
SIGKDD Explorations. Volume 2, Issue 1 - page 63
[3] R. Agrawal and It. Srikant. Mining sequential patterns. [16] It. Motwani, E. Cohen, M. Datar, S. Fujiware, A. Gio-
In Proc. of the Int'l Conf. on Data Engineering (ICDE), nis, P. Indyk, J. D. Ullman, and C. Yang. Finding inter-
Taipei, Taiwan, March 1995. esting associations without support pruning. In Proc. of
the 16th Int'l Conf. on Data engineering (ICDE). IEEE,
[4] N. F. Ayan, A. U. Tansel, and E. Arkun. An efficient 2000.
algorithm to update large itemsets with early pruning.
In Proc. of the 5th Int'l Conf. on Knowledge Discovery [17] R. Ng, L. S. Lakshmanan, J. Han, and A. Pang. Ex-
and Data Mining (KDD '99), San Diego, California, ploratory mining and pruning optimizations of con-
USA, August 1999. strained associations rules. In Proc. of 1998 A CM SIG-
MOD Int'l Conf. on Management of Data, Seattle,
[5] It. J. Bayardo Jr., It. Agrawal, and D. Gunopulos. Washington, USA, June 1998.
Constraint-based rule mining in large, dense databases. [18] N. Pasqnier, Y. Bastide, It. Taouil, and L. Lakhal. Dis-
In Proe. of the 15th Int'l Conf. on Data Engineering, covering frequent closed itemsets for association rules.
Sydney, Australia, March 1999. In Proc. of the 7th Int'l Conf. on Database Theory
[6] S. Brin, It. Motwani, and C. Silverstein. Beyond market (ICDT'99), Jerusalem, Israel, January 1999.
baskets: Generalizing association rules to correlations. [19] J. Pei, J. Hart, and It. Mao. An efficient algorithm for
In Proc. of the ACM SIGMOD Int'l Conf. on Manage- mining frequent closed itemsets. In Proc. of the 2000
ment of Data (ACM SIGMOD '97), 1997. ACM-SIGMOD Int'l Conf. on Management of Data,
Dallas, Texas, USA, May 2000.
[7] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dy-
namic itemset counting and implication rules for mar- [20] It. Rastogi and K. Shim. Mining optimized support
ket basket data. In Proc. of the ACM SIGMOD Int'l rules for numeric attributes. In Proe. of the 15th Int'l
Conf. on Management of Data, 1997. Conf. on Data Engineering, Sydney, Australia, March
1999. IEEE Computer Society Press.
[8] T. Fukuda, Y. Morimoto, S. Morishita, and
T. Tokuyama. Mining optimized association rules for [21] A. Savasere, E. Omiecinski, and S. Navathe. An effi-
numeric attributes. In Proc. of the 15th ACM SIGACT- cient algorithm for mining association rules in large
SIGMOD-SIGART Syrup. on Principles of Database databases. In Proe. of the 21st Conf. on Very Large
Systems (PODS '96), Montreal, Canada, June 1996. Databases (VLDB '95), Ziirich, Switzerland, Septem-
ber 1995.
[9] J. Han, J. Pei, and Y. Yin. Mining frequent patterns [22] C. Silverstein, S. Brin, It. Motwani, and J. D. Ull-
without candidate generation. In Proc. of the 2000 man. Scalable techniques for mining causal structures.
ACM-SIGMOD Int'l Conf. on Management of Data, In Proe. of 1998 A C M SIGMOD Int'l Conf. on Manage-
Dallas, Texas, USA, May 2000. ment of Data, Seattle, Washington, USA, June 1998.
[10] C. Hidber. Online association rule mining. In Proe. [23] It. Srikant and It. Agrawal. Mining generalized associ-
of the 1999 ACM SIGMOD Conf. on Management of ation rules. In Proc. of the 21st Conf. on Very Large
Data, 1999. Databases (VLDB '95), Ziirich, Switzerland, Septem-
ber 1995.
[11] J. Hipp, U. Giintzer, and G. Nakhaeizadeh. Mining
association rules: Deriving a superior algorithm by [24] It. Srikant and It. Agrawal. Mining quantitative asso-
analysing today's approaches. In Proe. of the 4th Eu- ciation rules in large relational tables. In Proe. of the
ropean Conf. on Principles and Practice of Knowledge 1996 A C M SIGMOD Conf. on Management of Data,
Discovery, Lyon, France, September 2000. to appear. Montreal, Canada, June 1996.
[12] J. Hipp, A. Myka, It. Wirth, and U. Gfintzer. A new [25] It. Srikant, Q. Vu, and It. Agrawal. Mining association-
algorithm for faster mining of generalized association rules with item constraints. In Proe. of the 3rd Int'l
rules. In Proc. of the 2nd European Symposium on Conf. on KDD and Data Mining (KDD '97), Newport
Principles of Data Mining and Knowledge Discovery Beach, California, August 1997.
(PKDD '98), Nantes, France, September 1998. [26] S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka. An
efficient algorithm for the incremental updation of asso-
[13] M. Houtsma and A. Swami. Set-oriented mining for as- ciation rules in large databases. In Proe. of the 3rd Int 'l
sociation rules in relational databases. Technical Report Conf. on KDD and Data Mining (KDD '97), Newport
ttJ 9567, IBM Almaden Research Center, San Jose, Beach, California, August 1997.
California, Oktober 1993.
[27] D. Tsur, J. D. UUman, S. Abitboul, C. Clifton, It. Mot-
[14] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivo- wani, S. Nestorov, and A. Rosenthal. Query flocks: A
nen, and A. I. Verkamo. Finding interesting rules from generalization of association-rule mining. In Proc. of
large sets of discovered association rules. In Proc. of the 1998 ACM SIGMOD Int'l Conf. on Management of
3rd Int'l Conf. on Information and Knowledge Manage- Data, Seattle, Washington, USA, June 1998.
ment, Gaithersburg, Maryland, 29. Nov - 2. Dez 1994.
[28] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li.
[15] H. Mannila, H. Toivonen, and I. Verkamo. Discovery of New algorithms for fast discovery of association rules.
frequent episodes in event sequences. Data Mining and In Proc. of the 8rd Int'l Conf. on KDD and Data Minin 9
Knowledge Discovery, 1(3), November 1997. (KDD '97), Newport Beach, California, August 1997.
SIGKDD Explorations. Volume 2, Issue 1 - page 64