Clickstream Analysis using
Association Rules
Web Analytics
▪ Web Structure Analytics – one of the method is network analysis which can be applied.
▪ Web Content Analytics - Text Analytics process can be applied.
▪ Web Usage Analytics – Application in Recommender system, Clickstream Analytics
Clickstream analytics
▪ Web log data of clicks recorded over a period of time can provide insights in
form of discovering patterns
Session Id Item
item1
Session ID 1
▪ Whenever item 4 is clicked item 5 is also clicked OR
Session ID 1 item3
▪ Whenever item4 is bought item 5 is also bought OR
Session ID 1 item4
Session ID 1 item5 ▪ Whenever item4 is viewed item 5 is also viewed
Session ID 2 item9
Session ID 2 item4
Session ID 2 item5
Discovering Patterns
▪ Pattern
▪ 12121?
▪ ’12’ pattern is found often enough So, with some confidence we can say ‘?’ is 2
▪ “If ‘1’ then ‘2’ follows”
▪ Pattern ➔ Model
Confidence
▪ 121212?
▪ 12121231212123121212?
▪ 121212➔ 3
▪ Models are created using historical data by detecting patterns. It is a calculated
guess about likelihood of repetition of pattern.
Discovering Patterns
What Is ASSOCIATION RULE MINING?
▪ Association rule mining is well know for Market Basket Analysis . It can also be
applied for clickstream analysis
▪ Finding frequent patterns, associations, correlations, or causal structures among
sets of items or objects in transaction databases, relational databases, and other
information repositories is called association rule mining.
▪ To simply put, given a set of records, each of which contain some number of
items from a given collection
▪ produce dependency rules which will predict occurrence of an item based on
occurrences of other items
What Is ASSOCIATION RULE MINING?
▪ Given a set of records, let say for a fresh farm ecommerce store, each of which
contain some number of items from a given collection
▪ produce dependency rules which will identify occurrence of an item based on
occurrences of other items
▪ For example set of records is in form of transaction data table from which rules are
found
TID Itemset Rules Found:
1 Bread, Coke, Milk {Milk} => {Coke}
{ Diaper, Milk} => {Beer}
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule Mining Basics
▪ It is assumed that we are given: (1) database of transactions, (2) each transaction
is a list of items (purchased/viewed/activity by a customer in a visit on
website)
▪ The transaction table have Transaction ID (TID) which is nominal field and
Itemset which is collection of items (one or more) (can be product or webpages)
TID Itemset
Then find: all rules that correlate the
1 Bread, Coke, Milk presence of one set of items with that of
2 Beer, Bread another set of items
E.g., 60% of people who viewed SonyAlpha12 also
3 Beer, Coke, Diaper, Milk tends to view Camera Tripod
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Applications of Association Rule Mining
▪ Market Basket Analysis or Association Rule Mining is applied in areas such as
▪ Clickstream Analytics – Consider rule for Goodreads as if you viewed the {Biography, … } -->
{Memoir}
▪ Marketing and Sales Promotion – Consider discovered rule as {Potato Chips , … } -->
{Cold drinks}
▪ Supermarket shelf management - Consider discovered rule as {Potato Chips , … } -->
{Cold drinks}
▪ Sequential Pattern Discovery - Given: set of objects, each associated with its own
timeline of events, find rules that predict strong sequential dependencies among
different events, of the form (A B) (C) (D E) --> (F). For
example, (Shoes) (Racket, Racketball) --> (Sports Jacket)
▪ Catalogue design for business, Product clustering , Credit/debit card analysis , Web
usage mining, Banking & Insurance products profiles
Translate Rules to Strategy
Strategy 1: Placing milk and bread as
frequently bought together on fresh
food website may further encourage
the sale of these items
Strategy 2: Placing milk and bread at opposite
ends. When people have put milk you remind
at the end of checkout process to add bread in
cart, along with other items associated
Strategy 3:Put these two items into a package
at reduced price.
10%
Off
Measures & Concepts of Association Rule
Mining
TID Itemset
▪ Itemset
1 Bread, Coke, Milk
▪ A collection of one or more items
2 Beer, Bread, Milk
▪ Example: {Milk, Bread, Diaper}
3 Beer, Coke, Diaper, Milk
▪ k-itemset
▪ An itemset that contains k items 4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
▪ Rule
▪ E.g. X=> Y
▪ Left Hand Side (LHS) of Rule = X is an itemset, for example X = { Beer, Bread}
▪ Right Hand Side (RHS) of Rule = Y is an itemset, for example Y = {Milk}
▪ => implication means co-occurrence & NOT causality , merely association
Support
TID Itemset
▪ Support count
1 Bread, Coke, Milk
▪ Frequency of occurrence of an itemset ,
▪ E.g. Support Count of ({Milk, Beer,Diaper}) = 2 2 Beer, Bread, Milk
3 Beer, Coke, Diaper, Milk
▪ Support
4 Beer, Bread, Diaper, Milk
▪ Fraction of transactions that contain an itemset
5 Coke, Diaper, Milk
▪ For a rule X => Y
▪ Probability that a transaction contains (X U Y)
▪ E.g. Support ({Milk, Beer, Diaper}) = 2/5 = 0.4
▪ Alternatively in probability notation,
▪ Support = P(XU Y) = Support Count (X U Y)/ Total no. of transactions
▪ Support is an indication of how frequently the itemset appears in the dataset
Confidence
TID Itemset
▪ Confidence
1 Bread, Coke, Milk
▪ For a rule X => Y
2 Beer, Bread, Milk
▪ conditional probability that a transaction having X also
contains Y 3 Beer, Coke, Diaper, Milk
▪ Measures how often itemset Y appear in transactions that 4 Beer, Bread, Diaper, Milk
contains X itemset 5 Coke, Diaper, Milk
▪ E.g. For Rule {Milk, Beer} => {Diaper}
▪ Confidence ({Milk, Beer, Diaper}) = 2/3 = 0.67
• Alternatively,
• Confidence(X=>Y) = Support (X U Y)/ Support (X)
• If we take Ex & Ey as events that a transaction contains itemset X & Y respectively then
• Support (X U Y) = P (Ex Ey)
• Confidence (X=>Y) = P(Ey/Ex) = P (Ex Ey) / P(Ex) = Support (X U Y) / Support (X)
• Confidence is an indication of how often the rule has been found to be true.
Association Rule Mining Task
▪ Now the Association Rule Mining Task can be broken down as
▪ Given a set of transactions T, the goal of association rule mining is to find all rules
having
▪ support ≥ minsup threshold ( user provided parameter)
▪ confidence ≥ minconf threshold ( user provided parameter)
▪ Also additionally we can consider Lift measure for evaluating rules
Generated Rule Evaluation
▪ Lift Measure
Coffee Coffee
▪ For a rule X => Y
▪ Lift (X=>Y) = Support (X U Y)/ (Support (X)*Support(Y))
Tea 15 5 20
Tea 75 5 80
90 10 100
▪ Lift & other measures can be used to prune/rank the derived patterns
▪ Let us test Association Rule: Tea → Coffee
Confidence = P(Coffee|Tea) = 15/20 = 0.75
▪ So it seems good rule. But note that Support (Coffee) = (15/100)/((20/100)*(90/100)) = 0.90
▪ Lift here is 0.75/0.9= 0.8333 (< 1, therefore is negatively associated i.e. substitute items) . So Lift is useful to judge
positive as well as negative association.
▪ Other rules interest measures are leverage, conviction, rule power factor , chisquare, cosine, coverage etc.
Association Rule Mining Task
▪ If we computer all rules for {Milk, Diaper, Beer} then we may realize that Rules
originating from the same itemset have identical support but have different
confidence
▪ So Task is broken down in two stages
▪ (1) Frequent Itemset Generation - Generate all itemsets whose support minsup
▪ (2) Rule Generation - Generate high confidence rules from each frequent itemset
where confidence of rule minconfidence
Additional Reference Slides – for those who
want to explore further ( not mandatory)
Approaches for association rule generation
▪ One approach can be Brute Force
▪ List all possible association rules
▪ Compute the support and confidence for each rule
▪ Prune rules that fail the minsup and minconf thresholds
▪ But this is Computationally Prohibitive
▪ Other is to use the algorithms like apriori, FP Growth, EClat
Alogrithms
▪ Apriori algorithm - uses a breadth-first search strategy to count the support of
itemsets and uses a candidate generation function which exploits the downward
closure property of support.
(a) Breadth first
▪ ECLAT algorithm - stands for Equivalence Class Transformation is a depth-first
search algorithm based on set intersection.
▪ FP-growth algorithm - FP stands for frequent pattern
(a) Breadth first (b) Depth first
Frequent Itemset Generation from Lattice
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Apriori Algorithm in Action
TID Itemsets ▪ Let Bread be assigned code 1 Butter – 2; Milk -3
A {Bread, Butter, Milk, Sugar} ; Sugar-4
B {Bread, Butter, Sugar} ▪ Then Transaction dataset would be
C {Bread, Butter}
TID Itemsets
D {Butter, Milk, Sugar}
A {1,2,3,4}
E {Butter, Milk}
B {1,2,4}
F {Milk, Sugar}
C {1,2}
G {Butter, Sugar}
D {2,3,4}
E {2,3}
F {3,4}
G {2,4}
Apriori Algorithm in Action
Itemsets Support
Determine Frequent Item Support
{1} 3
1-Itemset {1,2} 3
{2} 6
Let minsupport = 3, so any {1,3} 1
{3} 4
itemset which appears more {1,4} 2
than equal to 3 will be {4} 5
{2,3} 3
frequent itemset
{2,4} 4
Determine Frequent 2-Itemset
{3,4} 3
Only {1,3} & {1,4} are not
frequent.
Apriori Algorithm make use of
the result that any superset of
these will not be frequent Item Support
{2,3,4} 2