CCST9047: The Age
of Big Data
Lecture 5
Feb. 28, 2024
Notice
‣ We have tutorial this week, homework 1 will be released after the
tutorial.
‣ We have the make-up course in reading week.
‣ The rst in-class quiz will be the week after reading week.
‣ We will allow one-page cheating sheet.
‣ The quiz will cover the material by today.
‣ We will open the portal for proposal submission, the proposal
presentation will be in the same week as the quiz.
fi
Recap of Previous Lecture
‣ Data Mining
‣ Non-trivial discovering interesting patterns and insights from the data
‣ Predictive Mining
‣ Classi cation: Decision Tree, Linear classi cation
‣ Regression: Linear regression, generalized linear regression
‣ Brief introduction of neural networks
fi
fi
Data Mining Tasks
‣ Predictive Mining (Previous lecture)
‣ Classi cation
‣ Regression
‣ Description Mining (This lecture)
‣ Clustering
‣ Association Rule Discovery
‣ Principle Component Analysis
fi
Description Mining
‣ Find human-interpretable patterns that describe the data:
‣ Which data points are similar to each other?
‣ Relation between data points or features?
‣ What’s the trend of the data?
‣ Causal relationships
‣ You do not have speci c ground-truth target of the data.
‣ Unsupervised learning
fi
Clustering
‣ What’s Clustering
‣ Organizing data into clusters such that
‣ Intra-cluster data are similar
‣ Inter-cluster data are less similar
‣ Finding natural groupings among objects
Similarity
‣ How to measure the similarity between data?
Data 1 Data 2
Distance function
D(data1, data2)
Similarity
‣ For Data in the continuous space
‣ Data is formulated as a continuous
vector
‣ Distance function can be
‣ ℓ2 norm: ∥x1 − x2∥2
‣ 1
ℓ norm: ∥x1 − x ∥
2 1
Similarity
‣ For a set of directions
‣ Data is formulated as a continuous
vector
‣ Distance function can be cosine
⟨x1, x2⟩
function: D(x1, x2) =
∥x1∥2∥x2∥2
Similarity
‣ For a set of images
‣ We rst need a feature extractor to
get semantic feature from the data.
‣ We then use ℓ1 or ℓ2 norm to
characterize the distance between
these semantic features.
fi
Clustering With Categorical Features
‣ If the data has categorical features, one can cluster datapoints directly.
Clustering With Categorical Features
Martial Taxable
Refund Cheat
status Income
Yes Single 125K No
Yes Married 120K No
Yes Divorced 220K No
Grouping by refund
Martial Taxable
Refund Cheat
status Income
No Married 100K No
No Single 70K No
No Divorced 95K Yes
No Married 60K No
No Single 85K Yes
No Married 75K No
No Single 90K Yes
Clustering by Key Words
‣ Each data point is a paper
‣ Each paper has a number of
key words
‣ Clustering can be performed
based on the key words
Clustering in Euclidean Space
‣ In more challenging case, the data in euclidean cannot be directly grouped.
‣ We do not know the label of data
points.
‣ We do not know how many
clusters should be considered.
‣ How to build the clusters?
Clustering in Euclidean Space
‣ In more challenging case, the data in euclidean cannot be directly grouped.
?
K-Means Clustering
‣ We rst need to decide the number of clusters, i.e., K.
K-means Clustering: Initialization
‣ Step 1: initialize the centers of K = 3 clusters
Decide K, and initialize K centers (randomly)
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
fi
K-Means Clustering
‣ Step 2: assignK-means
all data to Clustering:
the nearest center Iteration 1 the centers to the
and move
Assign all objects to the nearest center.
means of the new clusters.
Move a center to the mean of its members.
5
4
k1
3
2
k2
k3
0
0 1 2 3 4 5
K-Means Clustering
K-means Clustering: Iteration 2
‣ Step 3:After
re-assign the data
moving centers, and move
re-assign centers
the objects again
to nearest
Move a center to the mean of its new members.
centers.
5
4
k1
k3
1 k2
0
0 1 2 3 4 5
K-Means Clustering
‣ K-means
Step 4 or more: Clustering:
re-assign the data andFinished!
move centers again, until the
Re-assign and move centers, until …
clustering result does not change.
no objects changed membership.
expression in condition 2 5
4
k1
k2
1
k3
0
0 1 2 3 4 5
General Algorithm of K-means
1. Decide on a value for K, the number of clusters
2. Initialize the K cluster centers (randomly or using prior
information)
3. Decide the class memberships of data points by assigning
them to the nearest cluster center.
4. Re-estimate the K cluster centers, by assuming the
memberships found above are correct.
5. Repeat the steps 4-5 until none of the data changed
membership in the last iteration.
General Algorithm of K-means
• Decide the class memberships of data points by assigning them
to the nearest cluster center.
• Nearest cluster center can be found using di erent similarity/
distance measure.
• Re-estimate the K cluster centers, by assuming the
memberships found above are correct.
• Estimating the centers can be performed by calculating the
mean or median, or weighted average.
ff
Why K-means Works
‣ What’s a good partition?
‣ High intra-cluster similarity
‣ The average distance to the center of the cluster should be as small as
possible.
Distance to the cluster
K
2
∑∑
min ∥xi − μk∥2
k=1 i∈Ck
Cluster k
Why K-means Works
‣ Find the best μk
‣ Take derivatives over μk
K 2
∂ ∑k=1 ∑i∈C ∥xi − μk∥2
∑
k
= (μk − xi)
∂μk i∈Ck
‣ Derivatives becomes zero implies μk is the average of data in Ck
‣ This is consistent with the goal of K-means
What if the distance is ℓ1 norm? ⟹ one may use median to get the center
rather than average.
Choice of K
jective function isWhen When k = 3, the objective function is 133.6
873.0k = 2, the objective function is 173.1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
K=1 K=2 K=3
k = 2 can already provide reasonable clustering
Choice plotK
We can of the objective function values for k equals 1 to 6…
The abrupt change at k = 2, is highly
K suggestive of two clusters
2 of
∑ or∑“elbow finding”.
in the data. This technique
We can calculate the objectivefor ∥xi − μk∥2
determining the number
clusters is known as “knee finding”
k=1 i∈Ck
1.00E+03
9.00E+02
Objective Function
8.00E+02
7.00E+02
6.00E+02
5.00E+02
4.00E+02
3.00E+02
2.00E+02
1.00E+02
0.00E+00
1 2 3
k 4 5 6
k = 2 can already give a low objective value.
Summary of K-means
‣ Strengths
‣ Simple, easy to implement and debug
‣ Intuitive objective function: optimizes intra-cluster similarity
‣ E cient, especially when K ≪ n
ffi
Summary of K-means
‣ Weakness
‣ Only applicable when the center can be
calculated, thus cannot be applied to
categorical data.
‣ Relatively sensitive to initialization, cannot
guarantee to converge to global min
‣ Need to specify K
‣ Unable to handle noisy data and outliers
‣ Not suitable to discover clusters with non-
convex shape
Association Analysis
Customer Purchase
1 Beer, Chips, Diapers
2 Apple, Beer, Diapers
3 Beer, Egg
4 Chips
5 Beer, Egg, Diapers
Customers who like to buy diapers
6 Apple, Beer also like to buy beer.
7 Beer, Chips, Diapers, Egg
Association Rule Discovery
‣ Given a set of records each of which contain some number of items from a
given collection
‣ Produce dependency rules which will predict occurrence of an item
based on occurrences of other items
Customer Purchase
1 Beer, Coke, Milk
Rules discovered
2 Beer, Bread
3 Beer, Coke, Diaper, Milk {Milk} → {Coke}
4 Beer, Bread, Diaper, Milk {Diaper, Milk} → {Beer}
5 Coke, Diaper, Milk
Application of Association Rule Discovery
‣ Marketing and Sales promotion:
‣ Considering the rule: {Bagels, …}→{Chips}
‣ Chips as consequent
‣ Can be used to see what should be done to boost its sales
‣ Bagels in the antecedent
‣ Can be used to see which products will be a ected if the store
continues to sell Bagels
‣ Bagels in antecedent and Chips in consequent
‣ can be used to see what products should be sold with Bagels to
promote sale of chips
ff
Application of Association Rule Discovery
‣ Supermarket shelf management
‣ Goal: to identify items that are bought together by su ciently many
customers
‣ Approach: process the point-of-sale data collected with barcode
scanners to nd dependencies of items
‣ Classical rules:
‣ If a customer buys diaper and milk, then he is very likely to buy beer.
‣ So, don’t be surprised if you nd six-packs stacked next to diapers!
fi
fi
ffi
Basic Association Rule Mining Algorithm
‣ Given a set of purchase transactions, nd rules that
will predict the occurrence of an item based on the
occurrences of other items in the transaction. Customer Purchase
‣ Itemset: a collection of items, e.g., {milk, bread, 1 Bread, Milk
diaper} 2 Bread, Diaper, Beer, Eggs
‣ Support count (σ): frequency of occurrence of an 3 Milk, Diaper, Beer, Coke
items, e.g., σ({milk, bread}) = 3 4 Bread, Milk, Diaper, Beer
‣ Support: Fraction of transactions that contain an 5 Bread, Milk, Diaper, Coke
items, e.g., s({milk, bread})=3/5
fi
Basic Association Rule Mining Algorithm
‣ Association Rule Customer Purchase
‣ An implication expression of the form X → Y, 1 Bread, Milk
where X and Y are item sets
2 Bread, Diaper, Beer, Eggs
‣
Milk, Diaper, Beer, Coke
Example: {Milk, Diaper}→{Beer}
4 Bread, Milk, Diaper, Beer
‣ Rule Evaluation Metrics 5 Bread, Milk, Diaper, Coke
‣ Support (s): fraction of transactions
containing both X and Y s({Milk, Diaper, Beer}) = 2/5
‣ Con dence (c): measures how often items in
c({Milk, Diaper}→{Beer}) = 2/3
Y appear in the transactions that contain X
fi
Basic Association Rule Mining Algorithm
‣ Given a set of transactions T, the goal of association rule mining is to
nd all rules having
‣ Support is su ciently large: s > sthreshold
‣ Con dence is su ciently large: c > cthreshold
‣ Brute-force approach:
‣ List all possible rules
‣ Calculating the support and con dence
‣ Prune rules based on sthreshold and cthreshold
‣ Computationally expensive!!
fi
fi
ffi
ffi
fi
Basic Association Rule Mining Algorithm
Customer Purchase
{Milk,Diaper} →{Beer} (s=0.4, c=0.67)
1 Bread, Milk {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
2 Bread, Diaper, Beer, Eggs {Diaper,Beer} → {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
4 Bread, Milk, Diaper, Beer
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have
di erent con dence
• Thus, we may decouple the support and con dence requirements
ff
fi
fi
Dimension Reduction
Given data points in d dimensions
Convert them to data points in r<d dimensions
With minimal loss of informa on
‣ Clustering
‣ One way to summarize a complex real-valued data point with a
single categorical variable.
‣ Dimension Reduction
‣ Another way to simplify complex high-dimensional data
‣ Summarize data with a lower dimensional real valued vector
‣ Given data points in d dimensions
‣ Convert them to data points in r < d dimensions with
minimal loss of information
ti
Data Compression
Reduce 2D data to 1D
Data Compression
One idea: only use one feature
Data Compression
Reduce 3D data to 2D: use two features out of three
Effective Projection
Criteria of good data compression
‣ Projection: x d
→ proj(x): ℝ → ℝ k
‣ Reconstruction: proj(x) → recon(proj(x)) : ℝ k
→ ℝ d
‣ Projection error: ∥x − recon(proj(x)∥ 2
‣ We hope to nd the projection approach that minimizes the projection error!
fi
A 2D example
‣ Projection to one of the original
variable may not be good
‣ Projections to PC1 could be a
better choice.
Covariance
‣ Variance and covariance:
‣ Measure of the “spread” of a set of points around their center of mass (mean)
‣ Variance:
‣ Measure of the deviation from the mean for points in one dimension
‣ Covariance:
‣ Measure of how much each of the dimensions vary from the mean with
respect to each other
Covariance Matrix
X = [X1, …, XK] Covariance between two di erent dimensions
Variance of the dimension
ff
Covariance for Computing Principle Components
‣ Given the data, how to nd the
principle components/directions?
‣ PC1 and PC2 are orthogonal.
‣ It can be seen that the projection of
data onto PC1 has very large variance.
https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/DecisionTrees.pdf
fi
Eigen-decomposition of the Covariance Matrix
‣ Let A = COV(X) denotes the covariance matrix
‣ Eigenvalue λ and corresponding eigenvector v: Av = λv
‣ For d × d matrix, we have a set of eigenvalues λ1 , …, λd and corresponding
eigenvectors v1, …, vd such that
Avi = λivi
‣ Larger λi implies higher variance of data along the direction of vi
Application of Principle Components
‣ Principle components for face image
Application of Principle Components
‣ Image representation and reconstruction
Application of Principle Components
‣ Calculating principle components using di erent sample size, then do image
reconstruction
ff
Application of Principle Components
‣ Compressing the image by only using its principle components
Application of Principle Components
Application of Principle Components
Other Dimension Reduction Methods
‣ PCA (principle component analysis):
‣ Find linear projections that maximize the variance
‣ ICA (independent component analysis):
‣ Similar to PCA but has di erent statistical assumptions
‣ Auto-encoder-decoder:
‣ Training two neural networks to compress and reconstruct by minimizing
the reconstruction error.
ff
Summary
‣ Description Data Mining
‣ Clustering: K-means
‣ Associate Rule Discovery
‣ Principle component analysis: Eigen-decomposition of covariance matrix.