IFN645 Large Scale Data Mining
Lecture 4 – Feature Selection
Faculty of Science and Engineering
Semester 2, 2021
CRICOS No. 00213J
Queensland University of Technology
Last Week
• Major types of Clustering algorithms
• Hierarchical
• EM
• DBSCAN
• Canopy
a university for the real world
R
2
CRICOS No. 00213J
Practical
• Use Weka
– Running KMeans
– Running Hierarchical Clustering
– Running EM
– Running Canopy
• DBScan
– Install it
– Activate it
• Accessing Weka in Java
a university for the real world
R
3
CRICOS No. 00213J
This Week Lecture
• Feature Selection
– Manual Selection
– Wrappers
– Filters
a university for the real world
R
4
CRICOS No. 00213J
FEATURE SELECTION
a university for the real world
R
5
CRICOS No. 00213J
Feature Selection
a university for the real world
R
6
CRICOS No. 00213J
Problems in Initial Features
• Irreverent – Features that actively add noise to the system
• Redundant – Features do not add anything to the system
• Interacting features – Appear to be irrelevant, but can be relevant if
combined with others
a university for the real world
R
7
CRICOS No. 00213J
Feature Extraction vs Feature Selection
• Deal with the curse of dimensionality issue
• Improve data mining quality
a university for the real world
R
8
CRICOS No. 00213J
Feature Selection
Use some features/dimensions
a university for the real world
R
9
CRICOS No. 00213J
Feature Selection
• Usually, the initial feature set is large and contains irrelevant or redundant
features.
• The goal of feature selection is to manually or automatically select a subset
of features which retain most of the relevant information.
– To improve performance by removing irrelevant or redundant features (noise)
– To increase execution speed
– To decrease memory requirements
– To increase generalization by reducing the risk of overfitting
a university for the real world
R
10
CRICOS No. 00213J
Feature Selection
a university for the real world
R
11
CRICOS No. 00213J
Feature Selection Approaches
1. Manual feature selection
2. Wrapper methods - Consider feature selection as a search problem
3. Filter feature selection - Assign a score to features
4. Embedded methods – combine wrapper and filter techniques to learn features.
a university for the real world
R
12
CRICOS No. 00213J
MANUAL SELECTION
a university for the real world
R
13
CRICOS No. 00213J
Manual Selection
• Usually we need to get domain experts’ advice about which features to keep.
• Experts on the application domain.
• E.g. Glass dataset in Weka with J48 algorithm (cross-validation, 10 folders)
– Using all features, accuracy is 66.8%
– Remove feature ‘Fe’, 67.3%
– Remove feature ‘AI’, 70.6%
– Remove ‘RI’ and ‘Mg’, 65.4%
– Remove everything except for features ‘RI’ and ‘Mg’, 68.7%
– Remove everything except for ‘RI’, ‘Na’, ‘Mg,’ ‘Ca’, ‘Ba’, 73.8%
• For the Glass dataset
– Mg and RI look important
a university for the real world
R
14
CRICOS No. 00213J
Manual Selection
a university for the real world
R
15
CRICOS No. 00213J
Manual Selection
• Problems with manual selection
1. It is an inefficient process, which does not scale.
2. It can only be applied to objects whose features can easily be manually
identified.
a university for the real world
R
16
CRICOS No. 00213J
WRAPPER
a university for the real world
R
17
CRICOS No. 00213J
Wrapper
• Wrapper-based feature selection will take a mining model into consideration
to determine a set of features based on the performance of the mining model.
• Feature selection process is a search process, e.g., using a greedy search.
• Basic strategy
– Create a subset of features
– Use a mining model to evaluate the accuracy of using the set of features.
– Based on the accuracy to either accept or reject the set of features.
a university for the real world
R
18
CRICOS No. 00213J
Wrapper
• Basic approach of wrapper-based feature selection using exhaustive search
– Pick a feature subset and pass it into a learning algorithm for feature selection.
– Train the learning model with the training set
– Calculate the accuracy of the learning model with a test set
– Repeat for all feature subsets and pick the feature subset which has the highest
evaluation result.
• Basic approach is simple
• Can be inefficient since there are an exponential number of subsets
a university for the real world
R
19
CRICOS No. 00213J
Search Strategies
a university for the real world
R
20
CRICOS No. 00213J
Weka Wrapper: WraperSubsetEval
a university for the real world
R
21
CRICOS No. 00213J
Weka Wrapper: Cross-validation
a university for the real world
R
22
CRICOS No. 00213J
Weka Wrapper: learning algorithm
a university for the real world
R
23
CRICOS No. 00213J
Weka Wrapper
• Select attributes and apply classifier J48 or IBK
– Dataset: glass.arff Classifier
– No attribute selection and use default parameters Attributes J48 IBk
everywhere All attributes 67% 71%
– Use Wrapper selection with J48: {RI, Mg, AI, K, Ba} {RI, Mg, AI, K, Ba} 71%
– Use Wrapper selection with IBK: { RI, Mg, AI, K, Ca, Ba} { RI, Mg, AI, K, Ca, Ba} 78%
a university for the real world
R
24
CRICOS No. 00213J
Weka Wrapper
• Attribute selection and classification together (in meta)
– Select and configure “AttributeSelectedClassifier”
a university for the real world
R
25
CRICOS No. 00213J
Weka Wrapper
• Attribute selection and classification together (in meta)
– Use AttributeSelectedClassifier to wrap J48 Classifier
Wrapper J48 IBk
J48 72% 74%
– Use AttributeSelectedClassifier to wrap IBk IBk 70% 72%
• Without using attribute selection
Classifier
J48 IBk
67% 71%
• Classification performance is increased after attribute selection
a university for the real world
R
26
CRICOS No. 00213J
Wrapper-based feature selection
• Wrapper-based feature selection is simple and direct, but slow
• Either
– Use a single-attribute evaluator, eliminate irrelevant attributes
– Use a subset evaluator with a search method
• Wrapper methods are scheme-dependent
• Filters are scheme-independent
a university for the real world
R
27
CRICOS No. 00213J
FILTERS
a university for the real world
R
28
CRICOS No. 00213J
Feature Selection - Filters
• Filters work independent of particular data mining models, i.e., scheme-
independent.
• Filters seek a subset of features which maximize or minimize some merit
scores, e.g., feature similarity/distance, feature correlation, and information
gain, etc.
– Can score each feature individually, then choose the top features based on the score.
– Can score subsets of features together, then choose the best subset of features based
on the score.
a university for the real world
R
29
CRICOS No. 00213J
Subset Filter: CfsSubsetEval
• CfsSubsetEval: a scheme-independent attribute subset evaluator.
– An attribute subset is good if the attributes in the subset are
• Highly correlated with the class attribute
• Not strongly correlated with one another
σ𝑥∈𝐴 𝐶(𝑥, 𝑐𝑙𝑎𝑠𝑠_𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒)
𝐺𝑜𝑜𝑑𝑛𝑒𝑠𝑠 𝐴 =
σ𝑥∈𝐴 σ𝑦∈𝐴 𝐶(𝑥, 𝑦)
‘A’ is a subset of attributes, C(x,y) measures the correlation between x and y.
Goodness(A) measures how good the subset is.
a university for the real world
R
30
CRICOS No. 00213J
Single-attribute Filters
• Two Single-attribute filters:
InfoGainAttributeEval , GainRatioAttributeEval
– Measure attributes’ information gain or gain ratio with respect to the class
attribute.
– The higher the measurement score, the higher the correlation between
the attribute and the class attribute.
a university for the real world
R
31
CRICOS No. 00213J
Ranking
• Attribute subset selection involves
– Attribute subset evaluation measure
– Search method
• Searching is slow because it finds a good subset by traversing a large space
of attribute subsets.
• Alternative: use a single-attribute evaluator with a ranking scheme for
individual attributes.
– Rank individual attributes by their individual evaluation score
– Eliminate irrelevant attributes with lower evaluation scores
a university for the real world
R
32
CRICOS No. 00213J
Ranking Metric in Weka
• Metrics for evaluating attributes, we have seen some of them before
– InfoGainAttributeEval
information gain is used by decision tree algorithm C4.5 (i.e., J48)
– GainRatioAttributeEval
gain ratio is used by decision tree algorithm (i.e., J48)
– OneRAttributeEval
This is the evaluation value used in OneR algorithm
– SymmetricalUncertAttribute
symmetric uncertainty
• Ranker in Weka sorts attributes according to their evaluation scores and return
top-k attributes.
a university for the real world
R
33
CRICOS No. 00213J
InfoGainAttributeEval
For a given dataset D and an attribute A, the information gain of A is defined by
𝐺𝑎𝑖𝑛 𝐴 = 𝐻 𝐷 − H𝐴 𝐷 , 𝐻 𝐷 is the entropy of D
𝐻 D = − σ𝑘𝑖=1 𝑃𝑖 log(𝑃𝑖 ), 𝑃𝑖 is the probability of the ith class.
|𝐷𝑗 |
H𝐴 𝐷 = σ𝑣𝑗=1 𝐻(𝐷𝑗 ) , 𝐷𝑗 is a subset of D consisting of records with A being the jth value.
|𝐷|
𝐶𝑖,𝐷
𝑃𝑖 =
𝐷
, 𝐶𝑖,𝐷 is a subset of D, contains records in the ith class. Temperature Wind Class
|𝐶𝑖,𝐷 | is the number of records whose class is the ith class. High Low Play
𝐷 is the number of records in D. Low Low Play
High Low Play
class 1: Play; class 2: Cancelled Low High Cancelled
𝑃𝑝𝑙𝑎𝑦 = 5/7 = 0.71 Low Low Play
𝑃𝑐𝑎𝑛𝑐𝑒𝑙𝑙𝑒𝑑 = 2/7 = 0.29 High High Cancelled
High Low Play
a university for the real world
R
34
CRICOS No. 00213J
InfoGainAttributeEval
𝑮𝒂𝒊𝒏 𝑻𝒆𝒎𝒑𝒆𝒓𝒂𝒕𝒖𝒓𝒆 = 𝐻 𝐷 − H𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 𝐷 𝑃𝒑𝒍𝒂𝒚= 0.71
𝑃𝒄𝒂𝒏𝒄𝒆𝒍𝒍𝒆𝒅 = 0.29
𝐻 D = − σ𝑘𝑖=1 𝑃𝑖 𝑙𝑜𝑔2 𝑃𝑖 =-(0.71* 𝑙𝑜𝑔2 (0.71) + 0.29* 𝑙𝑜𝑔2 (0.29)) = 0.863
𝐷𝑗 3 4
H𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 𝐷 = σ𝑣𝑗=1 𝐻 𝐷𝑗 = ∗ 𝐻 𝐷𝑙𝑜𝑤 + ∗ 𝐻 𝐷ℎ𝑖𝑔ℎ
𝐷 7 7
1 𝟏 𝟐 𝟐
𝐻 𝐷𝑙𝑜𝑤 = −( * 𝑙𝑜𝑔2 + *𝑙𝑜𝑔2 ) = 0.918 Temperature Wind Class
3 𝟑 𝟑 𝟑
3 𝟑 𝟏 𝟏 High Low Play
𝐻 𝐷ℎ𝑖𝑔ℎ = −(
4
* 𝑙𝑜𝑔2
𝟒
+ *𝑙𝑜𝑔2
𝟒 𝟒
) =0.811
Low Low Play
3 4 High Low Play
H𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 𝐷 = ∗ 0.918 + ∗ 0.811 = 0.857
7 7 Low High Cancelled
Low Low Play
𝐺𝑎𝑖𝑛 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 𝐻 𝐷 − H𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 𝐷
High High Cancelled
= 0.863 – 0.857 = 0.006 High Low Play
a university for the real world
R
35
CRICOS No. 00213J
CONCLUSION
a university for the real world
R
36
CRICOS No. 00213J
This Week
• Feature Selection
– Manual Selection
– Wrappers
– Scheme-independent filters
– Attribute ranking
a university for the real world
R
37
CRICOS No. 00213J
Practical
• In Weka, select attributes using
– Manual Selection
– Wrappers
– Filters
– Ranking
• Accessing Weka in Java
a university for the real world
R
38
CRICOS No. 00213J
Thank You
CRICOS No. 00213J
Queensland University of Technology 39