Data Mining Term Project
Machine Learning with WEKA
Weka Explorer Tutorial
for Version 3.4.3
Svetlana S. Aksenova
Department of Computer Science
California State University, Sacramento
Fall 2004
Machine learning methods
for data mining
use techniques from computer science, statistics
and probability, and data visualization to search for
patterns and relationships in large data sets
Allow automatically analyze a large amount of data
The result of analysis automatically makes
predictions faster and more accurately
The result of analysis makes decisions faster and
more accurately
About WEKA
Developed by University of Waikato in New
Zealand
open source software issued under the GNU
General Public License
WEKA is a data mining system written in Java
implements data mining algorithms
compatible with most of computer platforms
applied to the dataset by choosing either
command line or graphic user interface
Introduction to the Tutorial
Created to help in learning process
Consists of 8 parts:
Introduction
Launching WEKA
Preprocessing Data
Building Classifiers
Clustering Data
Finding Associations
Attribute Selection
Data Visualization
Launching WEKA
GUI Chooser the Main Menu
Preprocessing
Data can be read from a
Local filesystem (in ARFF, CSV, C4.5, binary formats)
URL
SQL database (using JDBC)
File conversion
Preprocessing window
Preprocessing tools - filters
File Conversion
Excel
CSV
ARFF
Open File (from the local filesystem)
Open File (from a website)
http://gaia.ecs.csus.edu/~aksenovs/ weather.arff
Preprocessing Window
Setting Filters
WEKA contains filters for discretization,
normalization, resampling, attribute selection,
transformation and combination of attributes.
Some techniques, such as association rule mining,
can only be performed on categorical data.
Filter Configuration Options
Right-click on on filter
Building Classifiers
Choosing a classifier J48 (C4.5)
Setting Test Options
Output the Result
Used weather data in weather.arff for classification
Analyzing Results
Visualizing Results
Tree Visualizer
Error Visualizer
Error Visualizer (contd)
Exercise
Given at the end of the section
Classification Exercise
Use ID3 algorithm to classify weather data
from the weather.arff file. Perform initial
preprocessing and create a version of the
initial dataset in which all numeric attributes
should be converted to categorical data.
Clustering Data
The clustering schemes available in WEKA are
k-Means, EM, Cobweb, X-means, FarthestFirst.
Used customer data for clustering in customers.arff
Clustering Data (contd)
Choosing clustering scheme
K- means
5 clusters
Setting test options
Analyzing results
Visualizing Results
Results of Clustering in ARFF File
Exercise
Given at the end of the section
Clustering Exercise
Use k-means algorithm to bank data from
the bank.arff file. Perform initial
preprocessing and create a version of the
initial data set in which the ID field should
be removed and the "children" attribute
should be converted to categorical data.
Finding Associations
Apriori
works only with discrete data
identifies statistical dependencies between
groups of attributes
used grocery store data
from grocery.arff file with
confidence 40% and
support 30%.
Setting test options
Analyzing Results
Exercise
Given at the end of the section
Association Rules Exercise
Use Apriori algorithm to generate association
rules for Iris data from the iris.arff file.
Perform initial preprocessing and create a
version of the initial data set in which the
numeric attributes should be converted to
categorical data.
Attribute Selection
searches through all possible combinations of
attributes
finds which subset of attributes works best for
prediction.
contain two parts:
a search method: best-first, forward selection,
random, exhaustive, genetic algorithm, ranking,
evaluation method: correlation-based, wrapper,
information gain, chi-squared.
used weather data from weather.arff file
Attribute Selection (contd)
Data Visualization
visualize a 2-D plot of the current working relation
determine difficulty of the learning problem
Data Visualization (contd)
Selecting Instances
A group of points on the graph can be selected in
four ways:
1. Select Instance
2. Rectangle
3. Polygon
4. Polyline
Select Instance
Rectangle
Polygon
Polyline
Why should we use WEKA
You can solve a machine learning
problem with a minimum programming
WEKA includes
reading of data,
implementation of filtering,
result evaluation
Performance
Has not been evaluated in this project
Can it process large ARFF files (GB)?
An answer has been found in
wekalist
It can process some schemes that are
either incrementally trainable or can be
made to be.
Future Work
Has not been done due to time constraints
Simple CLI provides a simple commandline interface and allows direct execution of
Weka commands.
KnowledgeFlow is a Java-Beans-based
interface for setting up and running machine
learning experiments.
References
1.
2.
3.
4.
5.
6.
I. Witten, E. Frank, Data Mining, Practical Machine.
Learning Tools and Techniques with Java
Implementation, Morgan Kaufmann Publishers, 2000.
R. Kirkby, WEKA Explorer User Guide for version 3-3-4,
University of Weikato, 2002.
Weka Machine Learning Project,
http://www.cs.waikato.ac.nz/~ml/index.html.
Machine Learning With WEKA, E.Frank, University of
Waikato, New Zealand.
B. Mobasher, Data Preparation and Mining with WEKA,
http://maua.cs.depaul.edu/~classes/ect584/WEKA/associ
ation_rules.html, DePaul University, 2003.
M. H. Dunham, Data Mining, Introductory and Advanced
Topics, Prentice Hall, 2002.