Unit 5 : Open Source Data mining Tool
WEKA (Waikato Environment for Knowledge Analysis) is a Java-
based open-source tool for machine learning, data mining, and
predictive modeling. Below is a breakdown of its core functions with
examples and use cases.
Enlist features of weka tool
Features of weka :
1. Data Preprocessing : Weka suports various file formats such as
ARFF,CSV,JSON,XLSX and databases and mainly this feature
includes the process of data cleaning tasks like handling
missing values, normalization and discretization.
2. Machine learning algorithms : Weka includes 4 main category
of ML algorithms :
Classification : Decision trees (J48 ),Naïve
bayes,SVM,Random forest
Clustering : K-means,DBSCAN,EM
Regression : Linear Regression , M5P
Association rules : Apriori, FP-Growth
3. Visualization tools : Weka allows you to visualize your data
through Scatter plots , histograms and also your model
evaluation illustration using ROC curves and confusion matrices
including tree/graph visualization.
4. Model Evaluation : Weka allows you to measure metrics of your
model like Accuracy, Precision Recall,F1, AUC and also provides
with validation methods like :
Train-Test split
Cross-Validation
Bootstrap
5. Integration & Extensibility : WEKA supports for :
Java API: Embed WEKA in Java apps.
Command Line: Run workflows via terminal.
Python Integration: Use weka.core with Jython.
State and explain the advantages of WEKA tool
1. User-Friendly GUI (No Coding Required)
Easy-to-use graphical interface for beginners and non-
programmers.
Drag-and-drop functionality in the Knowledge Flow interface.
No need for Python/R coding—ideal for quick prototyping.
Example:
Load a dataset (CSV/ARFF), select an algorithm (e.g., J48
Decision Tree), and visualize results—without writing a single
line of code.
2. Comprehensive Collection of ML Algorithms
Supports all major ML techniques:
Classification (J48, Naïve Bayes, SVM, Random Forest).
Clustering (k-Means, DBSCAN, EM).
Regression (Linear Regression, M5P).
Association Rule Mining (Apriori, FP-Growth).
Feature Selection (PCA, InfoGain).
Pre-built implementations—no need to write algorithms from
scratch.
Example:
Compare Random Forest vs SVM in Experimenter Mode with
statistical tests.
3. Built-in Data Preprocessing Tools
Handles missing values (remove/replace/impute).
Normalization & Standardization (min-max, z-score).
Discretization (convert numeric → categorical).
Filtering (remove outliers, noise, duplicates).
Example:
Normalize a dataset (Preprocess → Filters → Normalize) before
running k-Means clustering.
4. Powerful Visualization Capabilities
Dataset Visualization: Scatter plots, histograms, box plots.
Model Evaluation: ROC curves, confusion matrices.
Decision Tree Visualization: View splits interactively.
Example:
Plot a ROC curve to compare classifiers (Visualize Threshold
Curve).
5. Experiment Automation & Reproducibility
Experimenter Mode:
Compare multiple algorithms statistically.
Export results to CSV/ARFF for reporting.
Command-line & Java API for scripting and automation.
Example:
Run 10-fold cross-validation on 5 classifiers and analyze p-
values.
6. Open-Source & Cross-Platform
Free to use (GNU GPL license).
Runs on Windows, macOS, Linux.
Active community (forums, tutorials, plugins).
Example:
Extend WEKA with third-party
packages (e.g., DeepLearning4j for neural networks).
Clustering :
Example :
Another popular algorithm is DBSCAN (Density-Based Spatial
Clustering of Applications with Noise), which forms clusters based
on the density of data points. Unlike k-Means, DBSCAN does not
require specifying the number of clusters beforehand and can detect
outliers. For instance, in anomaly detection, DBSCAN can be used to
identify fraudulent transactions in banking data by clustering normal
transactions and flagging isolated points as suspicious.
Classfiers :
Example :
Support Vector Machines (SVM) are another class of classifiers that
work by finding the optimal hyperplane separating different classes
in high-dimensional space. For instance, in image recognition, an
SVM can classify handwritten digits (0-9) by analyzing pixel
intensities and determining the best boundary between digit
categories.
WEKA Installation Guide for Windows
1. Download WEKA
o Visit the official WEKA
website: https://www.waikato.ac.nz/ml/weka/
o Click on the Windows (64-bit) installer link to download
the .exe file.
2. Install WEKA
o Double-click the downloaded .exe file.
o Follow the setup wizard, accepting the license agreement.
o Choose the default installation location (C:\Program
Files\Weka).
o Select "Add WEKA to PATH" for command-line access
(optional).
o Click Install and wait for the process to complete.
3. Verify Java Installation
o Open Command Prompt and type java -version.
o If Java is not installed, download and install Java 8 or
later from Oracle’s website.
4. Launch WEKA
o Open WEKA from the Start Menu or desktop shortcut.
o Select the WEKA Explorer to begin using the tool.
Feature Selection in WEKA Tool
Feature Selection (also called attribute selection) is the process of
identifying and selecting the most relevant features (attributes) from
a dataset to improve model performance, reduce overfitting, and
speed up training. WEKA provides several built-in methods for feature
selection, accessible through its GUI, command-line, and API.
1. Types of Feature Selection in WEKA
WEKA supports three main approaches to feature selection:
A) Filter Methods
Independent of any learning algorithm.
Uses statistical measures to rank features.
Examples in WEKA: checks the relevance of attribute only relevant
attribute is selected
eg : roll no attribute is not related to student
passing and failing which is our target attribute
o InfoGain (Information Gain) – Measures reduction in
entropy.
o Chi-Squared (Chi²) – Tests independence between
features and class.
o Correlation-based Feature Selection (CFS) – Selects
features highly correlated with the class but uncorrelated
with each other.
it wraps 2 or more attribute, checks for accuracy and
B) Wrapper Methods after computing all possible subset the most relevant
is selected
Uses a learning algorithm (e.g., Naïve Bayes, Decision Tree) to
evaluate feature subsets. it checks for usefulness of attribute
More computationally expensive but often more accurate.
Example in WEKA:
o WrapperSubsetEval – Uses a classifier (e.g., J48) to
assess feature subsets. recursive feature elemenation , genetic
algorithm
C) Embedded Methods
Feature selection is built into the learning algorithm.
Example in WEKA:
o Decision Trees (J48, Random Forest) – Automatically
select important features during training.
2. How to Perform Feature Selection in WEKA (Step-by-Step)
Step 1: Open WEKA Explorer
Launch WEKA → Click "Explorer".
Step 2: Load Your Dataset
Go to "Preprocess" tab → "Open File" (e.g., iris.arff).
Step 3: Select Feature Selection Method
Go to "Select Attributes" tab.
Choose an evaluator and search method:
Step 4: Run Feature Selection
Click "Start" → WEKA displays the selected features.
Example output:
Copy
Selected attributes: 2,4 : petallength, class
Step 5: Apply Selected Features
Click "Apply" to keep only the selected features.
Now, train a model on the reduced dataset.
File formats :
ARFF ( Attribute relation file format ) :
An ARFF (Attribute-Relation File Format) file is an ASCII text file
that describes a list of instances sharing a set of attributes.
ARFF files have two distinct sections. The first section is the
Header information, which is followed the Data information. The
Header of the ARFF file contains the name of the relation, a list of
the attributes (the columns in the data), and their types. An
example header on the standard IRIS dataset looks like this:
The Data of the ARFF file looks like the following
Lines that begin with a % are comments.
The @RELATION, @ATTRIBUTE and @DATA declarations are
case insensitive
The ARFF Header Section
The @relation Declaration
The relation name is defined as the first line in the ARFF file. The
format is:
@relation <relation-name>
where <relation-name> is a string. The string must be quoted if the
name includes spaces.
The @attribute Declarations
Attribute declarations take the form of an ordered sequence of
@attribute statements. Each attribute in the data set has its own
@attribute statement which uniquely defines the name of that
attribute and it's data type. The order the attributes are declared
indicates the column position in the data section of the file. For
example, if an attribute is the third one declared then Weka expects
that all that attributes values will be found in the third comma
delimited column. The format for the @attribute statement is:
@attribute <attribute-name> <datatype>
where the <attribute-name> must start with an alphabetic character.
If spaces are to be included in the name then the entire name must
be quoted. The <datatype> can be any of the four types currently
(version 3.2.1) supported by
Weka:
numeric
<nominal-specification>
string
date [<date-format>]
where <nominal-specification> and <date-format> are defined
below.
The keywords numeric, string and date are case insensitive.
Nominal attributes
Nominal values are defined by providing an <nominal-specification>
listing the possible values: {<nominal-name1>, <nominal-name2>,
<nominal-name3>, ...} For example, the class value of the Iris dataset
can be defined as follows:
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
Values that contain spaces must be quoted.
String attributes
String attributes allow us to create attributes containing arbitrary
textual values. This is very useful in text-mining applications, as we
can create datasets with string attributes, then write Weka Filters to
manipulate strings (like StringToWordVectorFilter). String attributes
are declared as follows:
@ATTRIBUTE LCC string
ARFF Data Section
The ARFF Data section of the file contains the data declaration line
and the actual instance lines.
The @data Declaration
The @data declaration is a single line denoting the start of the data
segment in the file. The format is: @data
The instance data
Each instance is represented on a single line, with carriage returns
denoting the end of the instance. Attribute values for each instance
are delimited by commas. They must appear in the order that they
were declared in the header section (i.e. the data corresponding to
the nth @attribute declaration is always the nth field of the
attribute). Missing values are represented by a single question mark,
as in:
@data
4.4,?,1.5,?,Iris-setosa
Values of string and nominal attributes are case sensitive, and any
that contain space must be quoted, as follows:
@relation LCCvsLCSH
@attribute LCC string
@attribute LCSH string
@data
AG5, 'Encyclopedias and dictionaries.;Twentieth century.'
AS262, 'Science -- Soviet Union -- History.'
AE5, 'Encyclopedias and dictionaries.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'
AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Tables.'
Dates must be specified in the data section using the string
representation specified in the attribute declaration. For example:
@RELATION Timestamps
@ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss"
@DATA
"2001-04-03 12:12:12"
"2001-05-03 12:59:55"
Open source Tools used for data mining:
1.WEKA
2.R studio
3.Rattle
4.Orange
5.KNIME
6.RapidMiner