KEMBAR78
Data Mining Unit 5 | PDF | Cluster Analysis | Support Vector Machine
0% found this document useful (0 votes)
30 views12 pages

Data Mining Unit 5

WEKA is a Java-based open-source tool for machine learning and data mining that supports various file formats and offers features such as data preprocessing, a wide range of machine learning algorithms, visualization tools, and model evaluation methods. Its user-friendly GUI allows non-programmers to easily perform tasks like loading datasets and selecting algorithms without coding. WEKA also supports feature selection methods and can be integrated with Java and Python for automation and extensibility.

Uploaded by

smjlife2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views12 pages

Data Mining Unit 5

WEKA is a Java-based open-source tool for machine learning and data mining that supports various file formats and offers features such as data preprocessing, a wide range of machine learning algorithms, visualization tools, and model evaluation methods. Its user-friendly GUI allows non-programmers to easily perform tasks like loading datasets and selecting algorithms without coding. WEKA also supports feature selection methods and can be integrated with Java and Python for automation and extensibility.

Uploaded by

smjlife2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit 5 : Open Source Data mining Tool

WEKA (Waikato Environment for Knowledge Analysis) is a Java-


based open-source tool for machine learning, data mining, and
predictive modeling. Below is a breakdown of its core functions with
examples and use cases.

Enlist features of weka tool

Features of weka :

1. Data Preprocessing : Weka suports various file formats such as


ARFF,CSV,JSON,XLSX and databases and mainly this feature
includes the process of data cleaning tasks like handling
missing values, normalization and discretization.
2. Machine learning algorithms : Weka includes 4 main category
of ML algorithms :
 Classification : Decision trees (J48 ),Naïve
bayes,SVM,Random forest
 Clustering : K-means,DBSCAN,EM
 Regression : Linear Regression , M5P
 Association rules : Apriori, FP-Growth
3. Visualization tools : Weka allows you to visualize your data
through Scatter plots , histograms and also your model
evaluation illustration using ROC curves and confusion matrices
including tree/graph visualization.
4. Model Evaluation : Weka allows you to measure metrics of your
model like Accuracy, Precision Recall,F1, AUC and also provides
with validation methods like :
 Train-Test split
 Cross-Validation
 Bootstrap
5. Integration & Extensibility : WEKA supports for :
 Java API: Embed WEKA in Java apps.
 Command Line: Run workflows via terminal.
 Python Integration: Use weka.core with Jython.

State and explain the advantages of WEKA tool


1. User-Friendly GUI (No Coding Required)

Easy-to-use graphical interface for beginners and non-


programmers.
Drag-and-drop functionality in the Knowledge Flow interface.
No need for Python/R coding—ideal for quick prototyping.

Example:

 Load a dataset (CSV/ARFF), select an algorithm (e.g., J48


Decision Tree), and visualize results—without writing a single
line of code.
2. Comprehensive Collection of ML Algorithms

Supports all major ML techniques:

 Classification (J48, Naïve Bayes, SVM, Random Forest).

 Clustering (k-Means, DBSCAN, EM).

 Regression (Linear Regression, M5P).

 Association Rule Mining (Apriori, FP-Growth).

 Feature Selection (PCA, InfoGain).

Pre-built implementations—no need to write algorithms from


scratch.

Example:

 Compare Random Forest vs SVM in Experimenter Mode with


statistical tests.

3. Built-in Data Preprocessing Tools

Handles missing values (remove/replace/impute).


Normalization & Standardization (min-max, z-score).
Discretization (convert numeric → categorical).
Filtering (remove outliers, noise, duplicates).

Example:

 Normalize a dataset (Preprocess → Filters → Normalize) before


running k-Means clustering.

4. Powerful Visualization Capabilities


Dataset Visualization: Scatter plots, histograms, box plots.
Model Evaluation: ROC curves, confusion matrices.
Decision Tree Visualization: View splits interactively.

Example:

 Plot a ROC curve to compare classifiers (Visualize Threshold


Curve).

5. Experiment Automation & Reproducibility

Experimenter Mode:

 Compare multiple algorithms statistically.

 Export results to CSV/ARFF for reporting.


Command-line & Java API for scripting and automation.

Example:

 Run 10-fold cross-validation on 5 classifiers and analyze p-


values.

6. Open-Source & Cross-Platform

Free to use (GNU GPL license).


Runs on Windows, macOS, Linux.
Active community (forums, tutorials, plugins).

Example:

 Extend WEKA with third-party


packages (e.g., DeepLearning4j for neural networks).
Clustering :

Example :

Another popular algorithm is DBSCAN (Density-Based Spatial


Clustering of Applications with Noise), which forms clusters based
on the density of data points. Unlike k-Means, DBSCAN does not
require specifying the number of clusters beforehand and can detect
outliers. For instance, in anomaly detection, DBSCAN can be used to
identify fraudulent transactions in banking data by clustering normal
transactions and flagging isolated points as suspicious.

Classfiers :

Example :

Support Vector Machines (SVM) are another class of classifiers that


work by finding the optimal hyperplane separating different classes
in high-dimensional space. For instance, in image recognition, an
SVM can classify handwritten digits (0-9) by analyzing pixel
intensities and determining the best boundary between digit
categories.

WEKA Installation Guide for Windows


1. Download WEKA

o Visit the official WEKA


website: https://www.waikato.ac.nz/ml/weka/

o Click on the Windows (64-bit) installer link to download


the .exe file.

2. Install WEKA

o Double-click the downloaded .exe file.

o Follow the setup wizard, accepting the license agreement.


o Choose the default installation location (C:\Program
Files\Weka).

o Select "Add WEKA to PATH" for command-line access


(optional).

o Click Install and wait for the process to complete.

3. Verify Java Installation

o Open Command Prompt and type java -version.

o If Java is not installed, download and install Java 8 or


later from Oracle’s website.

4. Launch WEKA

o Open WEKA from the Start Menu or desktop shortcut.

o Select the WEKA Explorer to begin using the tool.

Feature Selection in WEKA Tool


Feature Selection (also called attribute selection) is the process of
identifying and selecting the most relevant features (attributes) from
a dataset to improve model performance, reduce overfitting, and
speed up training. WEKA provides several built-in methods for feature
selection, accessible through its GUI, command-line, and API.

1. Types of Feature Selection in WEKA

WEKA supports three main approaches to feature selection:

A) Filter Methods

 Independent of any learning algorithm.

 Uses statistical measures to rank features.

 Examples in WEKA: checks the relevance of attribute only relevant


attribute is selected
eg : roll no attribute is not related to student
passing and failing which is our target attribute
o InfoGain (Information Gain) – Measures reduction in
entropy.

o Chi-Squared (Chi²) – Tests independence between


features and class.

o Correlation-based Feature Selection (CFS) – Selects


features highly correlated with the class but uncorrelated
with each other.
it wraps 2 or more attribute, checks for accuracy and
B) Wrapper Methods after computing all possible subset the most relevant
is selected
 Uses a learning algorithm (e.g., Naïve Bayes, Decision Tree) to

evaluate feature subsets. it checks for usefulness of attribute

 More computationally expensive but often more accurate.

 Example in WEKA:

o WrapperSubsetEval – Uses a classifier (e.g., J48) to


assess feature subsets. recursive feature elemenation , genetic
algorithm
C) Embedded Methods

 Feature selection is built into the learning algorithm.

 Example in WEKA:

o Decision Trees (J48, Random Forest) – Automatically


select important features during training.

2. How to Perform Feature Selection in WEKA (Step-by-Step)

Step 1: Open WEKA Explorer

 Launch WEKA → Click "Explorer".

Step 2: Load Your Dataset

 Go to "Preprocess" tab → "Open File" (e.g., iris.arff).


Step 3: Select Feature Selection Method

 Go to "Select Attributes" tab.

 Choose an evaluator and search method:

Step 4: Run Feature Selection

 Click "Start" → WEKA displays the selected features.

 Example output:

Copy

Selected attributes: 2,4 : petallength, class

Step 5: Apply Selected Features

 Click "Apply" to keep only the selected features.

 Now, train a model on the reduced dataset.

File formats :
ARFF ( Attribute relation file format ) :

An ARFF (Attribute-Relation File Format) file is an ASCII text file


that describes a list of instances sharing a set of attributes.

ARFF files have two distinct sections. The first section is the
Header information, which is followed the Data information. The
Header of the ARFF file contains the name of the relation, a list of
the attributes (the columns in the data), and their types. An
example header on the standard IRIS dataset looks like this:
The Data of the ARFF file looks like the following

 Lines that begin with a % are comments.


 The @RELATION, @ATTRIBUTE and @DATA declarations are
case insensitive

The ARFF Header Section

The @relation Declaration

The relation name is defined as the first line in the ARFF file. The
format is:

@relation <relation-name>

where <relation-name> is a string. The string must be quoted if the


name includes spaces.
The @attribute Declarations

Attribute declarations take the form of an ordered sequence of


@attribute statements. Each attribute in the data set has its own
@attribute statement which uniquely defines the name of that
attribute and it's data type. The order the attributes are declared
indicates the column position in the data section of the file. For
example, if an attribute is the third one declared then Weka expects
that all that attributes values will be found in the third comma
delimited column. The format for the @attribute statement is:

@attribute <attribute-name> <datatype>

where the <attribute-name> must start with an alphabetic character.


If spaces are to be included in the name then the entire name must
be quoted. The <datatype> can be any of the four types currently
(version 3.2.1) supported by

Weka:

 numeric
 <nominal-specification>
 string
 date [<date-format>]

where <nominal-specification> and <date-format> are defined


below.

The keywords numeric, string and date are case insensitive.

Nominal attributes

Nominal values are defined by providing an <nominal-specification>


listing the possible values: {<nominal-name1>, <nominal-name2>,
<nominal-name3>, ...} For example, the class value of the Iris dataset
can be defined as follows:
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

Values that contain spaces must be quoted.

String attributes

String attributes allow us to create attributes containing arbitrary


textual values. This is very useful in text-mining applications, as we
can create datasets with string attributes, then write Weka Filters to
manipulate strings (like StringToWordVectorFilter). String attributes
are declared as follows:

@ATTRIBUTE LCC string

ARFF Data Section

The ARFF Data section of the file contains the data declaration line
and the actual instance lines.

The @data Declaration

The @data declaration is a single line denoting the start of the data
segment in the file. The format is: @data

The instance data

Each instance is represented on a single line, with carriage returns


denoting the end of the instance. Attribute values for each instance
are delimited by commas. They must appear in the order that they
were declared in the header section (i.e. the data corresponding to
the nth @attribute declaration is always the nth field of the
attribute). Missing values are represented by a single question mark,
as in:

@data

4.4,?,1.5,?,Iris-setosa
Values of string and nominal attributes are case sensitive, and any
that contain space must be quoted, as follows:

@relation LCCvsLCSH

@attribute LCC string

@attribute LCSH string

@data

AG5, 'Encyclopedias and dictionaries.;Twentieth century.'

AS262, 'Science -- Soviet Union -- History.'

AE5, 'Encyclopedias and dictionaries.'

AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'

AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Tables.'

Dates must be specified in the data section using the string


representation specified in the attribute declaration. For example:

@RELATION Timestamps

@ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss"

@DATA

"2001-04-03 12:12:12"

"2001-05-03 12:59:55"

Open source Tools used for data mining:


1.WEKA
2.R studio
3.Rattle
4.Orange
5.KNIME
6.RapidMiner

You might also like