0% found this document useful (0 votes)

30 views12 pages

Data Mining Unit 5

WEKA is a Java-based open-source tool for machine learning and data mining that supports various file formats and offers features such as data preprocessing, a wide range of machine learning algorithms, visualization tools, and model evaluation methods. Its user-friendly GUI allows non-programmers to easily perform tasks like loading datasets and selecting algorithms without coding. WEKA also supports feature selection methods and can be integrated with Java and Python for automation and extensibility.

Uploaded by

smjlife2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views12 pages

Data Mining Unit 5

Uploaded by

smjlife2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unit 5 : Open Source Data mining Tool

WEKA (Waikato Environment for Knowledge Analysis) is a Java-

based open-source tool for machine learning, data mining, and
predictive modeling. Below is a breakdown of its core functions with
examples and use cases.

Enlist features of weka tool

Features of weka :

1. Data Preprocessing : Weka suports various file formats such as

ARFF,CSV,JSON,XLSX and databases and mainly this feature
includes the process of data cleaning tasks like handling
missing values, normalization and discretization.
2. Machine learning algorithms : Weka includes 4 main category
of ML algorithms :
 Classification : Decision trees (J48 ),Naïve
bayes,SVM,Random forest
 Clustering : K-means,DBSCAN,EM
 Regression : Linear Regression , M5P
 Association rules : Apriori, FP-Growth
3. Visualization tools : Weka allows you to visualize your data
through Scatter plots , histograms and also your model
evaluation illustration using ROC curves and confusion matrices
including tree/graph visualization.
4. Model Evaluation : Weka allows you to measure metrics of your
model like Accuracy, Precision Recall,F1, AUC and also provides
with validation methods like :
 Train-Test split
 Cross-Validation
 Bootstrap
5. Integration & Extensibility : WEKA supports for :
 Java API: Embed WEKA in Java apps.
 Command Line: Run workflows via terminal.
 Python Integration: Use weka.core with Jython.

State and explain the advantages of WEKA tool

1. User-Friendly GUI (No Coding Required)

Easy-to-use graphical interface for beginners and non-

programmers.
Drag-and-drop functionality in the Knowledge Flow interface.
No need for Python/R coding—ideal for quick prototyping.

Example:

 Load a dataset (CSV/ARFF), select an algorithm (e.g., J48

Decision Tree), and visualize results—without writing a single
line of code.
2. Comprehensive Collection of ML Algorithms

Supports all major ML techniques:

 Classification (J48, Naïve Bayes, SVM, Random Forest).

 Clustering (k-Means, DBSCAN, EM).

 Regression (Linear Regression, M5P).

 Association Rule Mining (Apriori, FP-Growth).

 Feature Selection (PCA, InfoGain).

Pre-built implementations—no need to write algorithms from

scratch.

Example:

 Compare Random Forest vs SVM in Experimenter Mode with

statistical tests.

3. Built-in Data Preprocessing Tools

Handles missing values (remove/replace/impute).

Normalization & Standardization (min-max, z-score).
Discretization (convert numeric → categorical).
Filtering (remove outliers, noise, duplicates).

Example:

 Normalize a dataset (Preprocess → Filters → Normalize) before

running k-Means clustering.

4. Powerful Visualization Capabilities

Dataset Visualization: Scatter plots, histograms, box plots.
Model Evaluation: ROC curves, confusion matrices.
Decision Tree Visualization: View splits interactively.

Example:

 Plot a ROC curve to compare classifiers (Visualize Threshold

Curve).

5. Experiment Automation & Reproducibility

Experimenter Mode:

 Compare multiple algorithms statistically.

 Export results to CSV/ARFF for reporting.

Command-line & Java API for scripting and automation.

Example:

 Run 10-fold cross-validation on 5 classifiers and analyze p-

values.

6. Open-Source & Cross-Platform

Free to use (GNU GPL license).

Runs on Windows, macOS, Linux.
Active community (forums, tutorials, plugins).

Example:

 Extend WEKA with third-party

packages (e.g., DeepLearning4j for neural networks).
Clustering :

Example :

Another popular algorithm is DBSCAN (Density-Based Spatial

Clustering of Applications with Noise), which forms clusters based
on the density of data points. Unlike k-Means, DBSCAN does not
require specifying the number of clusters beforehand and can detect
outliers. For instance, in anomaly detection, DBSCAN can be used to
identify fraudulent transactions in banking data by clustering normal
transactions and flagging isolated points as suspicious.

Classfiers :

Example :

Support Vector Machines (SVM) are another class of classifiers that

work by finding the optimal hyperplane separating different classes
in high-dimensional space. For instance, in image recognition, an
SVM can classify handwritten digits (0-9) by analyzing pixel
intensities and determining the best boundary between digit
categories.

WEKA Installation Guide for Windows

1. Download WEKA

o Visit the official WEKA

website: https://www.waikato.ac.nz/ml/weka/

o Click on the Windows (64-bit) installer link to download

the .exe file.

2. Install WEKA

o Double-click the downloaded .exe file.

o Follow the setup wizard, accepting the license agreement.

o Choose the default installation location (C:\Program
Files\Weka).

o Select "Add WEKA to PATH" for command-line access

(optional).

o Click Install and wait for the process to complete.

3. Verify Java Installation

o Open Command Prompt and type java -version.

o If Java is not installed, download and install Java 8 or

later from Oracle’s website.

4. Launch WEKA

o Open WEKA from the Start Menu or desktop shortcut.

o Select the WEKA Explorer to begin using the tool.

Feature Selection in WEKA Tool

Feature Selection (also called attribute selection) is the process of
identifying and selecting the most relevant features (attributes) from
a dataset to improve model performance, reduce overfitting, and
speed up training. WEKA provides several built-in methods for feature
selection, accessible through its GUI, command-line, and API.

1. Types of Feature Selection in WEKA

WEKA supports three main approaches to feature selection:

A) Filter Methods

 Independent of any learning algorithm.

 Uses statistical measures to rank features.

 Examples in WEKA: checks the relevance of attribute only relevant

attribute is selected
eg : roll no attribute is not related to student
passing and failing which is our target attribute
o InfoGain (Information Gain) – Measures reduction in
entropy.

o Chi-Squared (Chi²) – Tests independence between

features and class.

o Correlation-based Feature Selection (CFS) – Selects

features highly correlated with the class but uncorrelated
with each other.
it wraps 2 or more attribute, checks for accuracy and
B) Wrapper Methods after computing all possible subset the most relevant
is selected
 Uses a learning algorithm (e.g., Naïve Bayes, Decision Tree) to

evaluate feature subsets. it checks for usefulness of attribute

 More computationally expensive but often more accurate.

 Example in WEKA:

o WrapperSubsetEval – Uses a classifier (e.g., J48) to

assess feature subsets. recursive feature elemenation , genetic
algorithm
C) Embedded Methods

 Feature selection is built into the learning algorithm.

 Example in WEKA:

o Decision Trees (J48, Random Forest) – Automatically

select important features during training.

2. How to Perform Feature Selection in WEKA (Step-by-Step)

Step 1: Open WEKA Explorer

 Launch WEKA → Click "Explorer".

Step 2: Load Your Dataset

 Go to "Preprocess" tab → "Open File" (e.g., iris.arff).

Step 3: Select Feature Selection Method

 Go to "Select Attributes" tab.

 Choose an evaluator and search method:

Step 4: Run Feature Selection

 Click "Start" → WEKA displays the selected features.

 Example output:

Copy

Selected attributes: 2,4 : petallength, class

Step 5: Apply Selected Features

 Click "Apply" to keep only the selected features.

 Now, train a model on the reduced dataset.

File formats :
ARFF ( Attribute relation file format ) :

An ARFF (Attribute-Relation File Format) file is an ASCII text file

that describes a list of instances sharing a set of attributes.

ARFF files have two distinct sections. The first section is the
Header information, which is followed the Data information. The
Header of the ARFF file contains the name of the relation, a list of
the attributes (the columns in the data), and their types. An
example header on the standard IRIS dataset looks like this:
The Data of the ARFF file looks like the following

 Lines that begin with a % are comments.

 The @RELATION, @ATTRIBUTE and @DATA declarations are
case insensitive

The ARFF Header Section

The @relation Declaration

The relation name is defined as the first line in the ARFF file. The
format is:

@relation <relation-name>

where <relation-name> is a string. The string must be quoted if the

name includes spaces.
The @attribute Declarations

Attribute declarations take the form of an ordered sequence of

@attribute statements. Each attribute in the data set has its own
@attribute statement which uniquely defines the name of that
attribute and it's data type. The order the attributes are declared
indicates the column position in the data section of the file. For
example, if an attribute is the third one declared then Weka expects
that all that attributes values will be found in the third comma
delimited column. The format for the @attribute statement is:

@attribute <attribute-name> <datatype>

where the <attribute-name> must start with an alphabetic character.

If spaces are to be included in the name then the entire name must
be quoted. The <datatype> can be any of the four types currently
(version 3.2.1) supported by

Weka:

 numeric
 <nominal-specification>
 string
 date [<date-format>]

where <nominal-specification> and <date-format> are defined

below.

The keywords numeric, string and date are case insensitive.

Nominal attributes

Nominal values are defined by providing an <nominal-specification>

listing the possible values: {<nominal-name1>, <nominal-name2>,
<nominal-name3>, ...} For example, the class value of the Iris dataset
can be defined as follows:
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

Values that contain spaces must be quoted.

String attributes

String attributes allow us to create attributes containing arbitrary

textual values. This is very useful in text-mining applications, as we
can create datasets with string attributes, then write Weka Filters to
manipulate strings (like StringToWordVectorFilter). String attributes
are declared as follows:

@ATTRIBUTE LCC string

ARFF Data Section

The ARFF Data section of the file contains the data declaration line
and the actual instance lines.

The @data Declaration

The @data declaration is a single line denoting the start of the data
segment in the file. The format is: @data

The instance data

Each instance is represented on a single line, with carriage returns

denoting the end of the instance. Attribute values for each instance
are delimited by commas. They must appear in the order that they
were declared in the header section (i.e. the data corresponding to
the nth @attribute declaration is always the nth field of the
attribute). Missing values are represented by a single question mark,
as in:

@data

4.4,?,1.5,?,Iris-setosa
Values of string and nominal attributes are case sensitive, and any
that contain space must be quoted, as follows:

@relation LCCvsLCSH

@attribute LCC string

@attribute LCSH string

@data

AG5, 'Encyclopedias and dictionaries.;Twentieth century.'

AS262, 'Science -- Soviet Union -- History.'

AE5, 'Encyclopedias and dictionaries.'

AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'

AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Tables.'

Dates must be specified in the data section using the string

representation specified in the attribute declaration. For example:

@RELATION Timestamps

@ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss"

@DATA

"2001-04-03 12:12:12"

"2001-05-03 12:59:55"

Open source Tools used for data mining:

1.WEKA
2.R studio
3.Rattle
4.Orange
5.KNIME
6.RapidMiner

Dinesh DM
No ratings yet
Dinesh DM
34 pages
DHW Lab (Ex1 To 3)
No ratings yet
DHW Lab (Ex1 To 3)
18 pages
Weka (20030421-Version1 by Kdelab)
No ratings yet
Weka (20030421-Version1 by Kdelab)
51 pages
WEKA Intro
No ratings yet
WEKA Intro
17 pages
Rintro Wekacomplete
No ratings yet
Rintro Wekacomplete
135 pages
WEKA Guide for ML Practitioners
No ratings yet
WEKA Guide for ML Practitioners
58 pages
Appendix Weka
No ratings yet
Appendix Weka
17 pages
Data Mining Lab Manual for CSE
No ratings yet
Data Mining Lab Manual for CSE
50 pages
Lecture 12 - Weka Tutorial
No ratings yet
Lecture 12 - Weka Tutorial
84 pages
Weka Overview Slides
No ratings yet
Weka Overview Slides
31 pages
Data Warehousing Lab Course Guide
0% (1)
Data Warehousing Lab Course Guide
28 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
55 pages
WEKA Data Mining Tool Guide
No ratings yet
WEKA Data Mining Tool Guide
19 pages
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
No ratings yet
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
42 pages
Weka Data Miningvsem
No ratings yet
Weka Data Miningvsem
7 pages
Lecture 7 - Weka
No ratings yet
Lecture 7 - Weka
69 pages
DWDM File
No ratings yet
DWDM File
26 pages
Weka Tutorial
No ratings yet
Weka Tutorial
45 pages
WEKA Explorer Tutorial
No ratings yet
WEKA Explorer Tutorial
45 pages
DMW LabFile 0901CS243D11 Swastik
No ratings yet
DMW LabFile 0901CS243D11 Swastik
25 pages
Data Mining Practical Guide
No ratings yet
Data Mining Practical Guide
27 pages
WEKA Toolkit: Machine Learning Guide
No ratings yet
WEKA Toolkit: Machine Learning Guide
8 pages
DW 9 Exp 1
No ratings yet
DW 9 Exp 1
43 pages
Data Warehousing and Data Mining Lab
No ratings yet
Data Warehousing and Data Mining Lab
53 pages
Weka Tutorial
No ratings yet
Weka Tutorial
8 pages
2.3 Weka Tool
No ratings yet
2.3 Weka Tool
84 pages
Data Warehousing Full
No ratings yet
Data Warehousing Full
41 pages
DMW Lab Print
No ratings yet
DMW Lab Print
21 pages
DWDM File-Final Ver3.pdf 20241230 172003 0000
No ratings yet
DWDM File-Final Ver3.pdf 20241230 172003 0000
54 pages
DM Lab Material
No ratings yet
DM Lab Material
88 pages
Lab Manual (2024)
No ratings yet
Lab Manual (2024)
56 pages
Anne - CCS341 - DW - Students Record - 1a - 1b - 2 - Print
No ratings yet
Anne - CCS341 - DW - Students Record - 1a - 1b - 2 - Print
63 pages
Lab 04
No ratings yet
Lab 04
7 pages
Priyadarshini J. L. College of Engineering, Nagpur: Session 2022-23 Semester-V
No ratings yet
Priyadarshini J. L. College of Engineering, Nagpur: Session 2022-23 Semester-V
31 pages
DWM1 Riya
No ratings yet
DWM1 Riya
16 pages
DWBI Lab Manual 2023-24 Final
No ratings yet
DWBI Lab Manual 2023-24 Final
40 pages
DWM1
No ratings yet
DWM1
19 pages
Data Warehousing Lab Exp 1-3
No ratings yet
Data Warehousing Lab Exp 1-3
24 pages
WEKA: ML Tool for Data Scientists
No ratings yet
WEKA: ML Tool for Data Scientists
23 pages
Data Mining Complete Lab Manual - DRSNR
No ratings yet
Data Mining Complete Lab Manual - DRSNR
27 pages
Weka Software Manuala
No ratings yet
Weka Software Manuala
20 pages
WEKA Explorer User Guide For Version 3-4: Richard Kirkby Eibe Frank July 15, 2008
No ratings yet
WEKA Explorer User Guide For Version 3-4: Richard Kirkby Eibe Frank July 15, 2008
13 pages
Data Warehousing
No ratings yet
Data Warehousing
54 pages
Data Warehousing - To Write
No ratings yet
Data Warehousing - To Write
23 pages
Weka Lab
No ratings yet
Weka Lab
11 pages
Machine Learning With WEKA An Introduction
No ratings yet
Machine Learning With WEKA An Introduction
66 pages
WEKA Practical Protocol
No ratings yet
WEKA Practical Protocol
40 pages
Palak ML 1
No ratings yet
Palak ML 1
9 pages
Itdw
No ratings yet
Itdw
44 pages
Laboratory Manual On: Data Mining
No ratings yet
Laboratory Manual On: Data Mining
41 pages
J48 & Naive Bayes Classification Guide
No ratings yet
J48 & Naive Bayes Classification Guide
3 pages
Weka Data Mining Lab Guide
No ratings yet
Weka Data Mining Lab Guide
20 pages
Data Werehousing Lab Manual
No ratings yet
Data Werehousing Lab Manual
63 pages
Data Warehousing and Data Mining Lab Manual
0% (1)
Data Warehousing and Data Mining Lab Manual
30 pages
DW Lab Manual
No ratings yet
DW Lab Manual
44 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
50 pages
Lesson 4 - Continuous Probability Distributions (With Exercises)
No ratings yet
Lesson 4 - Continuous Probability Distributions (With Exercises)
16 pages
Aspakali Pinjar Pune
No ratings yet
Aspakali Pinjar Pune
5 pages
Oracle Security Hardening Guide
No ratings yet
Oracle Security Hardening Guide
9 pages
Automated Flow Lines
No ratings yet
Automated Flow Lines
17 pages
Leaf Disease Detection Robot
No ratings yet
Leaf Disease Detection Robot
61 pages
LCGPA - UG - Verifying The Local Content Certificate or Target Percentage Document - v0.4
No ratings yet
LCGPA - UG - Verifying The Local Content Certificate or Target Percentage Document - v0.4
17 pages
AI Robot Trouble Shooting Guide: User Was Unable To Download From Links and You Need To Send Ea Direct
No ratings yet
AI Robot Trouble Shooting Guide: User Was Unable To Download From Links and You Need To Send Ea Direct
3 pages
Peer-to-Peer Network Design Lab
No ratings yet
Peer-to-Peer Network Design Lab
3 pages
Likewise Open 5.0 Installation and Administration Guide
100% (2)
Likewise Open 5.0 Installation and Administration Guide
86 pages
Unix Commands List New
No ratings yet
Unix Commands List New
2 pages
Risc & Sisc Characteristics
No ratings yet
Risc & Sisc Characteristics
9 pages
sb95 ECDIS Color Calibration
No ratings yet
sb95 ECDIS Color Calibration
31 pages
OUA-Memo - 0421136 - May 2021 Mi-TechTalk Webinar Sessions On Microsoft 365 - 2021 - 04 - 28
No ratings yet
OUA-Memo - 0421136 - May 2021 Mi-TechTalk Webinar Sessions On Microsoft 365 - 2021 - 04 - 28
6 pages
PLC Course Outline
No ratings yet
PLC Course Outline
3 pages
Report ...
No ratings yet
Report ...
54 pages
Ram Sequential Atpg
No ratings yet
Ram Sequential Atpg
14 pages
Compressor control-TS - L Manual Operacion
80% (5)
Compressor control-TS - L Manual Operacion
80 pages
KR 6 R900 sixx Robot Specs
No ratings yet
KR 6 R900 sixx Robot Specs
1 page
Empowerment Technologies Quarter 1, Module 3
No ratings yet
Empowerment Technologies Quarter 1, Module 3
12 pages
Ats1600 Brochure en
No ratings yet
Ats1600 Brochure en
1 page
Srujana Short Resume
No ratings yet
Srujana Short Resume
2 pages
KeyLab Essential mk3 - Logic Pro User Guide
No ratings yet
KeyLab Essential mk3 - Logic Pro User Guide
4 pages
PSA Nexteer EE Workshop 161207 Final
No ratings yet
PSA Nexteer EE Workshop 161207 Final
38 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Head Versus Volume and Time
No ratings yet
Head Versus Volume and Time
7 pages
PassThru Protocol Log Analysis
No ratings yet
PassThru Protocol Log Analysis
2 pages
DMS (313302) Unit1
No ratings yet
DMS (313302) Unit1
75 pages
56
No ratings yet
56
2 pages
DTC B1608/84 Front Satellite Sensor Bus LH Initialization Incomplete Lost Communication With Front Airbag Sensor LH Front Airbag Sensor LH Initialization Incom-Plete
No ratings yet
DTC B1608/84 Front Satellite Sensor Bus LH Initialization Incomplete Lost Communication With Front Airbag Sensor LH Front Airbag Sensor LH Initialization Incom-Plete
7 pages
Inggris Pas Gasal 22 23
No ratings yet
Inggris Pas Gasal 22 23
8 pages