KEMBAR78
Data Analytics Using WEKA | PDF | Data Mining | Data
0% found this document useful (0 votes)
16 views65 pages

Data Analytics Using WEKA

Uploaded by

Joyce Wm Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views65 pages

Data Analytics Using WEKA

Uploaded by

Joyce Wm Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

NTW: Data analytics for Layman using WEKA

Introduction

1
Introduction

“…..we are actually living in the data age” (Han et al., 2012)

2
Data
• Data are any facts, numbers, or text that can be processed by a
computer. This includes:
• Operational or transactional data – sales, cost, inventory, payroll,
accounting, etc.
• Non-operational data – industry sales, forecast data, remote sensors on a
satellite, microarrays generating gene expression data, microeconomic
data, etc.
• Metadata: data about the data itself such as logical database design or
data dictionary definitions – “data about data”

3
Metadata

4
Data
• Modern ICT technologies make it
very easy to generate large
volumes of data, and because
storage is quite cheap, there is a
tendency to keep that data
regardless of whether it has a
point.
• Every organization benefits from
collecting and analysing its data.
• Analysing data is crucial for
knowledge-driven decision
making.
• The problem then becomes how
to analyse these data.

5
Data Analysis
• Data analysis (analysis of data or data analytics) – is a
process of inspecting, cleansing, transforming, and
modeling data with the goal of discovering useful information,
suggesting conclusions, and supporting decision-making.

• Data analysis method includes simple query and reporting,


statistical analysis, more complex multidimensional analysis,
and data mining.

6
Definition of Data Mining
• Data mining is the process of findings interesting structure
in data (Roiger, 2017).

• Data mining is the process of automatically discovering


useful information in large data repositories (Tan et al.,
2006).

• Data mining is the application of specific algorithms for


extracting patterns from data (Fayyad, 1996).

7
Definition of Data Mining

“Data mining is a process of discovering


various models, summaries and derived values
from a given collection of data”.
(Kantardzic, 2020)

8
Data Mining Process

(Kantardzic, 2020)

9
Data Mining Process

Knowledge Discovery in Databases (KDD)

10
Data Mining Techniques
Unsupervised Learning Supervised Learning
Independent Variable (x) Independent Variable (x) Dependent Variable (y)
• Clustering • Simple Linear Regression • Simple Linear Regression
Numerical

▪ K-means • Multiple Regression • Multiple Regression


• Logistic Regression
• Decision Trees

• Association rules • Logistic Regression • Logistic Regression


Categorical

▪ Apriori • Decision Trees • Decision Trees


▪ FP-growth
Data Mining Tools

12
Data Mining Tools

13
Data Mining Applications
Area Application
Finance/ Banking Credit card analysis, loyal customers
Insurance Claims, fraud, churn analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods Promotion analysis
Scientific research Image, video, speech pattern analysis
Utilities Power usage analysis
Law enforcement Crime analysis

14
NTW: Basic Data analytics using WEKA
Data Pre-Processing

15
Data, Objects and Attributes
Attributes
• Data is a collection of objects and
their attributes ID Age Height Height Books
(m) (ft) Category
• An attribute is a property or
101 20 1.75 5.74 SDA321
characteristic of an object
• Examples: eye color of a 102 18 1.80 5.91 ENG364
person, temperature, etc.
103 21 1.53 5.20 IT543
• Attribute is also known as variable,
field, characteristic, or feature Objects 104 55 1.65 5.41 SDQ735

• A collection of attributes describe 105 20 1.58 5.18 IT954


an object
106 21 1.63 5.35 ENG735
• Object is also known as record,
point, case, sample, entity, or 107 19 1.72 5.64 SDM628
instance
Types of Attribute
Attribute
Types

Categorical Numerical

Nominal Ordinal Interval Ratio

Qualitative types Quantitative types


Properties of Attributes Values
• The type of an attribute depends on which of the
following properties it possesses:
✓ Distinctness: = ≠
✓ Order: < >
✓ Addition/subtraction: + -
✓ Multiplication/division: * /

❖ Nominal attribute: distinctness


❖ Ordinal attribute: distinctness and order
❖ Interval attribute: distinctness, order and addition/subtraction
❖ Ratio attribute: all 4 properties
Why pre-processing the data?
• Some data pre-processing is needed for all mining
tools.

• The purpose of pre-processing is to transform data


sets so that their information content is best exposed
to the mining tools.

• Pre-processing data also prepares the miner so that


when using prepared data, the miner produces better
models.
Major tasks in data pre-processing
1. Data Cleaning

2. Data Integration

3. Data Transformation

4. Data Reduction

20
1. Data Cleaning
• Data in the real world is dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error, etc.
– Incomplete (missing values): lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
❖ e.g., Occupation = “ ” (missing values)
– Noisy: containing noise, errors, or outliers
❖ e.g., Salary = “−10” (an error)
– Inconsistent: containing discrepancies in codes or names, e.g.,
❖ Age = “42”, Birthday = “03/07/2010”
❖ Was rating “1, 2, 3”, now rating “A, B, C”
❖ discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
21 ❖ Jan. 1 as everyone’s birthday?
NTW: Data analytics for Layman using WEKA
Data Pre-Processing (HANDS-ON)

22
Preparing .csv file
• Open an excel file
(Data2.xlsx)
• Save as Data2.csv
• Close the Data2.csv
file.

23
Data Pre-Processing in WEKA
• Open the
Data2.csv file in
WEKA.
• Check the
attribute
names.
• All attribute
types were set
as numeric.
• Is these
correct?
24
Data Pre-Processing in WEKA
Click edit to
view all the
dataset.

How to
change the
NoMatrik
attribute
as nominal
type?

25
Data Pre-Processing in WEKA
Choose>Filter>Unsupervised>Attribute>NumericToNominal

Right-click

Change first-last
to first only

26
Data Pre-Processing in WEKA
Choose>Add>Right-Click

Change to last

Change to
Grade and
Nominal
attribute

Type A+, A, B+,


B, C, D, E, F

27
Data Pre-Processing in WEKA
Set the Grade
attribute as A+, A,
B+, B, C, D, E or F

90 – 100 A+
80 – 89 A
70 – 79 B+
60 – 69 B
50 – 59 C
40 – 49 D
30 – 39 E
20 – 29 F
10 – 19 F
0–9 F

28
Data Pre-Processing in WEKA

29
Data Pre-Processing in WEKA

Save the file

30
NTW: Data analytics for Layman using WEKA

Data Pre-processing and Knowledge Discovery


Using Weka to Generate Apriori Algorithms

31
Data Pre-Processing in WEKA
• Waikato Environment for Knowledge Analysis
(WEKA)
• https://www.cs.waikato.ac.nz/ml/weka/index.
html
WEKA
• Weka is a collection of machine learning
algorithms for data mining tasks.
• The algorithms can either be applied directly to a
dataset or called from your own Java code.
• Weka contains tools for data pre-processing,
classification, regression, clustering, association
rules and visualization.
WEKA
The Explorer
• Gives access to all facilities of Weka using menu selection
and form filling
• Prepare the data, open the Explorer and load the data
• Flip back and forth between results, evaluate models built
on different datasets and visualize graphically both and
models and datasets, including classification errors

35
Preparing the data
• Data can be imported from a file in various
formats: e.g
– CSV (comma-separated values)
– ARFF (attribute-relation file format)
– Binary serialized instances
– Matlab ASCII files

36
Preparing the data
• ARFF files have 2 sections HEADER and DATA
• HEADER:
@RELATION dataset name

@ATTRIBUTE attribute name and type

• DATA:
@DATA
list of dataset

37
Attributes in WEKA
• Nominal: one of a predefined list of values
- e.g. red, green, blue
• Numeric: A real or integer number
• String: Enclosed in “double quotes”
• Date
• Relational

38
Apriori Algorithm in Weka

39
Using Weka to Generate Apriori Algorithm
Open an Excel file name
Using Weka to Generate Apriori Algorithm

Save as csv file>


click Save> click Yes
Using Weka to Generate Apriori Algorithm
Since most of the
attribute types of the
data is not numeric,
the file need to be
converted to .arff first
before it can be
opened in Weka.

Open the csv


file in MS Word
and you will get
the data just
like this
Using Weka to Generate Apriori Algorithm
@RELATION

@ATTRIBUTE string

@DATA

Remove , ,
Using Weka to Generate Apriori Algorithm

Save As >
click Save
> click OK
> close the
file
Using Weka to Generate Apriori Algorithm
Open the MS
Word of
Women_clothing
in the Notepad

Save as the file as


Women_clothing.arff
Using Weka to Generate Apriori Algorithm
Then, you will get the ARFF type of Women_clothing file which can be opened in Weka
Using Weka to Generate Apriori Algorithm
Open the
Women_clothing.arff
file in Weka.

Inspect the attribute


type for each of
attributes

To do an Apriori
algorithm, you need
to transform the
string type to
nominal type.
Using Weka to Generate Apriori Algorithm
1. Select> filters>
unsupervised >
attributes >
StringToNominal>
right-click > Show
properties > write 1-7

2. Click Apply
Using Weka to Generate Apriori Algorithm
Inspect all the
attribute types

Remove Gender
and Age as we
want to remain
the nominal
attributes in
order to do an
Apriori
algorithm
Using Weka to Generate Apriori Algorithm
Select Associate
> right-click >
Show properties

Minimum
support
Using Weka to Generate Apriori Algorithm

Evaluate and
interpret the
results

Save As the file as


Women_clothing1.arff
NTW: Data analytics for Layman using WEKA
Pattern and Knowledge Discovery
Using WEKA for Clustering

52
Clustering
Open Iris data in
WEKA database
file

Check the
attributes

Choose No class
Clustering

Cluster tab >


Clusterer >
Choose
Clustering

SimpleKMeans
Clustering
Right-click >
Show
properties…
Clustering

Change the
numClusters
into 3

Click OK
Clustering

Click Start
Clustering

Analyze
the result
Clustering
Clustering

Right-click >
Visualize
cluster
assignments
Clustering
Choose:
X: TotalsalesQuantity(Num)
Y: TotalSalesValue(Num)

Drag Jitter to the


right side

Examine the
result of cluster
analysis
Wants to dig more??
Hope we have more time!
Neural
Decision Tree
Network
(with source
code!)
References
• Data mining and KDD (SIGKDD: CDROM)
– Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
– Journal: Data Mining and Knowledge Discovery, KDD Explorations
• Database systems (SIGMOD: CD ROM)
– Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
– Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
• AI & Machine Learning
– Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
– Journals: Machine Learning, Artificial Intelligence, etc.
• Statistics
– Conferences: Joint Stat. Meeting, etc.
– Journals: Annals of statistics, etc.
• Visualization
– Conference proceedings: CHI, ACM-SIGGraph, etc.
– Journals: IEEE Trans. visualization and computer graphics, etc.
• Website:
– http://www.kdnuggets.com/
Thank you and see you again next time

Ask me personally or interested to become my post-graduate


student?
nurdiyana@upnm.edu.my

+0176962946

You might also like