NTW: Data analytics for Layman using WEKA
Introduction
1
Introduction
“…..we are actually living in the data age” (Han et al., 2012)
2
Data
• Data are any facts, numbers, or text that can be processed by a
computer. This includes:
• Operational or transactional data – sales, cost, inventory, payroll,
accounting, etc.
• Non-operational data – industry sales, forecast data, remote sensors on a
satellite, microarrays generating gene expression data, microeconomic
data, etc.
• Metadata: data about the data itself such as logical database design or
data dictionary definitions – “data about data”
3
Metadata
4
Data
• Modern ICT technologies make it
very easy to generate large
volumes of data, and because
storage is quite cheap, there is a
tendency to keep that data
regardless of whether it has a
point.
• Every organization benefits from
collecting and analysing its data.
• Analysing data is crucial for
knowledge-driven decision
making.
• The problem then becomes how
to analyse these data.
5
Data Analysis
• Data analysis (analysis of data or data analytics) – is a
process of inspecting, cleansing, transforming, and
modeling data with the goal of discovering useful information,
suggesting conclusions, and supporting decision-making.
• Data analysis method includes simple query and reporting,
statistical analysis, more complex multidimensional analysis,
and data mining.
6
Definition of Data Mining
• Data mining is the process of findings interesting structure
in data (Roiger, 2017).
• Data mining is the process of automatically discovering
useful information in large data repositories (Tan et al.,
2006).
• Data mining is the application of specific algorithms for
extracting patterns from data (Fayyad, 1996).
7
Definition of Data Mining
“Data mining is a process of discovering
various models, summaries and derived values
from a given collection of data”.
(Kantardzic, 2020)
8
Data Mining Process
(Kantardzic, 2020)
9
Data Mining Process
Knowledge Discovery in Databases (KDD)
10
Data Mining Techniques
Unsupervised Learning Supervised Learning
Independent Variable (x) Independent Variable (x) Dependent Variable (y)
• Clustering • Simple Linear Regression • Simple Linear Regression
Numerical
▪ K-means • Multiple Regression • Multiple Regression
• Logistic Regression
• Decision Trees
• Association rules • Logistic Regression • Logistic Regression
Categorical
▪ Apriori • Decision Trees • Decision Trees
▪ FP-growth
Data Mining Tools
12
Data Mining Tools
13
Data Mining Applications
Area Application
Finance/ Banking Credit card analysis, loyal customers
Insurance Claims, fraud, churn analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods Promotion analysis
Scientific research Image, video, speech pattern analysis
Utilities Power usage analysis
Law enforcement Crime analysis
14
NTW: Basic Data analytics using WEKA
Data Pre-Processing
15
Data, Objects and Attributes
Attributes
• Data is a collection of objects and
their attributes ID Age Height Height Books
(m) (ft) Category
• An attribute is a property or
101 20 1.75 5.74 SDA321
characteristic of an object
• Examples: eye color of a 102 18 1.80 5.91 ENG364
person, temperature, etc.
103 21 1.53 5.20 IT543
• Attribute is also known as variable,
field, characteristic, or feature Objects 104 55 1.65 5.41 SDQ735
• A collection of attributes describe 105 20 1.58 5.18 IT954
an object
106 21 1.63 5.35 ENG735
• Object is also known as record,
point, case, sample, entity, or 107 19 1.72 5.64 SDM628
instance
Types of Attribute
Attribute
Types
Categorical Numerical
Nominal Ordinal Interval Ratio
Qualitative types Quantitative types
Properties of Attributes Values
• The type of an attribute depends on which of the
following properties it possesses:
✓ Distinctness: = ≠
✓ Order: < >
✓ Addition/subtraction: + -
✓ Multiplication/division: * /
❖ Nominal attribute: distinctness
❖ Ordinal attribute: distinctness and order
❖ Interval attribute: distinctness, order and addition/subtraction
❖ Ratio attribute: all 4 properties
Why pre-processing the data?
• Some data pre-processing is needed for all mining
tools.
• The purpose of pre-processing is to transform data
sets so that their information content is best exposed
to the mining tools.
• Pre-processing data also prepares the miner so that
when using prepared data, the miner produces better
models.
Major tasks in data pre-processing
1. Data Cleaning
2. Data Integration
3. Data Transformation
4. Data Reduction
20
1. Data Cleaning
• Data in the real world is dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error, etc.
– Incomplete (missing values): lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
❖ e.g., Occupation = “ ” (missing values)
– Noisy: containing noise, errors, or outliers
❖ e.g., Salary = “−10” (an error)
– Inconsistent: containing discrepancies in codes or names, e.g.,
❖ Age = “42”, Birthday = “03/07/2010”
❖ Was rating “1, 2, 3”, now rating “A, B, C”
❖ discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
21 ❖ Jan. 1 as everyone’s birthday?
NTW: Data analytics for Layman using WEKA
Data Pre-Processing (HANDS-ON)
22
Preparing .csv file
• Open an excel file
(Data2.xlsx)
• Save as Data2.csv
• Close the Data2.csv
file.
23
Data Pre-Processing in WEKA
• Open the
Data2.csv file in
WEKA.
• Check the
attribute
names.
• All attribute
types were set
as numeric.
• Is these
correct?
24
Data Pre-Processing in WEKA
Click edit to
view all the
dataset.
How to
change the
NoMatrik
attribute
as nominal
type?
25
Data Pre-Processing in WEKA
Choose>Filter>Unsupervised>Attribute>NumericToNominal
Right-click
Change first-last
to first only
26
Data Pre-Processing in WEKA
Choose>Add>Right-Click
Change to last
Change to
Grade and
Nominal
attribute
Type A+, A, B+,
B, C, D, E, F
27
Data Pre-Processing in WEKA
Set the Grade
attribute as A+, A,
B+, B, C, D, E or F
90 – 100 A+
80 – 89 A
70 – 79 B+
60 – 69 B
50 – 59 C
40 – 49 D
30 – 39 E
20 – 29 F
10 – 19 F
0–9 F
28
Data Pre-Processing in WEKA
29
Data Pre-Processing in WEKA
Save the file
30
NTW: Data analytics for Layman using WEKA
Data Pre-processing and Knowledge Discovery
Using Weka to Generate Apriori Algorithms
31
Data Pre-Processing in WEKA
• Waikato Environment for Knowledge Analysis
(WEKA)
• https://www.cs.waikato.ac.nz/ml/weka/index.
html
WEKA
• Weka is a collection of machine learning
algorithms for data mining tasks.
• The algorithms can either be applied directly to a
dataset or called from your own Java code.
• Weka contains tools for data pre-processing,
classification, regression, clustering, association
rules and visualization.
WEKA
The Explorer
• Gives access to all facilities of Weka using menu selection
and form filling
• Prepare the data, open the Explorer and load the data
• Flip back and forth between results, evaluate models built
on different datasets and visualize graphically both and
models and datasets, including classification errors
35
Preparing the data
• Data can be imported from a file in various
formats: e.g
– CSV (comma-separated values)
– ARFF (attribute-relation file format)
– Binary serialized instances
– Matlab ASCII files
36
Preparing the data
• ARFF files have 2 sections HEADER and DATA
• HEADER:
@RELATION dataset name
@ATTRIBUTE attribute name and type
• DATA:
@DATA
list of dataset
37
Attributes in WEKA
• Nominal: one of a predefined list of values
- e.g. red, green, blue
• Numeric: A real or integer number
• String: Enclosed in “double quotes”
• Date
• Relational
38
Apriori Algorithm in Weka
39
Using Weka to Generate Apriori Algorithm
Open an Excel file name
Using Weka to Generate Apriori Algorithm
Save as csv file>
click Save> click Yes
Using Weka to Generate Apriori Algorithm
Since most of the
attribute types of the
data is not numeric,
the file need to be
converted to .arff first
before it can be
opened in Weka.
Open the csv
file in MS Word
and you will get
the data just
like this
Using Weka to Generate Apriori Algorithm
@RELATION
@ATTRIBUTE string
@DATA
Remove , ,
Using Weka to Generate Apriori Algorithm
Save As >
click Save
> click OK
> close the
file
Using Weka to Generate Apriori Algorithm
Open the MS
Word of
Women_clothing
in the Notepad
Save as the file as
Women_clothing.arff
Using Weka to Generate Apriori Algorithm
Then, you will get the ARFF type of Women_clothing file which can be opened in Weka
Using Weka to Generate Apriori Algorithm
Open the
Women_clothing.arff
file in Weka.
Inspect the attribute
type for each of
attributes
To do an Apriori
algorithm, you need
to transform the
string type to
nominal type.
Using Weka to Generate Apriori Algorithm
1. Select> filters>
unsupervised >
attributes >
StringToNominal>
right-click > Show
properties > write 1-7
2. Click Apply
Using Weka to Generate Apriori Algorithm
Inspect all the
attribute types
Remove Gender
and Age as we
want to remain
the nominal
attributes in
order to do an
Apriori
algorithm
Using Weka to Generate Apriori Algorithm
Select Associate
> right-click >
Show properties
Minimum
support
Using Weka to Generate Apriori Algorithm
Evaluate and
interpret the
results
Save As the file as
Women_clothing1.arff
NTW: Data analytics for Layman using WEKA
Pattern and Knowledge Discovery
Using WEKA for Clustering
52
Clustering
Open Iris data in
WEKA database
file
Check the
attributes
Choose No class
Clustering
Cluster tab >
Clusterer >
Choose
Clustering
SimpleKMeans
Clustering
Right-click >
Show
properties…
Clustering
Change the
numClusters
into 3
Click OK
Clustering
Click Start
Clustering
Analyze
the result
Clustering
Clustering
Right-click >
Visualize
cluster
assignments
Clustering
Choose:
X: TotalsalesQuantity(Num)
Y: TotalSalesValue(Num)
Drag Jitter to the
right side
Examine the
result of cluster
analysis
Wants to dig more??
Hope we have more time!
Neural
Decision Tree
Network
(with source
code!)
References
• Data mining and KDD (SIGKDD: CDROM)
– Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
– Journal: Data Mining and Knowledge Discovery, KDD Explorations
• Database systems (SIGMOD: CD ROM)
– Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
– Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
• AI & Machine Learning
– Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
– Journals: Machine Learning, Artificial Intelligence, etc.
• Statistics
– Conferences: Joint Stat. Meeting, etc.
– Journals: Annals of statistics, etc.
• Visualization
– Conference proceedings: CHI, ACM-SIGGraph, etc.
– Journals: IEEE Trans. visualization and computer graphics, etc.
• Website:
– http://www.kdnuggets.com/
Thank you and see you again next time
Ask me personally or interested to become my post-graduate
student?
nurdiyana@upnm.edu.my
+0176962946