0% found this document useful (0 votes)

12 views92 pages

Chapter1 Introduction

The document outlines the MH6151 Data Mining course at Nanyang Technological University, detailing the course structure, objectives, and assessment methods. It covers key topics such as data preprocessing, classification, clustering, and the application of data mining techniques. The course is designed for students aiming to solve real-world problems using data mining and includes practical tools like R and Python.

Uploaded by

thai te qin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views92 pages

Chapter1 Introduction

Uploaded by

thai te qin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 92

MH6151 Data Mining

Overall Introduction

Zhang Jie
Nanyang Technological University
Outline

• Course Introduction

• Definitions of Data Mining and KDD

• Data Mining Tasks

2
Information about Instructor
• Zhang Jie, Professor
College of Computing and Data Science, NTU
• Email: ZhangJ@ntu.edu.sg
• Phone: 6790-6245
• Office: N4-02C-100
• Webpage: https://www.ntu.edu.sg/home/zhangj/

3
Plan for MH6151
(subject to minor changes)
• 11 Lectures
– Overall Introduction (1 Lecture)
– Data and Data exploration (2 Lectures)
– Classification algorithms (2 lecturers)
– Ensemble Learning (1 lecture)
– Regression (1 lecture)
– Association Rule Mining (1.5 Lecture)
– Clustering (2.5 Lectures)
• Course project demonstration and
presentation (Week 12&13)
• Final exam (Week 14) 4
Logistics
• Venue: check the timetable

• Time: 6.30-9.30pm (usually 1 break)

• Q&A/Consultations
– After/between the lectures
– Email me.
– Meet me? You can contact me first by email or call
me to make an appointment

5
Course Description
• Data mining (Data analytics) is a diverse field
which draws its foundation from many research
areas like database, machine learning, AI,
statistics etc.
• This course aims to introduce the concepts,
algorithms, techniques on data mining,
including 1) data preprocessing, 2) association
rule mining, 3) clustering, 4) classification, and
cover some applications of data mining
techniques.
6
Course Objective
• At the end of the course, students should
• Have a good knowledge of data mining concepts
• Be familiar with various data mining algorithms.
• Given a dataset and problem statement/task,
know how to select appropriate data mining
techniques to analyze data and address the
problem.
• MH8111 and MH8112 (Analytics software) will
give you more knowledge on analytics tools (e.g.
R, Python, Tableau, Weka) so that you can apply
the knowledge learned in this course
Target Audiences
• Students who intend to solve problems and perform
tasks in data mining or related fields
• To bridge the gap between the knowledge of
maths/algorithms that is needed in problem solving
and those that are taught in undergraduate level
• Students who intend to become a data engineer to
apply data mining techniques to solve real-world
applications. Always take your time to practice to
sharpen your skills and gain additional knowledge
(programming, fundamental knowledge, experience to
process different data/PS) – integrate with domain
knowledge (only data driven approach may not work).
Is the content hard to learn?
• As a lecturer, I need to make sure most students
can understand majority of key content and
know how to apply
• So, for some CS/CE/Math/engineering students,
you should be able to follow my lecture without
a lot of difficulties.
• If you want to learn sth more challenging or very
theoretical (e.g. to do research), you might be a
little disappointed  . You may need to take
additional courses, e.g. online courses/videos
from top professors in the world
Assessment
• Participation and attendance: 10%
• One Assignments: 15%
• Late submission of an assignment would result in a
reduced grade for the assignment.
• Group project report and code: 30%
• 4-5 members (randomly formed)
• Topic: a data mining task (will be released later)
• For a group, I will give each member individual mark
depending on his/her contributions to the project
• Project demonstration and presentation: 15%
• Final exam: 30%
Textbook and Reference
• Textbook
• Introduction to Data Mining, by Pang-Ning Tan,
Michael Steinbach, and Vipin Kumar, Addison
Wesley, 2005.
• Reference:
• Data Mining: Concepts and Techniques, 3rd ed.,
by Jiawei Han, Micheline Kamber, and Jian Pei,
Morgan Kaufmann, 2011.
• Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations, 3rd
ed., by Ian H. Witten, Eibe Frank, and Mark A.
Hall, Morgan Kaufmann, 2011.
Useful Resources
• Google Scholar – http://scholar.google.com
• Related conferences:
• KDD, ICDM, SDM, ECML/PKDD, PAKDD, ICML, NIPS,
AAAI, IJCAI, WWW, SIGMOD, VLDB, ICDE, CIKM, etc.
• Related journals:
• TKDE, TKDD, DMKD etc.
• Data mining competition platform:
• Kaggle – http://www.kaggle.com/
• Data mining community's resource:
• Kdnuggets – http://www.kdnuggets.com/
Data Mining Tools
• R
• Python
• Weka
• RapidMiner
• Orange
• IBM SPSS modeler
• SAS Data Mining
• Oracle Data Mining
• ……
Outline

• Module Introduction

• Definitions of Data Mining and KDD

• Data Mining Tasks

14
A Motivated Example:
Background: flight efficiency management

Opened 3
months of
flight data
usually not
available to the
public!

• Complex: many influencing factors

• Potential to benefit to multiple parties:
o Airline: reduced buffer time, better fuel planning
o Airport: reduced gate congestion and idle time in crew management
o Passengers and logistic companies: saving of travelling time
Flight Efficiency Management

Predicting flight arrival timings at the runway and at the gate

179 teams, 242 participants, 3073 entries,

http://www.gequest.com/c/flight

Big Real-World Flight Data

• Recent and novel;
• Entire set of US domestic flights; ML Algorithm
• 87 days with 26K flights per day;
Predictive Output
• Many attributes per flight (airline, • Runway arrival time: average error =
airport, planned route, congestion, 3.2 minutes, 45% improvement over
weather, ground delay, …) industrial standard
• Gate arrival time: average error =
4.2 minutes, 40% improvement
Flight Efficiency Management

Competition 258 Features

Data 26,000 US Extracted
domestic flights
and weather data
x 87 days 84 Features
Selected Only 58
used for predicting
Runway Arrival Time

Winning
Results
Average 40%-45% Mixture of
less errors for gate Prediction
and runway arrival Models GBM
time, respectively, and RF models
compared to the
standard industry
benchmark
Publicity and Potential Impact

report:

Huge Potential of
Impact:

…
Big Data Analytics

VOLUME
Value (Actionable
Terabytes Insights)
Transactions,
Tables
Files Processing
+ Analytics
Batch
time series Structured
(uniform time Unstructured
interval) Semi-structured
Streams
(continuously)

VELOCITY VARIETY
Big Data Analytics: Why Big Data?

• 1 Bit = Binary Digit • 1024 Terabytes = 1 Petabyte

• 8 Bits = 1 Byte • 1024 Petabytes = 1 Exabyte

• 1024 Bytes = 1 Kilobyte • 1024 Exabytes = 1 Zettabyte

• 1024 Kilobytes = 1 Megabyte • 1024 Zettabytes = 1 Yottabyte

• 1024 Megabytes = 1 Gigabyte • 1024 Yottabytes = 1 Brontobyte

• 1024 Brontobytes = 1 Geopbyte

• 1024 Gigabytes = 1 Terabyte

http://www.youtube.com/watch?v=7D1CQ_LOizA
Motivation: Why Mine Data?
Facebook
• Many data • 800 million active users
become • 60 billion photos in total, 250 million photos uploaded per day
• 80 groups/events per user (till Feb 2011)
available
Flickr
• 60 million users
• Five billion photos
• 10 million groups (till Feb 2011)
Twitter Weibo
• 175 million users (registered) • 200 million users
• 140 million tweets per day (till June 2011)

Telecom /manufacturing/transportation data: Huge data sets

“Necessity is the mother of invention”
Data Mining—analyze massive data sets to discover values
21
We are drowning in data But starving for knowledge

Data Mining

Data are sleeping in many organizations

No value has been extracted out
Why Mine Data? Commercial Viewpoint
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• Purchases at department/
grocery stores
• Bank/Credit Card
transactions/Telecom
• Emails (content, customer network)
• Manufacturing data (sensory readings, breakdowns, quality…)

• Computers have become cheaper and more powerful

• Competitive Pressure is Strong
• Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
• remote sensors on a satellite
• telescopes scanning the skies
• microarrays generating gene
expression data
• scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
• in classifying and segmenting data
• in Hypothesis Formation
Mining Large Data Sets - Motivation

• There is often information “hidden” in the data that is

not readily evident
• Human analysts may take weeks to discover useful information
• Much of the data is never analyzed at all
On the other hand, we are lack of experienced data scientists – it is
your opportunity to extract knowledge & insights from data
Many successful data science projects due to 1) data, 2) powerful
computer, 3) advanced analytics techniques
What is Data Mining?
Jonathan’s blocks

Jessica’s blocks
Whose block
is this?

Jonathan’s rules : Blue or Circle

Jessica’s rules : All the rest

Perhaps we are able to learn knowledge from data directly

What is Data Mining?

We have no problem distinguishing man/woman (even when the faces are

half covered). But can you explain how? e.g. rules you have used?
No one can write down a clear comprehensive set of rules that work with high
precision. However we can do fairly accurate predictions using data mining
algorithms

http://how-old.net/
What is Data Mining?
• Many Definitions
• Non-trivial extraction of implicit, previously
unknown and potentially useful information
from data
• Exploration & analysis, by automatic or
semi-automatic means, of large quantities of
data in order to discover meaningful patterns
https://www.youtube.com/watch?v=R-sGvh6tI04
Definitions: KDD & Data Mining

• KDD (Knowledge Discovery in Databases)

• The overall process of non-trivial extraction of implicit,
previously unknown and potentially useful knowledge from
large amounts of data

29
Data Mining: A KDD Process
• Data Mining: The core steps of KDD
• Application of specific
algorithms for extracting
patterns from data
• We may not use data Warehouse

30
What is (not) Data Mining?
What is not Data  What is Data Mining?
Mining? – Certain names are more
prevalent in certain US locations
– Look up phone (O’Brien, O’Rurke, O’Reilly… in
number in phone Boston area)
directory
– Group together similar
documents returned by search
– Query a Web engine according to their
search engine for context (e.g. Amazon rainforest,
information about Amazon.com) or search apple
“Amazon” and group documents (fruit
apple, apple.com)
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
• Traditional Techniques
may be unsuitable due to
• Volume: enormity of data Statistics/ Machine Learning/
• Complexity AI Pattern
Recognition
• High dimensionality of data
• Distribution of data Data Mining
• Heterogeneity of data (variety)
• Time series/ Streams (velocity)
Database
• …… systems
Why not use classical data analysis?

• Tremendous amount of data

• Algorithms must be highly scalable to handle massive data,
such as terabytes/petabytes/Exabyte of data
• High-dimensionality of data
• E.g., microarray data may have tens of thousands of dimensions
• Text data could have hundreds of dimensions
• High complexity of data
• Data stream, time-series data, temporal data, sequence data
• Structure data, graphs, social networks
• Heterogeneous databases and legacy databases
• Spatial, spatiotemporal, multimedia, text and Web data
• Software programs, scientific simulations
• New and sophisticated applications 33
Major Steps of Data Mining (KDD)
Input Data Data
Post-processing Knowledge
data Preprocessing Mining

1. Data Preprocessing
A. Data Integration
Combine multiple data sources (more relevant data, rich insights;
no data, no insights)
B. Data Cleaning
Remove noise and inconsistent data (GIGO)
C. Data Selection
Select task-relevant data
D. Data Transformation
Transform selected data for further analysis
34
Major Steps of Data Mining (KDD)

2. Data Mining
Apply data mining & machine learning methods (e.g., association,
classification, clustering, regression, anomaly detection) to extract
patterns from data
3. Pattern Evaluation (Post Processing)
Evaluate the performance & identify truly interesting patterns or
models
4. Visualization (Post Processing; we also frequently use
visualization data mining stage to better understand
data)
Present the mined patterns and prediction results to users
Although “Data Mining” is just one of the many steps,
it is usually used to refer to the whole process of KDD

35
The Architecture of a Typical Data Mining System
User

Visualization

Post Processing Interesting Patterns

Data Mining Engine

Data Preprocessing
Integration/Cleaning/Selection/Reduction/Transformation

Databases

36
Data Mining & Business Intelligence

Oftentimes, you may not have business analyst to present your mining results to
management. It is critical for you to explain them in business context and language.
37
Outline

• Module Introduction

• Definitions of Data Mining and KDD

• Data Mining Tasks

38
Data Mining Tasks

• Prediction Methods
• Use some variables to predict unknown or future values of
other variables.

• Description Methods
• Find human-interpretable patterns that describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
• Classification [Predictive]
• Regression [Predictive]
• Outlier Detection (Deviation Detection) [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]

All major analytics tools (e.g. R, Python, SAS, …) cover all the mining tasks so that you
can directly use them. What we teach in this course give you theoretical foundations of
these methods.
Data Mining Taxonomy

Data Mining Tasks

Descriptive Predictive

Association Classification Regression

Rule Mining
Outlier
Clustering Sequence Detection
Pattern Mining
This taxonomy is based on the kinds of patterns output by data mining tasks.
41
Association Rules
Association Rule Mining:
“Market Basket Analysis”
Typical Data Mining Tasks
• Frequent pattern mining
• Finding groups of items that tend to occur
together {A, B, C}
• Also known as “frequent itemset mining” or
“market basket analysis”
• Grocery store: Beer and Diaper
• Amazon.com: People who bought this book also
bought other books
• Association Rule Mining
• Turn the frequent patterns, e.g. {A, B, C}, into
rule formats, e.g. A, C=>B
Beer and diaper story
• A purported survey of behaviour of supermarket
shoppers discovered that customers who buy diapers
tend also to buy beer. This anecdote became popular as
an example of how unexpected association rules might
be found from everyday data.
• In 1992, Thomas Blischok, manager of a retail consulting
group at Teradata, and his staff prepared an analysis of
1.2 million market baskets. The mining techniques "did
discover that between 5:00 and 7:00 p.m. that
consumers bought beer and diapers".
Rationale behind "beer and diaper"

• After some serious thinking, the supermarket figured

out the rationale was that because diapers are
voluminous, the wife, who in most cases made the
household purchases, left the diaper purchase to her
husband who had the car.
• The husband and father, most often between 25 and 35
years old, usually bought the diapers at the end of the
working week. With the weekend, beer often becomes
a priority; and so, beer became the product most often
associated with the sale of diapers.
A Real-world Application:
AMAZON BOOK RECOMMENDATIONS
Frequent patterns and association rules
Frequent pattern {data mining, mining the social
web, data analysis with open source tools}

Association Rule {data mining} ->

{machine learning in action}
An Intuitive Example of Association Rules
from Transactional Data
Shopping Cart Items Bought
(transaction)
1 {fruits, , milk, beef, ,eggs, …}

2 {sugar, , toothbrush, ice-cream, …}

3 { , pacifier, formula, ,blanket,…}

… …

n {battery, juice, beef, egg, chicken,…}

Association Rule Discovery: Definition

• Given a set of records each of which contain some

number of items from a given collection;
• Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.

TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule Mining: Application 1
Marketing and Sales Promotion
• Let the rule discovered be
{Coke, … } --> {Potato Chips}
• Potato Chips as consequent
– Can be used to determine what should be done to boost its
sales.
• Coke in the antecedent
– Can be used to see which products would be affected if the
store discontinues selling Coke.
• Coke in antecedent and Potato chips in consequent
– Can be used to see what products should be sold with Coke to
promote sale of Potato chips! 51
Association Rule Mining: Application 2
Supermarket shelf management.
• Goal: To identify items that are bought together by
sufficiently many customers.
• Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
• Here is a classical rule: {diaper, milk} --> {beer}
• If a customer buys diaper and milk, then he is very likely to
buy beer.
• So, don’t be surprised if you find six-packs stacked next to
diapers!

52
Sequential Pattern Mining

• Takes the sequential information into consideration.

• An example of a sequential pattern is “5% of
customers buy bed first, then mattress and then
pillows”.
• The items are not purchased at the same time, but
one after another. Such patterns are useful for
recommendation.
Sequential Pattern Mining
• Definition
• Given is a set of objects, with each object associated with
its own timeline of events, find rules that predict strong
sequential dependencies among different events.
• Models regularities or trends for objects whose
behavior changes over time
• Example: Consider a cloth shop: with season the
demand of various items changes; it will be
appropriate to learn the trend and purchase
necessary items accordingly

54
Classification
• Also called Supervised Learning
• Learn from past experience/labels, and use the
learned knowledge to classify new data
• Knowledge learned by intelligent machine learning
algorithms
• Examples:
• Clinical diagnosis for patients
An Classification Example Predictive

Attributes

Uniformity Single Bland Class: (2 for

id Clump Uniformity of Cell Marginal Epithelial Bare Chrom Normal benign, 4 for
number Thickness of Cell Size Shape Adhesion Cell Size Nuclei atin Nucleoli Mitoses malignant)

ID1 5 1 1 1 2 1 3 1 1 2
Training Examples
ID2 5 4 4 5 7 10 3 2 1 2

ID3 3 1 1 1 2 2 3 1 1 4

ID4 8 10 10 8 7 10 9 7 1 4

Find a model for class attribute as a function of the values of other attributes.
f (Clump Thickness, Uniformity of Cell Size, …, Mitoses) = Class
f(4, 6, 5, 6, 8, 9, 2, 4)=?
A test example
Classification: Definition
• Given a collection of records (training set ), each record
contains a set of attributes, one of the attributes is the
class attribute.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: To ensure that previously unseen records should
be assigned a class as accurately as possible.

A test set is used to determine the accuracy of the model

or for real prediction. Usually, the given data set is divided
into training and test sets, with the training set used to
build the model and the test set used to validate it.
57
What is Classification?
Put things into groups according to their characteristics.
– Wiki
A systematic arrangement into classes or groups
– Dictionary.com

A person here uses a magnifier to measure some attributes of

this bug, such as 8 legs, round body-shape, dark brown color,
etc. These features well conform to our knowledge/rules to
define a spider. So we can tell that this must be spider instead of
a lizard, or other categories.
Example: Watermelon Ripeness Determination
Watermelon Training set We go to supermarket to buy a
watermelon. Which one is a good
one? You use the model in your
ID Color Shape of Root Sound Ripeness brain to predict. Perhaps you will
use size, colour, shape root, and
1 Green Curl Dull Y acoustic signal. If your model is
2 Black Curl Dull Y
not accurate, you cannot find
good ones The reason could be
3 Green Stiff crisp N the quality of your sensors, less
4 Black Stiff Dull N
training data etc. The sales guy
could be much more accurate.

There are 3 normal features and 1 target If we have many training data and
feature. This is a binary classification quality sensors, we can build a very
problem. Objective is to learn an accurate accurate model using data mining. In
model to help us to pick good watermelon addition, we could learn rules/insights,
e.g. which feature is the most
important one and which two could be
Perhaps to build a mobile app to make $
used together to build a better model
Classification Example

Marital Taxable Marital Taxable

Tid Refund Cheat Refund Cheat
Status Income Status Income
1 Yes Single 125K No No Single 75K ?
2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
Test
6 No Married 60K No No Married 80K ?
10
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training Learn
10 No Single 90K Yes
10

Set Classifier Model

Example of a Decision Tree

Splitting Attributes

Tid Refund Marital Taxable

Status Income Cheat
Refund
1 Yes Single 125K No Yes No
2 No Married 100K No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No Single, Divorced Married
Yes
5 No Divorced 95K
TaxInc NO
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Training Data Model: Decision Tree

61
Another Example of Decision Tree

Tid Refund Marital Taxable MarSt Single,

Status Income Cheat Married Divorced
1 Yes Single 125K No
NO Refund
2 No Married 100K No
Yes No
3 No Single 70K No
4 Yes Married 120K No NO TaxInc
5 No Divorced 95K Yes < 80K > 80K
6 No Married 60K No
7 Yes Divorced 220K No
NO YES
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes There could be more than one tree that
fits the same data!
10

Training Data
62
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Decision
Model
Training Set
Apply Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
63
Apply Model to Test Data

Test Data
Start from the root of tree.
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No