KEMBAR78
Chapter1 Introduction | PDF | Data Mining | Data
0% found this document useful (0 votes)
12 views92 pages

Chapter1 Introduction

The document outlines the MH6151 Data Mining course at Nanyang Technological University, detailing the course structure, objectives, and assessment methods. It covers key topics such as data preprocessing, classification, clustering, and the application of data mining techniques. The course is designed for students aiming to solve real-world problems using data mining and includes practical tools like R and Python.

Uploaded by

thai te qin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views92 pages

Chapter1 Introduction

The document outlines the MH6151 Data Mining course at Nanyang Technological University, detailing the course structure, objectives, and assessment methods. It covers key topics such as data preprocessing, classification, clustering, and the application of data mining techniques. The course is designed for students aiming to solve real-world problems using data mining and includes practical tools like R and Python.

Uploaded by

thai te qin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

MH6151 Data Mining

Overall Introduction

Zhang Jie
Nanyang Technological University
Outline

• Course Introduction

• Definitions of Data Mining and KDD

• Data Mining Tasks

2
Information about Instructor
• Zhang Jie, Professor
College of Computing and Data Science, NTU
• Email: ZhangJ@ntu.edu.sg
• Phone: 6790-6245
• Office: N4-02C-100
• Webpage: https://www.ntu.edu.sg/home/zhangj/

3
Plan for MH6151
(subject to minor changes)
• 11 Lectures
– Overall Introduction (1 Lecture)
– Data and Data exploration (2 Lectures)
– Classification algorithms (2 lecturers)
– Ensemble Learning (1 lecture)
– Regression (1 lecture)
– Association Rule Mining (1.5 Lecture)
– Clustering (2.5 Lectures)
• Course project demonstration and
presentation (Week 12&13)
• Final exam (Week 14) 4
Logistics
• Venue: check the timetable

• Time: 6.30-9.30pm (usually 1 break)

• Q&A/Consultations
– After/between the lectures
– Email me.
– Meet me? You can contact me first by email or call
me to make an appointment

5
Course Description
• Data mining (Data analytics) is a diverse field
which draws its foundation from many research
areas like database, machine learning, AI,
statistics etc.
• This course aims to introduce the concepts,
algorithms, techniques on data mining,
including 1) data preprocessing, 2) association
rule mining, 3) clustering, 4) classification, and
cover some applications of data mining
techniques.
6
Course Objective
• At the end of the course, students should
• Have a good knowledge of data mining concepts
• Be familiar with various data mining algorithms.
• Given a dataset and problem statement/task,
know how to select appropriate data mining
techniques to analyze data and address the
problem.
• MH8111 and MH8112 (Analytics software) will
give you more knowledge on analytics tools (e.g.
R, Python, Tableau, Weka) so that you can apply
the knowledge learned in this course
Target Audiences
• Students who intend to solve problems and perform
tasks in data mining or related fields
• To bridge the gap between the knowledge of
maths/algorithms that is needed in problem solving
and those that are taught in undergraduate level
• Students who intend to become a data engineer to
apply data mining techniques to solve real-world
applications. Always take your time to practice to
sharpen your skills and gain additional knowledge
(programming, fundamental knowledge, experience to
process different data/PS) – integrate with domain
knowledge (only data driven approach may not work).
Is the content hard to learn?
• As a lecturer, I need to make sure most students
can understand majority of key content and
know how to apply
• So, for some CS/CE/Math/engineering students,
you should be able to follow my lecture without
a lot of difficulties.
• If you want to learn sth more challenging or very
theoretical (e.g. to do research), you might be a
little disappointed  . You may need to take
additional courses, e.g. online courses/videos
from top professors in the world
Assessment
• Participation and attendance: 10%
• One Assignments: 15%
• Late submission of an assignment would result in a
reduced grade for the assignment.
• Group project report and code: 30%
• 4-5 members (randomly formed)
• Topic: a data mining task (will be released later)
• For a group, I will give each member individual mark
depending on his/her contributions to the project
• Project demonstration and presentation: 15%
• Final exam: 30%
Textbook and Reference
• Textbook
• Introduction to Data Mining, by Pang-Ning Tan,
Michael Steinbach, and Vipin Kumar, Addison
Wesley, 2005.
• Reference:
• Data Mining: Concepts and Techniques, 3rd ed.,
by Jiawei Han, Micheline Kamber, and Jian Pei,
Morgan Kaufmann, 2011.
• Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations, 3rd
ed., by Ian H. Witten, Eibe Frank, and Mark A.
Hall, Morgan Kaufmann, 2011.
Useful Resources
• Google Scholar – http://scholar.google.com
• Related conferences:
• KDD, ICDM, SDM, ECML/PKDD, PAKDD, ICML, NIPS,
AAAI, IJCAI, WWW, SIGMOD, VLDB, ICDE, CIKM, etc.
• Related journals:
• TKDE, TKDD, DMKD etc.
• Data mining competition platform:
• Kaggle – http://www.kaggle.com/
• Data mining community's resource:
• Kdnuggets – http://www.kdnuggets.com/
Data Mining Tools
• R
• Python
• Weka
• RapidMiner
• Orange
• IBM SPSS modeler
• SAS Data Mining
• Oracle Data Mining
• ……
Outline

• Module Introduction

• Definitions of Data Mining and KDD

• Data Mining Tasks

14
A Motivated Example:
Background: flight efficiency management

Opened 3
months of
flight data
usually not
available to the
public!

• Complex: many influencing factors


• Potential to benefit to multiple parties:
o Airline: reduced buffer time, better fuel planning
o Airport: reduced gate congestion and idle time in crew management
o Passengers and logistic companies: saving of travelling time
Flight Efficiency Management

Predicting flight arrival timings at the runway and at the gate

179 teams, 242 participants, 3073 entries,


http://www.gequest.com/c/flight

Big Real-World Flight Data


• Recent and novel;
• Entire set of US domestic flights; ML Algorithm
• 87 days with 26K flights per day;
Predictive Output
• Many attributes per flight (airline, • Runway arrival time: average error =
airport, planned route, congestion, 3.2 minutes, 45% improvement over
weather, ground delay, …) industrial standard
• Gate arrival time: average error =
4.2 minutes, 40% improvement
Flight Efficiency Management

Competition 258 Features


Data 26,000 US Extracted
domestic flights
and weather data
x 87 days 84 Features
Selected Only 58
used for predicting
Runway Arrival Time

Winning
Results
Average 40%-45% Mixture of
less errors for gate Prediction
and runway arrival Models GBM
time, respectively, and RF models
compared to the
standard industry
benchmark
Publicity and Potential Impact

report:

Huge Potential of
Impact:


Big Data Analytics

VOLUME
Value (Actionable
Terabytes Insights)
Transactions,
Tables
Files Processing
+ Analytics
Batch
time series Structured
(uniform time Unstructured
interval) Semi-structured
Streams
(continuously)

VELOCITY VARIETY
Big Data Analytics: Why Big Data?

• 1 Bit = Binary Digit • 1024 Terabytes = 1 Petabyte

• 8 Bits = 1 Byte • 1024 Petabytes = 1 Exabyte

• 1024 Bytes = 1 Kilobyte • 1024 Exabytes = 1 Zettabyte

• 1024 Kilobytes = 1 Megabyte • 1024 Zettabytes = 1 Yottabyte

• 1024 Megabytes = 1 Gigabyte • 1024 Yottabytes = 1 Brontobyte

• 1024 Brontobytes = 1 Geopbyte


• 1024 Gigabytes = 1 Terabyte

http://www.youtube.com/watch?v=7D1CQ_LOizA
Motivation: Why Mine Data?
Facebook
• Many data • 800 million active users
become • 60 billion photos in total, 250 million photos uploaded per day
• 80 groups/events per user (till Feb 2011)
available
Flickr
• 60 million users
• Five billion photos
• 10 million groups (till Feb 2011)
Twitter Weibo
• 175 million users (registered) • 200 million users
• 140 million tweets per day (till June 2011)

Telecom /manufacturing/transportation data: Huge data sets


“Necessity is the mother of invention”
Data Mining—analyze massive data sets to discover values
21
We are drowning in data But starving for knowledge

Data Mining

Data are sleeping in many organizations


No value has been extracted out
Why Mine Data? Commercial Viewpoint
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• Purchases at department/
grocery stores
• Bank/Credit Card
transactions/Telecom
• Emails (content, customer network)
• Manufacturing data (sensory readings, breakdowns, quality…)

• Computers have become cheaper and more powerful


• Competitive Pressure is Strong
• Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
• remote sensors on a satellite
• telescopes scanning the skies
• microarrays generating gene
expression data
• scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
• in classifying and segmenting data
• in Hypothesis Formation
Mining Large Data Sets - Motivation

• There is often information “hidden” in the data that is


not readily evident
• Human analysts may take weeks to discover useful information
• Much of the data is never analyzed at all
On the other hand, we are lack of experienced data scientists – it is
your opportunity to extract knowledge & insights from data
Many successful data science projects due to 1) data, 2) powerful
computer, 3) advanced analytics techniques
What is Data Mining?
Jonathan’s blocks

Jessica’s blocks
Whose block
is this?

Jonathan’s rules : Blue or Circle


Jessica’s rules : All the rest

Perhaps we are able to learn knowledge from data directly


What is Data Mining?

We have no problem distinguishing man/woman (even when the faces are


half covered). But can you explain how? e.g. rules you have used?
No one can write down a clear comprehensive set of rules that work with high
precision. However we can do fairly accurate predictions using data mining
algorithms

http://how-old.net/
What is Data Mining?
• Many Definitions
• Non-trivial extraction of implicit, previously
unknown and potentially useful information
from data
• Exploration & analysis, by automatic or
semi-automatic means, of large quantities of
data in order to discover meaningful patterns
https://www.youtube.com/watch?v=R-sGvh6tI04
Definitions: KDD & Data Mining

• KDD (Knowledge Discovery in Databases)


• The overall process of non-trivial extraction of implicit,
previously unknown and potentially useful knowledge from
large amounts of data

29
Data Mining: A KDD Process
• Data Mining: The core steps of KDD
• Application of specific
algorithms for extracting
patterns from data
• We may not use data Warehouse

30
What is (not) Data Mining?
What is not Data  What is Data Mining?
Mining? – Certain names are more
prevalent in certain US locations
– Look up phone (O’Brien, O’Rurke, O’Reilly… in
number in phone Boston area)
directory
– Group together similar
documents returned by search
– Query a Web engine according to their
search engine for context (e.g. Amazon rainforest,
information about Amazon.com) or search apple
“Amazon” and group documents (fruit
apple, apple.com)
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
• Traditional Techniques
may be unsuitable due to
• Volume: enormity of data Statistics/ Machine Learning/
• Complexity AI Pattern
Recognition
• High dimensionality of data
• Distribution of data Data Mining
• Heterogeneity of data (variety)
• Time series/ Streams (velocity)
Database
• …… systems
Why not use classical data analysis?

• Tremendous amount of data


• Algorithms must be highly scalable to handle massive data,
such as terabytes/petabytes/Exabyte of data
• High-dimensionality of data
• E.g., microarray data may have tens of thousands of dimensions
• Text data could have hundreds of dimensions
• High complexity of data
• Data stream, time-series data, temporal data, sequence data
• Structure data, graphs, social networks
• Heterogeneous databases and legacy databases
• Spatial, spatiotemporal, multimedia, text and Web data
• Software programs, scientific simulations
• New and sophisticated applications 33
Major Steps of Data Mining (KDD)
Input Data Data
Post-processing Knowledge
data Preprocessing Mining

1. Data Preprocessing
A. Data Integration
Combine multiple data sources (more relevant data, rich insights;
no data, no insights)
B. Data Cleaning
Remove noise and inconsistent data (GIGO)
C. Data Selection
Select task-relevant data
D. Data Transformation
Transform selected data for further analysis
34
Major Steps of Data Mining (KDD)

2. Data Mining
Apply data mining & machine learning methods (e.g., association,
classification, clustering, regression, anomaly detection) to extract
patterns from data
3. Pattern Evaluation (Post Processing)
Evaluate the performance & identify truly interesting patterns or
models
4. Visualization (Post Processing; we also frequently use
visualization data mining stage to better understand
data)
Present the mined patterns and prediction results to users
Although “Data Mining” is just one of the many steps,
it is usually used to refer to the whole process of KDD

35
The Architecture of a Typical Data Mining System
User

Visualization

Post Processing Interesting Patterns

Data Mining Engine

Data Preprocessing
Integration/Cleaning/Selection/Reduction/Transformation

Databases

36
Data Mining & Business Intelligence

Oftentimes, you may not have business analyst to present your mining results to
management. It is critical for you to explain them in business context and language.
37
Outline

• Module Introduction

• Definitions of Data Mining and KDD

• Data Mining Tasks

38
Data Mining Tasks

• Prediction Methods
• Use some variables to predict unknown or future values of
other variables.

• Description Methods
• Find human-interpretable patterns that describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
• Classification [Predictive]
• Regression [Predictive]
• Outlier Detection (Deviation Detection) [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]

All major analytics tools (e.g. R, Python, SAS, …) cover all the mining tasks so that you
can directly use them. What we teach in this course give you theoretical foundations of
these methods.
Data Mining Taxonomy

Data Mining Tasks

Descriptive Predictive

Association Classification Regression


Rule Mining
Outlier
Clustering Sequence Detection
Pattern Mining
This taxonomy is based on the kinds of patterns output by data mining tasks.
41
Association Rules
Association Rule Mining:
“Market Basket Analysis”
Typical Data Mining Tasks
• Frequent pattern mining
• Finding groups of items that tend to occur
together {A, B, C}
• Also known as “frequent itemset mining” or
“market basket analysis”
• Grocery store: Beer and Diaper
• Amazon.com: People who bought this book also
bought other books
• Association Rule Mining
• Turn the frequent patterns, e.g. {A, B, C}, into
rule formats, e.g. A, C=>B
Beer and diaper story
• A purported survey of behaviour of supermarket
shoppers discovered that customers who buy diapers
tend also to buy beer. This anecdote became popular as
an example of how unexpected association rules might
be found from everyday data.
• In 1992, Thomas Blischok, manager of a retail consulting
group at Teradata, and his staff prepared an analysis of
1.2 million market baskets. The mining techniques "did
discover that between 5:00 and 7:00 p.m. that
consumers bought beer and diapers".
Rationale behind "beer and diaper"

• After some serious thinking, the supermarket figured


out the rationale was that because diapers are
voluminous, the wife, who in most cases made the
household purchases, left the diaper purchase to her
husband who had the car.
• The husband and father, most often between 25 and 35
years old, usually bought the diapers at the end of the
working week. With the weekend, beer often becomes
a priority; and so, beer became the product most often
associated with the sale of diapers.
A Real-world Application:
AMAZON BOOK RECOMMENDATIONS
Frequent patterns and association rules
Frequent pattern {data mining, mining the social
web, data analysis with open source tools}

Association Rule {data mining} ->


{machine learning in action}
An Intuitive Example of Association Rules
from Transactional Data
Shopping Cart Items Bought
(transaction)
1 {fruits, , milk, beef, ,eggs, …}

2 {sugar, , toothbrush, ice-cream, …}

3 { , pacifier, formula, ,blanket,…}

… …

n {battery, juice, beef, egg, chicken,…}


Association Rule Discovery: Definition

• Given a set of records each of which contain some


number of items from a given collection;
• Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.

TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule Mining: Application 1
Marketing and Sales Promotion
• Let the rule discovered be
{Coke, … } --> {Potato Chips}
• Potato Chips as consequent
– Can be used to determine what should be done to boost its
sales.
• Coke in the antecedent
– Can be used to see which products would be affected if the
store discontinues selling Coke.
• Coke in antecedent and Potato chips in consequent
– Can be used to see what products should be sold with Coke to
promote sale of Potato chips! 51
Association Rule Mining: Application 2
Supermarket shelf management.
• Goal: To identify items that are bought together by
sufficiently many customers.
• Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
• Here is a classical rule: {diaper, milk} --> {beer}
• If a customer buys diaper and milk, then he is very likely to
buy beer.
• So, don’t be surprised if you find six-packs stacked next to
diapers!

52
Sequential Pattern Mining

• Takes the sequential information into consideration.


• An example of a sequential pattern is “5% of
customers buy bed first, then mattress and then
pillows”.
• The items are not purchased at the same time, but
one after another. Such patterns are useful for
recommendation.
Sequential Pattern Mining
• Definition
• Given is a set of objects, with each object associated with
its own timeline of events, find rules that predict strong
sequential dependencies among different events.
• Models regularities or trends for objects whose
behavior changes over time
• Example: Consider a cloth shop: with season the
demand of various items changes; it will be
appropriate to learn the trend and purchase
necessary items accordingly

54
Classification
• Also called Supervised Learning
• Learn from past experience/labels, and use the
learned knowledge to classify new data
• Knowledge learned by intelligent machine learning
algorithms
• Examples:
• Clinical diagnosis for patients
An Classification Example Predictive

Attributes

Uniformity Single Bland Class: (2 for


id Clump Uniformity of Cell Marginal Epithelial Bare Chrom Normal benign, 4 for
number Thickness of Cell Size Shape Adhesion Cell Size Nuclei atin Nucleoli Mitoses malignant)

ID1 5 1 1 1 2 1 3 1 1 2
Training Examples
ID2 5 4 4 5 7 10 3 2 1 2

ID3 3 1 1 1 2 2 3 1 1 4

ID4 8 10 10 8 7 10 9 7 1 4

Find a model for class attribute as a function of the values of other attributes.
f (Clump Thickness, Uniformity of Cell Size, …, Mitoses) = Class
f(4, 6, 5, 6, 8, 9, 2, 4)=?
A test example
Classification: Definition
• Given a collection of records (training set ), each record
contains a set of attributes, one of the attributes is the
class attribute.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: To ensure that previously unseen records should
be assigned a class as accurately as possible.

A test set is used to determine the accuracy of the model


or for real prediction. Usually, the given data set is divided
into training and test sets, with the training set used to
build the model and the test set used to validate it.
57
What is Classification?
Put things into groups according to their characteristics.
– Wiki
A systematic arrangement into classes or groups
– Dictionary.com

A person here uses a magnifier to measure some attributes of


this bug, such as 8 legs, round body-shape, dark brown color,
etc. These features well conform to our knowledge/rules to
define a spider. So we can tell that this must be spider instead of
a lizard, or other categories.
Example: Watermelon Ripeness Determination
Watermelon Training set We go to supermarket to buy a
watermelon. Which one is a good
one? You use the model in your
ID Color Shape of Root Sound Ripeness brain to predict. Perhaps you will
use size, colour, shape root, and
1 Green Curl Dull Y acoustic signal. If your model is
2 Black Curl Dull Y
not accurate, you cannot find
good ones The reason could be
3 Green Stiff crisp N the quality of your sensors, less
4 Black Stiff Dull N
training data etc. The sales guy
could be much more accurate.

There are 3 normal features and 1 target If we have many training data and
feature. This is a binary classification quality sensors, we can build a very
problem. Objective is to learn an accurate accurate model using data mining. In
model to help us to pick good watermelon addition, we could learn rules/insights,
e.g. which feature is the most
important one and which two could be
Perhaps to build a mobile app to make $
used together to build a better model
Classification Example

Marital Taxable Marital Taxable


Tid Refund Cheat Refund Cheat
Status Income Status Income
1 Yes Single 125K No No Single 75K ?
2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
Test
6 No Married 60K No No Married 80K ?
10
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training Learn
10 No Single 90K Yes
10

Set Classifier Model


Example of a Decision Tree

Splitting Attributes

Tid Refund Marital Taxable


Status Income Cheat
Refund
1 Yes Single 125K No Yes No
2 No Married 100K No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No Single, Divorced Married
Yes
5 No Divorced 95K
TaxInc NO
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Training Data Model: Decision Tree


61
Another Example of Decision Tree

Tid Refund Marital Taxable MarSt Single,


Status Income Cheat Married Divorced
1 Yes Single 125K No
NO Refund
2 No Married 100K No
Yes No
3 No Single 70K No
4 Yes Married 120K No NO TaxInc
5 No Divorced 95K Yes < 80K > 80K
6 No Married 60K No
7 Yes Divorced 220K No
NO YES
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes There could be more than one tree that
fits the same data!
10

Training Data
62
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Decision
Model
Training Set
Apply Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
63
Apply Model to Test Data

Test Data
Start from the root of tree.
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

64
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

65
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

66
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

67
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

68
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refun 10

Yes d No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
NO
< 80K > 80K

NO YES

69
Classification: Application 1
• Direct Marketing
• Goal: Reduce cost of mailing by targeting a set of consumers likely
to buy a new cell-phone product.
• Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided otherwise.
This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and teleco-interaction related
information about all such customers.
• Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.
Classification: Application 2
 Fraud Detection
 Goal: Predict fraudulent cases in credit card transactions.
 Approach:
 Use credit card transactions and the information on its account-
holder as attributes. When does a customer buy, what does he buy,
how often he pays on time, etc
 Label past transactions as fraud or fair transactions. This forms the
class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit card transactions
on an account.
Classification: Application 3
• Customer Attrition/Churn:
• Goal: To predict whether a customer is likely to be lost to
a competitor (e.g. Singtel -> M1).
• Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
• How often the customer calls, where he calls, what time-of-the
day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
Clustering
• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
• Automatically learn the structure of data

Main Principle
Inter-cluster distances
Intra-cluster distances C2 are maximized
are minimized

C1

C1, C2, C3 are good clusters, C3


while C4 is a bad cluster
The Input of Clustering
f1 f2 fn
We could use raw data columns
as features directly, but it is very
D1 = (V11, V12, …, V1n) important to generate new
features (based on domain
Objects/ D2 = (V21, V22, …, V2n) knowledge, your creativity or
preliminary analysis), e.g. ratio
Instances/ ……
of cholesterol/LDL, which is
points Dm = (Vm1, Vm2, …, Vmn) especially important indicators
for heart disease detection

f1 f2 ,… , fn could be various demographic features race, gender, income, education level,


medical features weight, blood pressure, total cholesterol, LDL (low-density lipoprotein
cholesterol), HDL (high-density lipoprotein cholesterol), triglycerides (fats carried in the blood),
activity features total steps per day, exercise brisk mins, stand hours, move burning calorie, or
environmental features like PM2.5, etc
Unsupervised Learning
Purely learn from the unlabeled data
Uniformity Single Bland
id Clump Uniformity of Cell Marginal Epithelial Bare Chrom Normal
number Thickness of Cell Size Shape Adhesion Cell Size Nuclei atin Nucleoli Mitoses

ID1 5 1 1 1 2 1 3 1 1

ID2 5 4 4 5 7 10 3 2 1

ID3 3 1 1 1 2 2 3 1 1

ID4 8 10 10 8 7 10 9 7 1

We can learn the relationship or structures from data by clustering, e.g. maybe ID1
and ID3 should be in 1 cluster?
Clustering Definition
• Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
• Data points in one cluster are more similar to one
another.
• Data points in separate clusters are less similar to one
another.
• Similarity Measures:
• Euclidean distance/cosine similarity if attributes are
continuous.
• Other Problem-specific Measures.
Clustering: Application 1
• Market Segmentation:
• Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached.
• Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
• Could be turned into classification
Clustering: Application 2
• Document Clustering:
• Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
• Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the
frequencies of different terms. Use classic clustering algo or
topic modeling to perform clustering. Can describe clusters
using keywords
• Gain: Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
Illustrating Document Clustering
• Clustering Points: 3204 Articles of Los Angeles Times.
• Similarity Measure: How many words are common in these
documents (after some word filtering).

Category Total Correctly


Articles Placed
Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746

Sports 738 573

Entertainment 354 278


Regression
• Goal:
• Predict a value of a given continuous valued variable based on
the values of other variables, assuming a linear or nonlinear
model of dependency.
• Extensively studied in statistics, neural network fields.
• Examples:
• Predicting when the aircraft will land into the airport.
• Predicting sales amounts of new product based on advertising
expenditure.
• Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
• Time series prediction of stock market indices.

80
Prediction & Regression: Example 1
Relationship between systolic blood pressure (y), birthweight (x1), and age (in days) (x2)
Birthweight Age Systolic BP
i in oz (x1) in days (x2) mm HG (y) Training regression model:
1 135 3 89 Use least-squares method to
2 120 4 90
determine the regression eqn:
3 100 3 83
4 105 2 77 y=53.45 + 0.126 * x1 + 5.89 *x2
5 130 4 92
6 125 5 98
Prediction using model:
7 125 2 82
To predict the systolic BP of a
8 105 3 85
9 120 5 96
baby with birthweight 8 lb (128
10 90 4 95
oz) measured at 3 days of life
11 120 2 80
12 95 3 79
y=53.45+0.126*(128)+5.89*(3)
13 120 3 86
14 150 4 97
=87.2mm Hg
15 160 3 92
16 125 3 88 81
Prediction & Regression: Example 2
 Stock Market Prediction
 Black dots: training data
 Red Line (continuous and dashed): Predictions
 Blue dots: test (unseen) actual data
 http://www.gold-eagle.com/editorials_03/sornette112403.html

82
Difference and commonality between
Classification and Regression
Difference: Regression predicts continuous target (stock
price, inventory demand, flight arriving time, weight etc),
whereas classification predicts categorical/ discrete
labels (e.g. stock up or down, good/bad watermelon,
cancer/normal, fraud/normal,
underweight/normal/overweight/Obese etc)

Commonality: both of them need training data to learn a


model. Many methods can be used for both purpose,
e.g. NN, SVM, DT.

83
Challenges of Data Mining
• Scalability
• Dimensionality
• Complex and Heterogeneous Data
• Data Quality
• Data Ownership and Distribution
• Privacy Preservation
• Streaming Data
• ……
Summary

• Data mining: Discovering interesting patterns from


large amounts of data
• A natural evolution of database technology, in great
demand, with wide applications
• A KDD process includes data cleaning, data integration,
data selection, transformation, data mining, pattern
evaluation, and knowledge presentation

85
Summary (cont.)
• Mining can be performed on a variety of information
repositories
• Data mining functionalities: association, classification,
clustering, outlier and trend analysis, etc.
• Major issues in data mining include mining
methodologies, user interaction, and applications

86
Career in Data Mining
• 2011 Salary survey (Annual Salary in US$)
• http://www.kdnuggets.com/polls/2011/data-mining-salary-
income.html
• Median Income = US$100K (vs IT median of US$60K)

• DM Job postings: http://www.kdnuggets.com/jobs/index.html


87
A Brief History of Data Mining Society
• 1989 IJCAI Workshop on Knowledge Discovery in Databases
• Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
• 1991-1994 Workshops on Knowledge Discovery in Databases
• Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-
Shapiro, P. Smyth, and R. Uthurusamy, 1996)
• 1995-1998 International Conferences on Knowledge Discovery in Databases and
Data Mining (KDD’95-98)
• Journal of Data Mining and Knowledge Discovery (1997)
• ACM SIGKDD conferences since 1998 and SIGKDD Explorations
• More conferences on data mining
• PAKDD (1997), PKDD (1997), SDM(2001), ICDM (2001), WSDM
• ACM Transactions on KDD starting in 2007 88
Conferences and Journals on Data Mining

• KDD Conferences
• Other related conferences
• ACM SIGKDD Int. Conf. on
Knowledge Discovery in Databases – ACM SIGMOD
and Data Mining (KDD) – VLDB
• SIAM Data Mining Conf (SDM) – (IEEE) ICDE
• (IEEE) Int. Conf. on Data Mining – WWW, SIGIR
(ICDM)
– ICML, CVPR, NIPS
• Conf. on Principles and practices of
Knowledge Discovery and Data • Journals
Mining (PKDD) – Data Mining and Knowledge
• Pacific-Asia Conf. on Knowledge Discovery (DAMI or DMKD)
Discovery and Data Mining (PAKDD) – IEEE Trans. On Knowledge and
• WSDM Data Eng. (TKDE)
– KDD Explorations
– ACM Trans. on KDD
89
Where to Find References? DBLP, CiteSeer, Google
• Data mining and KDD (SIGKDD)
• Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
• Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
• Database systems (SIGMOD: ACM SIGMOD Anthology)
• Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
• Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
• AI & Machine Learning
• Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
• Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-
PAMI, etc.
• Web and IR
• Conferences: SIGIR, WWW, CIKM, etc.
• Journals: WWW: Internet and Web Information Systems,
• Statistics
• Conferences: Joint Stat. Meeting, etc.
• Journals: Annals of statistics, etc.
• Visualization
• Conference proceedings: CHI, ACM-SIGGraph, etc.
• Journals: IEEE Trans. visualization and computer graphics, etc.
Recommended Reference Books
• P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

• J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006

• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001

• T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-
Verlag, 2001

• S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002

• R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

• T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
AAAI/MIT Press, 1996

• U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann,
2001

• T. M. Mitchell, Machine Learning, McGraw Hill, 1997

• G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991

• S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

• I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan
Kaufmann, 2nd ed. 2005 91
Contact: zhangj@ntu.edu.sg if you have questions

You might also like