Chapter1 Introduction
Chapter1 Introduction
Overall Introduction
Zhang Jie
Nanyang Technological University
Outline
• Course Introduction
2
Information about Instructor
• Zhang Jie, Professor
College of Computing and Data Science, NTU
• Email: ZhangJ@ntu.edu.sg
• Phone: 6790-6245
• Office: N4-02C-100
• Webpage: https://www.ntu.edu.sg/home/zhangj/
3
Plan for MH6151
(subject to minor changes)
• 11 Lectures
– Overall Introduction (1 Lecture)
– Data and Data exploration (2 Lectures)
– Classification algorithms (2 lecturers)
– Ensemble Learning (1 lecture)
– Regression (1 lecture)
– Association Rule Mining (1.5 Lecture)
– Clustering (2.5 Lectures)
• Course project demonstration and
presentation (Week 12&13)
• Final exam (Week 14) 4
Logistics
• Venue: check the timetable
• Q&A/Consultations
– After/between the lectures
– Email me.
– Meet me? You can contact me first by email or call
me to make an appointment
5
Course Description
• Data mining (Data analytics) is a diverse field
which draws its foundation from many research
areas like database, machine learning, AI,
statistics etc.
• This course aims to introduce the concepts,
algorithms, techniques on data mining,
including 1) data preprocessing, 2) association
rule mining, 3) clustering, 4) classification, and
cover some applications of data mining
techniques.
6
Course Objective
• At the end of the course, students should
• Have a good knowledge of data mining concepts
• Be familiar with various data mining algorithms.
• Given a dataset and problem statement/task,
know how to select appropriate data mining
techniques to analyze data and address the
problem.
• MH8111 and MH8112 (Analytics software) will
give you more knowledge on analytics tools (e.g.
R, Python, Tableau, Weka) so that you can apply
the knowledge learned in this course
Target Audiences
• Students who intend to solve problems and perform
tasks in data mining or related fields
• To bridge the gap between the knowledge of
maths/algorithms that is needed in problem solving
and those that are taught in undergraduate level
• Students who intend to become a data engineer to
apply data mining techniques to solve real-world
applications. Always take your time to practice to
sharpen your skills and gain additional knowledge
(programming, fundamental knowledge, experience to
process different data/PS) – integrate with domain
knowledge (only data driven approach may not work).
Is the content hard to learn?
• As a lecturer, I need to make sure most students
can understand majority of key content and
know how to apply
• So, for some CS/CE/Math/engineering students,
you should be able to follow my lecture without
a lot of difficulties.
• If you want to learn sth more challenging or very
theoretical (e.g. to do research), you might be a
little disappointed . You may need to take
additional courses, e.g. online courses/videos
from top professors in the world
Assessment
• Participation and attendance: 10%
• One Assignments: 15%
• Late submission of an assignment would result in a
reduced grade for the assignment.
• Group project report and code: 30%
• 4-5 members (randomly formed)
• Topic: a data mining task (will be released later)
• For a group, I will give each member individual mark
depending on his/her contributions to the project
• Project demonstration and presentation: 15%
• Final exam: 30%
Textbook and Reference
• Textbook
• Introduction to Data Mining, by Pang-Ning Tan,
Michael Steinbach, and Vipin Kumar, Addison
Wesley, 2005.
• Reference:
• Data Mining: Concepts and Techniques, 3rd ed.,
by Jiawei Han, Micheline Kamber, and Jian Pei,
Morgan Kaufmann, 2011.
• Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations, 3rd
ed., by Ian H. Witten, Eibe Frank, and Mark A.
Hall, Morgan Kaufmann, 2011.
Useful Resources
• Google Scholar – http://scholar.google.com
• Related conferences:
• KDD, ICDM, SDM, ECML/PKDD, PAKDD, ICML, NIPS,
AAAI, IJCAI, WWW, SIGMOD, VLDB, ICDE, CIKM, etc.
• Related journals:
• TKDE, TKDD, DMKD etc.
• Data mining competition platform:
• Kaggle – http://www.kaggle.com/
• Data mining community's resource:
• Kdnuggets – http://www.kdnuggets.com/
Data Mining Tools
• R
• Python
• Weka
• RapidMiner
• Orange
• IBM SPSS modeler
• SAS Data Mining
• Oracle Data Mining
• ……
Outline
• Module Introduction
14
A Motivated Example:
Background: flight efficiency management
Opened 3
months of
flight data
usually not
available to the
public!
Winning
Results
Average 40%-45% Mixture of
less errors for gate Prediction
and runway arrival Models GBM
time, respectively, and RF models
compared to the
standard industry
benchmark
Publicity and Potential Impact
report:
Huge Potential of
Impact:
…
Big Data Analytics
VOLUME
Value (Actionable
Terabytes Insights)
Transactions,
Tables
Files Processing
+ Analytics
Batch
time series Structured
(uniform time Unstructured
interval) Semi-structured
Streams
(continuously)
VELOCITY VARIETY
Big Data Analytics: Why Big Data?
http://www.youtube.com/watch?v=7D1CQ_LOizA
Motivation: Why Mine Data?
Facebook
• Many data • 800 million active users
become • 60 billion photos in total, 250 million photos uploaded per day
• 80 groups/events per user (till Feb 2011)
available
Flickr
• 60 million users
• Five billion photos
• 10 million groups (till Feb 2011)
Twitter Weibo
• 175 million users (registered) • 200 million users
• 140 million tweets per day (till June 2011)
Data Mining
Jessica’s blocks
Whose block
is this?
http://how-old.net/
What is Data Mining?
• Many Definitions
• Non-trivial extraction of implicit, previously
unknown and potentially useful information
from data
• Exploration & analysis, by automatic or
semi-automatic means, of large quantities of
data in order to discover meaningful patterns
https://www.youtube.com/watch?v=R-sGvh6tI04
Definitions: KDD & Data Mining
29
Data Mining: A KDD Process
• Data Mining: The core steps of KDD
• Application of specific
algorithms for extracting
patterns from data
• We may not use data Warehouse
30
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining? – Certain names are more
prevalent in certain US locations
– Look up phone (O’Brien, O’Rurke, O’Reilly… in
number in phone Boston area)
directory
– Group together similar
documents returned by search
– Query a Web engine according to their
search engine for context (e.g. Amazon rainforest,
information about Amazon.com) or search apple
“Amazon” and group documents (fruit
apple, apple.com)
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
• Traditional Techniques
may be unsuitable due to
• Volume: enormity of data Statistics/ Machine Learning/
• Complexity AI Pattern
Recognition
• High dimensionality of data
• Distribution of data Data Mining
• Heterogeneity of data (variety)
• Time series/ Streams (velocity)
Database
• …… systems
Why not use classical data analysis?
1. Data Preprocessing
A. Data Integration
Combine multiple data sources (more relevant data, rich insights;
no data, no insights)
B. Data Cleaning
Remove noise and inconsistent data (GIGO)
C. Data Selection
Select task-relevant data
D. Data Transformation
Transform selected data for further analysis
34
Major Steps of Data Mining (KDD)
2. Data Mining
Apply data mining & machine learning methods (e.g., association,
classification, clustering, regression, anomaly detection) to extract
patterns from data
3. Pattern Evaluation (Post Processing)
Evaluate the performance & identify truly interesting patterns or
models
4. Visualization (Post Processing; we also frequently use
visualization data mining stage to better understand
data)
Present the mined patterns and prediction results to users
Although “Data Mining” is just one of the many steps,
it is usually used to refer to the whole process of KDD
35
The Architecture of a Typical Data Mining System
User
Visualization
Data Preprocessing
Integration/Cleaning/Selection/Reduction/Transformation
Databases
36
Data Mining & Business Intelligence
Oftentimes, you may not have business analyst to present your mining results to
management. It is critical for you to explain them in business context and language.
37
Outline
• Module Introduction
38
Data Mining Tasks
• Prediction Methods
• Use some variables to predict unknown or future values of
other variables.
• Description Methods
• Find human-interpretable patterns that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
• Classification [Predictive]
• Regression [Predictive]
• Outlier Detection (Deviation Detection) [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
All major analytics tools (e.g. R, Python, SAS, …) cover all the mining tasks so that you
can directly use them. What we teach in this course give you theoretical foundations of
these methods.
Data Mining Taxonomy
Descriptive Predictive
… …
TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule Mining: Application 1
Marketing and Sales Promotion
• Let the rule discovered be
{Coke, … } --> {Potato Chips}
• Potato Chips as consequent
– Can be used to determine what should be done to boost its
sales.
• Coke in the antecedent
– Can be used to see which products would be affected if the
store discontinues selling Coke.
• Coke in antecedent and Potato chips in consequent
– Can be used to see what products should be sold with Coke to
promote sale of Potato chips! 51
Association Rule Mining: Application 2
Supermarket shelf management.
• Goal: To identify items that are bought together by
sufficiently many customers.
• Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
• Here is a classical rule: {diaper, milk} --> {beer}
• If a customer buys diaper and milk, then he is very likely to
buy beer.
• So, don’t be surprised if you find six-packs stacked next to
diapers!
52
Sequential Pattern Mining
54
Classification
• Also called Supervised Learning
• Learn from past experience/labels, and use the
learned knowledge to classify new data
• Knowledge learned by intelligent machine learning
algorithms
• Examples:
• Clinical diagnosis for patients
An Classification Example Predictive
Attributes
ID1 5 1 1 1 2 1 3 1 1 2
Training Examples
ID2 5 4 4 5 7 10 3 2 1 2
ID3 3 1 1 1 2 2 3 1 1 4
ID4 8 10 10 8 7 10 9 7 1 4
Find a model for class attribute as a function of the values of other attributes.
f (Clump Thickness, Uniformity of Cell Size, …, Mitoses) = Class
f(4, 6, 5, 6, 8, 9, 2, 4)=?
A test example
Classification: Definition
• Given a collection of records (training set ), each record
contains a set of attributes, one of the attributes is the
class attribute.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: To ensure that previously unseen records should
be assigned a class as accurately as possible.
There are 3 normal features and 1 target If we have many training data and
feature. This is a binary classification quality sensors, we can build a very
problem. Objective is to learn an accurate accurate model using data mining. In
model to help us to pick good watermelon addition, we could learn rules/insights,
e.g. which feature is the most
important one and which two could be
Perhaps to build a mobile app to make $
used together to build a better model
Classification Example
Splitting Attributes
Training Data
62
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
Decision
Model
Training Set
Apply Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
63
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
64
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
65
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
66
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
67
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
68
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refun 10
Yes d No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
NO
< 80K > 80K
NO YES
69
Classification: Application 1
• Direct Marketing
• Goal: Reduce cost of mailing by targeting a set of consumers likely
to buy a new cell-phone product.
• Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided otherwise.
This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and teleco-interaction related
information about all such customers.
• Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-
holder as attributes. When does a customer buy, what does he buy,
how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the
class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions
on an account.
Classification: Application 3
• Customer Attrition/Churn:
• Goal: To predict whether a customer is likely to be lost to
a competitor (e.g. Singtel -> M1).
• Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
• How often the customer calls, where he calls, what time-of-the
day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
Clustering
• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
• Automatically learn the structure of data
Main Principle
Inter-cluster distances
Intra-cluster distances C2 are maximized
are minimized
C1
ID1 5 1 1 1 2 1 3 1 1
ID2 5 4 4 5 7 10 3 2 1
ID3 3 1 1 1 2 2 3 1 1
ID4 8 10 10 8 7 10 9 7 1
We can learn the relationship or structures from data by clustering, e.g. maybe ID1
and ID3 should be in 1 cluster?
Clustering Definition
• Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
• Data points in one cluster are more similar to one
another.
• Data points in separate clusters are less similar to one
another.
• Similarity Measures:
• Euclidean distance/cosine similarity if attributes are
continuous.
• Other Problem-specific Measures.
Clustering: Application 1
• Market Segmentation:
• Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached.
• Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
• Could be turned into classification
Clustering: Application 2
• Document Clustering:
• Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
• Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the
frequencies of different terms. Use classic clustering algo or
topic modeling to perform clustering. Can describe clusters
using keywords
• Gain: Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
Illustrating Document Clustering
• Clustering Points: 3204 Articles of Los Angeles Times.
• Similarity Measure: How many words are common in these
documents (after some word filtering).
National 273 36
80
Prediction & Regression: Example 1
Relationship between systolic blood pressure (y), birthweight (x1), and age (in days) (x2)
Birthweight Age Systolic BP
i in oz (x1) in days (x2) mm HG (y) Training regression model:
1 135 3 89 Use least-squares method to
2 120 4 90
determine the regression eqn:
3 100 3 83
4 105 2 77 y=53.45 + 0.126 * x1 + 5.89 *x2
5 130 4 92
6 125 5 98
Prediction using model:
7 125 2 82
To predict the systolic BP of a
8 105 3 85
9 120 5 96
baby with birthweight 8 lb (128
10 90 4 95
oz) measured at 3 days of life
11 120 2 80
12 95 3 79
y=53.45+0.126*(128)+5.89*(3)
13 120 3 86
14 150 4 97
=87.2mm Hg
15 160 3 92
16 125 3 88 81
Prediction & Regression: Example 2
Stock Market Prediction
Black dots: training data
Red Line (continuous and dashed): Predictions
Blue dots: test (unseen) actual data
http://www.gold-eagle.com/editorials_03/sornette112403.html
82
Difference and commonality between
Classification and Regression
Difference: Regression predicts continuous target (stock
price, inventory demand, flight arriving time, weight etc),
whereas classification predicts categorical/ discrete
labels (e.g. stock up or down, good/bad watermelon,
cancer/normal, fraud/normal,
underweight/normal/overweight/Obese etc)
83
Challenges of Data Mining
• Scalability
• Dimensionality
• Complex and Heterogeneous Data
• Data Quality
• Data Ownership and Distribution
• Privacy Preservation
• Streaming Data
• ……
Summary
85
Summary (cont.)
• Mining can be performed on a variety of information
repositories
• Data mining functionalities: association, classification,
clustering, outlier and trend analysis, etc.
• Major issues in data mining include mining
methodologies, user interaction, and applications
86
Career in Data Mining
• 2011 Salary survey (Annual Salary in US$)
• http://www.kdnuggets.com/polls/2011/data-mining-salary-
income.html
• Median Income = US$100K (vs IT median of US$60K)
• KDD Conferences
• Other related conferences
• ACM SIGKDD Int. Conf. on
Knowledge Discovery in Databases – ACM SIGMOD
and Data Mining (KDD) – VLDB
• SIAM Data Mining Conf (SDM) – (IEEE) ICDE
• (IEEE) Int. Conf. on Data Mining – WWW, SIGIR
(ICDM)
– ICML, CVPR, NIPS
• Conf. on Principles and practices of
Knowledge Discovery and Data • Journals
Mining (PKDD) – Data Mining and Knowledge
• Pacific-Asia Conf. on Knowledge Discovery (DAMI or DMKD)
Discovery and Data Mining (PAKDD) – IEEE Trans. On Knowledge and
• WSDM Data Eng. (TKDE)
– KDD Explorations
– ACM Trans. on KDD
89
Where to Find References? DBLP, CiteSeer, Google
• Data mining and KDD (SIGKDD)
• Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
• Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
• Database systems (SIGMOD: ACM SIGMOD Anthology)
• Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
• Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
• AI & Machine Learning
• Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
• Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-
PAMI, etc.
• Web and IR
• Conferences: SIGIR, WWW, CIKM, etc.
• Journals: WWW: Internet and Web Information Systems,
• Statistics
• Conferences: Joint Stat. Meeting, etc.
• Journals: Annals of statistics, etc.
• Visualization
• Conference proceedings: CHI, ACM-SIGGraph, etc.
• Journals: IEEE Trans. visualization and computer graphics, etc.
Recommended Reference Books
• P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
• J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006
• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
• T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-
Verlag, 2001
• S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
• T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
AAAI/MIT Press, 1996
• U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann,
2001
• I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan
Kaufmann, 2nd ed. 2005 91
Contact: zhangj@ntu.edu.sg if you have questions