Lec.01 Introduction To DM
Lec.01 Introduction To DM
1
Report
QT1 (10%): attending classes and discuss
QT2 (20%): Homework #1-2-3
Midterm (20%)
Exam.
Final report (50%)
Group presentation
Individual performance
Requirement:
Submit HW, Report, … before deadline
Presentation:
1) Understanding proble clearly
2) Solution/ Algorithm
3) Demo code
2
Contents
Why data mining?
What is data mining?
What types of data can be mined?
Data mining functionalities/ Tasks
Interesting patterns
Classification of data mining systems
Major issues in data mining
3
Large-scale Data is Everywhere!
There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation and
collection technologies.
Cyber Security E-Commerce
New mantra
Gather whatever data you can
whenever and wherever possible.
5
Why Data Mining? Commercial Viewpoint
In hypothesis formation
Improving health care and reducing costs Predicting the impact of climate change
10
Why Data Mining?—Potential Applications
11
Why Data Mining?—Potential Applications
Other Applications
Text mining (news group, email, documents) and Web
mining
Stream data mining
Bioinformatics and bio-data analysis
12
Market Analysis and Management
Where does the data come from?
Credit card transactions, discount coupons, customer
complaint calls
Target marketing
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
13
Market Analysis and Management
Cross-market analysis
Associations/co-relations between product sales, &
prediction based on such association
Customer profiling
What types of customers buy what products
Customer requirement analysis
Identifying the best products for different customers
Predict what factors will attract new customers
14
Fraud Detection & Mining Unusual Patterns
15
Other Applications
16
Q2. What Is Data Mining?
Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
18
Steps of a KDD Process
Pattern evaluation
Data
Databases Warehouse
20
What is Data Mining?
Many Definitions
Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
Example:
Analysing customer data to predict the credit risks of new
customers (based on previous data)
Analysing sales data - (any deviations)
data
Data warehouse cub
e
Collection of data integrated from different sources
with querying and decision making on data
In data warehouse, data is stored in multidimensional
structure (datacube) where each dimension is each
attribute
Data
Source-1 Client-1
Data Data Querying
Source-2 Warehouse Analysis
Client-2
Data
Source-3
Transactional data
Each record is called as transaction
sales,
flight booking,
user clicks on web page
Regression:
Statistical methodology that is used for numeric prediction (done based on
previous data) of missing data
28
Q4. Data Mining Functionalities
Cluster analysis (Group)
Class label is unknown: Group data to form new classes, e.g., cluster
Outlier analysis
Outlier: a data object that does not comply with the general behavior of
the data
Useful in fraud detection, rare events analysis
29
Data Mining Tasks …
Clu
s teri
Data
ng
Tid Refund Marital Taxable
ng Status Income Cheat
l i
e
od
1 Yes Single 125K No
2 No Married 100K No
M
ve
3 No Single 70K No
c ti
4 Yes Married 120K No
i
ed
5 No Divorced 95K Yes
6
7
No
Yes
Married 60K
Divorced 220K
No
No P r
8 No Single 85K Yes
9 No Married 75K No
An
10 No Single 90K Yes
De oma
11 No Married 60K No
tec ly
oc i
13 No Single 85K Yes
s 14 No Married 75K No
ti o
As s 15 No Single 90K Yes n
le
10
Ru
Milk
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10
Number of Number of
years years
Yes No Yes No
Training
Learn
Model
Set Classifier
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Use of K-means to
partition Sea Surface
60
Land Cluster 2
0
(NPP) into clusters that
Ice or No NPP
-30
reflect the Northern and
Sea Cluster 2 Southern Hemispheres.
-60
Sea Cluster 1
-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
Cluster Introduction to Data Mining, 2nd Edition
longitude
Tan, Steinbach, Karpatne, Kumar 40
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their
geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
TID Items
1 Bread, Coke, Milk
Rules
RulesDiscovered:
Discovered:
2 Beer, Bread
{Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Beer, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Medical Informatics
Rules are used to find combination of patient symptoms and
test results associated with certain diseases
Data mining may generate thousands of patterns: Not all of them are
interesting
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty.
47
Q6. Data Mining: Classification Schemes
48
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-
media, heterogeneous, WWW
Knowledge to be mined
Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
49
Multi-Dimensional View of Data Mining
Techniques utilized
Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, Web mining, etc.
50
OLAP Mining: Integration of Data Mining and Data Warehousing
51
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
52
Q7. Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one:
knowledge fusion
53
Q7. Major Issues in Data Mining
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of
abstraction
54
Summary
Data mining: discovering interesting patterns from large amounts of data
A natural evolution of database technology, in great demand, with wide
applications
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
Data mining systems and architectures
Major issues in data mining
55
Where to Find References?
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
Data mining and KDD
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations
Database systems
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
AI & Machine Learning
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
Journals: Machine Learning, Artificial Intelligence, etc.
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc. 56