Lec.01 Introduction To DM
Lec.01 Introduction To DM
Data Mining
1
Report
n QT1 (10%): attending classes
n QT2 (20%): Homework #1-2-3
n Midterm (20%)
n Group presentation
n Individual performance
n Final report (50%)
n Group presentation
n Individual performance
n Requirement:
n Submit HW, Report, … before deadline
n Presentation:
n 1) Understanding proble clearly
n 2) Solution/ Algorithm
n 3) Demo code
2
Contents
n Why data mining?
n Interesting patterns
3
Large-scale Data is Everywhere!
§ There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation and
collection technologies. Cyber Security E-Commerce
§ New mantra
§ Gather whatever data you can
whenever and wherever possible.
5
Why Data Mining? Commercial Viewpoint
n In hypothesis formation
Improving health care and reducing costs Predicting the impact of climate change
10
Why Data Mining?—Potential Applications
11
Why Data Mining?—Potential Applications
n Other Applications
n Text mining (news group, email, documents) and Web
mining
n Stream data mining
n Bioinformatics and bio-data analysis
12
Market Analysis and Management
n Target marketing
n Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
n Determine customer purchasing patterns over time
13
Market Analysis and Management
n Cross-market analysis
n Associations/co-relations between product sales, &
prediction based on such association
n Customer profiling
n What types of customers buy what products
14
Fraud Detection & Mining Unusual Patterns
15
Other Applications
16
Q2. What Is Data Mining?
Task-relevant Data
Data Cleaning
Data Integration
Databases
18
Steps of a KDD Process
Pattern evaluation
Data
Databases Warehouse
20
What is Data Mining?
n Many Definitions
n Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
n Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns
n RDBMs
n Set of tables – has rows (tuples) and columns (attributes)
n While mining databases, we can search for trends or data
pattern
n Example:
n Analysing customer data to predict the credit risks of new
customers (based on previous data)
n Analysing sales data - (any deviations)
Data warehouse data
cube
Data
Source-1 Client-1
Data Data Querying
Source-2 Warehouse Analysis
Client-2
Data
Source-3
Transactional data
n Each record is called as transaction
n sales,
n flight booking,
n user clicks on web page
n Regression:
n Statistical methodology that is used for numeric prediction (done based on
previous data) of missing data
28
Q4. Data Mining Functionalities
n Cluster analysis (Group)
n Class label is unknown: Group data to form new classes, e.g., cluster
n Outlier analysis
n Outlier: a data object that does not comply with the general behavior of
the data
n Useful in fraud detection, rare events analysis
29
Data Mining Tasks …
Clu
ste Data
ring g
lin
Tid Refund Marital Taxable
Status Income Cheat
e
od
1 Yes Single 125K No
2 No Married 100K No
M
ve
3 No Single 70K No
ti
dic
4 Yes Married 120K No
e
5 No Divorced 95K Yes
Pr
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
An
De oma
11 No Married 60K No
i tec ly
soc
13 No Single 85K Yes
s 14 No Married 75K No
tio
A n
les
15 No Single 90K Yes
u
10
Milk
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10
Number of Number of
years years
Yes No Yes No
Set
Training
Learn
Model
Set Classifier
account-holder as attributes.
n When does a customer buy, what does he buy, how
often he pays on time, etc
n Label past transactions as fraud or fair transactions.
transactions on an account.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Use of K-means to
partition Sea Surface
60
Land Cluster 2
0
(NPP) into clusters that
reflect the Northern and
Ice or No NPP
-30
Sea Cluster 1
-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
Cluster
Introduction to Data Mining, 2nd Edition
longitude
Tan, Steinbach, Karpatne, Kumar 40
Clustering: Application 1
n Market Segmentation:
n Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
n Approach:
n Collect different attributes of customers based on their
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
n Medical Informatics
n Rules are used to find combination of patient symptoms and
test results associated with certain diseases
n Data mining may generate thousands of patterns: Not all of them are
interesting
n Suggested approach: Human-centered, query-based, focused mining
n Interestingness measures
n A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
n Objective vs. subjective interestingness measures
n Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
n Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty.
47
Q6. Data Mining: Classification Schemes
48
Multi-Dimensional View of Data Mining
n Data to be mined
n Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-
media, heterogeneous, WWW
n Knowledge to be mined
n Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
n Multiple/integrated functions and mining at multiple levels
49
Multi-Dimensional View of Data Mining
n Techniques utilized
n Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc.
n Applications adapted
n Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, Web mining, etc.
50
OLAP Mining: Integration of Data Mining and Data Warehousing
51
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
52
Q7. Major Issues in Data Mining
n Mining methodology
n Mining different kinds of knowledge from diverse data types,
e.g., bio, stream, Web
n Performance: efficiency, effectiveness, and scalability
n Pattern evaluation: the interestingness problem
n Incorporation of background knowledge
n Handling noise and incomplete data
n Parallel, distributed and incremental mining methods
n Integration of the discovered knowledge with existing one:
knowledge fusion
53
Q7. Major Issues in Data Mining
n User interaction
n Data mining query languages and ad-hoc mining
n Expression and visualization of data mining results
n Interactive mining of knowledge at multiple levels of
abstraction
54
Summary
n Data mining: discovering interesting patterns from large amounts of data
n A natural evolution of database technology, in great demand, with wide
applications
n A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
n Mining can be performed in a variety of information repositories
n Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
n Data mining systems and architectures
n Major issues in data mining
55
Where to Find References?
n More conferences on data mining
n PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
n Data mining and KDD
n Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
n Journal: Data Mining and Knowledge Discovery, KDD Explorations
n Database systems
n Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
n Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
n AI & Machine Learning
n Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
n Journals: Machine Learning, Artificial Intelligence, etc.
n Statistics
n Conferences: Joint Stat. Meeting, etc.
n Journals: Annals of statistics, etc.
n Visualization
n Conference proceedings: CHI, ACM-SIGGraph, etc.
n Journals: IEEE Trans. visualization and computer graphics, etc.
56