KEMBAR78
Lec.01 Introduction To DM | PDF | Data Mining | Databases
0% found this document useful (0 votes)
10 views56 pages

Lec.01 Introduction To DM

The document outlines a course on Data Mining and Knowledge Discovery, detailing the syllabus, assessment components, and the importance of data mining in various fields. It discusses the explosive growth of data, the necessity for automated analysis, and potential applications in commercial and scientific contexts. Key topics include data mining functionalities, types of data, and the knowledge discovery process.

Uploaded by

khanhndn2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views56 pages

Lec.01 Introduction To DM

The document outlines a course on Data Mining and Knowledge Discovery, detailing the syllabus, assessment components, and the importance of data mining in various fields. It discusses the explosive growth of data, the necessity for automated analysis, and potential applications in commercial and scientific contexts. Key topics include data mining functionalities, types of data, and the knowledge discovery process.

Uploaded by

khanhndn2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 56

Course: 505043

Data Mining and Knowledge Discovery

Lecture 1. Introduction to Data Mining


Types of Data

Dr. Anh HOANG

1
Report
 QT1 (10%): attending classes and discuss
 QT2 (20%): Homework #1-2-3
 Midterm (20%)

Exam.
 Final report (50%)

Group presentation

Individual performance
 Requirement:

Submit HW, Report, … before deadline

Presentation:

1) Understanding proble clearly

2) Solution/ Algorithm

3) Demo code
2
Contents
 Why data mining?
 What is data mining?
 What types of data can be mined?
 Data mining functionalities/ Tasks
 Interesting patterns
 Classification of data mining systems
 Major issues in data mining

3
Large-scale Data is Everywhere!
 There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation and
collection technologies.
Cyber Security E-Commerce

 New mantra
 Gather whatever data you can
whenever and wherever possible.

Social Networking: Twitter


 Expectations Traffic Patterns
 Gathered data will have value
either for the purpose collected or
for a purpose not envisioned.

Sensor Networks Computational Simulation


Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar 4
Q1. Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability

Automated data collection tools, database systems, Web, computerized
society
 Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras,


 We are drowning in data but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets

5
Why Data Mining? Commercial Viewpoint

 Lots of data is being collected


and warehoused
 Web data

Google has Peta Bytes of web data

Facebook has billions of active users
 Purchases at department/
grocery stores, e-commerce

Amazon handles millions of visits/day
 Bank/Credit Card transactions

 Computers have become cheaper and more powerful


 Competitive Pressure is Strong
 Provide better, customized services for an edge (e.g. in Customer
Relationship Management)

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 6
Why Data Mining? Scientific Viewpoint
 Data collected and stored at
enormous speeds
 Remote sensors on a satellite

NASA EOSDIS archives over
petabytes of earth science data / year
fMRI Data from Brain Sky Survey Data
 Telescopes scanning the skies

Sky survey data
 High-throughput biological data

Scientific simulations

Terabytes of data generated in a few hours
Gene Expression Data
 Data mining helps scientists
 In automated analysis of massive datasets

 In hypothesis formation

Surface Temperature of Earth


Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar 7
Great opportunities to improve productivity in all walks of life

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 8
Great Opportunities to Solve Society’s Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Reducing hunger and poverty by


Finding alternative/ green energy sources
increasing agriculture production

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 9
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web databases
 2000s:
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems

10
Why Data Mining?—Potential Applications

 Data analysis and decision support/making


 Market analysis and management

Target marketing, customer relationship management
(CRM), market basket analysis, market segmentation
 Risk analysis and management

Forecasting, customer retention, quality control,
competitive analysis
 Fraud detection and detection of unusual patterns (outliers)

11
Why Data Mining?—Potential Applications

 Other Applications
 Text mining (news group, email, documents) and Web
mining
 Stream data mining
 Bioinformatics and bio-data analysis

12
Market Analysis and Management
 Where does the data come from?
 Credit card transactions, discount coupons, customer
complaint calls

 Target marketing
 Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
 Determine customer purchasing patterns over time

13
Market Analysis and Management
 Cross-market analysis
 Associations/co-relations between product sales, &
prediction based on such association
 Customer profiling
 What types of customers buy what products
 Customer requirement analysis
 Identifying the best products for different customers
 Predict what factors will attract new customers

14
Fraud Detection & Mining Unusual Patterns

 Approaches: Clustering & model construction for frauds, outlier analysis

 Applications: Health care, retail, credit card service, telecom.


 Medical insurance

Professional patients, and ring of doctors

Unnecessary or correlated screening tests
 Telecommunications:

Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
 Retail industry

Analysts estimate that 38% of retail shrink is due to dishonest
employees

15
Other Applications

 Internet Web Surf-Aid


 IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover customer
preference and behavior pages, analyzing effectiveness of
Web marketing, improving Web site organization, etc.
 …

16
Q2. What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
 Alternative name
 Knowledge discovery in databases (KDD)
 Watch out: Is everything “data mining”?
 Query processing
 Expert systems
 Statistical programs
17
Data Mining: KDD Process


Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
18
Steps of a KDD Process

 Learning the application domain


 Relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% - 80% of effort!)
 Data reduction and transformation
 Find useful features, dimensionality/variable reduction.
 Choosing functions of data mining
 Summarization, classification, regression, association, clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 Visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge
 …
19
Architecture: Typical Data Mining System

Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-
Database or
data warehouse base
server
Data cleaning & data integration Filtering

Data
Databases Warehouse

20
What is Data Mining?
 Many Definitions
 Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
 Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 21
Origins of Data Mining
 Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems

 Traditional techniques may be unsuitable due to data that is



Large-scale

High dimensional

Heterogeneous

Complex

Distributed

 A key component of the emerging field of data science and data-driven


discovery

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 22
Q3. What types of data can be mined?
 Database data (RDBMs)
 Data warehouse
 Transactional data
 Other types of data:

Sequence data, data streams (cont.), spatial data (maps), engineering
design data, hypertext, multimedia, web data, etc.

 Advanced database and information repository


 Spatial and temporal data
 Time-series data
 Stream data
 Multimedia database
 Text databases & WWW
23
Database data (RDBMs): Relational -> tables
 RDBMs

Set of tables – has rows (tuples) and columns (attributes)

While mining databases, we can search for trends or data
pattern

 Example:

Analysing customer data to predict the credit risks of new
customers (based on previous data)

Analysing sales data - (any deviations)
data
Data warehouse cub
e
 Collection of data integrated from different sources
with querying and decision making on data
 In data warehouse, data is stored in multidimensional
structure (datacube) where each dimension is each
attribute
Data
Source-1 Client-1
Data Data Querying
Source-2 Warehouse Analysis
Client-2
Data
Source-3
Transactional data
 Each record is called as transaction

sales,

flight booking,

user clicks on web page

 Transaction has transaction ID, list of other items making


transaction

 From transaction database, we can mine frequent patterns

 Other types of data:



Sequence data, data streams (cont.), spatial data (maps),
engineering design data, hypertext, multimedia, web data, etc.
Q4. Data Mining Functionalities
 Data is always associated with class/concepts Descriptions:

Data characterisation:

Refers to the summary of the class/ concept

Output -> General overview

Data discrimination:

Compares the common features of the classes

Output -> barcharts, curves, etc.

 Mining frequent patterns, Association, and Correlations



Frequent patterns:

Things which are found most commonly in data

Frequent itemsets (data items/ data objects)

Frequent subsequence

Frequent substructure

Association analysis: (relationship)

It is a way identifying the relation between various items

Example: used to determine sales of items that are frequently purchased
together
27
Q4. Data Mining Functionalities
 Correlation analysis:

Mathematical technique

Shows how strongly pair of attributes are related together

Example: tall peope tend to have more weight

 Classification and Regression for predictive analysis



Classsification:

Process of finding a model that distinguishes data items

Decision tree is used for classification


Regression:

Statistical methodology that is used for numeric prediction (done based on
previous data) of missing data

28
Q4. Data Mining Functionalities
 Cluster analysis (Group)
 Class label is unknown: Group data to form new classes, e.g., cluster

houses to find distribution patterns


 Maximizing intra-class similarity & minimizing interclass similarity

 Outlier analysis
 Outlier: a data object that does not comply with the general behavior of

the data
 Useful in fraud detection, rare events analysis

 Trend and evolution analysis


 Trend and deviation: regression analysis

 Sequential pattern mining, periodicity analysis

29
Data Mining Tasks …

Clu
s teri
Data
ng
Tid Refund Marital Taxable
ng Status Income Cheat
l i
e
od
1 Yes Single 125K No
2 No Married 100K No
M
ve
3 No Single 70K No

c ti
4 Yes Married 120K No
i
ed
5 No Divorced 95K Yes
6
7
No
Yes
Married 60K
Divorced 220K
No
No P r
8 No Single 85K Yes
9 No Married 75K No

An
10 No Single 90K Yes

De oma
11 No Married 60K No

at i on 12 Yes Divorced 220K No

tec ly
oc i
13 No Single 85K Yes

s 14 No Married 75K No
ti o
As s 15 No Single 90K Yes n
le
10

Ru

Milk

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 30
Predictive Modeling: Classification

 Find a model for class attribute as a function of the


values of other attributes Model for predicting credit
worthiness

Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10

Number of Number of
years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 31
Classification Example
l l ive
ir ca ir ca a t # years at
go go nti t Tid Employed
Level of
present
Credit
ate ate u a ass Education
address
Worthy
c c q cl 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10 Test
Set

Training
Learn
Model
Set Classifier

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 32
Examples of Classification Task

 Classifying credit card transactions


as legitimate or fraudulent

 Classifying land covers (water bodies, urban areas,


forests, etc.) using satellite data

 Categorizing news stories as finance,


weather, entertainment, sports, etc

 Identifying intruders in the cyberspace

 Predicting tumor cells as benign or malignant

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random coil

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 33
Classification: Application 1
 Fraud Detection
 Goal: Predict fraudulent cases in credit card transactions.
 Approach:

Use credit card transactions and the information on its
account-holder as attributes.
 When does a customer buy, what does he buy, how often

he pays on time, etc



Label past transactions as fraud or fair transactions. This
forms the class attribute.

Learn a model for the class of the transactions.

Use this model to detect fraud by observing credit card
transactions on an account.

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 34
Classification: Application 2
 Churn prediction for telephone customers

Goal: To predict whether a customer is likely to be lost to a
competitor.

Approach:

Use detailed record of transactions with each of the past
and present customers, to find attributes.

How often the customer calls, where he calls, what time-of-
the day he calls most, his financial status, marital status,
etc.

Label the customers as loyal or disloyal.

Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 35
Classification: Application 3
 Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic survey
images (from Palomar Observatory).

3000 images with 23,040 x 23,040 pixels per image.
– Approach:

Segment the image.

Measure image attributes (features) - 40 of them per
object.

Model the class based on these features.

Success Story: Could find 16 new high red-shift quasars,
some of the farthest objects that are difficult to find!

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 36
Classifying Galaxies
Courtesy: http://aps.umn.edu

Early Class: Attributes:


• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate

Late

Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 37
Regression
 Predict a value of a given continuous valued variable based on
the values of other variables, assuming a linear or nonlinear
model of dependency.
 Extensively studied in statistics, neural network fields.
 Examples:

Predicting sales amounts of new product based on
advertising expenditure.

Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.

Time series prediction of stock market indices.

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 38
Clustering
 Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 39
Applications of Cluster Analysis
 Understanding

Custom profiling for targeted
marketing

Group related documents for
browsing

Group genes and proteins that have
similar functionality

Group stocks with similar price
fluctuations
 Summarization

Reduce the size of large data sets

Courtesy: Michael Eisen

Clusters for Raw SST and Raw NPP


90

Use of K-means to
partition Sea Surface
60

Land Cluster 2

30 Temperature (SST) and


Land Cluster 1 Net Primary Production
latitude

0
(NPP) into clusters that
Ice or No NPP

-30
reflect the Northern and
Sea Cluster 2 Southern Hemispheres.
-60

Sea Cluster 1

-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
Cluster Introduction to Data Mining, 2nd Edition
longitude
Tan, Steinbach, Karpatne, Kumar 40
Clustering: Application 1
 Market Segmentation:
 Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
 Approach:

Collect different attributes of customers based on their
geographical and lifestyle related information.

Find clusters of similar customers.

Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 41
Clustering: Application 2
 Document Clustering:

Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.

Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.

Enron email dataset

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 42
Association Rule Discovery: Definition
 Given a set of records each of which contain some
number of items from a given collection

Produce dependency rules which will predict occurrence of
an item based on occurrences of other items.

TID Items
1 Bread, Coke, Milk
Rules
RulesDiscovered:
Discovered:
2 Beer, Bread
{Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Beer, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 43
Association Analysis: Applications
 Market-basket analysis

Rules are used for sales promotion, shelf management, and
inventory management

 Telecommunication alarm diagnosis



Rules are used to find combination of alarms that occur
together frequently in the same time period

 Medical Informatics

Rules are used to find combination of patient symptoms and
test results associated with certain diseases

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 44
Association Analysis: Applications
 An Example Subspace Differential Co-expression Pattern
from lung cancer dataset Three lung cancer datasets [Bhattacharjee et al.
2001], [Stearman et al. 2005], [Su et al. 2007]

Enriched with the TNF/NFB signaling pathway


which is well-known to be related to lung cancer
P-value: 1.4*10-5 (6/10 overlap with the pathway)

[Fang et al PSB 2010]


Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar 45
Deviation/Anomaly/Change Detection
 Detect significant deviations from normal
behavior
 Applications:

Credit Card Fraud Detection

Network Intrusion
Detection

Identify anomalous behavior from sensor
networks for monitoring and surveillance.

Detecting changes in the global forest
cover.

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 46
Q5. Are All the “Discovered” Patterns Interesting?

 Data mining may generate thousands of patterns: Not all of them are
interesting
 Suggested approach: Human-centered, query-based, focused mining
 Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
 Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty.

47
Q6. Data Mining: Classification Schemes

 Different views, different classifications


 Kinds of data to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted

48
Multi-Dimensional View of Data Mining
 Data to be mined
 Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-
media, heterogeneous, WWW

 Knowledge to be mined
 Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels

49
Multi-Dimensional View of Data Mining
 Techniques utilized
 Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc.

 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, Web mining, etc.

50
OLAP Mining: Integration of Data Mining and Data Warehousing

 Data mining systems, DBMS, Data warehouse systems


coupling
 On-line analytical mining data
 Integration of mining and OLAP technologies
 Interactive mining multi-level knowledge
 Necessity of mining knowledge and patterns at different levels of
abstraction.
 Integration of multiple mining functions
 Characterized classification, first clustering and then association

51
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Systems

Machine
Learning
Data Mining Visualization

Algorithm Other
Disciplines

52
Q7. Major Issues in Data Mining
 Mining methodology
 Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one:
knowledge fusion

53
Q7. Major Issues in Data Mining
 User interaction
 Data mining query languages and ad-hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of
abstraction

 Applications and social impacts


 Domain-specific data mining & invisible data mining
 Protection of data security, integrity, and privacy

54
Summary
 Data mining: discovering interesting patterns from large amounts of data
 A natural evolution of database technology, in great demand, with wide
applications
 A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
 Mining can be performed in a variety of information repositories
 Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
 Data mining systems and architectures
 Major issues in data mining

55
Where to Find References?
 More conferences on data mining
 PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
 Data mining and KDD
 Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
 Journal: Data Mining and Knowledge Discovery, KDD Explorations
 Database systems
 Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
 Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
 AI & Machine Learning
 Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
 Journals: Machine Learning, Artificial Intelligence, etc.
 Statistics
 Conferences: Joint Stat. Meeting, etc.
 Journals: Annals of statistics, etc.
 Visualization
 Conference proceedings: CHI, ACM-SIGGraph, etc.
 Journals: IEEE Trans. visualization and computer graphics, etc. 56

You might also like