KEMBAR78
Lec.01 Introduction To DM | PDF | Data Mining | Databases
0% found this document useful (0 votes)
13 views56 pages

Lec.01 Introduction To DM

The document outlines the course structure for Data Mining, including assessment components and topics covered such as the importance of data mining, types of data, functionalities, and applications. It emphasizes the explosive growth of data and the necessity for automated analysis to extract valuable insights. Additionally, it discusses the evolution of database technology and various data mining tasks like classification, clustering, and fraud detection.

Uploaded by

khanhndn2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views56 pages

Lec.01 Introduction To DM

The document outlines the course structure for Data Mining, including assessment components and topics covered such as the importance of data mining, types of data, functionalities, and applications. It emphasizes the explosive growth of data and the necessity for automated analysis to extract valuable insights. Additionally, it discusses the evolution of database technology and various data mining tasks like classification, clustering, and fraud detection.

Uploaded by

khanhndn2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Course: 505043

Data Mining

Lecture 1. Introduction to Data Mining


Types of Data

Dr. Anh HOANG

1
Report
n QT1 (10%): attending classes
n QT2 (20%): Homework #1-2-3
n Midterm (20%)
n Group presentation
n Individual performance
n Final report (50%)
n Group presentation
n Individual performance
n Requirement:
n Submit HW, Report, … before deadline
n Presentation:
n 1) Understanding proble clearly

n 2) Solution/ Algorithm

n 3) Demo code
2
Contents
n Why data mining?

n What is data mining?

n What types of data can be mined?

n Data mining functionalities/ Tasks

n Interesting patterns

n Classification of data mining systems

n Major issues in data mining

3
Large-scale Data is Everywhere!
§ There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation and
collection technologies. Cyber Security E-Commerce

§ New mantra
§ Gather whatever data you can
whenever and wherever possible.

Social Networking: Twitter


§ Expectations Traffic Patterns
§ Gathered data will have value
either for the purpose collected or
for a purpose not envisioned.

Sensor Networks Computational Simulations


Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar 4
Q1. Why Data Mining?
n The Explosive Growth of Data: from terabytes to petabytes
n Data collection and data availability
n Automated data collection tools, database systems, Web, computerized
society
n Major sources of abundant data
n Business: Web, e-commerce, transactions, stocks, …
n Science: Remote sensing, bioinformatics, scientific simulation, …
n Society and everyone: news, digital cameras,
n …
n We are drowning in data but starving for knowledge!
n “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets

5
Why Data Mining? Commercial Viewpoint

n Lots of data is being collected


and warehoused
n Web data
n Google has Peta Bytes of web data
n Facebook has billions of active users
n Purchases at department/
grocery stores, e-commerce
n Amazon handles millions of visits/day
n Bank/Credit Card transactions

n Computers have become cheaper and more powerful


n Competitive Pressure is Strong
n Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar 6
Why Data Mining? Scientific Viewpoint
n Data collected and stored at
enormous speeds
n Remote sensors on a satellite
n NASA EOSDIS archives over
petabytes of earth science data / year
fMRI Data from Brain Sky Survey Data
n Telescopes scanning the skies
n Sky survey data
n High-throughput biological data
n Scientific simulations
n Terabytes of data generated in a few hours
Gene Expression Data
n Data mining helps scientists
n In automated analysis of massive datasets

n In hypothesis formation

Surface Temperature of Earth


Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar 7
Great opportunities to improve productivity in all walks of life

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 8
Great Opportunities to Solve Society’s Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Finding alternative/ green energy sources Reducing hunger and poverty by


increasing agriculture production
Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar 9
Evolution of Database Technology
n 1960s:
n Data collection, database creation, IMS and network DBMS
n 1970s:
n Relational data model, relational DBMS implementation
n 1980s:
n RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
n Application-oriented DBMS (spatial, scientific, engineering, etc.)
n 1990s:
n Data mining, data warehousing, multimedia databases, and Web databases
n 2000s:
n Stream data management and mining
n Data mining and its applications
n Web technology (XML, data integration) and global information systems

10
Why Data Mining?—Potential Applications

n Data analysis and decision support/making


n Market analysis and management
n Target marketing, customer relationship management
(CRM), market basket analysis, market segmentation
n Risk analysis and management
n Forecasting, customer retention, quality control,
competitive analysis
n Fraud detection and detection of unusual patterns (outliers)

11
Why Data Mining?—Potential Applications

n Other Applications
n Text mining (news group, email, documents) and Web
mining
n Stream data mining
n Bioinformatics and bio-data analysis

12
Market Analysis and Management

n Where does the data come from?


n Credit card transactions, discount coupons, customer
complaint calls

n Target marketing
n Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
n Determine customer purchasing patterns over time

13
Market Analysis and Management

n Cross-market analysis
n Associations/co-relations between product sales, &
prediction based on such association
n Customer profiling
n What types of customers buy what products

n Customer requirement analysis


n Identifying the best products for different customers
n Predict what factors will attract new customers

14
Fraud Detection & Mining Unusual Patterns

n Approaches: Clustering & model construction for frauds, outlier analysis

n Applications: Health care, retail, credit card service, telecom.


n Medical insurance
n Professional patients, and ring of doctors
n Unnecessary or correlated screening tests
n Telecommunications:
n Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
n Retail industry
n Analysts estimate that 38% of retail shrink is due to dishonest
employees

15
Other Applications

n Internet Web Surf-Aid


n IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer
preference and behavior pages, analyzing effectiveness of
Web marketing, improving Web site organization, etc.
n …

16
Q2. What Is Data Mining?

n Data mining (knowledge discovery from data)


n Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
n Alternative name
n Knowledge discovery in databases (KDD)
n Watch out: Is everything “data mining”?
n Query processing
n Expert systems
n Statistical programs
17
Data Mining: KDD Process

n Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
18
Steps of a KDD Process

n Learning the application domain


n Relevant prior knowledge and goals of application
n Creating a target data set: data selection
n Data cleaning and preprocessing: (may take 60% - 80% of effort!)
n Data reduction and transformation
n Find useful features, dimensionality/variable reduction.
n Choosing functions of data mining
n Summarization, classification, regression, association, clustering.
n Choosing the mining algorithm(s)
n Data mining: search for patterns of interest
n Pattern evaluation and knowledge presentation
n Visualization, transformation, removing redundant patterns, etc.
n Use of discovered knowledge
n …
19
Architecture: Typical Data Mining System

Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering

Data
Databases Warehouse

20
What is Data Mining?
n Many Definitions
n Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
n Exploration & analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover
meaningful patterns

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 21
Origins of Data Mining

n Draws ideas from machine learning/AI, pattern recognition,


statistics, and database systems

n Traditional techniques may be unsuitable due to data that is


n Large-scale
n High dimensional
n Heterogeneous
n Complex
n Distributed

n A key component of the emerging field of data science and data-driven


discovery

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 22
Q3. What types of data can be mined?

n Database data (RDBMs)


n Data warehouse
n Transactional data
n Other types of data:
n Sequence data, data streams (cont.), spatial data (maps), engineering
design data, hypertext, multimedia, web data, etc.

n Advanced database and information repository


n Spatial and temporal data
n Time-series data
n Stream data
n Multimedia database
n Text databases & WWW
23
Database data (RDBMs): Relational -> tables

n RDBMs
n Set of tables – has rows (tuples) and columns (attributes)
n While mining databases, we can search for trends or data
pattern

n Example:
n Analysing customer data to predict the credit risks of new
customers (based on previous data)
n Analysing sales data - (any deviations)
Data warehouse data
cube

n Collection of data integrated from different sources


with querying and decision making on data
n In data warehouse, data is stored in multidimensional
structure (datacube) where each dimension is each
attribute

Data
Source-1 Client-1
Data Data Querying
Source-2 Warehouse Analysis
Client-2
Data
Source-3
Transactional data
n Each record is called as transaction
n sales,
n flight booking,
n user clicks on web page

n Transaction has transaction ID, list of other items making


transaction

n From transaction database, we can mine frequent patterns

n Other types of data:


n Sequence data, data streams (cont.), spatial data (maps),
engineering design data, hypertext, multimedia, web data, etc.
Q4. Data Mining Functionalities
n Data is always associated with class/concepts Descriptions:
n Data characterisation:
n Refers to the summary of the class/ concept
n Output -> General overview
n Data discrimination:
n Compares the common features of the classes
n Output -> barcharts, curves, etc.

n Mining frequent patterns, Association, and Correlations


n Frequent patterns:
n Things which are found most commonly in data
n Frequent itemsets (data items/ data objects)
n Frequent subsequence
n Frequent substructure
n Association analysis: (relationship)
n It is a way identifying the relation between various items
n Example: used to determine sales of items that are frequently purchased
together
27
Q4. Data Mining Functionalities
n Correlation analysis:
n Mathematical technique
n Shows how strongly pair of attributes are related together
n Example: tall peope tend to have more weight

n Classification and Regression for predictive analysis


n Classsification:
n Process of finding a model that distinguishes data items

n Decision tree is used for classification

n Regression:
n Statistical methodology that is used for numeric prediction (done based on
previous data) of missing data

28
Q4. Data Mining Functionalities
n Cluster analysis (Group)
n Class label is unknown: Group data to form new classes, e.g., cluster

houses to find distribution patterns


n Maximizing intra-class similarity & minimizing interclass similarity

n Outlier analysis
n Outlier: a data object that does not comply with the general behavior of

the data
n Useful in fraud detection, rare events analysis

n Trend and evolution analysis


n Trend and deviation: regression analysis

n Sequential pattern mining, periodicity analysis

29
Data Mining Tasks …

Clu
ste Data
ring g
lin
Tid Refund Marital Taxable
Status Income Cheat

e
od
1 Yes Single 125K No
2 No Married 100K No
M
ve
3 No Single 70K No

ti
dic
4 Yes Married 120K No

e
5 No Divorced 95K Yes

Pr
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
An
De oma
11 No Married 60K No

ation 12 Yes Divorced 220K No

i tec ly
soc
13 No Single 85K Yes

s 14 No Married 75K No
tio
A n
les
15 No Single 90K Yes

u
10

Milk

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 30
Predictive Modeling: Classification

n Find a model for class attribute as a function of the


values of other attributes Model for predicting credit
worthiness

Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10

Number of Number of
years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 31
Classification Example
l l ive
ir ca ir ca t # years at
go go tita Tid Employed
Level of
present
Credit
ate ate uan lass Education
address
Worthy
c c q c
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … … Test
10

Set

Training
Learn
Model
Set Classifier

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 32
Examples of Classification Task

! Classifying credit card transactions


as legitimate or fraudulent

! Classifying land covers (water bodies, urban areas,


forests, etc.) using satellite data

! Categorizing news stories as finance,


weather, entertainment, sports, etc

! Identifying intruders in the cyberspace

! Predicting tumor cells as benign or malignant

! Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random coil

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 33
Classification: Application 1
n Fraud Detection
n Goal: Predict fraudulent cases in credit card transactions.
n Approach:
n Use credit card transactions and the information on its

account-holder as attributes.
n When does a customer buy, what does he buy, how
often he pays on time, etc
n Label past transactions as fraud or fair transactions.

This forms the class attribute.


n Learn a model for the class of the transactions.

n Use this model to detect fraud by observing credit card

transactions on an account.

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 34
Classification: Application 2
n Churn prediction for telephone customers
n Goal: To predict whether a customer is likely to be lost to a
competitor.
n Approach:
n Use detailed record of transactions with each of the past
and present customers, to find attributes.
n How often the customer calls, where he calls, what time-
of-the day he calls most, his financial status, marital status,
etc.
n Label the customers as loyal or disloyal.
n Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997


Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar 35
Classification: Application 3
n Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic survey
images (from Palomar Observatory).
n 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
n Segment the image.

n Measure image attributes (features) - 40 of them per


object.
n Model the class based on these features.

n Success Story: Could find 16 new high red-shift quasars,


some of the farthest objects that are difficult to find!

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 36
Classifying Galaxies
Courtesy: http://aps.umn.edu

Early Class: Attributes:


• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate

Late

Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 37
Regression
n Predict a value of a given continuous valued variable based on
the values of other variables, assuming a linear or nonlinear
model of dependency.
n Extensively studied in statistics, neural network fields.
n Examples:
n Predicting sales amounts of new product based on
advertising expenditure.
n Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
n Time series prediction of stock market indices.

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 38
Clustering
n Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 39
Applications of Cluster Analysis
n Understanding
n Custom profiling for targeted
marketing
n Group related documents for
browsing
n Group genes and proteins that have
similar functionality
n Group stocks with similar price
fluctuations
n Summarization
n Reduce the size of large data sets

Courtesy: Michael Eisen

Clusters for Raw SST and Raw NPP


90

Use of K-means to
partition Sea Surface
60

Land Cluster 2

30 Temperature (SST) and


Land Cluster 1 Net Primary Production
latitude

0
(NPP) into clusters that
reflect the Northern and
Ice or No NPP

-30

Sea Cluster 2 Southern Hemispheres.


-60

Sea Cluster 1

-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
Cluster
Introduction to Data Mining, 2nd Edition
longitude
Tan, Steinbach, Karpatne, Kumar 40
Clustering: Application 1
n Market Segmentation:
n Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
n Approach:
n Collect different attributes of customers based on their

geographical and lifestyle related information.


n Find clusters of similar customers.

n Measure the clustering quality by observing buying

patterns of customers in same cluster vs. those from


different clusters.

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 41
Clustering: Application 2
n Document Clustering:
n Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
n Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.

Enron email dataset

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 42
Association Rule Discovery: Definition
n Given a set of records each of which contain some
number of items from a given collection
n Produce dependency rules which will predict occurrence of
an item based on occurrences of other items.

TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 43
Association Analysis: Applications
n Market-basket analysis
n Rules are used for sales promotion, shelf management, and
inventory management

n Telecommunication alarm diagnosis


n Rules are used to find combination of alarms that occur
together frequently in the same time period

n Medical Informatics
n Rules are used to find combination of patient symptoms and
test results associated with certain diseases

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 44
Association Analysis: Applications

n An Example Subspace Differential Co-expression Pattern from


lung cancer dataset Three lung cancer datasets [Bhattacharjee et al.
2001], [Stearman et al. 2005], [Su et al. 2007]

Enriched with the TNF/NFB signaling pathway


which is well-known to be related to lung cancer
P-value: 1.4*10-5 (6/10 overlap with the pathway)

[Fang et al PSB 2010]


Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar 45
Deviation/Anomaly/Change Detection

n Detect significant deviations from normal


behavior
n Applications:
n Credit Card Fraud Detection
n Network Intrusion
Detection
n Identify anomalous behavior from sensor
networks for monitoring and surveillance.
n Detecting changes in the global forest
cover.

Introduction to Data Mining, 2nd Edition


Tan, Steinbach, Karpatne, Kumar 46
Q5. Are All the “Discovered” Patterns Interesting?

n Data mining may generate thousands of patterns: Not all of them are
interesting
n Suggested approach: Human-centered, query-based, focused mining
n Interestingness measures
n A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
n Objective vs. subjective interestingness measures
n Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
n Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty.

47
Q6. Data Mining: Classification Schemes

n Different views, different classifications


n Kinds of data to be mined
n Kinds of knowledge to be discovered
n Kinds of techniques utilized
n Kinds of applications adapted

48
Multi-Dimensional View of Data Mining
n Data to be mined
n Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-
media, heterogeneous, WWW

n Knowledge to be mined
n Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
n Multiple/integrated functions and mining at multiple levels

49
Multi-Dimensional View of Data Mining
n Techniques utilized
n Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc.

n Applications adapted
n Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, Web mining, etc.

50
OLAP Mining: Integration of Data Mining and Data Warehousing

n Data mining systems, DBMS, Data warehouse systems


coupling
n On-line analytical mining data
n Integration of mining and OLAP technologies

n Interactive mining multi-level knowledge


n Necessity of mining knowledge and patterns at different levels of
abstraction.

n Integration of multiple mining functions


n Characterized classification, first clustering and then association

51
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Systems

Machine
Learning
Data Mining Visualization

Algorithm Other
Disciplines

52
Q7. Major Issues in Data Mining
n Mining methodology
n Mining different kinds of knowledge from diverse data types,
e.g., bio, stream, Web
n Performance: efficiency, effectiveness, and scalability
n Pattern evaluation: the interestingness problem
n Incorporation of background knowledge
n Handling noise and incomplete data
n Parallel, distributed and incremental mining methods
n Integration of the discovered knowledge with existing one:
knowledge fusion

53
Q7. Major Issues in Data Mining
n User interaction
n Data mining query languages and ad-hoc mining
n Expression and visualization of data mining results
n Interactive mining of knowledge at multiple levels of
abstraction

n Applications and social impacts


n Domain-specific data mining & invisible data mining
n Protection of data security, integrity, and privacy

54
Summary
n Data mining: discovering interesting patterns from large amounts of data
n A natural evolution of database technology, in great demand, with wide
applications
n A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
n Mining can be performed in a variety of information repositories
n Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
n Data mining systems and architectures
n Major issues in data mining

55
Where to Find References?
n More conferences on data mining
n PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
n Data mining and KDD
n Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
n Journal: Data Mining and Knowledge Discovery, KDD Explorations
n Database systems
n Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
n Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
n AI & Machine Learning
n Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
n Journals: Machine Learning, Artificial Intelligence, etc.
n Statistics
n Conferences: Joint Stat. Meeting, etc.
n Journals: Annals of statistics, etc.
n Visualization
n Conference proceedings: CHI, ACM-SIGGraph, etc.
n Journals: IEEE Trans. visualization and computer graphics, etc.
56

You might also like