Introduction to Data
Mining
1
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Summary
2
Introduction to Data Mining
Why Data Mining?
Data vs. Information:
Data: recorded facts
Information: patterns underlying the data
The Explosive Growth of Data:
Data collection and data availability
Automated data collection tools, database systems, Web
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: bioinformatics,
Society and everyone: news, digital cameras, YouTube
3
Introduction to Data Mining
Why Data Mining?
We are drowning in data, but starving for knowledge!
We are data rich, but information poor.
4
Introduction to Data Mining
What is Data Mining?
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets.
Data mining—searching for knowledge (interesting
patterns) in your data.
5
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Summary
6
Introduction to Data Mining
What is Data Mining?
Data Mining(knowledge discovery from data)
Refers to extracting or “mining” knowledge from large amounts of
data.
Extraction of interesting (implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of
data.
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
7
Introduction to Data Mining
Knowledge Discovery Process
Data mining can be viewed as simply an essential step
in the process of knowledge discovery.
This is a view from typical
database systems and data
warehousing communities
Data mining plays an essential
role in the knowledge discovery
process
8
Introduction to Data Mining
Knowledge Discovery Process
Knowledge Discovery Process
Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be combined
Data selection (where data relevant to the analysis task are retrieved
from the database)
Data transformation (where data are transformed and consolidated into
forms appropriate for mining by performing summary or aggregation
operations)
Data mining (an essential process where intelligent methods are
applied to extract data patterns)
Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on interestingness measures
Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined knowledge to
users)
Steps 1 through 4 are different forms of data preprocessing
9
Introduction to Data Mining
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OOetc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
Modern GIS applications include address matching, location analysis or
site selection and development of evacuation plans. weather forecasting,
environmental study, natural hazards study
10
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Summary
11
Introduction to Data Mining
What Kind of Data Can be Mined?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data (varies over time), sequence data (incl.
bio-sequences (DNA sequence.))
Structure data, graphs, social networks
Heterogeneous databases and legacy databases
Spatial data (geographic) and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
12
Introduction to Data Mining
Database Data
Database Data: Database management system
(DBMS), consists of a collection of interrelated data,
known as a database, and a set of software programs to
manage and access the data.
The software programs provide mechanisms
for defining database structures and data storage;
for specifying and managing shared, or distributed data access;
for ensuring consistency and security of the information stored
despite system crashes or attempts at unauthorized access.
13
Introduction to Data Mining
Database Data
An example AllElectonics relational database
14
Introduction to Data Mining
Data Warehouse
Data warehouse: A data warehouse is a repository of
information collected from multiple sources, stored under
a unified schema, and usually residing at a single site.
Data in a data warehouse are organized around major
subjects (e.g., customer, item, supplier, and activity).
The data are stored to provide information from a
historical perspective, such as in the past 6 to 12 months,
and
15
Introduction to Data Mining
Data Warehouse
Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data
loading, and periodic data refreshing.
16
Introduction to Data Mining
Transactional Data
Transactional Data: Each record in a transactional
database captures a transaction, such as a customer’s
purchase, a flight booking, or a user’s clicks on a web
page.
A transaction typically includes a unique transaction identity
number (trans ID) and a list of the items making up the
transaction, such as the items purchased in the transaction.
Transactions can be stored in a table, with one record
per transaction.
Because most relational database systems do not support
nested relational structures, the transactional database is usually
either stored in a flat file
17
Introduction to Data Mining
Transactional Data
Fragment of a transactional database for sales at AllElectronics.
18
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Summary
19
Introduction to Data Mining
What Kind of Patterns Can Be Mined?
Data mining functionalities.
Characterization and discrimination
Mining of frequent patterns, associations, and correlations
Classification and regression
Clustering analysis
Outlier analysis
Data mining functionalities are used to specify the kinds
of patterns to be found in data mining tasks.
20
Introduction to Data Mining
Concept/Class Description
Characterization: summarization of the general
characteristics or features of a target class of data.
The output of data characterization can be presented in
various forms.
E.g., pie charts, bar charts, curves, multidimensional data cubes
etc.
Example:
A customer relationship manager at AllElectronics may order the
following data mining task: Summarize the characteristics of
customers who spend more than $5000 a year at AllElectronics.
The result is a general profile of these customers, such as that
they are 40 to 50 years old, employed, and have excellent credit
ratings.
21
Introduction to Data Mining
Concept/Class Description
Discrimination: Comparison of the general features of
the target class data objects against the general features
of objects from one or multiple contrasting classes.
The forms of output presentation are similar to those for
characteristic descriptions.
Example:
A customer relationship manager at AllElectronics may want to compare
two groups of customers—those who shop for computer products
regularly (e.g., more than twice a month) and those who rarely shop for
such products (e.g., less than three times a year). The resulting
description provides a general comparative profile of these customers,
such as that 80% of the customers who frequently purchase computer
products are between 20 and 40 years old and have a university
education, whereas 60% of the customers who infrequently buy such
products are either seniors or youths, and have no university degree.
22
Introduction to Data Mining
Frequent Patterns, Association and Correlation Analysis
Mining Frequent Patterns: Frequent patterns are
patterns that occur frequently in data.
The kinds of frequent patterns
Frequent item sets patterns: refers to a set of items that
frequently appear together in a transactional data set, such as
milk and bread.
Frequent sequential patterns: such as the pattern that
customers tend to purchase first a PC, followed by scanner, and
a printer , is a (frequent) sequential pattern.
Mining frequent patterns leads to the discovery of
interesting associations and correlations within data.
23
Introduction to Data Mining
Frequent Patterns, Association and Correlation Analysis
An example of association rule:
where X is a variable representing a customer.
This association rule involves a single attribute or
predicate (i.e., buys) that repeats, referred to as single-
dimensional
24
Introduction to Data Mining
Frequent Patterns, Association and Correlation Analysis
We may find association rules like:
This is an association between more than one attribute
(i.e., age, income, and buys).
This is a multidimensional association rule.
25
Introduction to Data Mining
Classification
Classification and label prediction
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future
prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
Typical applications:
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
26
Introduction to Data Mining
Classification
A classification model can be represented in various forms: (a)
IF-THEN rules, (b) a decision tree, or (c) a neural network.
27
Introduction to Data Mining
Clustering
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
Principle: Maximizing intra-class similarity & minimizing
interclass similarity
Many methods and applications
28
Introduction to Data Mining
Clustering
A 2-D plot of customer data with respect to customer locations
in a city, showing three data clusters.
29
Introduction to Data Mining
Clustering
The output takes the form of a diagram that shows how
the instances fall into clusters.
Different cases:
Simple 2D representation: involves associating a cluster
number with each instance
Venn diagram: allow one instance to belong to more than one
cluster
Probabilistic assignment: associate instances with clusters
probabilistically
Dendrogram: produces a hierarchical structure of clusters
(dendron is the Greek word for tree)
30
Introduction to Data Mining
31
Clustering
Introduction to Data Mining
32
Clustering
Introduction to Data Mining
Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general
behavior of the data
Noise or exception? ―
Methods: clustering or regression analysis, …
33
Introduction to Data Mining
Are All Patterns are Interesting?
Data mining may generate thousands of patterns: Not all
of them are interesting
What makes a pattern interesting?
Easily understood by humans,
Valid on new or test data
Novel, Potentially useful
Validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty etc.
34
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Summary
35
Introduction to Data Mining
What Technology Are Used?
36
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
37
Introduction to Data Mining
What Kind of Applications Are Targeted?
Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
Recommender systems
Basket data analysis
Biological and medical data analysis: classification, cluster analysis
biological sequence analysis, biological network analysis
38
Introduction to Data Mining
Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Summary
39
Introduction to Data Mining
Summary
Data mining: Discovering interesting patterns and
knowledge from massive amount of data
A natural evolution of database technology, in great
demand, with wide applications
A KDD process includes data cleaning, data integration,
data selection, transformation, data mining, pattern
evaluation, and knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend
analysis, etc.
40