Chapter 1.
Introduction
hy Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology is Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
Summary
1
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: digital cameras, YouTube, social media
We are drowning in data, but starving for knowledge!
Data mining—Automated analysis of massive data sets to discover
knowledge
2
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology is Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
Summary
3
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, business intelligence, etc.
Is everything “data mining”? Differenciate
Simple search and query processing
(Deductive) expert systems
4
Knowledge Discovery (KDD) Process
This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
5
KDD Process: A Typical View from ML and
Statistics
Input Data Data Pre- Data Post-
Processing Mining Processing
Data integration Pattern discovery Pattern evaluation
Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………
This is a view from typical machine learning and statistics communities
6
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology is Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
Summary
7
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
8
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology Are Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
9
Data Mining Function: (1) Generalization
Information integration and data warehouse construction
Data cleaning, transformation, integration, and
multidimensional data model
Data cube technology
Scalable methods for computing (i.e., materializing)
multidimensional aggregates
OLAP (online analytical processing)
Multidimensional concept description: Characterization
and discrimination
Generalize, summarize, and contrast data
characteristics
10
Data Mining Function: (2) Association and
Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together by a
customer
Association, correlation vs. causality
A typical association rule
Bread Peanut Butter [0.5%, 75%] (support, confidence)
Support reflects utility while confidence reflects certainty of the conclusion
High confidence value need not necessarily indicate strong
correlation between the items.
If 80% transactions has Peanut Butter, the above rule reflects negative
association between the two.
Additional correlation metrics like ‘Lift’ are used to mine such
patterns and rules efficiently in large datasets. 11
Data Mining Function: (3) Classification
Classification and label prediction
Construct models (functions) based on labelled training examples
Describe and distinguish classes or concepts for future prediction
E.g., classify tourist locations based on (climate, affordability,
activities, # days, etc), or estimate the cost of used cars based
on (mileage, age, model, fuel type, etc.)
Apply the models to predict class labels for unknown entities
Typical methods
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
Typical applications:
Credit card fraud detection, direct marketing, diagnosing diseases,
etc.
12
Data Mining Function: (4) Cluster Analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters) of similar
entities,
e.g., cluster houses to find distribution patterns or neighborhoods
Principle: Maximizing intra-cluster similarity & minimizing inter-
cluster similarity
Typical methods:
Partitional Clustering eg: K-Means, K-medoids
Hierarchical clustering eg: AGNES, DIANA
Density based clustering eg: DBSCAN, OPTICS
Applications: customer segmentation, taxonomy formation, topic
identification by document clustering, image quantization, pattern
recognition, etc.
13
Data Mining Function: (5) Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general
behavior of the data
Noise or exception? ― One person’s garbage could be another
person’s treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
14
Evaluation of Patterns
Interesting patterns represent knowledge
Are all mined patterns interesting?
One can mine tremendous amount of “patterns” and knowledge
Patterns are interesting if they are:
Easily understood
Valid on new or unknown data with a high degree of certainty
Potentially useful and Novel
Evaluation of mined patterns Directly mine only interesting
patterns / knowledge using:
some objective measures like typicality, support, confidence are used for
descriptive tasks while precision, recall, accuracy, etc are used for
predictive tasks.
Novelty, timeliness and actionability are subjective assessments 15
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
What is Data Warehousing?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology is Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
Summary
16
Data Mining: Confluence of Multiple Disciplines
Machine Pattern Statistics
Learning Recognition
Info Retrieval Visualization
Data Mining
Algorithm Database High-Performance
Technology Computing
17
Why Confluence of Multiple Disciplines?
Tremendous amount of data
Algorithms must be highly scalable to handle tera-bytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
18
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
What is Data Warehousing?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology is Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
Summary
19
Applications of Data Mining
Business Intelligence systems: Customer Relationship Management,
Predictive analytics for specific contexts, OLAP support for better
understanding business scenario
Web page analysis: Search engines for web page classification,
clustering using PageRank & HITS algorithms, context–aware Query
recommendations
Collaborative Filtering & Recommender systems
Market Basket analysis to targeted marketing
Medical data analysis: disease diagnosis, anomaly detection in medical
images, microarray data analysis
Weather modelling and prediction of future climatic conditions
20
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
What is Data Warehousing?
What Kind of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Technology is Used?
What Kind of Applications Are Targeted?
Major Issues in Data Mining
Summary
21
Major Data Mining Issues related to …
Mining Methodology
Mining various, possibly new, kinds of knowledge
Mining knowledge in (the subspaces of) a multi-dimensional space
Data mining: An interdisciplinary effort (eg: Q&A sys need NLP, Info
Retrieval and Mining)
Boosting the power of discovery in a networked environment (info
sharing among semantically linked heterogeneous data sources)
Handling noise, uncertainty, and incompleteness of data; sometimes
incorrect data due to attackers
Pattern evaluation and pattern- or constraint-guided mining (to focus
mining on specific topics or aspects of interest, context-aware RSs, etc.)
User Interaction
Interactive mining(dynamically change focus based on previous results)
Incorporation of background knowledge (domain specific relationships)
Presentation and visualization of data mining results 22
Major Data Mining Issues related to …
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
23