Data Mining
Lecture 1
Course Outline
1. Introduction to data mining
2. Data Pre-processing
3. Information Retrieval
4. Associations & Rule Generation
5. Classification and Prediction
6. ML Algorithms and Models
7. Clustering
8. Correlation analysis
Course Description
• Through this course students can learn:
• Basic principles, techniques, tools and applications of Data
Mining
• The concepts of data pre-processing, cluster analysis,
classification, prediction and frequent pattern mining
• Science of data mining as the automatic extraction of
patterns representing knowledge stored in large databases,
data warehouses, and other massive data repositories
What Is Data Mining?
• Text book:
• Data Mining: Concepts and Techniques (Latest Edition) by
Jiawei Han and Micheline Kamber
• Reference book:
• Elements of Statistical Learning by Hastie, Tibshirani and
Friedman
• Freely available online
What Is Data Mining?
• Data mining is the principle of sorting through large
amounts of data and picking out relevant information
The extraction of knowledge from data is called data
mining
Data mining can also be defined as the exploration
and analysis of large quantities of data in order to
discover meaningful patterns and rules
The ultimate goal of data mining is to discover
knowledge
What Is Data Mining?
• Alternative names
• Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/ pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
Data Rich, Information Poor
Motivation
Lots of data is being collected and
warehoused
Web data, e-commerce
purchases at department/grocery stores
Bank/Credit Card transactions
Computers have become cheaper and
more powerful
Data collected and stored at enormous
speeds (GB/hour)
Motivation
Traditional techniques infeasible for raw
data
Human analysts may take weeks to
discover useful information
We are drowning in data, but starving for
knowledge!
Data mining may help scientists
Classifying and segmenting data
Why data mining is important?
Rapid computerization of businesses produce
huge amount of data
How to make best use of data?
A growing realization:
knowledge discovered from data can be used
for competitive advantage
Classification and future prediction
Why data mining is important?
• Data analysis and decision support
• Market analysis and management
• Risk analysis and management
• Fraud detection and detection of unusual
patterns (outliers)
• Other Applications
• Text mining (news group, email) and Web
mining
• Stream data mining
•
Why data mining is important?
• Ex. 1: Market Analysis and Management
• Target marketing
• Cross-market analysis
• Customer profiling
• Customer requirement analysis
• Ex. 2: Fraud Detection & Mining Unusual
Patterns
• Auto insurance
• Money laundering
• Medical insurance
Why data mining is important?
• Ex.3: Biomedical Applications
• Approaches: Clustering & Classification
• Applications:
• Automated diagnosis
• Discovery of disease trends
• Prediction of epidemics
• Discovering causes for certain conditions
• Patient data retrieval
Data Mining: Combination of
Multiple Disciplines
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Artificial
Algorithm Intelligence
Knowledge Discovery (KDD) Process
Data mining—core of
Pattern Evaluation
knowledge discovery process
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
KDD Process: Several Key Steps
• Learning the application domain
• relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
• Find useful features, dimensionality/variable reduction, invariant
representation
• Choosing functions of data mining
• summarization, classification, regression, association, clustering
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
• visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
DBA
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Architecture: Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Know
Data Mining Engine ledge
-Base
Database or Data Warehouse Server
data cleaning, integration, and selection
Data World-Wide Other Info
Database Repositories
Warehouse Web
Evolution of Science
Before 1600, Theoretical Science
1600-1950s, Empirical Science
• 1950s-1990s, Computational Science
• 1990-now, Data Science
The flood of data from new scientific instruments and
simulations
The ability to economically store and manage petabytes of data
online
The Internet and computing Grid that makes all these archives
universally accessible
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
1990s—2000s:
Data mining and data warehousing, multimedia databases,
and Web databases
Evolution of Database Technology
Evolutionary Step Business Question Enabling Technologies Product Providers Product Providers
Data Collection "What was my total Computers, tapes, IBM, static data delivery
(1960s) revenue in the last disks
five years?"
Data Access "What were unit Relational databases Oracle, Sybase, dynamic data
(1980s) sales in last March?” (RDBMS), Structured Informix, IBM, delivery at record
Query Language Microsoft level
(SQL), ODBC
Data Warehousing "What were unit multidimensional Oracle, Pilot dynamic data
(1990) sales in New databases, data delivery at multiple
England last March? warehouses levels
Drill down to
Boston."
Data Mining "What’s likely to Advanced Pilot, Lockheed, Prospective,
( Emerging Today) happen to Boston algorithms, massive IBM, SGI, numerous proactive
unit sales next databases startups (nascent information delivery
month? Why?" industry)
Data Warehouse example
Data Warehouses: Data warehousing is defined as a process of
centralized data management and retrieval
It is repository of information collected from multiple sources, stored
under a unified schema and usually reside at a single site
The process Of Data Mining
There are 3 main steps in the Data Mining
process:
Preparation:
data is selected from the warehouse and
“cleansed”
Processing:
Different algorithms are used to process the
data in order to make predictions
Analysis:
Output is evaluated
Reasons for growing popularity
Growing data volume-
enormous amount of existing and
appearing data that require processing.
Limitations of Human Analysis-
humans lacking objectiveness when
analyzing.
Low cost of Machine Learning-
the data mining process has a lower cost
than hiring highly trained professionals to
analyze data.
Applications of Data Mining
Data Mining is applied in the following areas:
Prediction of the Stock Market:
predicting the future trends
Bankruptcy prediction:
prediction based on computer generated rules,
using models
Foreign Exchange Market:
data Mining is used to identify trading rules
Fraud Detection:
construction of algorithms and models that will
help recognize a variety of fraud patterns
Results of Data Mining Include:
Forecasting what may happen in the
future
Classifying people or things into groups
by recognizing patterns
Clustering people or things into groups
based on their attributes
Associating what events are likely to
occur together
Sequencing what events are likely to lead
to later events
Data Mining Functions
Two types of model:
Predictive models predict unknown values
based on
known data
Descriptive models identify patterns in data
Each type has several sub-categories, each of
which has many algorithms
Data Mining Functions