KEMBAR78
Data Mining : Concepts and Techniques | PPTX
By
R. Deepa (IT),
Batch: 2016-2019
Department of (CS&IT),
Nadar Saraswathi College of Arts Science, Theni.
Concepts and
Techniques
CHAPTER : 1
INTRODUCTION
 Evolution of Data Mining technology
 What is Data Mining?
 Data Mining Tasks
 Data Mining : On what kind of data?
 Are all the patterns interesting?
 Data Mining : A KDD Process
 Classification of Data Mining systems
 Major Issues in Data mining
 Summary
EVOLUTION OF DATA MINING TECHNOLOGY
Evolution Step Business Question Enabling
Technologies
Data Collection
(1960s)
“What was my total revenue
in the last five years? ”
Computers,
tapes,disks
Data Access
(1980s)
“What were unit sales in New
England last March? “
Relational Databases
(RDBMS),
Structured Query
Language(SQL),
ODBC
Data Warehousing &
Decision Support
(1990s)
“What were unit sales in New
England last March? Drill
down to Boston.”
On-line analytic
processing(OLAP),
Multidimensional
Database, Data
Warehouses
Data Mining
(Emerging Today)
“What ’s likely to happen to
Boston unit sales next month?
Why? “
Advanced algorithms,
Multiprocessor
Computers,
Massive Databases
R.Deepa ITData Mining: Concepts and techniques
WHAT IS DATA MINING ?
Data Mining:
Data Mining refers to extracting or “Mining” knowledge
from large amounts of data.
• Extraction of interesting (non-trivial, implicit, previously
unknown, and potentially useful) information or patterns from data
in large databases.
Alternatives Names:
Knowledge Discovery from Data(KDD),
Knowledge extraction,
Data / pattern analysis,
Data archeology,
Data dredging,
Information harvesting,
Business intelligence, etc..,
.
R.Deepa ITData Mining: Concepts and techniques
DECISIONS IN DATA MINING :
Databases to be mined
Relational, transactional, object-oriented, object-
relational, spatial, time-series, text, legacy, multi-media,
heterogeneous, WWW, etc.
Knowledge to be mined
Association, classification, clustering , etc.
Techniques utilized
Database-oriented, Data warehouse(OLAP), Machine
learning, Statistics, Visualization, Neural Networks, etc.
Applications adapted
Retail, Telecommunication, Banking, Fraud analysis, DNA
mining, Stock market analysis, Web mining, Weblog analysis, etc.
R.Deepa ITData Mining: Concepts and techniques
DATA MINING TASKS
 Prediction Tasks:
Use some variables to predict unknown or future
values of other variables.
 Description Tasks:
Find human-interpretable patterns that describe
the data.
Common Data Mining Tasks:
• Classification(predictive)
• Clustering(descriptive)
• Association Rule Discovery(descriptive)
• Sequential Pattern Discovery(descriptive)
• Regression (predictive)
• Deviation Detection(predictive)
R.Deepa ITData Mining: Concepts and techniques
DATA MINING - A KDD PROCESS
++
DATABASES
DATA
MINING
DATA
WAREHOUSE
PATTERNS
FLAT FILES
Cleaning and
Integration
Selection and
Transformation
R.Deepa IT
7 STEPS IN KDD PROCESS
1.Data Cleaning:
to remove noise and inconsistent data
2.Data Integration:
where multiple data sources may be combined
3.Data Selection:
where data relevant to the analysis task are retrieved from the database
4.Data Transformation:
where data are transformed and consolidated into forms appropriate for
mining performing summary or aggregation operations.
5.Data Mining:
an essential process where intelligent methods are applied to extract
data patterns
6.Pattern Evaluation:
to identify the truly interesting patterns representing knowledge based
on interestingness measures.
6.Knowledge Presentation:
where visualization and knowledge representation techniques are used to
present mined knowledge to users .
R.Deepa ITData Mining: Concepts and techniques
ARCHITECTURE OF A TYPICAL DATA
MINING SYSTEM
Graphical user interface
Pattern evaluation
Data mining engine
Database or Data warehouse
Server
Data
Warehouse
Knowledge-Base
DataBases
Data Cleaning & Data Integration Filtering
R.Deepa IT
DATA MINING - ON WHAT KIND OF DATA?
1.Relational Databases
2.Data Warehouses
3.Transactional Databases
4.Advanced Data and Information System
• Object-Oriented and Object-Relational Databases
• Spatial and Spatiotemporal Databases
• Heterogeneous and Legacy Databases
• Text Databases
• Multimedia Databases
• Data Streams
• WWW
R.Deepa ITData Mining: Concepts and techniques
DATA MINING FUNCTIONALITIES
 Concept /Class Description: Characterization and
Discrimination
Generalize, summarize and contrast data
characteristics, eg., dry vs. wet regions .
 Data Characterization is a summarization of the
general characteristics or features in a target class of
data.
 Data Discrimination is a comparison of the general
features of target class data objects with the general
features of objects from one or a set of contrasting
classes.
R.Deepa ITData Mining: Concepts and techniques
 Association (Correlations and Causality)
* Multi- dimensional vs. single dimensional
association
* age( X, ”20….29”)^income (X,”20…..29K”) 
buys( X, “PC”)[Support=2%, Confidence= 60%]
* contains( T, “Computer”)contains (
X,”Software”) [1% and 75%]
R.Deepa ITData Mining: Concepts and techniques
 Classification and Prediction
* Finding models (function) that describe and
distinguish classes or concepts for future prediction
* Eg., classify countries based on climate, or
classify cars based on gas mileage.
Presentation: Decision tree, Classification rule, Neural network
Prediction: predict some unknown or missing numerical values
R.Deepa ITData Mining: Concepts and techniques
 Cluster Analysis
Class label is unknown: Group data to form new
classes, eg., Cluster houses to find distribution patterns
Clustering based on the principle: Maximizing
the intra-class similarity and minimizing the inter-class similarity.
 Outlier Analysis
Outlier: A data object that does not comply with the
general behavior of the data
It can be considered as noise or exception but
is quite useful in fraud detection, rare events analysis.
Trend and Evolution Analysis
Trend and Deviation: regression analysis
Sequential pattern mining, periodicity analysis
Similarity based Analysis
Other pattern directed or statistical
analyses
R.Deepa IT
ARE ALL THE “DISCOVERED”
PATTERNS INTERESTING?
 A data mining system/query may generate thousands of
patterns, not all of them are interesting.
 Suggested approach: Human-oriented, query-based,
focused mining
 Interestingness measures : A pattern is interesting if it is
easily understood by humans, valid on new or test data with
some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures:
Objective: based on statistics and structures of patterns,
eg., support, confidence, etc.
Subjective: based on user’s belief in the data, eg.,
unexpectedness, novelty, actionability, etc.
.
R.Deepa IT
CAN WE FIND AND ONLY INTERESTING
PATTERNS?
 Find all the interesting patterns: Completeness
 Can a data mining system find all the interesting patterns?
 Association vs. Classification vs. Clustering
 Search for only interesting patterns : Optimization
 Can a data mining system find only the interesting patterns?
 Approaches
 First general all the patterns and then filter out the
uninteresting ones.
 Generate only the interesting patterns-mining query
optimization
R.Deepa ITData Mining: Concepts and techniques
DATA MINING: CONFLUENCE OF
MULTIPLE DISCIPLINES
Statistics
Visualization
Other
Disciplines
Data MiningMachine
Learning
Information
Science
Database
Technology
R.Deepa ITData Mining: Concepts and techniques
DATA MINING : CLASSIFICATION
SCHEMES
 General Functionality
 Descriptive data mining
 Predictive data mining
 Different views ,different classifications
 Kinds of databases to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted
R.Deepa IT
Data Mining: Concepts and techniques
MAJOR ISSUES IN DATA MINING
 Mining methodology and user interaction
 Mining different kinds of knowledge in databases
 Interactive mining of Knowledge at multiple levels of
abstraction
 Incorporation of background knowledge
 Data mining query languages and ad-hoc data mining
 Expression and visualization of data mining results
 Handling noise and incomplete data
 Pattern evaluation: the interestingness problem
 Performance and scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed and incremental mining methods
R.Deepa ITData Mining: Concepts and techniques
 Issues relating to the diversity of data types
 Handling relational and complex types of data
 Mining information from heterogeneous databases and
global information systems(WWW)
 Issues related to applications and social impacts
 Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
 Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem
 Protection of data security, integrity, and privacy.
R.Deepa ITData Mining: Concepts and techniques
SUMMARY
 Data Mining: discovering interesting patterns from large
amounts of data
 A natural evolution of databases technology, in great demand
, with wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
 Mining can be performed in a variety of information
repositories
 Data mining functionalities:
Characterization, Discrimination, Association,
Classification, Clustering, Outlier and Trend analysis, etc.
 Classification of Data mining systems
 Major issues in Data Mining. R.Deepa IT
THANK YOU!
R.Deepa IT

Data Mining : Concepts and Techniques

  • 1.
    By R. Deepa (IT), Batch:2016-2019 Department of (CS&IT), Nadar Saraswathi College of Arts Science, Theni. Concepts and Techniques
  • 2.
    CHAPTER : 1 INTRODUCTION Evolution of Data Mining technology  What is Data Mining?  Data Mining Tasks  Data Mining : On what kind of data?  Are all the patterns interesting?  Data Mining : A KDD Process  Classification of Data Mining systems  Major Issues in Data mining  Summary
  • 3.
    EVOLUTION OF DATAMINING TECHNOLOGY Evolution Step Business Question Enabling Technologies Data Collection (1960s) “What was my total revenue in the last five years? ” Computers, tapes,disks Data Access (1980s) “What were unit sales in New England last March? “ Relational Databases (RDBMS), Structured Query Language(SQL), ODBC Data Warehousing & Decision Support (1990s) “What were unit sales in New England last March? Drill down to Boston.” On-line analytic processing(OLAP), Multidimensional Database, Data Warehouses Data Mining (Emerging Today) “What ’s likely to happen to Boston unit sales next month? Why? “ Advanced algorithms, Multiprocessor Computers, Massive Databases R.Deepa ITData Mining: Concepts and techniques
  • 4.
    WHAT IS DATAMINING ? Data Mining: Data Mining refers to extracting or “Mining” knowledge from large amounts of data. • Extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) information or patterns from data in large databases. Alternatives Names: Knowledge Discovery from Data(KDD), Knowledge extraction, Data / pattern analysis, Data archeology, Data dredging, Information harvesting, Business intelligence, etc.., . R.Deepa ITData Mining: Concepts and techniques
  • 5.
    DECISIONS IN DATAMINING : Databases to be mined Relational, transactional, object-oriented, object- relational, spatial, time-series, text, legacy, multi-media, heterogeneous, WWW, etc. Knowledge to be mined Association, classification, clustering , etc. Techniques utilized Database-oriented, Data warehouse(OLAP), Machine learning, Statistics, Visualization, Neural Networks, etc. Applications adapted Retail, Telecommunication, Banking, Fraud analysis, DNA mining, Stock market analysis, Web mining, Weblog analysis, etc. R.Deepa ITData Mining: Concepts and techniques
  • 6.
    DATA MINING TASKS Prediction Tasks: Use some variables to predict unknown or future values of other variables.  Description Tasks: Find human-interpretable patterns that describe the data. Common Data Mining Tasks: • Classification(predictive) • Clustering(descriptive) • Association Rule Discovery(descriptive) • Sequential Pattern Discovery(descriptive) • Regression (predictive) • Deviation Detection(predictive) R.Deepa ITData Mining: Concepts and techniques
  • 7.
    DATA MINING -A KDD PROCESS ++ DATABASES DATA MINING DATA WAREHOUSE PATTERNS FLAT FILES Cleaning and Integration Selection and Transformation R.Deepa IT
  • 8.
    7 STEPS INKDD PROCESS 1.Data Cleaning: to remove noise and inconsistent data 2.Data Integration: where multiple data sources may be combined 3.Data Selection: where data relevant to the analysis task are retrieved from the database 4.Data Transformation: where data are transformed and consolidated into forms appropriate for mining performing summary or aggregation operations. 5.Data Mining: an essential process where intelligent methods are applied to extract data patterns 6.Pattern Evaluation: to identify the truly interesting patterns representing knowledge based on interestingness measures. 6.Knowledge Presentation: where visualization and knowledge representation techniques are used to present mined knowledge to users . R.Deepa ITData Mining: Concepts and techniques
  • 9.
    ARCHITECTURE OF ATYPICAL DATA MINING SYSTEM Graphical user interface Pattern evaluation Data mining engine Database or Data warehouse Server Data Warehouse Knowledge-Base DataBases Data Cleaning & Data Integration Filtering R.Deepa IT
  • 10.
    DATA MINING -ON WHAT KIND OF DATA? 1.Relational Databases 2.Data Warehouses 3.Transactional Databases 4.Advanced Data and Information System • Object-Oriented and Object-Relational Databases • Spatial and Spatiotemporal Databases • Heterogeneous and Legacy Databases • Text Databases • Multimedia Databases • Data Streams • WWW R.Deepa ITData Mining: Concepts and techniques
  • 11.
    DATA MINING FUNCTIONALITIES Concept /Class Description: Characterization and Discrimination Generalize, summarize and contrast data characteristics, eg., dry vs. wet regions .  Data Characterization is a summarization of the general characteristics or features in a target class of data.  Data Discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. R.Deepa ITData Mining: Concepts and techniques
  • 12.
     Association (Correlationsand Causality) * Multi- dimensional vs. single dimensional association * age( X, ”20….29”)^income (X,”20…..29K”)  buys( X, “PC”)[Support=2%, Confidence= 60%] * contains( T, “Computer”)contains ( X,”Software”) [1% and 75%] R.Deepa ITData Mining: Concepts and techniques
  • 13.
     Classification andPrediction * Finding models (function) that describe and distinguish classes or concepts for future prediction * Eg., classify countries based on climate, or classify cars based on gas mileage. Presentation: Decision tree, Classification rule, Neural network Prediction: predict some unknown or missing numerical values R.Deepa ITData Mining: Concepts and techniques
  • 14.
     Cluster Analysis Classlabel is unknown: Group data to form new classes, eg., Cluster houses to find distribution patterns Clustering based on the principle: Maximizing the intra-class similarity and minimizing the inter-class similarity.
  • 15.
     Outlier Analysis Outlier:A data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis. Trend and Evolution Analysis Trend and Deviation: regression analysis Sequential pattern mining, periodicity analysis Similarity based Analysis Other pattern directed or statistical analyses R.Deepa IT
  • 16.
    ARE ALL THE“DISCOVERED” PATTERNS INTERESTING?  A data mining system/query may generate thousands of patterns, not all of them are interesting.  Suggested approach: Human-oriented, query-based, focused mining  Interestingness measures : A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm  Objective vs. subjective interestingness measures: Objective: based on statistics and structures of patterns, eg., support, confidence, etc. Subjective: based on user’s belief in the data, eg., unexpectedness, novelty, actionability, etc. . R.Deepa IT
  • 17.
    CAN WE FINDAND ONLY INTERESTING PATTERNS?  Find all the interesting patterns: Completeness  Can a data mining system find all the interesting patterns?  Association vs. Classification vs. Clustering  Search for only interesting patterns : Optimization  Can a data mining system find only the interesting patterns?  Approaches  First general all the patterns and then filter out the uninteresting ones.  Generate only the interesting patterns-mining query optimization R.Deepa ITData Mining: Concepts and techniques
  • 18.
    DATA MINING: CONFLUENCEOF MULTIPLE DISCIPLINES Statistics Visualization Other Disciplines Data MiningMachine Learning Information Science Database Technology R.Deepa ITData Mining: Concepts and techniques
  • 19.
    DATA MINING :CLASSIFICATION SCHEMES  General Functionality  Descriptive data mining  Predictive data mining  Different views ,different classifications  Kinds of databases to be mined  Kinds of knowledge to be discovered  Kinds of techniques utilized  Kinds of applications adapted R.Deepa IT Data Mining: Concepts and techniques
  • 20.
    MAJOR ISSUES INDATA MINING  Mining methodology and user interaction  Mining different kinds of knowledge in databases  Interactive mining of Knowledge at multiple levels of abstraction  Incorporation of background knowledge  Data mining query languages and ad-hoc data mining  Expression and visualization of data mining results  Handling noise and incomplete data  Pattern evaluation: the interestingness problem  Performance and scalability  Efficiency and scalability of data mining algorithms  Parallel, distributed and incremental mining methods R.Deepa ITData Mining: Concepts and techniques
  • 21.
     Issues relatingto the diversity of data types  Handling relational and complex types of data  Mining information from heterogeneous databases and global information systems(WWW)  Issues related to applications and social impacts  Application of discovered knowledge • Domain-specific data mining tools • Intelligent query answering • Process control and decision making  Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem  Protection of data security, integrity, and privacy. R.Deepa ITData Mining: Concepts and techniques
  • 22.
    SUMMARY  Data Mining:discovering interesting patterns from large amounts of data  A natural evolution of databases technology, in great demand , with wide applications  A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation  Mining can be performed in a variety of information repositories  Data mining functionalities: Characterization, Discrimination, Association, Classification, Clustering, Outlier and Trend analysis, etc.  Classification of Data mining systems  Major issues in Data Mining. R.Deepa IT
  • 23.