KEMBAR78
1.3 applications, issues | PPT
1
Applications of Data Mining
Issues in Data Mining
2
Applications
 Financial Data Analysis
 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection
3
Financial Data Analysis
 Financial Data
 Collected from Banks and Financial Institutions
 Usually complete and reliable
 Design and Construction of data Warehouses for multi-
dimensional data analysis and mining
 Analysis – Changes by month, by region, by sector…and max,
min, total, average, trend etc.
 Characteristic and Comparative analysis, Outlier Analysis
4
 Loan payment and customer credit policy analysis
 Feature Selection and attribute relevance ranking (Debt ratio,
credit history, income, education level …)
 Loan granting policy can be adjusted
 Low risk Customers are granted loans
 Classification and Clustering of customers for targeted
marketing
 Customer group identification
 Multidimensional clustering techniques
 Can associate new customer with existing groups
Financial Data Analysis
5
 Detection of money laundering and financial crimes
 Data from several sources – integrated
 Data Analysis tools can be used to detect unusual patterns
 Data Visualization tools, Linkage Analysis tools
 Classification tools, Clustering tools
 Outlier Analysis tools
Financial Data Analysis
6
Retail Industry
 Sales Data, Customer Shopping history, Goods
Transportation, E-Commerce
 Mining can help to
 Identify buying behaviour, discover shopping trends
 Improve the quality of customer service, retain customers
 Design and Construction of data warehouses
 Several ways to design a warehouse
 Entities involved: Sales, Customers, Employers, Goods transportation…
 Preliminary data mining exercises can help to guide the design
process
 Dimensions and levels to involve and pre-processing to be done
7
 Multi-dimensional analysis of sales, customers, products,
time and region
 Multi-feature data cube
 Visualization tools
 Analysis of effectiveness of sales campaigns
 Compare sales and transaction volume
 Multidimensional analysis
 Compare sales amount, number of transactions containing same items before
and after the campaign
 Association Analysis
 Identify items likely to be purchased together
Retail Industry
8
 Customer Retention
 Customer loyalty and trends
 Sequential pattern mining
 Adjust pricing strategy and goods range
 Purchase recommendation and cross-reference of items
 Recommender Systems
 Sales promotion by displaying deal information in association
with items of interest
Retail Industry
9
Telecommunication Industry
 Computer and Web data transmission, fax, Mobile
phone, Telephone services
 Multidimensional analysis of telecommunication data
 Helps to identify and compare the data traffic, System work load,
Resource usage, User Group Behavior, Profit..
 Time-of-day usage patterns
 Fraudulent pattern analysis
 Identify fraudulent users and atypical usage patterns
 Illegal Customer account access
 Automatic Dial-out equipment
 Switch and route congestion patterns
10
 Multidimensional association and sequential pattern
analysis
 Usage patterns for a set of communication services by customer
group, time of day
 Sales Promotion
 Mobile Telecommunication Services
 Spatio-temporal data mining
 Use of visualization tools
Telecommunication Industry
11
Biomedical and DNA Data Analysis
 Research in DNA Analysis has led to
 Development of new drugs
 Cancer therapies
 Human genome study
 Discovery of genetic causes for many diseases
 Genome Research
 Study of DNA Sequences
 Adenine, Cytosine, Guanine, Thymine
 1,00,000 genes – each has hundreds of nucleotides – can be
combined in a number of ways
 Identifying Gene Sequence patterns is challenging
12
 Semantic Integration of Heterogeneous, distributed
genome databases
 Highly distributed generation and use of DNA data
 Integrated data warehouses and distributed federated databases
 Efficient Data Cleaning and Integration methods
 Similarity Search and Comparison among DNA
Sequences
 Gene sequences – isolated from healthy and diseased tissues
 Compare frequently occurring patterns in each class
 Help to identify the genetic factors of the disease and immune factors
 Non-numeric nature of data poses difficulties
Biomedical and DNA Data Analysis
13
 Association Analysis: Identification of co-occurring gene
sequences
 Diseases – triggered by a combination of genes acting together
 Association analysis helps to detect the kinds of genes that may
co-occur
 Study interactions and relationships between them
 Path Analysis: Linking genes to different stages of
disease development
 Different genes become active at different stages of the disease
 Develop drug interventions that target specific stages
Biomedical and DNA Data Analysis
14
 Visualization tools and genetic data analysis
 Complex Gene structures – Graphs, trees, Cuboids and
visualization tools
 Better Understanding and support interactive data
exploration
Biomedical and DNA Data Analysis
15
Intrusion Detection
 Intrusions
 Any set of actions that threaten the integrity, availability, or confidentiality of a
network resource
 Misuse detection: use patterns of well-known attacks to identify
intrusions
 Signatures – Must be updated
 Classification based on known intrusions
 E.g., three consecutive login failures: password guessing.
 Anomaly detection: use deviation from normal usage patterns to
identify intrusions
 Any significant deviations from the expected behavior are reported as possible
attacks
16
Intrusion Detection
 Data Mining Algorithms
 Misuse detection
 training data labeled – normal / intrusion
 Classifier can be used to detect known intrusions
 Classification algorithms, Association rule mining
 Anomaly detection
 Builds models of normal behavior and detects significant deviations
 Supervised – ‘normal’ training data
 Unsupervised – no information about training data
 Classification, clustering
17
Intrusion Detection
 Association and Correlation Analysis
 Finds relationships between system attributes describing the
network data
 Helps in selection of useful attributes
 Analysis of Stream data
 Transient and dynamic nature of intrusions
 An event maybe normal on its own but malicious when viewed as
a part of a sequence
 Distributed Data Mining
 Analysis of data from several locations
 Visualization and Querying tools
18
Data Mining in other Scientific Applications
 Old Scenario: Small, homogeneous data sets
 Formulate hypothesis, build model, evaluate results
 Current Scenario: High-dimensional data, stream data,
heterogeneous data (spatial, temporal)
 Collect and store data, mine for new hypotheses, confirm with
data or experimentation
 Vast amounts of data have been collected from Scientific
domains
 Climate and ecosystem modeling, Chemical engineering, fluid
dynamics, structural mechanics…
19
Other Scientific Applications
 Data Warehouses and data preprocessing
 Scientific applications – methods are needed for integrating
data from heterogeneous sources (Geospatial data
warehouse) and identifying events (Climate and Ecosystem
data)
 Mining complex data types
 Scientific data – Semi-structured and unstructured
 Multimedia and Spatial data
20
Other Scientific Applications
 Graph-based mining
 Labeled graphs – capture spatial, topological, geometric and
other relational characteristics present in scientific data
 Nodes – objects to be mined; edges – relationships
 Scalable and efficient mining methods are needed
 Visualization tools and domain specific knowledge
 High level GUIs and visualization tools are needed
 Integrated with existing domain-specific systems and database
systems
21
Issues in Data Mining
 Mining methodology and user interaction
 Mining different kinds of knowledge in databases
 Interactive mining of knowledge at multiple levels of abstraction
 Incorporation of background knowledge
 Data mining query languages and ad-hoc data mining
 Expression and visualization of data mining results
 Handling noise and incomplete data
 Pattern evaluation
22
Issues in Data Mining
 Issues relating to the diversity of data types
 Handling relational and complex types of data
 Mining information from heterogeneous databases and global
information systems (WWW)
 Performance and scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed and incremental mining methods

1.3 applications, issues

  • 1.
    1 Applications of DataMining Issues in Data Mining
  • 2.
    2 Applications  Financial DataAnalysis  Retail Industry  Telecommunication Industry  Biological Data Analysis  Other Scientific Applications  Intrusion Detection
  • 3.
    3 Financial Data Analysis Financial Data  Collected from Banks and Financial Institutions  Usually complete and reliable  Design and Construction of data Warehouses for multi- dimensional data analysis and mining  Analysis – Changes by month, by region, by sector…and max, min, total, average, trend etc.  Characteristic and Comparative analysis, Outlier Analysis
  • 4.
    4  Loan paymentand customer credit policy analysis  Feature Selection and attribute relevance ranking (Debt ratio, credit history, income, education level …)  Loan granting policy can be adjusted  Low risk Customers are granted loans  Classification and Clustering of customers for targeted marketing  Customer group identification  Multidimensional clustering techniques  Can associate new customer with existing groups Financial Data Analysis
  • 5.
    5  Detection ofmoney laundering and financial crimes  Data from several sources – integrated  Data Analysis tools can be used to detect unusual patterns  Data Visualization tools, Linkage Analysis tools  Classification tools, Clustering tools  Outlier Analysis tools Financial Data Analysis
  • 6.
    6 Retail Industry  SalesData, Customer Shopping history, Goods Transportation, E-Commerce  Mining can help to  Identify buying behaviour, discover shopping trends  Improve the quality of customer service, retain customers  Design and Construction of data warehouses  Several ways to design a warehouse  Entities involved: Sales, Customers, Employers, Goods transportation…  Preliminary data mining exercises can help to guide the design process  Dimensions and levels to involve and pre-processing to be done
  • 7.
    7  Multi-dimensional analysisof sales, customers, products, time and region  Multi-feature data cube  Visualization tools  Analysis of effectiveness of sales campaigns  Compare sales and transaction volume  Multidimensional analysis  Compare sales amount, number of transactions containing same items before and after the campaign  Association Analysis  Identify items likely to be purchased together Retail Industry
  • 8.
    8  Customer Retention Customer loyalty and trends  Sequential pattern mining  Adjust pricing strategy and goods range  Purchase recommendation and cross-reference of items  Recommender Systems  Sales promotion by displaying deal information in association with items of interest Retail Industry
  • 9.
    9 Telecommunication Industry  Computerand Web data transmission, fax, Mobile phone, Telephone services  Multidimensional analysis of telecommunication data  Helps to identify and compare the data traffic, System work load, Resource usage, User Group Behavior, Profit..  Time-of-day usage patterns  Fraudulent pattern analysis  Identify fraudulent users and atypical usage patterns  Illegal Customer account access  Automatic Dial-out equipment  Switch and route congestion patterns
  • 10.
    10  Multidimensional associationand sequential pattern analysis  Usage patterns for a set of communication services by customer group, time of day  Sales Promotion  Mobile Telecommunication Services  Spatio-temporal data mining  Use of visualization tools Telecommunication Industry
  • 11.
    11 Biomedical and DNAData Analysis  Research in DNA Analysis has led to  Development of new drugs  Cancer therapies  Human genome study  Discovery of genetic causes for many diseases  Genome Research  Study of DNA Sequences  Adenine, Cytosine, Guanine, Thymine  1,00,000 genes – each has hundreds of nucleotides – can be combined in a number of ways  Identifying Gene Sequence patterns is challenging
  • 12.
    12  Semantic Integrationof Heterogeneous, distributed genome databases  Highly distributed generation and use of DNA data  Integrated data warehouses and distributed federated databases  Efficient Data Cleaning and Integration methods  Similarity Search and Comparison among DNA Sequences  Gene sequences – isolated from healthy and diseased tissues  Compare frequently occurring patterns in each class  Help to identify the genetic factors of the disease and immune factors  Non-numeric nature of data poses difficulties Biomedical and DNA Data Analysis
  • 13.
    13  Association Analysis:Identification of co-occurring gene sequences  Diseases – triggered by a combination of genes acting together  Association analysis helps to detect the kinds of genes that may co-occur  Study interactions and relationships between them  Path Analysis: Linking genes to different stages of disease development  Different genes become active at different stages of the disease  Develop drug interventions that target specific stages Biomedical and DNA Data Analysis
  • 14.
    14  Visualization toolsand genetic data analysis  Complex Gene structures – Graphs, trees, Cuboids and visualization tools  Better Understanding and support interactive data exploration Biomedical and DNA Data Analysis
  • 15.
    15 Intrusion Detection  Intrusions Any set of actions that threaten the integrity, availability, or confidentiality of a network resource  Misuse detection: use patterns of well-known attacks to identify intrusions  Signatures – Must be updated  Classification based on known intrusions  E.g., three consecutive login failures: password guessing.  Anomaly detection: use deviation from normal usage patterns to identify intrusions  Any significant deviations from the expected behavior are reported as possible attacks
  • 16.
    16 Intrusion Detection  DataMining Algorithms  Misuse detection  training data labeled – normal / intrusion  Classifier can be used to detect known intrusions  Classification algorithms, Association rule mining  Anomaly detection  Builds models of normal behavior and detects significant deviations  Supervised – ‘normal’ training data  Unsupervised – no information about training data  Classification, clustering
  • 17.
    17 Intrusion Detection  Associationand Correlation Analysis  Finds relationships between system attributes describing the network data  Helps in selection of useful attributes  Analysis of Stream data  Transient and dynamic nature of intrusions  An event maybe normal on its own but malicious when viewed as a part of a sequence  Distributed Data Mining  Analysis of data from several locations  Visualization and Querying tools
  • 18.
    18 Data Mining inother Scientific Applications  Old Scenario: Small, homogeneous data sets  Formulate hypothesis, build model, evaluate results  Current Scenario: High-dimensional data, stream data, heterogeneous data (spatial, temporal)  Collect and store data, mine for new hypotheses, confirm with data or experimentation  Vast amounts of data have been collected from Scientific domains  Climate and ecosystem modeling, Chemical engineering, fluid dynamics, structural mechanics…
  • 19.
    19 Other Scientific Applications Data Warehouses and data preprocessing  Scientific applications – methods are needed for integrating data from heterogeneous sources (Geospatial data warehouse) and identifying events (Climate and Ecosystem data)  Mining complex data types  Scientific data – Semi-structured and unstructured  Multimedia and Spatial data
  • 20.
    20 Other Scientific Applications Graph-based mining  Labeled graphs – capture spatial, topological, geometric and other relational characteristics present in scientific data  Nodes – objects to be mined; edges – relationships  Scalable and efficient mining methods are needed  Visualization tools and domain specific knowledge  High level GUIs and visualization tools are needed  Integrated with existing domain-specific systems and database systems
  • 21.
    21 Issues in DataMining  Mining methodology and user interaction  Mining different kinds of knowledge in databases  Interactive mining of knowledge at multiple levels of abstraction  Incorporation of background knowledge  Data mining query languages and ad-hoc data mining  Expression and visualization of data mining results  Handling noise and incomplete data  Pattern evaluation
  • 22.
    22 Issues in DataMining  Issues relating to the diversity of data types  Handling relational and complex types of data  Mining information from heterogeneous databases and global information systems (WWW)  Performance and scalability  Efficiency and scalability of data mining algorithms  Parallel, distributed and incremental mining methods