KEMBAR78
Data Mining: Applications and Techniques | PDF | Business | Art
0% found this document useful (0 votes)
104 views60 pages

Data Mining: Applications and Techniques

The document provides an overview of data mining, including why it is useful given the large amount of data being collected, what data mining is, the major steps in the data mining process, examples of data mining tools and techniques, and potential applications of data mining such as market analysis, risk analysis, and fraud detection.

Uploaded by

jaineti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views60 pages

Data Mining: Applications and Techniques

The document provides an overview of data mining, including why it is useful given the large amount of data being collected, what data mining is, the major steps in the data mining process, examples of data mining tools and techniques, and potential applications of data mining such as market analysis, risk analysis, and fraud detection.

Uploaded by

jaineti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Data Mining and its Applications

Why Data Mining?

• The Explosive Growth of Data


• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras,
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of massive
data sets
Why Mine Data? Commercial Viewpoint

• Lots of data is being collected


and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions

• Computers have become cheaper and more powerful


• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Why Mine Data? Scientific Viewpoint

• Data collected and stored at


enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
What Is Data Mining?

• Data mining (knowledge discovery in databases):


– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases

• Alternative names and their “inside stories”:


– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, business
intelligence, etc.
What Is Data Mining?

• Data mining (knowledge discovery from data)


– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
– Data mining: a misnomer?
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems
Data Mining
• The non-trivial extraction of novel, implicit, and actionable knowledge
from large datasets.
– Extremely large datasets
– Discovery of the non-obvious
– Useful knowledge that can improve processes
– Can not be done manually
• Technology to enable data exploration, data analysis, and data
visualization of very large databases at a high level of abstraction,
without a specific hypothesis in mind.
• Sophisticated data search capability that uses statistical algorithms to
discover patterns and correlations in data.
Data Mining (cont.)
Data Mining (cont.)
• Data Mining is a step of Knowledge Discovery in
Databases (KDD) Process
– Data Warehousing
– Data Selection
– Data Preprocessing
– Data Transformation
– Data Mining
– Interpretation/Evaluation
• Data Mining is sometimes referred to as KDD and DM
and KDD tend to be used as synonyms
Major Issues in Data Warehousing and
Mining
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
Major Issues in Data Warehousing and
Mining
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and global
information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with existing knowledge: A
knowledge fusion problem
– Protection of data security, integrity, and privacy
Examples: What is (not) Data Mining?

l What is not Data l What is Data Mining?


Mining?

– Look up phone – Certain names are more


number in phone prevalent in certain US locations
directory (O’Brien, O’Rurke, O’Reilly… in
Boston area)

– Query a Web – Group together similar


documents returned by search
search engine for
engine according to their context
information about
(e.g. Amazon rainforest,
“Amazon”
Amazon.com,)
Extraction of Knowledge from
Data
4 Phases of Data Mining
• Data Preparation
– Identify the main data sets to be used by the data
mining operation (usually the data warehouse)
• Data Analysis and Classification
– Study the data to identify common data
characteristics or patterns
• Data groupings, classifications, clusters, sequences
• Data dependencies, links, or relationships
• Data patterns, trends, deviation
4 Phases of Data Mining
• Knowledge Acquisition
– Uses the Results of the Data Analysis and Classification phase
– Data mining tool selects the appropriate modeling or knowledge-
acquisition algorithms
• Neural Networks
• Decision Trees
• Rules Induction
• Genetic algorithms
• Memory-Based Reasoning
• Prognosis
– Predict Future Behavior
– Forecast Business Outcomes
• 65% of customers who did not use a particular credit card in the last 6
months are 88% likely to cancel the account.
3 Steps Data Mining Process
• Stage 1: Exploration. This stage usually starts with data
preparation which may involve cleaning data, data
transformations, selecting subsets of records
• Stage 2: Model building and validation. This stage involves
considering various models and choosing the best one based
on their predictive performance
• Stage 3: Deployment. That final stage involves using the
model selected as best in the previous stage and applying it to
new data in order to generate predictions or estimates of the
expected outcome
Some of the tools used for data
mining are:
• Artificial neural networks - Non-linear predictive models that
learn through training and resemble biological neural
networks in structure.
• Decision trees - Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification
of a dataset.
• Rule induction - The extraction of useful if-then rules from
data based on statistical significance.
• Genetic algorithms - Optimization techniques based on the
concepts of genetic combination, mutation, and natural
selection.
• Nearest neighbor - A classification technique that classifies
each record based on the records most similar to it in an
historical database.
Data Mining: On What Kinds of Data?

• Database-oriented data sets and applications


– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
Why Data Mining?—Potential Applications

• Data analysis and decision support


– Market analysis and management
• Target marketing, customer relationship management (CRM), market
basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
– Fraud detection and detection of unusual patterns (outliers)
• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– Bioinformatics and bio-data analysis
Ex. 1: Market Analysis and Management
• Where does the data come from?—Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
• Target marketing
– Find clusters of “model” customers who share the same characteristics: interest, income
level, spending habits, etc.,
– Determine customer purchasing patterns over time
• Cross-market analysis—Find associations/co-relations between product sales, & predict based
on such association
• Customer profiling—What types of customers buy what products (clustering or classification)
• Customer requirement analysis
– Identify the best products for different customers
– Predict what factors will attract new customers
• Provision of summary information
– Multidimensional summary reports
– Statistical summary information (data central tendency and variation)
Ex. 2: Corporate Analysis & Risk Management

• Finance planning and asset evaluation


– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio, trend analysis,
etc.)
• Resource planning
– summarize and compare the resources and spending
• Competition
– monitor competitors and market directions
– group customers into classes and a class-based pricing procedure
– set pricing strategy in a highly competitive market
Ex. 3: Fraud Detection & Mining Unusual
Patterns

• Approaches: Clustering & model construction for frauds, outlier analysis


• Applications: Health care, retail, credit card service, telecomm.
– Auto insurance: ring of collisions
– Money laundering: suspicious monetary transactions
– Medical insurance
• Professional patients, ring of doctors, and ring of references
• Unnecessary or correlated screening tests
– Telecommunications: phone-call fraud
• Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm
– Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest employees
– Anti-terrorism
Data Mining: Classification Schemes

• Decisions in data mining


– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted

• Data mining tasks


– Descriptive data mining
– Predictive data mining
Decisions in Data Mining

• Databases to be mined
– Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
• Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock market
analysis, Web mining, Weblog analysis, etc.
Data Mining Models and Tasks
Data Mining Tasks

• Prediction Tasks
– Use some variables to predict unknown or future values of other
variables
• Description Tasks
– Find human-interpretable patterns that describe the data.

Common data mining tasks


– Classification [Predictive]
– Clustering [Descriptive]
– Association Rule Discovery [Descriptive]
– Sequential Pattern Discovery [Descriptive]
– Regression [Predictive]
– Deviation Detection [Predictive]
Basic Data Mining Tasks
• Classification maps data into predefined groups
or classes
– Supervised learning
– Pattern recognition
– Prediction
• Regression is used to map a data item to a real
valued prediction variable.
• Clustering groups similar data together into
clusters.
– Unsupervised learning
– Segmentation
– Partitioning
Basic Data Mining Tasks
(cont’d)
• Summarization maps data into subsets with
associated simple descriptions.
– Characterization
– Generalization
• Link Analysis uncovers relationships among data.
– Affinity Analysis
– Association Rules
– Sequential Analysis determines sequential patterns.
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts


OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data Mining: On What Kind of Data?

• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW
Data Mining: Confluence of Multiple
Disciplines
Database
Statistics
Technology

Machine
Learning
Data Mining Visualization

Information Other
Science Disciplines
Data Mining vs. Statistical Analysis

Statistical Analysis:
• Ill-suited for Nominal and Structured Data Types
• Completely data driven - incorporation of domain knowledge not possible
• Interpretation of results is difficult and daunting
• Requires expert user guidance

Data Mining:
• Large Data sets
• Efficiency of Algorithms is important
• Scalability of Algorithms is important
• Real World Data
• Lots of Missing Values
• Pre-existing data - not user generated
• Data not static - prone to updates
• Efficient methods for data retrieval available for use
Data Mining vs. DBMS

• Example DBMS Reports


– Last months sales for each service type
– Sales per service grouped by customer sex or age bracket
– List of customers who lapsed their policy

• Questions answered using Data Mining


– What characteristics do customers that lapse their policy
have in common and how do they differ from customers
who renew their policy?
– Which motor insurance policy holders would be potential
customers for my House Content Insurance policy?
Data Mining and Data Warehousing

• Data Warehouse: a centralized data repository which can be


queried for business benefit.
• Data Warehousing makes it possible to
– extract archived operational data
– overcome inconsistencies between different legacy data formats
– integrate data throughout an enterprise, regardless of location,
format, or communication requirements
– incorporate additional or expert information
• OLAP: On-line Analytical Processing
• Multi-Dimensional Data Model (Data Cube)
• Operations:
– Roll-up
– Drill-down
– Slice and dice
– Rotate
Major Issues in Data Warehousing and
Mining
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
Major Issues in Data Warehousing and
Mining
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and global
information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with existing knowledge: A
knowledge fusion problem
– Protection of data security, integrity, and privacy
Major Issues in Data Mining
• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
What makes data mining
possible?
• Advances in the following areas are making
data mining deployable:
– data warehousing
– better and more data (i.e., operational,
behavioral, and demographic)
– the emergence of easily deployed data mining
tools and
– the advent of new data mining techniques.
– -- Gartner Group
Data Mining Motivation
• Changes in the Business Environment
– Customers becoming more demanding
– Markets are saturated
• Databases today are huge:
– More than 1,000,000 entities/records/rows
– From 10 to 10,000 fields/attributes/variables
– Gigabytes and terabytes
• Databases a growing at an unprecedented rate
• Decisions must be made rapidly
• Decisions must be made with maximum knowledge
ADVANTAGES OF DATA
MINING
• Marking/Retailing: Data mining can aid direct
marketers by providing them with useful and
accurate trends about their customers’
purchasing behavior.
• Banking/Crediting: Data mining can assist
financial institutions in areas such as credit
reporting and loan information.
ADVANTAGES OF DATA
MINING Cont…
• Law enforcement: Data mining can aid law enforcers
in identifying criminal suspects as well as
apprehending these criminals by examining trends in
location, crime type, habit, and other patterns of
behaviors.
• Researchers: Data mining can assist researchers by
speeding up their data analyzing process; thus,
allowing them more time to work on other
projects.
DISADVANTAGES OF DATA
MINING
• Privacy Issues: For example, according to
Washing Post, in 1998, CVS had sold their
patient’s prescription purchases to a different
company
• American Express also sold their customers’
credit card purchases to another company.
DISADVANTAGES OF DATA
MINING Cont…
• Security issues: Although companies have a lot of personal
information about us available online, they do not have
sufficient security systems in place to protect that
information.
• Misuse of information: Some of the company will answer
your phone based on your purchase history. If you have spent
a lot of money or buying
a lot of product from one company, your call will be answered
really soon. So you should not think that your call is really
being answer in the order in which it was receive.
Data Mining Motivation
“The key in business is to know something that nobody
else knows.”
— Aristotle Onassis

PHOTO: LUCINDA DOUGLAS-MENZIES


PHOTO: HULTON-DEUTSCH COLL

“To understand is to perceive patterns.”


— Sir Isaiah Berlin
Data Mining Applications
Data Mining Applications:
Retail
• Performing basket analysis
– Which items customers tend to purchase together. This knowledge can
improve stocking, store layout strategies, and promotions.
• Sales forecasting
– Examining time-based patterns helps retailers make stocking
decisions. If a customer purchases an item today, when are they likely
to purchase a complementary item?
• Database marketing
– Retailers can develop profiles of customers with certain behaviors, for
example, those who purchase designer labels clothing or those who
attend sales. This information can be used to focus cost–effective
promotions.
• Merchandise planning and allocation
– When retailers add new stores, they can improve merchandise
planning and allocation by examining patterns in stores with similar
demographic characteristics. Retailers can also use data mining to
determine the ideal layout for a specific store.
Data Mining Applications:
Banking
• Card marketing
– By identifying customer segments, card issuers and acquirers can
improve profitability with more effective acquisition and retention
programs, targeted product development, and customized pricing.
• Cardholder pricing and profitability
– Card issuers can take advantage of data mining technology to price
their products so as to maximize profit and minimize loss of
customers. Includes risk-based pricing.
• Fraud detection
– Fraud is enormously costly. By analyzing past transactions that were
later determined to be fraudulent, banks can identify patterns.
• Predictive life-cycle management
– DM helps banks predict each customer’s lifetime value and to service
each segment appropriately (for example, offering special deals and
discounts).
Data Mining Applications:
Telecommunication
• Call detail record analysis
– Telecommunication companies accumulate detailed call records.
By identifying customer segments with similar use patterns, the
companies can develop attractive pricing and feature
promotions.
• Customer loyalty
– Some customers repeatedly switch providers, or “churn”, to take
advantage of attractive incentives by competing companies. The
companies can use DM to identify the characteristics of
customers who are likely to remain loyal once they switch, thus
enabling the companies to target their spending on customers
who will produce the most profit.
Data Mining Applications:
Other Applications
• Customer segmentation
– All industries can take advantage of DM to discover discrete segments
in their customer bases by considering additional variables beyond
traditional analysis.
• Manufacturing
– Through choice boards, manufacturers are beginning to customize
products for customers; therefore they must be able to predict which
features should be bundled to meet customer demand.
• Warranties
– Manufacturers need to predict the number of customers who will
submit warranty claims and the average cost of those claims.
• Frequent flier incentives
– Airlines can identify groups of customers that can be given incentives
to fly more.
Data Mining in CRM:
Customer Life Cycle
• Customer Life Cycle
– The stages in the relationship between a customer and a
business
• Key stages in the customer lifecycle
– Prospects: people who are not yet customers but are in the
target market
– Responders: prospects who show an interest in a product or
service
– Active Customers: people who are currently using the product
or service
– Former Customers: may be “bad” customers who did not pay
their bills or who incurred high costs
• It’s important to know life cycle events (e.g. retirement)
Data Mining in CRM:
Customer Life Cycle
• What marketers want: Increasing customer
revenue and customer profitability
– Up-sell
– Cross-sell
– Keeping the customers for a longer period of time
• Solution: Applying data mining
Data Mining in CRM
• DM helps to
– Determine the behavior surrounding a particular
lifecycle event
– Find other people in similar life stages and
determine which customers are following similar
behavior patterns
Data Mining in CRM (cont.)

Data Warehouse Customer Profile Data Mining

Customer Life Cycle Info.

Campaign Management
Data Mining in Practice
Application Areas

Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
Why Now?
• Data is being produced
• Data is being warehoused
• The computing power is available
• The computing power is affordable
• The competitive pressures are strong
• Commercial products are available
Data Mining works with
Warehouse Data
• Data Warehousing provides the
Enterprise with a memory

ÑData Mining provides the


Enterprise with intelligence
Usage scenarios
• Data warehouse mining:
– assimilate data from operational sources
– mine static data
• Mining log data
• Continuous mining: example in process control
• Stages in mining:
– data selection  pre-processing: cleaning 
transformation  mining  result evaluation 
visualization
Mining market
• Around 20 to 30 mining tool vendors
• Major tool players:
– Clementine,
– IBM’s Intelligent Miner,
– SGI’s MineSet,
– SAS’s Enterprise Miner.
• All pretty much the same set of tools
• Many embedded products:
– fraud detection:
– electronic commerce applications,
– health care,
– customer relationship management: Epiphany
Vertical integration:
Mining on the web
• Web log analysis for site design:
– what are popular pages,
– what links are hard to find.
• Electronic stores sales enhancements:
– recommendations, advertisement:
– Collaborative filtering: Net perception, Wisewire
– Inventory control: what was a shopper looking for
and could not find..

You might also like