KEMBAR78
Data mining introduction | PPT
Data Mining
Introduction
intro
Data mining is a powerful new
technology with great potential to help
companies focus on the most important
information in the data they have
collected about the behavior of their
customers and potential customers.
Data collections in the real world
īŽ

īŽ

īŽ

Ten largest transaction-processing
databases range from 3 to 18
Terabytes
Ten largest decision support databases
range from 10 to 29 Terabytes
Sizes have doubled / tripled between
2001 and end of 2003
Questions arise
īŽ

īŽ

īŽ

Is there any new, unexpected and
potentially useful information contained
in this data?
Can we use historical data to predict
future outcomes?
(e.g. customer behavior, fraud
detection, etc.)
Some examples of data mining
1.

Telecommunications

Huge amount of data is collected daily
īŽ Transactional data (about each phone call)
īŽ Data on mobile phones, house based phones, Internet, etc.)
īŽ Other customer data (billing, personal information, etc.)
īŽ Additional data (network load, faults, etc.)
Questions arises
īŽ Which customer group is highly profitable, which one is not?
īŽ To which customers should we advertise what kind of special
offers?
īŽ What kind of call rates would increase profit without loosing good
customers?
īŽ How do customer profiles change over time?
īŽ Fraud detection (stolen mobile phones or phone cards
īŽ
Another
2. Health
īŽ Different aspects of the health system
īŽ Personal health records (at GPs, specialists, etc.)
īŽ Hospital data (e.g. admission data, midwives data,
surgery data)
īŽ Billing information (Medicare, PBS)
Questions
īŽ Are doctors following the procedures (e.g. prescription of
medication)?
īŽ Adverse drug reactions (analysis of different data
collections to find correlations)
īŽ Are people committing fraud (e.g. doctor shoppers)
īŽ Correlations between social and environmental issues
and people's health?
What is data mining?
īŽ

Data Mining is the automated extraction
of previously unrealized information
from Large data sources for the
purpose of supporting business actions.
Some more definitions
īŽ

īŽ

īŽ

Knowledge discovery in databases is the
non-trivial process of identifying valid, novel,
potentially useful, and ultimately
understandable patterns in data.
An information extraction activity whose goal
is to discover hidden facts contained in
databases.
Data mining, or knowledge discovery, is the
computer-assisted process of digging through
and analyzing enormous sets of data and
then extracting the meaning of the data.
Data mining process
Data mining process
īŽ

īŽ

īŽ

Extract, transform, and load transaction
data onto the data warehouse system.
Store and manage the data in a
multidimensional database system.
Provide data access to business
analysts and information technology
professionals.
Data mining process
īŽ

īŽ

Analyze the data by application
software.
Present the data in a useful format,
such as a graph or table.
DM is multi disciplinary
What they do
Detect patterns in data: Rules, patterns,
classes, associations and functional
dependencies, outliers, data distributions,
clusters
How they do it

īŽ

Search through data and pattern space,
non-parametric modelling, filtering,
aggregation
How well they do it
Errors and biases, over-fitting,
confounding effects, speed, scalability
Challenges in DM
īŽ

īŽ

īŽ

Data size
īŽ Size of data collections grows more than
linear, doubling every 18 months
īŽ Scalable algorithms are needed
īŽ Data complexity
Different types of data (free text, HTML, XML,
multimedia)
Dimensionality of the data increases (more
attributes)
Challenges contd..
īŽ

īŽ

īŽ

The curse of dimensionality affects many
algorithms
(for example find nearest neighbors in high
dimensions)
Data quality
īŽ Real world data is messy and dirty
(missing and out-of-date values,
typographical errors, different
coding/formats, etc.)
Why mine data?
īŽ
īŽ
īŽ
īŽ
īŽ
īŽ

Data is being recorded
Recorded data is being warehoused
Computing power is affordable
Competitive pressure is strong
Commercial DM products are available
It provides support for business
decisions
Value to business
īŽ

īŽ

īŽ

Market segmentation - Identify the
common characteristics of customers
who buy the same products from your
company.
Customer churn - Predict which
customers are likely to leave your
company and go to a competitor.
Fraud detection - Identify which
transactions are most likely to be
fraudulent.
Value to business
īŽ

īŽ

Interactive marketing - Predict what each
individual accessing a Web site is most
likely interested in seeing.
Market basket analysis - Understand what
products or services are commonly
purchased together; e.g., beer and
diapers.
Value to business
īŽ

īŽ

īŽ

Trend analysis - Reveal the difference
between a typical customer this month
and last.
Data mining can also effectively deal with
missing, inconsistent, and noisy data.
Direct marketing - Identify which prospects
should be included in a mailing list to
obtain the highest response rate.

Data mining introduction

  • 1.
  • 2.
    intro Data mining isa powerful new technology with great potential to help companies focus on the most important information in the data they have collected about the behavior of their customers and potential customers.
  • 3.
    Data collections inthe real world īŽ īŽ īŽ Ten largest transaction-processing databases range from 3 to 18 Terabytes Ten largest decision support databases range from 10 to 29 Terabytes Sizes have doubled / tripled between 2001 and end of 2003
  • 4.
    Questions arise īŽ īŽ īŽ Is thereany new, unexpected and potentially useful information contained in this data? Can we use historical data to predict future outcomes? (e.g. customer behavior, fraud detection, etc.)
  • 5.
    Some examples ofdata mining 1. Telecommunications Huge amount of data is collected daily īŽ Transactional data (about each phone call) īŽ Data on mobile phones, house based phones, Internet, etc.) īŽ Other customer data (billing, personal information, etc.) īŽ Additional data (network load, faults, etc.) Questions arises īŽ Which customer group is highly profitable, which one is not? īŽ To which customers should we advertise what kind of special offers? īŽ What kind of call rates would increase profit without loosing good customers? īŽ How do customer profiles change over time? īŽ Fraud detection (stolen mobile phones or phone cards īŽ
  • 6.
    Another 2. Health īŽ Differentaspects of the health system īŽ Personal health records (at GPs, specialists, etc.) īŽ Hospital data (e.g. admission data, midwives data, surgery data) īŽ Billing information (Medicare, PBS) Questions īŽ Are doctors following the procedures (e.g. prescription of medication)? īŽ Adverse drug reactions (analysis of different data collections to find correlations) īŽ Are people committing fraud (e.g. doctor shoppers) īŽ Correlations between social and environmental issues and people's health?
  • 7.
    What is datamining? īŽ Data Mining is the automated extraction of previously unrealized information from Large data sources for the purpose of supporting business actions.
  • 8.
    Some more definitions īŽ īŽ īŽ Knowledgediscovery in databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. An information extraction activity whose goal is to discover hidden facts contained in databases. Data mining, or knowledge discovery, is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data.
  • 9.
  • 10.
    Data mining process īŽ īŽ īŽ Extract,transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals.
  • 11.
    Data mining process īŽ īŽ Analyzethe data by application software. Present the data in a useful format, such as a graph or table.
  • 12.
    DM is multidisciplinary
  • 13.
    What they do Detectpatterns in data: Rules, patterns, classes, associations and functional dependencies, outliers, data distributions, clusters
  • 14.
    How they doit īŽ Search through data and pattern space, non-parametric modelling, filtering, aggregation How well they do it Errors and biases, over-fitting, confounding effects, speed, scalability
  • 15.
    Challenges in DM īŽ īŽ īŽ Datasize īŽ Size of data collections grows more than linear, doubling every 18 months īŽ Scalable algorithms are needed īŽ Data complexity Different types of data (free text, HTML, XML, multimedia) Dimensionality of the data increases (more attributes)
  • 16.
    Challenges contd.. īŽ īŽ īŽ The curseof dimensionality affects many algorithms (for example find nearest neighbors in high dimensions) Data quality īŽ Real world data is messy and dirty (missing and out-of-date values, typographical errors, different coding/formats, etc.)
  • 17.
    Why mine data? īŽ īŽ īŽ īŽ īŽ īŽ Datais being recorded Recorded data is being warehoused Computing power is affordable Competitive pressure is strong Commercial DM products are available It provides support for business decisions
  • 18.
    Value to business īŽ īŽ īŽ Marketsegmentation - Identify the common characteristics of customers who buy the same products from your company. Customer churn - Predict which customers are likely to leave your company and go to a competitor. Fraud detection - Identify which transactions are most likely to be fraudulent.
  • 19.
    Value to business īŽ īŽ Interactivemarketing - Predict what each individual accessing a Web site is most likely interested in seeing. Market basket analysis - Understand what products or services are commonly purchased together; e.g., beer and diapers.
  • 20.
    Value to business īŽ īŽ īŽ Trendanalysis - Reveal the difference between a typical customer this month and last. Data mining can also effectively deal with missing, inconsistent, and noisy data. Direct marketing - Identify which prospects should be included in a mailing list to obtain the highest response rate.