Data mining

DEFINITION
Data mining, the extraction of
hidden predictive information
from large databases, is a
powerful new technology with great
potential to help companies focus on the
most important information in their data
warehouses.

Extract, transform, and load transaction data
onto the data warehouse system.
Store and manage the data in a
multidimensional database system.
Provide data access to business analysts
and information technology professionals.
Analyze the data by application software.
Present the data in a useful format, such as
a graph or table.

Classes
Clusters
Association
Sequential
patterns

Stored data is used to locate data in
predetermined groups. For example, a
restaurant chain could mine customer
purchase data to determine when customers
visit and what they typically order. This
information could be used to increase traffic
by having daily specials.

Data items are grouped according to logical
relationships or consumer preferences. For
example, data can be mined to identify market
segments or consumer affinities.

Data can be mined to identify associations.
The beer-diaper example is an example of
associative mining.

• Data is mined to anticipate behavior patterns
and trends. For example, an outdoor
equipment retailer could predict the likelihood
of a backpack being purchased based on a
consumer's purchase of sleeping bags and
hiking shoes.

Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics
Data
Collection(1960s)
"What was my total
revenue in the last five
years?"
Computers, tapes, disks IBM, CDC Retrospective,
static data
delivery
Data Access(1980s) "What were unit sales in
New England last March?"
Relational databases
(RDBMS), Structured
Query Language (SQL),
ODBC
Oracle, Sybase,
Informix, IBM,
Microsoft
Retrospective,
dynamic data
delivery at
record level
Data Warehousing
&Decision Support
(1990s)
"What were unit sales in
New England last March?
Drill down to Boston."
On-line analytic
processing (OLAP),
multidimensional
databases, data
warehouses
Pilot, Comshare,
Arbor, Cognos,
Microstrategy
Retrospective,
dynamic data
delivery at
multiple levels
Data
Mining(Emerging
Today)
"What’s likely to happen to
Boston unit sales next
month? Why?"
Advanced
algorithms,
multiprocessor
computers, massive
databases
Pilot, Lockheed,
IBM, SGI,
numerous
startups (nascent
industry)
Prospective,
proactive
information
delivery

Techniques
Neural Network
Decision
Tree
Visualisation
Link
Analysis

Neural Network
• Are used in a blackbox fashion.
• One creates a test data set,lets the neural
network learn patterns based on known
outcomes, then sets the neural network loose on
huge amounts of data.
• For example, a credit card company has 3,000
records, 100 of which are known fraud records
• The data set updates the neural network to make
sure it knows the difference between the fraud
records and the legitimate ones.

Link analysis
• This is another technique for associating like
records
• Not used too much, but there are some tools
created just for this.
• As the name suggests, the technique tries to
find links, either in customers, transactions
and demonstrate those links.

Visualisation
• Helps users understand their data
• Makes the bridge from text based to graphical
presentation.
• Such things as decision tree, rule ,cluster and
pattern visualization help users see data
relationships rather than read about them.
• Many of the stronger data mining programs
have made strides in improving their visual
content over the past few years.

Decision Tree
• Use real data mining algorithms
• Decision trees help with classification and spit out
information that is very descriptive,helping users to
understand their data.
• A decision tree process will generate the rules followed
in a process.
• For example, a lender at a bank goes through a set of
rules when approving a loan.
• Based on the loan data a bank has, the outcomes of
the loans and limits of acceptable levels of default, the
decision tree can set up the guidelines for the lending
institution.

PROCESS STAGES
1 The initial exploration
2
3
Model building or pattern identification with
validation/verification
Deployment

Stage 1: Exploration
• This stage usually starts with data preparation
which may involve cleaning data, data
transformations, selecting subsets of records
and - in case of data sets with large numbers
of variables ("fields")

Stage 2: Model building and
validation
This stage involves considering various models
and choosing the best one based on their
predictive performance.
• i.e. explaining the variability in question and
producing stable results across samples.

Process Models
Business Understanding Data Understanding
Data Preparation Modeling
Evaluation
Deployment

Define
Measure
Analyze
Improve
Control
Sample
Explore
Modify
Model
Assess

Stage 3: Deployment
That final stage involves using the model
selected as best in the previous stage and
applying it to new data in order to generate
predictions or estimates of the expected
outcome.

• KDD Nuggets and Rexer
Analytics have surveys and
asked people involved in
data mining which the
most popular software that
they use.
• While it is not necessarily
true that the most popular
software is the best for a
particular purpose they can
help guide us in choosing
which software to evaluate.

• Include a wide variety of methods.
• Easy to use interface makes it accessible
for general user
• Flexibility and extensibility make it
suitible for academic user
• Is written in java and released under the
GNU General Public Licence (GPL).
• Can be run in Windows, Linux, Mac and
other platform.

• Part of SAS suite of analysis software and uses a
client-server architacture with java based client
allowing parallel processing and grid-computing.
• Can be deployed on both Windows and
Linux/Unix platforms.
• User interface-easy to use data-flow gui
• Can intergrate code written in the SAS language.
• Data mining package with multiple techniques
and data flow interface

Data mining

More Related Content

What's hot

Viewers also liked

Similar to Data mining

Recently uploaded

Data mining