Data Mining for Business
Intelligence
Data Mining Concepts and Definitions
Why Data Mining?
More intense competition at the global scale
Recognition of the value in data sources
Availability of quality data on customers,
vendors, transactions, Web, etc.
Consolidation and integration of data
repositories into data warehouses
The exponential increase in data processing
and storage capabilities; and decrease in cost
Movement toward conversion of information
resources into nonphysical form
Definition of Data Mining
The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data stored in
structured databases - Fayyad et al., (1996)
Keywords in this definition: Process, nontrivial,
valid, novel, potentially useful, understandable
Data mining: a misnomer?
Other names: knowledge extraction, pattern
analysis, knowledge discovery, information
harvesting, pattern searching, data dredging
Data Mining at the Intersection of
Many Disciplines
Pattern
Recognition
DATA Machine
MINING Learning
Mathematical
Modeling Databases
Management Science &
Information Systems
Data Mining Characteristics/Objectives
Source of data for DM is often a consolidated
data warehouse (not always!).
DM environment is usually a client-server or a
Web-based information systems architecture.
Data is the most critical ingredient for DM
which may include soft/unstructured data.
The miner is often an end user.
Striking it rich requires creative thinking.
Data mining tools’ capabilities and ease of use
are essential (Web, Parallel processing, etc.).
Data in Data Mining
Data: a collection of facts usually obtained as the
result of experiences, observations, or experiments
Data may consist of numbers, words, and images
Data: lowest level of abstraction (from which
information and knowledge are derived)
Data
- DM with different
data types?
Categorical Numerical - Other data types?
Nominal Ordinal Interval Ratio
What Does DM Do? How Does it Work?
DM extracts patterns from data
Pattern? A mathematical (numeric and/or symbolic)
relationship among data items
Types of patterns
Association: (Beer & diapers in a markets basket analysis)
Prediction: Predicts future occurrences based on the past (Super
Bowl winner, temperature on a specific day)
Cluster: (segmentation based on demographics or past purchase
behavior)
Sequential (or time series) relationships: existing bank
customer with checking account will open savings account within a
year
A Taxonomy for Data Mining Tasks
Data Mining Learning Method Popular Algorithms
Classification and Regression Trees,
Prediction Supervised
ANN, SVM, Genetic Algorithms
Decision trees, ANN/MLP, SVM, Rough
Classification Supervised
sets, Genetic Algorithms
Linear/Nonlinear Regression, Regression
Regression Supervised
trees, ANN/MLP, SVM
Association Unsupervised Apriory, OneR, ZeroR, Eclat
Link analysis Unsupervised Expectation Maximization, Apriory
Algorithm, Graph-based Matching
Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique
Clustering Unsupervised K-means, ANN/SOM
Outlier analysis Unsupervised K-means, Expectation Maximization (EM)
Other Data Mining Tasks
These are in addition to the primary DM
tasks (prediction, association, clustering)
Time-series forecasting
Part of sequence or link analysis?
Visualization
Another data mining task?
Types of DM
Hypothesis-driven data mining
Discovery-driven data mining
Data Mining Applications
Customer Relationship Management
Maximize return on marketing campaigns
Improve customer retention (churn analysis)
Maximize customer value (cross- or up-selling)
Identify and treat most valued customers
Banking & Other Financial
Automate the loan application process
Detecting fraudulent transactions
Maximize customer value (cross- and up-selling)
Optimizing cash reserves with forecasting
Data Mining Applications (cont.)
Retailing and Logistics
Optimize inventory levels at different locations
Improve the store layout and sales promotions
Optimize logistics by predicting seasonal effects
Minimize losses due to limited shelf life
Manufacturing and Maintenance
Predict/prevent machinery failures
Identify anomalies in production systems to
optimize manufacturing capacity
Discover novel patterns to improve product quality
Data Mining Applications (cont.)
Brokerage and Securities Trading
Predict changes on certain bond prices
Forecast the direction of stock fluctuations
Assess the effect of events on market movements
Identify and prevent fraudulent activities in trading
Insurance
Forecast claim costs for better business planning
Determine optimal rate plans
Optimize marketing to specific customers
Identify and prevent fraudulent claim activities
Data Mining Applications (cont.)
Computer hardware and software
Science and engineering
Government and defense
Homeland security and law enforcement
Travel industry
Healthcare Highly popular application
Medicine areas for data mining
Entertainment industry
Sports
Etc.
Data Mining Methods: Classification
Most frequently used DM method
Part of the machine-learning family
Employ supervised learning
Learn from past data, classify new data
The output variable is categorical
(nominal or ordinal) in nature
Classification versus regression?
Classification versus clustering?
Classification Techniques
Decision tree analysis
Statistical analysis
Neural networks
Support vector machines
Case-based reasoning
Bayesian classifiers
Genetic algorithms
Rough sets
Decision Trees
Employs the divide and conquer method
Recursively divides a training set until each
division consists of examples from one class
A general 1. Create a root node and assign all of the training
algorithm data to it.
for 2. Select the best splitting attribute.
decision 3. Add a branch to the root node for each value of
tree the split. Split the data into mutually exclusive
building subsets along the lines of the specific split.
4. Repeat the steps 2 and 3 for each and every leaf
node until the stopping criteria is reached.
Data Mining SPSS PASW Modeler (formerly Clementine)
RapidMiner
SAS / SAS Enterprise Miner
Software Microsoft Excel
Your own code
Weka (now Pentaho)
Commercial KXEN
MATLAB
IBM SPSS Modeler Other commercial tools
(formerly Clementine)
KNIME
Microsoft SQL Server
SAS – Enterprise Miner Other free tools
Zementis
IBM – Intelligent Miner Oracle DM
StatSoft – Statistica Data
Statsoft Statistica
Salford CART, Mars, other
Miner Orange
Angoss
… many more C4.5, C5.0, See5
Free and/or Open Source
Bayesia
Insightful Miner/S-Plus (now TIBCO)
RapidMiner Megaputer
Viscovery
Weka Clario Analytics
Total (w/ others) Alone
Miner3D
… many more Thinkanalytics
0 20 40 60 80 100 120
Source: KDNuggets.com, May 2009
Data Mining Myths
Data mining …
provides instant solutions/predictions.
is not yet viable for business applications.
requires a separate, dedicated database.
can only be done by those with advanced
degrees.
is only for large firms that have lots of
customer data.
is another name for good-old statistics.
Common Data Mining Blunders
1. Selecting the wrong problem for data mining
2. Ignoring what your sponsor thinks data
mining is and what it really can/cannot do
3. Not leaving sufficient time for data
acquisition, selection and preparation
4. Looking only at aggregated results and not
at individual records/predictions
5. Being sloppy about keeping track of the data
mining procedure and results
Common Data Mining Mistakes
6. Ignoring suspicious (good or bad) findings
and quickly moving on
7. Running mining algorithms repeatedly and
blindly, without thinking about the next stage
8. Naively believing everything you are told
about the data
9. Naively believing everything you are told
about your own data mining analysis
10. Measuring your results differently from the
way your sponsor measures them