Data mining and neural networks
Lecturer:
Dr Evgeny Mirkes em322@le.ac.uk
List of publications can be found in Google scholar
Marks:
Final exam 50%
3 Computational tasks 30%
3 Homework 20%
Total 100%
1
The global Big Data Challenge
The age of analytics: Competing in a data-driven world
The McKinsey Global Institute
Potential gap in data analysis professionals is estimated of
approximately 140,000 to 190,000 potential positions.
Gartner expected even more positions by 2015 (millions).
Every day we can find thousands of data-analysis related
vacancies in the UK.
IT Job Market, Database & Business Intelligence Category
in the UK
More than 11,500 positions per 6 month
2
Big data creates value in several ways
• Creating transparency
• Enabling experimentation to discover needs, expose
variability, and improve performance
• Segmenting populations to customize actions
• Replacing/supporting human decision making with
automated algorithms
• Innovating new business models, products, and services
3
4
Fundamental book: T. Hastie, R. Tibshirani, J.
Friedman, The Elements of Statistical
Learning: Data Mining, Inference, and
Prediction,
https://web.stanford.edu/~hastie/Papers/E
SLII.pdf
The book of receipts and advices
Data Mining and Knowledge Discovery
Handbook; Edited by Oded Maimon,
Lior Rokach, Springer, 2005
http://link.springer.com/book/10.1007%
2Fb107408
5
“This timely book says out
loud what has finally
become apparent: in the
modern world, Data is
Business, and you can no
longer think business
without thinking data.
Read this book and you will
understand the Science
behind thinking data.”
Ron Bekkerman,
Chief Data Officer at
Carmel Ventures
6
What is Data Mining?
• Recently* coined term for confluence of ideas from
statistics and computer science (machine learning and
database methods) applied to large databases in science,
engineering and business.
• In state of flux, many definitions, lot of debate about
what it is and what it is not. Terminology is not standard
e.g. bias, classification, prediction, feature = regressor =
independent variable, target = dependent variable =
response, case = exemplar = row = record = observation.
* First International Conference on Knowledge Discovery
and Data Mining was in 1995
7
What is Data Mining?
• Broad Definition includes traditional statistical methods
• Narrow Definition emphasizes automated and heuristic
methods
• Data mining, Data dredging, Data fishing
• Knowledge Discovery in Databases (KDD)
8
What is Data Mining?
Darryl Pregiborn: The data mining is “statistics at scale and
speed”. Usual extension is “and simplicity” (of logic).
Gartner Group: “Data mining is the process of discovering
meaningful correlations, patterns and trends by sifting
through large amounts of data stored in repositories. Data
mining employs pattern recognition technologies, as well
as statistical and mathematical techniques.”
9
Drivers
• Market: From focus on product/service to focus on
customer
• IT: From focus on up-to-date balances to focus on
patterns in transactions – Data Warehouses – OLAP
(Online Analytical Processing)
• Dramatic drop in storage costs (especially during COVID):
Huge databases
• Walmart: 20 million transaction per day, 10 terabyte database
• Blockbuster: 36 million households
• Automatic Data Capture of Transaction
• Bar codes, POS devices, Mouse clicks, Location data (GPS)
• Internet: Personalised interaction, longitudinal data
10
What is Data Warehouse?
A data warehouse is a copy of transaction data
specifically structured for querying and reporting.
OR
A data warehouse is a copy of transaction data
specifically structured for query and analysis.
OR
A data warehouse is a centralized store of an
organization's data resources implemented specifically
for query, reporting, and analysis purposes.
11
What is OLAP/FASMI?
‘On-Line Analytical Processing’
F FAST means that the system is targeted to deliver most responses
to users within about five seconds, with the simplest analyses taking
no more than one second and very few taking more than 20
seconds.
A ANALYSIS the system can cope with any business logic and
statistical analysis that is relevant for the application and the user,
and keep it easy enough for the target user.
S SHARED the system implements all the security requirements for
confidentiality (possibly down to cell level) and, if multiple write
access is needed, concurrent update locking at an appropriate level.
M MULTIDIMENSIONAL is the key requirement. If we had to pick a
one-word definition of OLAP, this is it.
I INFORMATION we are measuring the capacity of various products
in terms of how much input data they can handle, not how many
Gigabytes they take to store it.
12
Core Disciplines
• Statistics (adapted for 21st century data sizes and speed
requirements)
• Descriptive: Visualisation
• Models: Regression, Classification, Cluster analysis
• Machine Learning
• Neural networks
• Data Base Retrieval
• Associations Rules
• Parallel developments
• Tree methods
• K Nearest neighbours
• OLAP-EDA
13
Process
1. Develop understanding of application, goals (data
understanding)
2. Create dataset for study (often from data Warehouse)
3. Data cleaning and preprocessing
4. Data reduction and projection Data mining
5. Choose Data Mining task
6. Choose Data Mining algorithms
7. Use algorithm to perform task
8. Interpret and repeat 1-7 if necessary
9. Deploy: integrate into operational system
14
SEMMA Methodology (SAS)
• Sample from dataset, partition into Training, Validation
and Test data sets
• Explore data set statistically and graphically
• Modify: Transform variables, impute missing values
• Model: fit models, for example, regression, classification
tree, neural network
• Assess: Evaluate model accuracy using Test data set
15
Illustrative Applications
• Customer Relationship Management
• Target marketing
• Attrition Prediction/Churn analysis
• Fraud Detection
• Credit Scoring (Risk analysis)
• Finance
• E-commerce and Internet
• Recommendation systems
• Clicks to Customers
16
Target marketing
• Business problem: Use list of prospects for direct mailing
campaign
• Solution: Use Data Mining to identify the most promising
respondents combining demographic and geographic
data with data on past purchase behaviour
• Benefit: Better response rate, saving in campaign cost
17
Fraud Detection
• Business problem: Fraud increases costs or decreases
revenue
• Solution: Use Logistic regression or neural networks to
identify characteristics of fraudulent cases to prevent in
future or prosecute more vigorously
• Benefit: Increases profits by reducing undesirable
customers
• Example: Automobile Insurance Bureau of Massachusetts
• Past reports on claims adjustors scrutinized by experts to
identify cases of fraud
• Several characteristics (over 60) of claimant, type of accident,
type of injury are coded into database
• Dimensionality Reduction methods used to obtain weighted
variables. Multiple Regression Step-wise Subset selection is
used to identify characteristics strongly correlated with fraud
18
Risk analysis
• Business problem: Reduce risk of loans to delinquent
customers
• Solution: Use credit scoring models based on
discriminant analysis to create a score function that
separate out risky customers
• Benefit: Decrease in cost of bad debts
19
Finance
• Business problem: Pricing of corporate bond depends on
several factors: risk profile of company, seniority of debt,
dividends, prior history, etc.
• Solution: Through Data Mining, develop more accurate
models for price prediction
20
Recommendation systems
• Business opportunity: Users rate items (Amazon.com,
CDNOW.com, FilmFestivals.com) on the web. How to use
information from other users to infer rating for a
particular user?
• Solution: Use of technique known as collaborative
filtering
• Benefit: Increase revenues by cross selling, up selling
21
[convert] Clicks to Customers
• Business problem: 50% of Dell’s clients order their
computers through Internet. However, the retention rate
is 0.5%: 0.5% of Dell’s web page visitors become
customers.
• Solution: Through the sequence of visitors clicks, cluster
customers and change design website to maximise the
number of visitors who eventually buy.
• Benefit: Increase revenues
22
Emerging Major Data Mining applications
• Spam detection
• Bioinformatics/Genomics
• Medical History Data – Insurance Claims
• Personalisation of services in e-commerce
• Security
• RF Tags: Gillette
• Container Shipments
• Network Intrusion Detection
23
Core concepts
• Types of data:
• Numeric
• Continuous – ratio and interval
• Discrete
• Need for Binning
• Categorical
• Ordinal (ordered)
• Nominal (unordered)
• Binary
• Overfitting and Generalisation
• Regularisation: penalty for model complexity
• Distance and Curse of Dimensionality
• Random and stratified sampling, resampling
• Loss function
24
Examples of overfitting
25
Typical characteristics of mining data
• “Standard” format is spreadsheet:
• Each row is one observation (object)
• Each column is one variable (feature)
• Many rows, many columns
• Many rows moderate number of columns (e.g. phone
calls)
• Many columns, moderate number of rows (e.g.
genomics)
• Opportunistic (often by-product of transaction)
• Not from designed experiments
• Often has outliers and missing data
26
Course Topics
• Supervised Techniques
• Classification
• k Nearest Neighbours, Naïve Bayes, Decision Tree, Neural Networks
• Prediction
• Regression, Neural Networks
• Unsupervised Techniques
• Cluster Analysis, Principal components
• Time series analysis
• Data preprocessing
27