Lecture1

CS 590M Fall 2001: Security
Issues in Data Mining
Chris Clifton
Tuesdays and Thursdays, 9-10:15
Heavilon Hall 123

Course Goals:
Knowledge
At the end of this course, you will:
• Have a basic understanding of the
technology involved in Data Mining
• Know how data mining impacts
information security
• Understand leading-edge research on
data mining and security

Course Goals:
Skills
At the end of this course, you will:
• Be able to understand new technology
through reading the research literature
• Have given conference-style
presentations on difficult research topics
• Have written journal-style critical
reviews of research papers

Course Topics
• Data Mining (as necessary)
– What is it?
– How does it work?
• Research in the use of Data Mining to
improve security
• Research in the security problems posed
by the availability of Data Mining
technology

Process
Initial phase of course: Data Mining
background
• Lectures, handouts, suggested reading
• Length/material to be determined by
what you already know
Expect a quiz at the end of this phase

Process
• Phase 2: Student Presentations
• Two paper presentations per class
– Student presenting will read paper and prepare
presentation materials
You must prepare materials yourself – no fair using
material obtained from the authors
• Any week you do not present, you will do a
journal quality review of one of the papers
being presented that week
You may request a papers to review/present, I will do
final assignment

Evaluation/Grading
Evaluation will be a subjective process, however
it will be based primarily on your
understanding of the material as evidenced in:
• Your presentations
• Your written reviews
• Your contribution to classroom discussions
• Post phase-1 quiz

Policy on Academic Integrity
• Basic idea: You are learning to do Original
Research
– Work you do for the class should be original
(yours)
– Don’t borrow authors slides for presentations, even
if they are available.
Copying images/graphs okay where necessary
• More details on course web site:
http://www.cs.purdue.edu/homes/clifton/cs590m
• When in doubt, ASK!

What is Data Mining?
Searching through large amounts of data for
correlations, sequences, and trends.
Current “driving applications” in sales (targeted
marketing, inventory) and finance (stock
picking)
Sales data
Sequence
Classify
Inference
Cluster
“70%of
customers who
purchase
comforters later
purchase
curtains”
Select information to bemined Choosemining tool (based on
typeof results wanted)
Evaluateresults

adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advanced in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Data
Target
Data
Selection
Knowledge
Knowledge
Preprocessed
Data
Patterns
Data Mining
Interpretation/
Evaluation
Knowledge Discovery in
Databases: Process
See also: http://www.crisp-dm.org
Preprocessing

What is Data Mining?
History
• Knowledge Discovery in Databases workshops
started ‘89
– Now a conference under the auspices of ACM
SIGKDD
– IEEE conference series starting 2001
• Key founders / technology contributers:
– Usama Fayyad, JPL (then Microsoft, now has his
own company, Digimine)
– Gregory Piatetsky-Shapiro (then GTE, now his own
data mining consulting company, Knowledge
Stream Partners)
– Rakesh Agrawal (IBM Research)

What Can Data Mining Do?
• Cluster
• Classify
– Categorical, Regression
• Summarize
– Summary statistics, Summary rules
• Link Analysis / Model Dependencies
– Association rules
• Sequence analysis
– Time-series analysis, Sequential associations
• Detect Deviations

Clustering
• Find groups of similar data
items
• Statistical techniques require
definition of “distance” (e.g.
between travel profiles),
conceptual techniques use
background concepts and
logical descriptions
Uses:
• Demographic analysis
Technologies:
• Self-Organizing Maps
• Probability Densities
• Conceptual Clustering
“Group people with
similar travel
profiles”
– George, Patricia
– Jeff, Evelyn, Chris
– Rob
Clusters
Top Stories clustering

Classification
• Find ways to separate data
items into pre-defined groups
– We know X and Y belong
together, find other things in
same group
• Requires “training data”:
Data items where group is
known
Uses:
• Profiling
Technologies:
• Generate decision trees
(results are human
understandable)
• Neural Nets
“Route documents to
most likely interested
parties”
– English or non-
english?
– Domestic or Foreign?
Groups
Training Data
tool produces
classifier

Association Rules
• Identify dependencies in
the data:
– X makes Y likely
• Indicate significance of
each dependency
• Bayesian methods
Uses:
• Targeted marketing
Technologies:
• AIS, SETM, Hugin,
TETRAD II
“Find groups of items
commonly purchased
together”
– People who purchase fish
are extraordinarily likely
to purchase wine
– People who purchase
Turkey are
extraordinarily likely to
purchase cranberries
Date/Time/Register Fish Turkey Cranberries Wine …
12/6 13:15 2 N Y Y Y …
12/6 13:16 3 Y N N Y …

Sequential Associations
• Find event sequences that are
unusually likely
• Requires “training” event list,
known “interesting” events
• Must be robust in the face of
additional “noise” events
Uses:
• Failure analysis and
prediction
Technologies:
• Dynamic programming
(Dynamic time warping)
• “Custom” algorithms
“Find common sequences
of warnings/faults
within 10 minute
periods”
– Warn 2 on Switch C
preceded by Fault 21 on
Switch B
– Fault 17 on any switch
preceded by Warn 2 on
any switchTime SwitchEvent
21:10 B Fault21
21:11 A Warn2
21:13 C Warn2
21:20 A Fault17

Deviation Detection
• Find unexpected values,
outliers
• Uses:
• Failure analysis
• Anomaly discovery for
analysis
• Technologies:
• clustering/classification
methods
• Statistical techniques
• visualization
• “Find unusual
occurrences in IBM
stock prices”
Date Close Volume Spread
58/07/02 369.50 314.08 .022561
58/07/03 369.25 313.87 .022561
58/07/04 MarketClosed
58/07/07 370.00 314.50 .022561
Sampledate Event Occurrences
58/07/04 Marketclosed317times
59/01/06 2.5%dividend2times
59/04/04 50%stocksplit7times
73/10/09 nottraded 1time

Large-scale Endeavors
Clustering Classification Association Sequence Deviation
SAS Decision
Trees
SPSS √ √
Oracle
(Darwin)
√ ANN
IBM Time
Series
Decision
Trees
√ √ √
DBMiner
(Simon Fraser)
√ √
Products
Research

War Stories:
Warehouse Product Allocation
The second project, identified as "Warehouse Product Allocation," was also initiated in
late 1995 by RS Components' IS and Operations Departments. In addition to their
warehouse in Corby, the company was in the process of opening another 500,000-
square-foot site in the Midlands region of the U.K. To efficiently ship product from
these two locations, it was essential that RS Components know in advance what
products should be allocated to which warehouse. For this project, the team used IBM
Intelligent Miner and additional optimization logic to split RS Components' product
sets between these two sites so that the number of partial orders and split shipments
would be minimized.
Parker says that the Warehouse Product Allocation project has directly contributed to a
significant savings in the number of parcels shipped, and therefore in shipping costs. In
addition, he says that the Opportunity Selling project not only increased the level of
service, but also made it easier to provide new subsidiaries with the value-added
knowledge that enables them to quickly ramp-up sales.
"By using the data mining tools and some additional optimization logic, IBM helped us
produce a solution which heavily outperformed the best solution that we could have
arrived at by conventional techniques," said Parker. "The IBM group tracked historical
order data and conclusively demonstrated that data mining produced increased revenue
that will give us a return on investment 10 times greater than the amount we spent on
the first project."
http://direct.boulder.ibm.com/dss/customer/rscomp.html

War Stories:
Inventory Forecasting
American Entertainment Company
Forecasting demand for inventory is a central problem for any
distributor. Ship too much and the distributor incurs the cost of
restocking unsold products; ship too little and sales opportunities
are lost.
IBM Data Mining Solutions assisted this customer by providing
an inventory forecasting model, using segmentation and predictive
modeling. This new model has proven to be considerably more
accurate than any prior forecasting model.
More war stories (many humorous) starting with slide 21 of:
http://robotics.stanford.edu/~ronnyk/chasm.pdf

Data Mining as a Threat to
Security
• Data mining gives us “facts” that are not obvious to human
analysts of the data
• Enables inspection and analysis of huge amounts of data
• Possible threats:
– Predict information about classified work from correlation with
unclassified work (e.g. budgets, staffing)
– Detect “hidden” information based on “conspicuous” lack of
information
– Mining “Open Source” data to determine predictive events (e.g.,
Pizza deliveries to the Pentagon)
• It isn’t the data we want to protect, but correlations among data
items
• Published in Chris Clifton and Don Marks, “Security and Privacy
Implications of Data Mining”, Proceedings of the 1996 ACM
SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery

Background – Inference
Problem
• MLS database – “high” and “low” data
– Problem if we can infer “high” data from “low” data
– Progress has been made (Morgenstern, Marks, ...)
• Problem: What if the inference isn’t “strict”?
– “Default inference” problems – Birds fly, an Ostrich is a bird,
so Ostriches fly – not true, so we can’t infer birds fly (and we
don’t prevent such an inference)
– But “birds fly” is useful, even if not strictly true
– Only limited work in detecting/preventing “imprecise”
inferences (Rath, Jones, Hale, Shenoi)
• Data mining specializes in finding imprecise inferences

Data mining – Inference from
Large Data
• Data mining gives us probabilistic “inferences”:
– 25% of group X is Y, but only 2% of population is Y.
• Key to data mining: Don’t need to pre-specify X and
Y.
– Define total population
– Define parameters that can be used to create group X
– Define parameters that can be used to create group Y
– Note the combinatorial explosion in the number of possible
groups: if three parameters used to create group X, possible
n3 groups
• Data mining tool determines groups X and Y where
“inference” is unusually likely
• Existing inference prevention based on guaranteed
truth of inference, but is this good enough?

Motivating Example:
Mortgage Application
• Idea: Mortgage company buys market research data to develop
profile of people likely to default
– Marketing data available
– Mortgage companies have history of current client defaults
• Problem: If 20% of profile defaults, it may make business sense
to reject all – but is it fair to the 80% that wouldn’t?
• Information Provider doesn’t want this done (potential public
backlash, e.g. Lotus)
Name Golfs Skis Mail-order Car ... Default
Dennis Y N $25 BMW N
Chris N Y $815 Ford Y
Denise N Y $790 Ford N
...
Eric N Y $830 Ford ?

Goal – Technical Solution
We want to protect the information
provider.
• Prevent others from finding any meaningful
correlations
– Must still provide access to individual data
elements (e.g. phone book)
• Prevent specific correlations (or classes of
correlations)
– Preserve ability to mine in desired fashion (e.g.
targeted marketing, inventory prediction)

What Can We Do?
• Prevent useful results from mining
– Algorithms only find “facts” with sufficient confidence and
support
– Limit data access to ensure low confidence and support
– Extra data (“cover stories”) to give “false” results with high
confidence and support
• Exploit weaknesses in mining algorithms
– Performance “blowups” under certain conditions
– Alter data to prevent exact matches
• Example: Extra digit at end of telephone number
• Remove information providing unwanted correlations
– Strip identifiers
– Group identifiers (e.g. census blocks, not addresses)
• “You mine the data, I’ll send the mailings”

What We Have Learned So Far:
Qualitative Results
• Avoid unnecessary groupings of data
– Ranges of instances can give information
• Department encodes center, division
• Employee number encodes hire date
– Knowing the meaning of a grouping is not necessary; the
existence of a meaningful grouping allows us to mine
– Moral: Assign “id numbers” randomly (still serve to identify)
• Providing only samples of data can lower confidence
in mining results
– Key: Provable limits for validity of mining results given a
sample

Data Mining to Handle
Security Problems
• Data mining tools can be used to examine audit data
and flag abnormal behavior
• Some work in Intrusion detection
– e.g., Neural networks to detect abnormal patterns
• SRI work on IDES
• Harris Corporation work
• Tools are being examined as a means to determine
abnormal patterns and also to determine the type of
problem
– Classification techniques
• Can draw heavily on Fraud detection
– Credit cards, calling cards, etc.
– Work by SRA Corporation

Data Mining to Improve
Security
• Intrusion Detection
– Relies on “training data”
– We’ll go into detail on this area (lots of new work)
• User profiling (what is normal behavior for a
user)
– Lots of work in the telecommunications industry
(caller fraud)
– Work is happening in computer security community
Various work in “command sequence” profiles

Lecture1

More Related Content

What's hot

Viewers also liked

Similar to Lecture1

More from Manish Kumar

Recently uploaded

Lecture1

Editor's Notes