KEMBAR78
Lecture1 | PPT
CS 590M Fall 2001: Security
Issues in Data Mining
Chris Clifton
Tuesdays and Thursdays, 9-10:15
Heavilon Hall 123
Course Goals:
Knowledge
At the end of this course, you will:
• Have a basic understanding of the
technology involved in Data Mining
• Know how data mining impacts
information security
• Understand leading-edge research on
data mining and security
Course Goals:
Skills
At the end of this course, you will:
• Be able to understand new technology
through reading the research literature
• Have given conference-style
presentations on difficult research topics
• Have written journal-style critical
reviews of research papers
Course Topics
• Data Mining (as necessary)
– What is it?
– How does it work?
• Research in the use of Data Mining to
improve security
• Research in the security problems posed
by the availability of Data Mining
technology
Process
Initial phase of course: Data Mining
background
• Lectures, handouts, suggested reading
• Length/material to be determined by
what you already know
Expect a quiz at the end of this phase
Process
• Phase 2: Student Presentations
• Two paper presentations per class
– Student presenting will read paper and prepare
presentation materials
You must prepare materials yourself – no fair using
material obtained from the authors
• Any week you do not present, you will do a
journal quality review of one of the papers
being presented that week
You may request a papers to review/present, I will do
final assignment
Evaluation/Grading
Evaluation will be a subjective process, however
it will be based primarily on your
understanding of the material as evidenced in:
• Your presentations
• Your written reviews
• Your contribution to classroom discussions
• Post phase-1 quiz
Policy on Academic Integrity
• Basic idea: You are learning to do Original
Research
– Work you do for the class should be original
(yours)
– Don’t borrow authors slides for presentations, even
if they are available.
Copying images/graphs okay where necessary
• More details on course web site:
http://www.cs.purdue.edu/homes/clifton/cs590m
• When in doubt, ASK!
What is Data Mining?
Searching through large amounts of data for
correlations, sequences, and trends.
Current “driving applications” in sales (targeted
marketing, inventory) and finance (stock
picking)
Sales data
Sequence
Classify
Inference
Cluster
“70%of
customers who
purchase
comforters later
purchase
curtains”
Select information to bemined Choosemining tool (based on
typeof results wanted)
Evaluateresults
adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advanced in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Data
Target
Data
Selection
Knowledge
Knowledge
Preprocessed
Data
Patterns
Data Mining
Interpretation/
Evaluation
Knowledge Discovery in
Databases: Process
See also: http://www.crisp-dm.org
Preprocessing
What is Data Mining?
History
• Knowledge Discovery in Databases workshops
started ‘89
– Now a conference under the auspices of ACM
SIGKDD
– IEEE conference series starting 2001
• Key founders / technology contributers:
– Usama Fayyad, JPL (then Microsoft, now has his
own company, Digimine)
– Gregory Piatetsky-Shapiro (then GTE, now his own
data mining consulting company, Knowledge
Stream Partners)
– Rakesh Agrawal (IBM Research)
What Can Data Mining Do?
• Cluster
• Classify
– Categorical, Regression
• Summarize
– Summary statistics, Summary rules
• Link Analysis / Model Dependencies
– Association rules
• Sequence analysis
– Time-series analysis, Sequential associations
• Detect Deviations
Clustering
• Find groups of similar data
items
• Statistical techniques require
definition of “distance” (e.g.
between travel profiles),
conceptual techniques use
background concepts and
logical descriptions
Uses:
• Demographic analysis
Technologies:
• Self-Organizing Maps
• Probability Densities
• Conceptual Clustering
“Group people with
similar travel
profiles”
– George, Patricia
– Jeff, Evelyn, Chris
– Rob
Clusters
Top Stories clustering
Classification
• Find ways to separate data
items into pre-defined groups
– We know X and Y belong
together, find other things in
same group
• Requires “training data”:
Data items where group is
known
Uses:
• Profiling
Technologies:
• Generate decision trees
(results are human
understandable)
• Neural Nets
“Route documents to
most likely interested
parties”
– English or non-
english?
– Domestic or Foreign?
Groups
Training Data
tool produces
classifier
Association Rules
• Identify dependencies in
the data:
– X makes Y likely
• Indicate significance of
each dependency
• Bayesian methods
Uses:
• Targeted marketing
Technologies:
• AIS, SETM, Hugin,
TETRAD II
“Find groups of items
commonly purchased
together”
– People who purchase fish
are extraordinarily likely
to purchase wine
– People who purchase
Turkey are
extraordinarily likely to
purchase cranberries
Date/Time/Register Fish Turkey Cranberries Wine …
12/6 13:15 2 N Y Y Y …
12/6 13:16 3 Y N N Y …
Sequential Associations
• Find event sequences that are
unusually likely
• Requires “training” event list,
known “interesting” events
• Must be robust in the face of
additional “noise” events
Uses:
• Failure analysis and
prediction
Technologies:
• Dynamic programming
(Dynamic time warping)
• “Custom” algorithms
“Find common sequences
of warnings/faults
within 10 minute
periods”
– Warn 2 on Switch C
preceded by Fault 21 on
Switch B
– Fault 17 on any switch
preceded by Warn 2 on
any switchTime SwitchEvent
21:10 B Fault21
21:11 A Warn2
21:13 C Warn2
21:20 A Fault17
Deviation Detection
• Find unexpected values,
outliers
• Uses:
• Failure analysis
• Anomaly discovery for
analysis
• Technologies:
• clustering/classification
methods
• Statistical techniques
• visualization
• “Find unusual
occurrences in IBM
stock prices”
Date Close Volume Spread
58/07/02 369.50 314.08 .022561
58/07/03 369.25 313.87 .022561
58/07/04 MarketClosed
58/07/07 370.00 314.50 .022561
Sampledate Event Occurrences
58/07/04 Marketclosed317times
59/01/06 2.5%dividend2times
59/04/04 50%stocksplit7times
73/10/09 nottraded 1time
Large-scale Endeavors
Clustering Classification Association Sequence Deviation
SAS Decision
Trees
SPSS √ √
Oracle
(Darwin)
√ ANN
IBM Time
Series
Decision
Trees
√ √ √
DBMiner
(Simon Fraser)
√ √
Products
Research
War Stories:
Warehouse Product Allocation
The second project, identified as "Warehouse Product Allocation," was also initiated in
late 1995 by RS Components' IS and Operations Departments. In addition to their
warehouse in Corby, the company was in the process of opening another 500,000-
square-foot site in the Midlands region of the U.K. To efficiently ship product from
these two locations, it was essential that RS Components know in advance what
products should be allocated to which warehouse. For this project, the team used IBM
Intelligent Miner and additional optimization logic to split RS Components' product
sets between these two sites so that the number of partial orders and split shipments
would be minimized.
Parker says that the Warehouse Product Allocation project has directly contributed to a
significant savings in the number of parcels shipped, and therefore in shipping costs. In
addition, he says that the Opportunity Selling project not only increased the level of
service, but also made it easier to provide new subsidiaries with the value-added
knowledge that enables them to quickly ramp-up sales.
"By using the data mining tools and some additional optimization logic, IBM helped us
produce a solution which heavily outperformed the best solution that we could have
arrived at by conventional techniques," said Parker. "The IBM group tracked historical
order data and conclusively demonstrated that data mining produced increased revenue
that will give us a return on investment 10 times greater than the amount we spent on
the first project."
http://direct.boulder.ibm.com/dss/customer/rscomp.html
War Stories:
Inventory Forecasting
American Entertainment Company
Forecasting demand for inventory is a central problem for any
distributor. Ship too much and the distributor incurs the cost of
restocking unsold products; ship too little and sales opportunities
are lost.
IBM Data Mining Solutions assisted this customer by providing
an inventory forecasting model, using segmentation and predictive
modeling. This new model has proven to be considerably more
accurate than any prior forecasting model.
More war stories (many humorous) starting with slide 21 of:
http://robotics.stanford.edu/~ronnyk/chasm.pdf
Data Mining as a Threat to
Security
• Data mining gives us “facts” that are not obvious to human
analysts of the data
• Enables inspection and analysis of huge amounts of data
• Possible threats:
– Predict information about classified work from correlation with
unclassified work (e.g. budgets, staffing)
– Detect “hidden” information based on “conspicuous” lack of
information
– Mining “Open Source” data to determine predictive events (e.g.,
Pizza deliveries to the Pentagon)
• It isn’t the data we want to protect, but correlations among data
items
• Published in Chris Clifton and Don Marks, “Security and Privacy
Implications of Data Mining”, Proceedings of the 1996 ACM
SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery
Background – Inference
Problem
• MLS database – “high” and “low” data
– Problem if we can infer “high” data from “low” data
– Progress has been made (Morgenstern, Marks, ...)
• Problem: What if the inference isn’t “strict”?
– “Default inference” problems – Birds fly, an Ostrich is a bird,
so Ostriches fly – not true, so we can’t infer birds fly (and we
don’t prevent such an inference)
– But “birds fly” is useful, even if not strictly true
– Only limited work in detecting/preventing “imprecise”
inferences (Rath, Jones, Hale, Shenoi)
• Data mining specializes in finding imprecise inferences
Data mining – Inference from
Large Data
• Data mining gives us probabilistic “inferences”:
– 25% of group X is Y, but only 2% of population is Y.
• Key to data mining: Don’t need to pre-specify X and
Y.
– Define total population
– Define parameters that can be used to create group X
– Define parameters that can be used to create group Y
– Note the combinatorial explosion in the number of possible
groups: if three parameters used to create group X, possible
n3 groups
• Data mining tool determines groups X and Y where
“inference” is unusually likely
• Existing inference prevention based on guaranteed
truth of inference, but is this good enough?
Motivating Example:
Mortgage Application
• Idea: Mortgage company buys market research data to develop
profile of people likely to default
– Marketing data available
– Mortgage companies have history of current client defaults
• Problem: If 20% of profile defaults, it may make business sense
to reject all – but is it fair to the 80% that wouldn’t?
• Information Provider doesn’t want this done (potential public
backlash, e.g. Lotus)
Name Golfs Skis Mail-order Car ... Default
Dennis Y N $25 BMW N
Chris N Y $815 Ford Y
Denise N Y $790 Ford N
...
Eric N Y $830 Ford ?
Goal – Technical Solution
We want to protect the information
provider.
• Prevent others from finding any meaningful
correlations
– Must still provide access to individual data
elements (e.g. phone book)
• Prevent specific correlations (or classes of
correlations)
– Preserve ability to mine in desired fashion (e.g.
targeted marketing, inventory prediction)
What Can We Do?
• Prevent useful results from mining
– Algorithms only find “facts” with sufficient confidence and
support
– Limit data access to ensure low confidence and support
– Extra data (“cover stories”) to give “false” results with high
confidence and support
• Exploit weaknesses in mining algorithms
– Performance “blowups” under certain conditions
– Alter data to prevent exact matches
• Example: Extra digit at end of telephone number
• Remove information providing unwanted correlations
– Strip identifiers
– Group identifiers (e.g. census blocks, not addresses)
• “You mine the data, I’ll send the mailings”
What We Have Learned So Far:
Qualitative Results
• Avoid unnecessary groupings of data
– Ranges of instances can give information
• Department encodes center, division
• Employee number encodes hire date
– Knowing the meaning of a grouping is not necessary; the
existence of a meaningful grouping allows us to mine
– Moral: Assign “id numbers” randomly (still serve to identify)
• Providing only samples of data can lower confidence
in mining results
– Key: Provable limits for validity of mining results given a
sample
Data Mining to Handle
Security Problems
• Data mining tools can be used to examine audit data
and flag abnormal behavior
• Some work in Intrusion detection
– e.g., Neural networks to detect abnormal patterns
• SRI work on IDES
• Harris Corporation work
• Tools are being examined as a means to determine
abnormal patterns and also to determine the type of
problem
– Classification techniques
• Can draw heavily on Fraud detection
– Credit cards, calling cards, etc.
– Work by SRA Corporation
Data Mining to Improve
Security
• Intrusion Detection
– Relies on “training data”
– We’ll go into detail on this area (lots of new work)
• User profiling (what is normal behavior for a
user)
– Lots of work in the telecommunications industry
(caller fraud)
– Work is happening in computer security community
Various work in “command sequence” profiles

Lecture1

  • 1.
    CS 590M Fall2001: Security Issues in Data Mining Chris Clifton Tuesdays and Thursdays, 9-10:15 Heavilon Hall 123
  • 2.
    Course Goals: Knowledge At theend of this course, you will: • Have a basic understanding of the technology involved in Data Mining • Know how data mining impacts information security • Understand leading-edge research on data mining and security
  • 3.
    Course Goals: Skills At theend of this course, you will: • Be able to understand new technology through reading the research literature • Have given conference-style presentations on difficult research topics • Have written journal-style critical reviews of research papers
  • 4.
    Course Topics • DataMining (as necessary) – What is it? – How does it work? • Research in the use of Data Mining to improve security • Research in the security problems posed by the availability of Data Mining technology
  • 5.
    Process Initial phase ofcourse: Data Mining background • Lectures, handouts, suggested reading • Length/material to be determined by what you already know Expect a quiz at the end of this phase
  • 6.
    Process • Phase 2:Student Presentations • Two paper presentations per class – Student presenting will read paper and prepare presentation materials You must prepare materials yourself – no fair using material obtained from the authors • Any week you do not present, you will do a journal quality review of one of the papers being presented that week You may request a papers to review/present, I will do final assignment
  • 7.
    Evaluation/Grading Evaluation will bea subjective process, however it will be based primarily on your understanding of the material as evidenced in: • Your presentations • Your written reviews • Your contribution to classroom discussions • Post phase-1 quiz
  • 8.
    Policy on AcademicIntegrity • Basic idea: You are learning to do Original Research – Work you do for the class should be original (yours) – Don’t borrow authors slides for presentations, even if they are available. Copying images/graphs okay where necessary • More details on course web site: http://www.cs.purdue.edu/homes/clifton/cs590m • When in doubt, ASK!
  • 9.
    What is DataMining? Searching through large amounts of data for correlations, sequences, and trends. Current “driving applications” in sales (targeted marketing, inventory) and finance (stock picking) Sales data Sequence Classify Inference Cluster “70%of customers who purchase comforters later purchase curtains” Select information to bemined Choosemining tool (based on typeof results wanted) Evaluateresults
  • 10.
    adapted from: U. Fayyad,et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advanced in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press Data Target Data Selection Knowledge Knowledge Preprocessed Data Patterns Data Mining Interpretation/ Evaluation Knowledge Discovery in Databases: Process See also: http://www.crisp-dm.org Preprocessing
  • 11.
    What is DataMining? History • Knowledge Discovery in Databases workshops started ‘89 – Now a conference under the auspices of ACM SIGKDD – IEEE conference series starting 2001 • Key founders / technology contributers: – Usama Fayyad, JPL (then Microsoft, now has his own company, Digimine) – Gregory Piatetsky-Shapiro (then GTE, now his own data mining consulting company, Knowledge Stream Partners) – Rakesh Agrawal (IBM Research)
  • 12.
    What Can DataMining Do? • Cluster • Classify – Categorical, Regression • Summarize – Summary statistics, Summary rules • Link Analysis / Model Dependencies – Association rules • Sequence analysis – Time-series analysis, Sequential associations • Detect Deviations
  • 13.
    Clustering • Find groupsof similar data items • Statistical techniques require definition of “distance” (e.g. between travel profiles), conceptual techniques use background concepts and logical descriptions Uses: • Demographic analysis Technologies: • Self-Organizing Maps • Probability Densities • Conceptual Clustering “Group people with similar travel profiles” – George, Patricia – Jeff, Evelyn, Chris – Rob Clusters Top Stories clustering
  • 14.
    Classification • Find waysto separate data items into pre-defined groups – We know X and Y belong together, find other things in same group • Requires “training data”: Data items where group is known Uses: • Profiling Technologies: • Generate decision trees (results are human understandable) • Neural Nets “Route documents to most likely interested parties” – English or non- english? – Domestic or Foreign? Groups Training Data tool produces classifier
  • 15.
    Association Rules • Identifydependencies in the data: – X makes Y likely • Indicate significance of each dependency • Bayesian methods Uses: • Targeted marketing Technologies: • AIS, SETM, Hugin, TETRAD II “Find groups of items commonly purchased together” – People who purchase fish are extraordinarily likely to purchase wine – People who purchase Turkey are extraordinarily likely to purchase cranberries Date/Time/Register Fish Turkey Cranberries Wine … 12/6 13:15 2 N Y Y Y … 12/6 13:16 3 Y N N Y …
  • 16.
    Sequential Associations • Findevent sequences that are unusually likely • Requires “training” event list, known “interesting” events • Must be robust in the face of additional “noise” events Uses: • Failure analysis and prediction Technologies: • Dynamic programming (Dynamic time warping) • “Custom” algorithms “Find common sequences of warnings/faults within 10 minute periods” – Warn 2 on Switch C preceded by Fault 21 on Switch B – Fault 17 on any switch preceded by Warn 2 on any switchTime SwitchEvent 21:10 B Fault21 21:11 A Warn2 21:13 C Warn2 21:20 A Fault17
  • 17.
    Deviation Detection • Findunexpected values, outliers • Uses: • Failure analysis • Anomaly discovery for analysis • Technologies: • clustering/classification methods • Statistical techniques • visualization • “Find unusual occurrences in IBM stock prices” Date Close Volume Spread 58/07/02 369.50 314.08 .022561 58/07/03 369.25 313.87 .022561 58/07/04 MarketClosed 58/07/07 370.00 314.50 .022561 Sampledate Event Occurrences 58/07/04 Marketclosed317times 59/01/06 2.5%dividend2times 59/04/04 50%stocksplit7times 73/10/09 nottraded 1time
  • 18.
    Large-scale Endeavors Clustering ClassificationAssociation Sequence Deviation SAS Decision Trees SPSS √ √ Oracle (Darwin) √ ANN IBM Time Series Decision Trees √ √ √ DBMiner (Simon Fraser) √ √ Products Research
  • 19.
    War Stories: Warehouse ProductAllocation The second project, identified as "Warehouse Product Allocation," was also initiated in late 1995 by RS Components' IS and Operations Departments. In addition to their warehouse in Corby, the company was in the process of opening another 500,000- square-foot site in the Midlands region of the U.K. To efficiently ship product from these two locations, it was essential that RS Components know in advance what products should be allocated to which warehouse. For this project, the team used IBM Intelligent Miner and additional optimization logic to split RS Components' product sets between these two sites so that the number of partial orders and split shipments would be minimized. Parker says that the Warehouse Product Allocation project has directly contributed to a significant savings in the number of parcels shipped, and therefore in shipping costs. In addition, he says that the Opportunity Selling project not only increased the level of service, but also made it easier to provide new subsidiaries with the value-added knowledge that enables them to quickly ramp-up sales. "By using the data mining tools and some additional optimization logic, IBM helped us produce a solution which heavily outperformed the best solution that we could have arrived at by conventional techniques," said Parker. "The IBM group tracked historical order data and conclusively demonstrated that data mining produced increased revenue that will give us a return on investment 10 times greater than the amount we spent on the first project." http://direct.boulder.ibm.com/dss/customer/rscomp.html
  • 20.
    War Stories: Inventory Forecasting AmericanEntertainment Company Forecasting demand for inventory is a central problem for any distributor. Ship too much and the distributor incurs the cost of restocking unsold products; ship too little and sales opportunities are lost. IBM Data Mining Solutions assisted this customer by providing an inventory forecasting model, using segmentation and predictive modeling. This new model has proven to be considerably more accurate than any prior forecasting model. More war stories (many humorous) starting with slide 21 of: http://robotics.stanford.edu/~ronnyk/chasm.pdf
  • 21.
    Data Mining asa Threat to Security • Data mining gives us “facts” that are not obvious to human analysts of the data • Enables inspection and analysis of huge amounts of data • Possible threats: – Predict information about classified work from correlation with unclassified work (e.g. budgets, staffing) – Detect “hidden” information based on “conspicuous” lack of information – Mining “Open Source” data to determine predictive events (e.g., Pizza deliveries to the Pentagon) • It isn’t the data we want to protect, but correlations among data items • Published in Chris Clifton and Don Marks, “Security and Privacy Implications of Data Mining”, Proceedings of the 1996 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
  • 22.
    Background – Inference Problem •MLS database – “high” and “low” data – Problem if we can infer “high” data from “low” data – Progress has been made (Morgenstern, Marks, ...) • Problem: What if the inference isn’t “strict”? – “Default inference” problems – Birds fly, an Ostrich is a bird, so Ostriches fly – not true, so we can’t infer birds fly (and we don’t prevent such an inference) – But “birds fly” is useful, even if not strictly true – Only limited work in detecting/preventing “imprecise” inferences (Rath, Jones, Hale, Shenoi) • Data mining specializes in finding imprecise inferences
  • 23.
    Data mining –Inference from Large Data • Data mining gives us probabilistic “inferences”: – 25% of group X is Y, but only 2% of population is Y. • Key to data mining: Don’t need to pre-specify X and Y. – Define total population – Define parameters that can be used to create group X – Define parameters that can be used to create group Y – Note the combinatorial explosion in the number of possible groups: if three parameters used to create group X, possible n3 groups • Data mining tool determines groups X and Y where “inference” is unusually likely • Existing inference prevention based on guaranteed truth of inference, but is this good enough?
  • 24.
    Motivating Example: Mortgage Application •Idea: Mortgage company buys market research data to develop profile of people likely to default – Marketing data available – Mortgage companies have history of current client defaults • Problem: If 20% of profile defaults, it may make business sense to reject all – but is it fair to the 80% that wouldn’t? • Information Provider doesn’t want this done (potential public backlash, e.g. Lotus) Name Golfs Skis Mail-order Car ... Default Dennis Y N $25 BMW N Chris N Y $815 Ford Y Denise N Y $790 Ford N ... Eric N Y $830 Ford ?
  • 25.
    Goal – TechnicalSolution We want to protect the information provider. • Prevent others from finding any meaningful correlations – Must still provide access to individual data elements (e.g. phone book) • Prevent specific correlations (or classes of correlations) – Preserve ability to mine in desired fashion (e.g. targeted marketing, inventory prediction)
  • 26.
    What Can WeDo? • Prevent useful results from mining – Algorithms only find “facts” with sufficient confidence and support – Limit data access to ensure low confidence and support – Extra data (“cover stories”) to give “false” results with high confidence and support • Exploit weaknesses in mining algorithms – Performance “blowups” under certain conditions – Alter data to prevent exact matches • Example: Extra digit at end of telephone number • Remove information providing unwanted correlations – Strip identifiers – Group identifiers (e.g. census blocks, not addresses) • “You mine the data, I’ll send the mailings”
  • 27.
    What We HaveLearned So Far: Qualitative Results • Avoid unnecessary groupings of data – Ranges of instances can give information • Department encodes center, division • Employee number encodes hire date – Knowing the meaning of a grouping is not necessary; the existence of a meaningful grouping allows us to mine – Moral: Assign “id numbers” randomly (still serve to identify) • Providing only samples of data can lower confidence in mining results – Key: Provable limits for validity of mining results given a sample
  • 28.
    Data Mining toHandle Security Problems • Data mining tools can be used to examine audit data and flag abnormal behavior • Some work in Intrusion detection – e.g., Neural networks to detect abnormal patterns • SRI work on IDES • Harris Corporation work • Tools are being examined as a means to determine abnormal patterns and also to determine the type of problem – Classification techniques • Can draw heavily on Fraud detection – Credit cards, calling cards, etc. – Work by SRA Corporation
  • 29.
    Data Mining toImprove Security • Intrusion Detection – Relies on “training data” – We’ll go into detail on this area (lots of new work) • User profiling (what is normal behavior for a user) – Lots of work in the telecommunications industry (caller fraud) – Work is happening in computer security community Various work in “command sequence” profiles

Editor's Notes

  • #11 Mine for: Selection Aggregation Abstraction Visualization Transformation/Conversion Statistical Analysis “Cleaning”
  • #22 Problem is that we may not know what may be learned from mining Can’t “Classify everything”; as some is open source or may have large benefits to being accessible This is the opposite of statistical queries – we are concerned about preventing generalities from specifics, rather then specifics from generalities – but conceptually similar. Not the same as induction – data mining finds “rules” that are generally true (high confidence and support), but not necessarily exact.