KEMBAR78
Ch 1 Intro to Data Mining | PPT
SUSHIL  KULKARNI INTRODUCTION TO  DATA MINING
INTENSIONS   Define data mining in brief. What are the misunderstanding about data mining? List different steps in data mining analysis. What are the different area required to expertise data mining? Explain how data mining algorithm is developed? Differentiate data base and data mining process SUSHIL KULKARNI
DATA SUSHIL KULKARNI
The Data Massive, Operational, and opportunistic   Data is growing at a phenomenal rate DATA SUSHIL KULKARNI
Since 1963 Moore’s Law : The information density on silicon integrated circuits double every 18 to 24 months Parkinson’s Law : Work expands to fill the time available for its completion DATA SUSHIL KULKARNI
Users expect more sophisticated  information How? DATA UNCOVER HIDDEN INFORMATION DATA MINING SUSHIL KULKARNI
DATA MINING DEFINITION SUSHIL KULKARNI
Data Mining is: The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner DEFINE DATA MINING SUSHIL KULKARNI
Data:  a set of facts (items) D, usually stored in a database Pattern:  an expression E in a language L, that describes a subset of facts Attribute:  a field in an item  i  in D. Interestingness:  a function I D,L  that maps an expression E in L into a measure  space M FEW TERMS SUSHIL KULKARNI
The Data Mining Task: For a given dataset D, language of facts L, interestingness function I D,L  and threshold c, find the expression E such that I D,L (E) > c  efficiently. FEW TERMS SUSHIL KULKARNI
EXAMPLE OF LAGE DATASETS Government:  IGSI, … Large corporations WALMART: 20M transactions per day MOBIL: 100 TB geological databases AT&T 300 M calls per day Scientific NASA, EOS project: 50 GB per hour Environmental datasets SUSHIL KULKARNI
EXAMPLES OF DATA MINING APPLICATIONS Fraud detection: credit cards, phone cards Marketing: customer targeting Data Warehousing: Walmart Astronomy Molecular biology SUSHIL KULKARNI
Advanced methods for exploring and modeling relationships in large amount  of data THUS : DATA MINING SUSHIL KULKARNI
Finding hidden information in a database Fit data to a model Similar terms Exploratory data analysis Data driven discovery Deductive learning THUS : DATA MINING SUSHIL KULKARNI
NUGGETS SUSHIL KULKARNI
“  IF YOU’VE GOT TERABYTES OF DATA, AND YOU ARE RELYING ON DATA MINING TO FIND INTERESTING THINGS IN THERE  FOR YOU, YOU’VE LOST BEFORE YOU’VE3  EVEN BEGUN”  -  HERB EDELSTEIN NUGGETS SUSHIL KULKARNI
“  … .. You really need people who understand what it is they are looking for and what they can do with it once they find it ”  -  BECK (1997) NUGGETS SUSHIL KULKARNI
Data mining means magically discovering hidden nuggets of information without  having to formulate the problem and without  regard to the structure or content of the data PEOPLE THINK SUSHIL KULKARNI
DATA MINING PROCESS SUSHIL KULKARNI
Understand the   Domain - Understands particulars of the business or scientific problems Create a Data set - Understand structure, size, and format of data - Select the interesting attributes - Data cleaning and preprocessing The Data Mining Process SUSHIL KULKARNI
Choose the data mining task and the specific algorithm - Understand capabilities and limitations of algorithms that may be relevant to the problem Interpret the results, and possibly return to bullet 2 The Data Mining Process SUSHIL KULKARNI
Specify Objectives - In terms of subject matter Example :  Understand customer base Re-engineer our customer retention strategy Detect actionable patterns EXAMPLE SUSHIL KULKARNI
2.   Translation into Analytical Methods Examples : Implement Neural Networks Apply Visualization tools Cluster Database 3.   Refinement and Reformulation EXAMPLE SUSHIL KULKARNI
DATA MINNING QUERIES SUSHIL KULKARNI
DB VS DM PROCESSING Query Well defined SQL Query Poorly defined No precise query language Data Operational data Output Precise Subset of  database Data Not operational data Output Fuzzy Not a subset  of database SUSHIL KULKARNI
QUERY EXAMPLES Database Data Mining Find all customers who have purchased milk Find all items which are frequently  purchased with milk. (association rules) Find all credit applicants with first name of Sane. Identify customers who have purchased  more than Rs.10,000 in the last month.   Find all credit applicants who are poor  credit risks. (classification) Identify customers with similar buying  habits. (Clustering) SUSHIL KULKARNI
INTENSIONS   Write short note on KDD process. How it is different then data mining? Explain basic data mining tasks Write short note on: 1. Classification  2. Regression 3. Time Series Analysis  4. Prediction 5. Clustering  6. Summarization 7. Link analysis SUSHIL KULKARNI
KDD PROCESS SUSHIL KULKARNI
KDD PROCESS Knowledge discovery in databases (KDD)  is a multi step process of finding useful information and patterns in data while  Data Mining  is one of the steps in KDD of using algorithms for extraction of patterns SUSHIL KULKARNI
STEPS OF KDD PROCESS 1. Selection- Data Extraction -Obtaining Data from heterogeneous data sources - Databases, Data warehouses, World wide web or other information repositories. 2. Preprocessing- Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned- Missing data may be ignored or predicted, erroneous data may be deleted or corrected. SUSHIL KULKARNI
STEPS OF KDD PROCESS 3. Transformation- Data Integration-   Combines data from multiple sources  into a coherent store -Data can be encoded in common formats, normalized, reduced. 4. D ata mining – Apply algorithms to transformed data an extract  patterns.  SUSHIL KULKARNI
STEPS OF KDD PROCESS 5. Pattern Interpretation/evaluation  Pattern Evaluation-  Evaluate the interestingness of resulting patterns or  apply interestingness measures to filter out discovered patterns. Knowledge presentation-  present the mined knowledge- visualization techniques can be used. SUSHIL KULKARNI
VISUALIZATION TECHNIQUES Hybrid-  combination of above approaches Hierarchical-  Hierarchically dividing display area Pixel-based-  data as colored pixels Icon-based-  using colors figures as icons Geometric- boxplot, scatter plot Graphical -bar charts,pie charts histograms
Data Cleaning Data Integration Knowledge Selection Data Mining Pattern Evaluation Data Transformation Operational Databases KDD is the nontrivial extraction of implicit previously unknown and potentially useful knowledge from data KDD PROCESS Data Preprocessing Data Warehouses SUSHIL KULKARNI
KDD PROCESS EX: WEB LOG Selection:   Select log data (dates and locations) to  use Preprocessing:   Remove identifying URLs Remove error logs Transformation:   Sessionize (sort and group) SUSHIL KULKARNI
KDD PROCESS EX: WEB LOG Data Mining:   Identify and count patterns Construct data structure Interpretation/Evaluation: Identify and display frequently accessed sequences. Potential User Applications: Cache prediction Personalization  SUSHIL KULKARNI
DATA MINING VS. KDD Knowledge Discovery in Databases  (KDD)  - Process of finding useful information and  patterns in data. Data Mining:   Use of algorithms to extract the information and patterns derived by the KDD process.  SUSHIL KULKARNI
KDD ISSUES Human Interaction Over fitting   Outliers   Interpretation Visualization  Large Datasets High Dimensionality SUSHIL KULKARNI
KDD ISSUES Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data Integration Application SUSHIL KULKARNI
DATA MINING TASKS AND METHODS SUSHIL KULKARNI
ARE ALL THE ‘DISCOVERED’ PATTERNS INTERESTING? Interestingness measures :   A pattern is  interesting  if it is  easily understood  by humans,  valid on new or test data  with some degree of certainty, potentially useful ,  novel, or validates some hypothesis  that a user seeks to confirm  SUSHIL KULKARNI
Objective vs. subjective interestingness measures: Objective:  based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective:  based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. ARE ALL THE ‘DISCOVERED’ PATTERNS INTERESTING? SUSHIL KULKARNI
CAN WE FIND ALL AND ONLY INTERESTING PATTERENS? Find all the interesting patterns:  completeness Can a data mining system find  all  the interesting patterns? Association vs. classification vs. clustering SUSHIL KULKARNI
Search for only interesting patterns:  Optimization Can a data mining system find  only  the interesting patterns? Approaches First general all the patterns and then filter out the uninteresting ones. Generate only the interesting patterns—mining query optimization CAN WE FIND ALL AND ONLY INTERESTING PATTERENS? SUSHIL KULKARNI
Data Mining Predictive Descriptive Classification Regression Time series Analysis Prediction Clustering Summarization Association rules Sequence Discovery SUSHIL KULKARNI
Data Mining Tasks Classification:  learning a function that maps an item into one of a set of predefined classes Regression:  learning a function that maps an item to a real value Clustering:  identify a set of groups of similar items SUSHIL KULKARNI
Data Mining Tasks Dependencies and associations: identify significant dependencies between data attributes Summarization: find a compact  description of the dataset or a subset of the dataset SUSHIL KULKARNI
Data Mining Methods Decision Tree Classifiers:  Used for modeling, classification Association Rules: Used to find associations between sets of attributes Sequential patterns: Used to find temporal associations in time Series Hierarchical clustering: used to group customers, web users, etc SUSHIL KULKARNI
DATA PREPROCESSING SUSHIL KULKARNI
DIRTY DATA Data in the real world is dirty: incomplete:  lacking  attribute values , lacking certain  attributes of interest , or containing only aggregate data noisy:  containing errors or outliers inconsistent:  containing discrepancies in codes or names SUSHIL KULKARNI
WHY DATA PREPROCESSING? No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Required for both OLAP and Data Mining! SUSHIL KULKARNI
Why can Data be  Incomplete ? Attributes of interest are not available (e.g., customer information for sales transaction data) Data were not considered important at the time of transactions, so they were not recorded! SUSHIL KULKARNI
Why can Data be  Incomplete ? Data not recorder because of  misunderstanding or malfunctions Data may have been recorded and later deleted! Missing/unknown values for some data SUSHIL KULKARNI
Why can Data be   Noisy / Inconsistent  ? Faulty instruments for data collection  Human or computer errors Errors in data transmission Technology limitations (e.g., sensor data come at a faster rate than they can be processed) SUSHIL KULKARNI
Why can Data be   Noisy / Inconsistent  ? Inconsistencies in naming conventions or data codes (e.g., 2/5/2002 could be 2 May 2002 or 5 Feb 2002) Duplicate tuples, which were received twice should also be removed SUSHIL KULKARNI
TASKS IN DATA PREPROCESSING SUSHIL KULKARNI
Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove  outliers , and resolve inconsistencies Data integration Integration of multiple databases or files Data transformation Normalization and aggregation outliers=exceptions! SUSHIL KULKARNI
Major Tasks in Data Preprocessing Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data SUSHIL KULKARNI
Forms of data preprocessing   SUSHIL KULKARNI
DATA CLEANING SUSHIL KULKARNI
Data cleaning tasks - Fill in missing values - Identify outliers and smooth out noisy data  - Correct inconsistent data DATA CLEANING SUSHIL KULKARNI
Ignore the tuple:   usually done when class label is missing (assuming the tasks in classification)— not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually:  tedious + infeasible? HOW TO HANDLE MISSING DATA? SUSHIL KULKARNI
Use a global constant to fill in the missing value:  e.g., “unknown”, a new class?!  Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value:  smarter Use the most probable value to fill in the missing value:  inference-based such as Bayesian formula or decision tree HOW TO HANDLE MISSING DATA? SUSHIL KULKARNI
HOW TO HANDLE MISSING DATA? Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution E.g., put the average income here, or put the most probable income based on the fact that the person is 39 years old E.g., put the most frequent team here SUSHIL KULKARNI F ? 45,390 45 F Yankees ? 39 M Red Sox 24,200 23 Gender Team Income Age
The process of partitioning continuous variables into categories is called  Discretization. HOW TO HANDLE NOISY DATA?    Discretization SUSHIL KULKARNI
Binning method: - first sort data and partition into (equi-depth) bins - then one can  smooth by bin means,  smooth by bin median, smooth by bin boundaries , etc. Clustering - detect and remove outliers HOW TO HANDLE NOISY DATA?    Discretization  :  Smoothing techniques   SUSHIL KULKARNI
Combined computer and human inspection -  computer detects suspicious values, which are then checked by humans Regression -  smooth by fitting the data into regression functions HOW TO HANDLE NOISY DATA?    Discretization  :  Smoothing techniques   SUSHIL KULKARNI
Equal-width (distance) partitioning: - It divides the range into  N  intervals of equal size:  uniform grid if  A  and  B  are the lowest and highest values of the attribute, the width of intervals will be:  W  = ( B - A )/ N. - The most straightforward - But outliers may dominate presentation - Skewed data is not handled well. SIMPLE DISCRETISATION    METHODS: BINNING SUSHIL KULKARNI
Equal-depth (frequency) partitioning: - It divides the range into  N  intervals, each containing approximately same number of samples - Good data scaling – good handing of skewed data SIMPLE DISCRETISATION    METHODS: BINNING SUSHIL KULKARNI
Binning is applied to each individual feature (attribute) Set of values can then be discretized by replacing each value in the bin, by bin mean, bin median, bin boundaries. Example  Set of values of attribute Age:  0. 4 , 12, 16, 14, 18, 23, 26, 28 BINNING : EXAMPLE SUSHIL KULKARNI
Example  :  Set of values of attribute Age:  0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin width = 10 EXAMPLE:  EQUI- WIDTH BINNING SUSHIL KULKARNI [ 20, +) { 23, 26, 28 } 3 [10, 20) { 12, 16, 16, 18 } 2 [ - , 10) {0,4} 1 Bin Boundaries Bin Elements Bin #
Example  :  Set of values of attribute Age:  0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin depth = 3 EXAMPLE:  EQUI- DEPTH BINNING SUSHIL KULKARNI [ 21, +) { 23, 26, 28 } 3 [14, 21) { 16, 16, 18 } 2 [ - , 14) {0,4, 12} 1 Bin Boundaries Bin Elements Bin #
SMOOTHING USING BINNING METHODS Sorted data for price (in Rs): 4, 8, 9, 15, 21, 21, 24,  25,  26, 28, 29, 34 Partition into ( equi-depth ) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries:  [4,15],[21,25],[26,34] - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 SUSHIL KULKARNI
SIMPLE DISCRETISATION METHODS: BINNING Example: customer ages 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 Equi-width binning: number of values 0-22 22-31 44-48 32-38 38-44 48-55 55-62 62-80 Equi-depth binning: SUSHIL KULKARNI
FEW TASKS SUSHIL KULKARNI
BASIC DATA MINING TASKS Clustering   groups similar data together  into clusters. - Unsupervised learning - Segmentation - Partitioning SUSHIL KULKARNI
CLUSTERING Partitions data set into clusters, and models it by one representative from each cluster Can be very effective if data is clustered but not if data is “smeared” There are many choices of clustering definitions and clustering algorithms, more later! SUSHIL KULKARNI
CLUSTER ANALYSIS cluster outlier salary age
CLASSIFICATION Classification   maps data into predefined groups or classes -  Supervised learning - Pattern recognition Prediction SUSHIL KULKARNI
REGRESSION Regression  is used to map a data item to a real valued prediction variable. SUSHIL KULKARNI
REGRESSION x y y = x + 1 X1 Y1 (salary) (age) Example of linear regression SUSHIL KULKARNI
DATA    INTEGRATION SUSHIL KULKARNI
DATA INTEGRATION Data integration:  combines data from multiple sources into a coherent store Schema integration -  Integrate metadata from different sources metadata:  data about the data (i.e., data descriptors) Entity identification problem:  identify real world entities from multiple data sources,  e.g., A.cust-id    B.cust-# SUSHIL KULKARNI
DATA INTEGRATION Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources  are different (e.g., S.A.Dixit.and Suhas Dixit may refer to the same person)  possible reasons:  different  representations, different scales, e.g., metric vs. British units (inches vs.  cm) SUSHIL KULKARNI
DATA  TRANSFORMATION SUSHIL KULKARNI
DATA TRANSFORMATION Smoothing : remove noise from data A ggregation : summarization, data cube construction Generalization : concept hierarchy climbing SUSHIL KULKARNI
Normalization:  scaled to fall within a small, specified range -  min-max normalization - z-score normalization normalization by decimal scaling Attribute/feature construction -  New attributes constructed from the given ones DATA TRANSFORMATION SUSHIL KULKARNI
NORMALIZATION min-max normalization z-score normalization SUSHIL KULKARNI
NORMALIZATION normalization by decimal scaling Where  j  is the smallest integer such that  Max(|  V  ‘  | ) <1 SUSHIL KULKARNI
SUMMARIZATION Summarization   maps data into subsets  with associated simple - Descriptions. - Characterization Generalization SUSHIL KULKARNI
DATA    EXTRACTION,   SELECTION,    CONSTRUCTION,    COMPRESSION SUSHIL KULKARNI
TERMS Extraction Feature: A process extracts a set of new features from the original features through some functional mapping or transformations. Selection Features: It is a process that chooses a subset of M features from the original set of N features so that the feature space is optimally reduced according to certain criteria. SUSHIL KULKARNI
TERMS Construction feature: It is a process that discovers missing information about the relationships between features and augments the space of features by inference or by creating additional features Compression Feature: A process to compress the information about the features. SUSHIL KULKARNI
SELECTION: DECISION TREE INDUCTION: Example Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 2 Reduced attribute set:  {A1, A4, A6} Class 1 > SUSHIL KULKARNI
DATA COMPRESSION String compression -  There are extensive theories and well-tuned  algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression: Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the  whole SUSHIL KULKARNI
DATA COMPRESSION Time sequence is not audio Typically short and varies slowly with time SUSHIL KULKARNI
DATA COMPRESSION Original Data Compressed  Data lossless Original Data Approximated   lossy SUSHIL KULKARNI
NUMEROSITY REDUCTION:   Reduce the  volume  of data Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces  Non-parametric methods  Do not assume models Major families: histograms, clustering,  sampling  SUSHIL KULKARNI
HISTOGRAM Popular data reduction technique Divide data into buckets and store  average (or sum) for each bucket Can be constructed optimally in one dimension using dynamic programming Related to quantization problems. SUSHIL KULKARNI
HISTOGRAM SUSHIL KULKARNI
HISTOGRAM TYPES Equal-width   histograms: It divides the range into  N  intervals of equal size Equal-depth   (frequency) partitioning: It divides the range into  N  intervals, each containing approximately same number of samples SUSHIL KULKARNI
HISTOGRAM TYPES V-optimal: It considers all histogram types for a given number of buckets and chooses the one with the least variance. MaxDiff: After sorting the data to be approximated, it defines the borders of the buckets at points where the adjacent values have the maximum difference SUSHIL KULKARNI
HISTOGRAM TYPES EXAMPLE; Split to three buckets  1,1,4,5,5,7,9, 14,16,18,  27,30,30,32 1,1,4,5,5,7,9, 14,16,18,  27,30,30,32 MaxDiff 27-18 and 14-9 SUSHIL KULKARNI
HIERARCHICAL REDUCTION Use multi-resolution structure with different degrees of reduction Hierarchical clustering is often performed but tends to define partitions of data sets rather than “clusters” SUSHIL KULKARNI
HIERARCHICAL REDUCTION Hierarchical aggregation  An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram SUSHIL KULKARNI
MULTIDIMENSIONAL INDEX    STRUCTURES CAN BE USED FOR    DATA REDUCTION Each level of the tree can be used to define a milti-dimensional equi-depth histogram E.g., R3,R4,R5,R6 define multidimensional buckets which approximate the points R0 R1 R2 R3 R4 R5 R6 f c g d h b a e i Example: an R-tree R0 (0) e f c i a b R5 R6 R3 R4 R1 R2 g h d R0: R1: R2: R3: R4: R5: R6: SUSHIL KULKARNI
SAMPLING Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a  representative  subset of the data - Simple random sampling may have very poor  performance in the presence of skew SUSHIL KULKARNI
SAMPLING Develop adaptive sampling methods Stratified sampling:  Approximate the percentage of each class (or subpopulation of interest) in the overall database  Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time). SUSHIL KULKARNI
SAMPLING SRSWOR (simple random sample without  replacement) SRSWR Raw Data SUSHIL KULKARNI
SAMPLING Raw Data  Cluster/Stratified Sample The number of samples drawn from each  cluster/stratum is analogous to its size Thus, the samples represent better the  data and outliers are avoided SUSHIL KULKARNI
LINK ANALYSIS Link Analysis  uncovers relationships  among data. - Affinity Analysis - Association Rules - Sequential Analysis determines sequential patterns SUSHIL KULKARNI
EX: TIME SERIES ANALYSIS Example:  Stock Market Predict future values Determine similar patterns over time Classify behavior SUSHIL KULKARNI
DATA MINING DEVELOPMENT Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Neural Networks Decision Tree  Algorithms Algorithm Design Techniques Algorithm Analysis Data Structures Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques SUSHIL KULKARNI
INTENSIONS   List the various data mining metrics What are the different visualization techniques of data mining? Write short note on “Database perspective of data mining” Write short note on each of the related concepts of data mining SUSHIL KULKARNI
VIEW DATA USING DATA MINING  SUSHIL KULKARNI
DATA MINING METRICS Usefulness Return on Investment (ROI) Accuracy Space/Time SUSHIL KULKARNI
VISUALIZATION TECHNIQUES Graphical Geometric Icon-based Pixel-based Hierarchical Hybrid SUSHIL KULKARNI
DATA BASE PERSPECTIVE ON DATA MINING Scalability Real World Data Updates Ease of Use SUSHIL KULKARNI
RELATED CONCEPTS OUTLINE Database/OLTP Systems Fuzzy Sets and Logic Information Retrieval(Web Search Engines) Dimensional Modeling Goal:  Examine some areas which are related to data mining. SUSHIL KULKARNI
RELATED CONCEPTS OUTLINE Data Warehousing OLAP Statistics Machine Learning Pattern Matching SUSHIL KULKARNI
DB AND OLTP SYSTEMS Schema (ID,Name,Address,Salary,JobNo) Data Model ER AND Relational Transaction Query: SELECT Name FROM T WHERE Salary > 10000 DM:  Only imprecise queries SUSHIL KULKARNI
FUZZY SETS AND LOGIC Fuzzy Set:   Set membership function is a real valued function with output in the range [0,1]. f(x):  Probability x is in F. 1-f(x):  Probability x is not in F. Example: T = {x | x is a person and x is tall}  Let f(x) be the probability that x is tall. Here f is the membership function DM:  Prediction and classification are fuzzy. SUSHIL KULKARNI
FUZZY SETS SUSHIL KULKARNI
FUZZY SETS Fuzzy set shows the triangular view of set of member ship values are shown in fuzzy set There is gradual decrease in the set of values of short, gradual increase and decrease in the set of values of median and, gradual increase in the set of values of tall. SUSHIL KULKARNI
CLASSIFICATION/ PREDICTION IS FUZZY Loan Amnt Simple Fuzzy Accept Accept Reject Reject SUSHIL KULKARNI
INFORMATION RETRIEVAL Information Retrieval (IR):  retrieving desired information from textual data. 1. Library Science  2. Digital Libraries 3. Web Search Engines 4.Traditionally keyword based Sample query: “ Find all documents about “data mining”. DM:  Similarity measures; Mine text/Web  data. SUSHIL KULKARNI
INFORMATION RETRIEVAL Similarity:  measure of how close a query is to a document. Documents which are “close enough” are retrieved. Metrics: Precision  = |Relevant and Retrieved|   |Retrieved| Recall   =  |Relevant and Retrieved|   |Relevant| SUSHIL KULKARNI
IR QUERY RESULT MEASURES AND CLASSIFICATION IR Classification SUSHIL KULKARNI
DIMENSION MODELING View data in a hierarchical manner more as business executives might Useful in decision support systems and mining Dimension:  collection of logically related attributes; axis for modeling data. SUSHIL KULKARNI
DIMENSION MODELING Facts:  data stored Example:  Dimensions – products, locations, date   Facts – quantity, unit price DM: May view data as dimensional. SUSHIL KULKARNI
AGGREGATION HIERARCHIES SUSHIL KULKARNI
STATISTICS Simple descriptive models Statistical inference:   generalizing a model created from a sample of the data to the entire dataset. Exploratory Data Analysis:   1. Data can actually drive the creation of the model 2. Opposite of traditional statistical  view. SUSHIL KULKARNI
STATISTICS Data mining targeted to business user DM: Many data mining methods come  from statistical techniques.  SUSHIL KULKARNI
MACHINE LEARNING Machine Learning:  area of AI that examines how to write programs that can learn. Often used in classification and prediction  Supervised Learning:   learns by example. SUSHIL KULKARNI
MACHINE LEARNING Unsupervised Learning:   learns without knowledge of correct answers. Machine learning often deals with small static datasets.  DM:  Uses many machine learning techniques. SUSHIL KULKARNI
PATTERN MATCHING (RECOGNITION) Pattern Matching:  finds occurrences of a  predefined pattern in the data. Applications include speech recognition, information retrieval, time series analysis. DM:  Type of classification. SUSHIL KULKARNI
T H A N K S ! SUSHIL KULKARNI

Ch 1 Intro to Data Mining

  • 1.
    SUSHIL KULKARNIINTRODUCTION TO DATA MINING
  • 2.
    INTENSIONS Define data mining in brief. What are the misunderstanding about data mining? List different steps in data mining analysis. What are the different area required to expertise data mining? Explain how data mining algorithm is developed? Differentiate data base and data mining process SUSHIL KULKARNI
  • 3.
  • 4.
    The Data Massive,Operational, and opportunistic Data is growing at a phenomenal rate DATA SUSHIL KULKARNI
  • 5.
    Since 1963 Moore’sLaw : The information density on silicon integrated circuits double every 18 to 24 months Parkinson’s Law : Work expands to fill the time available for its completion DATA SUSHIL KULKARNI
  • 6.
    Users expect moresophisticated information How? DATA UNCOVER HIDDEN INFORMATION DATA MINING SUSHIL KULKARNI
  • 7.
    DATA MINING DEFINITIONSUSHIL KULKARNI
  • 8.
    Data Mining is:The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner DEFINE DATA MINING SUSHIL KULKARNI
  • 9.
    Data: aset of facts (items) D, usually stored in a database Pattern: an expression E in a language L, that describes a subset of facts Attribute: a field in an item i in D. Interestingness: a function I D,L that maps an expression E in L into a measure space M FEW TERMS SUSHIL KULKARNI
  • 10.
    The Data MiningTask: For a given dataset D, language of facts L, interestingness function I D,L and threshold c, find the expression E such that I D,L (E) > c efficiently. FEW TERMS SUSHIL KULKARNI
  • 11.
    EXAMPLE OF LAGEDATASETS Government: IGSI, … Large corporations WALMART: 20M transactions per day MOBIL: 100 TB geological databases AT&T 300 M calls per day Scientific NASA, EOS project: 50 GB per hour Environmental datasets SUSHIL KULKARNI
  • 12.
    EXAMPLES OF DATAMINING APPLICATIONS Fraud detection: credit cards, phone cards Marketing: customer targeting Data Warehousing: Walmart Astronomy Molecular biology SUSHIL KULKARNI
  • 13.
    Advanced methods forexploring and modeling relationships in large amount of data THUS : DATA MINING SUSHIL KULKARNI
  • 14.
    Finding hidden informationin a database Fit data to a model Similar terms Exploratory data analysis Data driven discovery Deductive learning THUS : DATA MINING SUSHIL KULKARNI
  • 15.
  • 16.
    “ IFYOU’VE GOT TERABYTES OF DATA, AND YOU ARE RELYING ON DATA MINING TO FIND INTERESTING THINGS IN THERE FOR YOU, YOU’VE LOST BEFORE YOU’VE3 EVEN BEGUN” - HERB EDELSTEIN NUGGETS SUSHIL KULKARNI
  • 17.
    “ ….. You really need people who understand what it is they are looking for and what they can do with it once they find it ” - BECK (1997) NUGGETS SUSHIL KULKARNI
  • 18.
    Data mining meansmagically discovering hidden nuggets of information without having to formulate the problem and without regard to the structure or content of the data PEOPLE THINK SUSHIL KULKARNI
  • 19.
    DATA MINING PROCESSSUSHIL KULKARNI
  • 20.
    Understand the Domain - Understands particulars of the business or scientific problems Create a Data set - Understand structure, size, and format of data - Select the interesting attributes - Data cleaning and preprocessing The Data Mining Process SUSHIL KULKARNI
  • 21.
    Choose the datamining task and the specific algorithm - Understand capabilities and limitations of algorithms that may be relevant to the problem Interpret the results, and possibly return to bullet 2 The Data Mining Process SUSHIL KULKARNI
  • 22.
    Specify Objectives -In terms of subject matter Example : Understand customer base Re-engineer our customer retention strategy Detect actionable patterns EXAMPLE SUSHIL KULKARNI
  • 23.
    2. Translation into Analytical Methods Examples : Implement Neural Networks Apply Visualization tools Cluster Database 3. Refinement and Reformulation EXAMPLE SUSHIL KULKARNI
  • 24.
    DATA MINNING QUERIESSUSHIL KULKARNI
  • 25.
    DB VS DMPROCESSING Query Well defined SQL Query Poorly defined No precise query language Data Operational data Output Precise Subset of database Data Not operational data Output Fuzzy Not a subset of database SUSHIL KULKARNI
  • 26.
    QUERY EXAMPLES DatabaseData Mining Find all customers who have purchased milk Find all items which are frequently purchased with milk. (association rules) Find all credit applicants with first name of Sane. Identify customers who have purchased more than Rs.10,000 in the last month. Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering) SUSHIL KULKARNI
  • 27.
    INTENSIONS Write short note on KDD process. How it is different then data mining? Explain basic data mining tasks Write short note on: 1. Classification 2. Regression 3. Time Series Analysis 4. Prediction 5. Clustering 6. Summarization 7. Link analysis SUSHIL KULKARNI
  • 28.
  • 29.
    KDD PROCESS Knowledgediscovery in databases (KDD) is a multi step process of finding useful information and patterns in data while Data Mining is one of the steps in KDD of using algorithms for extraction of patterns SUSHIL KULKARNI
  • 30.
    STEPS OF KDDPROCESS 1. Selection- Data Extraction -Obtaining Data from heterogeneous data sources - Databases, Data warehouses, World wide web or other information repositories. 2. Preprocessing- Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned- Missing data may be ignored or predicted, erroneous data may be deleted or corrected. SUSHIL KULKARNI
  • 31.
    STEPS OF KDDPROCESS 3. Transformation- Data Integration- Combines data from multiple sources into a coherent store -Data can be encoded in common formats, normalized, reduced. 4. D ata mining – Apply algorithms to transformed data an extract patterns. SUSHIL KULKARNI
  • 32.
    STEPS OF KDDPROCESS 5. Pattern Interpretation/evaluation Pattern Evaluation- Evaluate the interestingness of resulting patterns or apply interestingness measures to filter out discovered patterns. Knowledge presentation- present the mined knowledge- visualization techniques can be used. SUSHIL KULKARNI
  • 33.
    VISUALIZATION TECHNIQUES Hybrid- combination of above approaches Hierarchical- Hierarchically dividing display area Pixel-based- data as colored pixels Icon-based- using colors figures as icons Geometric- boxplot, scatter plot Graphical -bar charts,pie charts histograms
  • 34.
    Data Cleaning DataIntegration Knowledge Selection Data Mining Pattern Evaluation Data Transformation Operational Databases KDD is the nontrivial extraction of implicit previously unknown and potentially useful knowledge from data KDD PROCESS Data Preprocessing Data Warehouses SUSHIL KULKARNI
  • 35.
    KDD PROCESS EX:WEB LOG Selection: Select log data (dates and locations) to use Preprocessing: Remove identifying URLs Remove error logs Transformation: Sessionize (sort and group) SUSHIL KULKARNI
  • 36.
    KDD PROCESS EX:WEB LOG Data Mining: Identify and count patterns Construct data structure Interpretation/Evaluation: Identify and display frequently accessed sequences. Potential User Applications: Cache prediction Personalization SUSHIL KULKARNI
  • 37.
    DATA MINING VS.KDD Knowledge Discovery in Databases (KDD) - Process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. SUSHIL KULKARNI
  • 38.
    KDD ISSUES HumanInteraction Over fitting Outliers Interpretation Visualization Large Datasets High Dimensionality SUSHIL KULKARNI
  • 39.
    KDD ISSUES MultimediaData Missing Data Irrelevant Data Noisy Data Changing Data Integration Application SUSHIL KULKARNI
  • 40.
    DATA MINING TASKSAND METHODS SUSHIL KULKARNI
  • 41.
    ARE ALL THE‘DISCOVERED’ PATTERNS INTERESTING? Interestingness measures : A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful , novel, or validates some hypothesis that a user seeks to confirm SUSHIL KULKARNI
  • 42.
    Objective vs. subjectiveinterestingness measures: Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. ARE ALL THE ‘DISCOVERED’ PATTERNS INTERESTING? SUSHIL KULKARNI
  • 43.
    CAN WE FINDALL AND ONLY INTERESTING PATTERENS? Find all the interesting patterns: completeness Can a data mining system find all the interesting patterns? Association vs. classification vs. clustering SUSHIL KULKARNI
  • 44.
    Search for onlyinteresting patterns: Optimization Can a data mining system find only the interesting patterns? Approaches First general all the patterns and then filter out the uninteresting ones. Generate only the interesting patterns—mining query optimization CAN WE FIND ALL AND ONLY INTERESTING PATTERENS? SUSHIL KULKARNI
  • 45.
    Data Mining PredictiveDescriptive Classification Regression Time series Analysis Prediction Clustering Summarization Association rules Sequence Discovery SUSHIL KULKARNI
  • 46.
    Data Mining TasksClassification: learning a function that maps an item into one of a set of predefined classes Regression: learning a function that maps an item to a real value Clustering: identify a set of groups of similar items SUSHIL KULKARNI
  • 47.
    Data Mining TasksDependencies and associations: identify significant dependencies between data attributes Summarization: find a compact description of the dataset or a subset of the dataset SUSHIL KULKARNI
  • 48.
    Data Mining MethodsDecision Tree Classifiers: Used for modeling, classification Association Rules: Used to find associations between sets of attributes Sequential patterns: Used to find temporal associations in time Series Hierarchical clustering: used to group customers, web users, etc SUSHIL KULKARNI
  • 49.
  • 50.
    DIRTY DATA Datain the real world is dirty: incomplete: lacking attribute values , lacking certain attributes of interest , or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names SUSHIL KULKARNI
  • 51.
    WHY DATA PREPROCESSING?No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Required for both OLAP and Data Mining! SUSHIL KULKARNI
  • 52.
    Why can Databe Incomplete ? Attributes of interest are not available (e.g., customer information for sales transaction data) Data were not considered important at the time of transactions, so they were not recorded! SUSHIL KULKARNI
  • 53.
    Why can Databe Incomplete ? Data not recorder because of misunderstanding or malfunctions Data may have been recorded and later deleted! Missing/unknown values for some data SUSHIL KULKARNI
  • 54.
    Why can Databe Noisy / Inconsistent ? Faulty instruments for data collection Human or computer errors Errors in data transmission Technology limitations (e.g., sensor data come at a faster rate than they can be processed) SUSHIL KULKARNI
  • 55.
    Why can Databe Noisy / Inconsistent ? Inconsistencies in naming conventions or data codes (e.g., 2/5/2002 could be 2 May 2002 or 5 Feb 2002) Duplicate tuples, which were received twice should also be removed SUSHIL KULKARNI
  • 56.
    TASKS IN DATAPREPROCESSING SUSHIL KULKARNI
  • 57.
    Major Tasks inData Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers , and resolve inconsistencies Data integration Integration of multiple databases or files Data transformation Normalization and aggregation outliers=exceptions! SUSHIL KULKARNI
  • 58.
    Major Tasks inData Preprocessing Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data SUSHIL KULKARNI
  • 59.
    Forms of datapreprocessing SUSHIL KULKARNI
  • 60.
  • 61.
    Data cleaning tasks- Fill in missing values - Identify outliers and smooth out noisy data - Correct inconsistent data DATA CLEANING SUSHIL KULKARNI
  • 62.
    Ignore the tuple: usually done when class label is missing (assuming the tasks in classification)— not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? HOW TO HANDLE MISSING DATA? SUSHIL KULKARNI
  • 63.
    Use a globalconstant to fill in the missing value: e.g., “unknown”, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree HOW TO HANDLE MISSING DATA? SUSHIL KULKARNI
  • 64.
    HOW TO HANDLEMISSING DATA? Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution E.g., put the average income here, or put the most probable income based on the fact that the person is 39 years old E.g., put the most frequent team here SUSHIL KULKARNI F ? 45,390 45 F Yankees ? 39 M Red Sox 24,200 23 Gender Team Income Age
  • 65.
    The process ofpartitioning continuous variables into categories is called Discretization. HOW TO HANDLE NOISY DATA? Discretization SUSHIL KULKARNI
  • 66.
    Binning method: -first sort data and partition into (equi-depth) bins - then one can smooth by bin means, smooth by bin median, smooth by bin boundaries , etc. Clustering - detect and remove outliers HOW TO HANDLE NOISY DATA? Discretization : Smoothing techniques SUSHIL KULKARNI
  • 67.
    Combined computer andhuman inspection - computer detects suspicious values, which are then checked by humans Regression - smooth by fitting the data into regression functions HOW TO HANDLE NOISY DATA? Discretization : Smoothing techniques SUSHIL KULKARNI
  • 68.
    Equal-width (distance) partitioning:- It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = ( B - A )/ N. - The most straightforward - But outliers may dominate presentation - Skewed data is not handled well. SIMPLE DISCRETISATION METHODS: BINNING SUSHIL KULKARNI
  • 69.
    Equal-depth (frequency) partitioning:- It divides the range into N intervals, each containing approximately same number of samples - Good data scaling – good handing of skewed data SIMPLE DISCRETISATION METHODS: BINNING SUSHIL KULKARNI
  • 70.
    Binning is appliedto each individual feature (attribute) Set of values can then be discretized by replacing each value in the bin, by bin mean, bin median, bin boundaries. Example Set of values of attribute Age: 0. 4 , 12, 16, 14, 18, 23, 26, 28 BINNING : EXAMPLE SUSHIL KULKARNI
  • 71.
    Example : Set of values of attribute Age: 0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin width = 10 EXAMPLE: EQUI- WIDTH BINNING SUSHIL KULKARNI [ 20, +) { 23, 26, 28 } 3 [10, 20) { 12, 16, 16, 18 } 2 [ - , 10) {0,4} 1 Bin Boundaries Bin Elements Bin #
  • 72.
    Example : Set of values of attribute Age: 0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin depth = 3 EXAMPLE: EQUI- DEPTH BINNING SUSHIL KULKARNI [ 21, +) { 23, 26, 28 } 3 [14, 21) { 16, 16, 18 } 2 [ - , 14) {0,4, 12} 1 Bin Boundaries Bin Elements Bin #
  • 73.
    SMOOTHING USING BINNINGMETHODS Sorted data for price (in Rs): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into ( equi-depth ) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries: [4,15],[21,25],[26,34] - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 SUSHIL KULKARNI
  • 74.
    SIMPLE DISCRETISATION METHODS:BINNING Example: customer ages 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 Equi-width binning: number of values 0-22 22-31 44-48 32-38 38-44 48-55 55-62 62-80 Equi-depth binning: SUSHIL KULKARNI
  • 75.
  • 76.
    BASIC DATA MININGTASKS Clustering groups similar data together into clusters. - Unsupervised learning - Segmentation - Partitioning SUSHIL KULKARNI
  • 77.
    CLUSTERING Partitions dataset into clusters, and models it by one representative from each cluster Can be very effective if data is clustered but not if data is “smeared” There are many choices of clustering definitions and clustering algorithms, more later! SUSHIL KULKARNI
  • 78.
    CLUSTER ANALYSIS clusteroutlier salary age
  • 79.
    CLASSIFICATION Classification maps data into predefined groups or classes - Supervised learning - Pattern recognition Prediction SUSHIL KULKARNI
  • 80.
    REGRESSION Regression is used to map a data item to a real valued prediction variable. SUSHIL KULKARNI
  • 81.
    REGRESSION x yy = x + 1 X1 Y1 (salary) (age) Example of linear regression SUSHIL KULKARNI
  • 82.
    DATA INTEGRATION SUSHIL KULKARNI
  • 83.
    DATA INTEGRATION Dataintegration: combines data from multiple sources into a coherent store Schema integration - Integrate metadata from different sources metadata: data about the data (i.e., data descriptors) Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-# SUSHIL KULKARNI
  • 84.
    DATA INTEGRATION Detectingand resolving data value conflicts for the same real world entity, attribute values from different sources are different (e.g., S.A.Dixit.and Suhas Dixit may refer to the same person) possible reasons: different representations, different scales, e.g., metric vs. British units (inches vs. cm) SUSHIL KULKARNI
  • 85.
    DATA TRANSFORMATIONSUSHIL KULKARNI
  • 86.
    DATA TRANSFORMATION Smoothing: remove noise from data A ggregation : summarization, data cube construction Generalization : concept hierarchy climbing SUSHIL KULKARNI
  • 87.
    Normalization: scaledto fall within a small, specified range - min-max normalization - z-score normalization normalization by decimal scaling Attribute/feature construction - New attributes constructed from the given ones DATA TRANSFORMATION SUSHIL KULKARNI
  • 88.
    NORMALIZATION min-max normalizationz-score normalization SUSHIL KULKARNI
  • 89.
    NORMALIZATION normalization bydecimal scaling Where j is the smallest integer such that Max(| V ‘ | ) <1 SUSHIL KULKARNI
  • 90.
    SUMMARIZATION Summarization maps data into subsets with associated simple - Descriptions. - Characterization Generalization SUSHIL KULKARNI
  • 91.
    DATA EXTRACTION, SELECTION, CONSTRUCTION, COMPRESSION SUSHIL KULKARNI
  • 92.
    TERMS Extraction Feature:A process extracts a set of new features from the original features through some functional mapping or transformations. Selection Features: It is a process that chooses a subset of M features from the original set of N features so that the feature space is optimally reduced according to certain criteria. SUSHIL KULKARNI
  • 93.
    TERMS Construction feature:It is a process that discovers missing information about the relationships between features and augments the space of features by inference or by creating additional features Compression Feature: A process to compress the information about the features. SUSHIL KULKARNI
  • 94.
    SELECTION: DECISION TREEINDUCTION: Example Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 2 Reduced attribute set: {A1, A4, A6} Class 1 > SUSHIL KULKARNI
  • 95.
    DATA COMPRESSION Stringcompression - There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression: Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole SUSHIL KULKARNI
  • 96.
    DATA COMPRESSION Timesequence is not audio Typically short and varies slowly with time SUSHIL KULKARNI
  • 97.
    DATA COMPRESSION OriginalData Compressed Data lossless Original Data Approximated lossy SUSHIL KULKARNI
  • 98.
    NUMEROSITY REDUCTION: Reduce the volume of data Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods Do not assume models Major families: histograms, clustering, sampling SUSHIL KULKARNI
  • 99.
    HISTOGRAM Popular datareduction technique Divide data into buckets and store average (or sum) for each bucket Can be constructed optimally in one dimension using dynamic programming Related to quantization problems. SUSHIL KULKARNI
  • 100.
  • 101.
    HISTOGRAM TYPES Equal-width histograms: It divides the range into N intervals of equal size Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples SUSHIL KULKARNI
  • 102.
    HISTOGRAM TYPES V-optimal:It considers all histogram types for a given number of buckets and chooses the one with the least variance. MaxDiff: After sorting the data to be approximated, it defines the borders of the buckets at points where the adjacent values have the maximum difference SUSHIL KULKARNI
  • 103.
    HISTOGRAM TYPES EXAMPLE;Split to three buckets 1,1,4,5,5,7,9, 14,16,18, 27,30,30,32 1,1,4,5,5,7,9, 14,16,18, 27,30,30,32 MaxDiff 27-18 and 14-9 SUSHIL KULKARNI
  • 104.
    HIERARCHICAL REDUCTION Usemulti-resolution structure with different degrees of reduction Hierarchical clustering is often performed but tends to define partitions of data sets rather than “clusters” SUSHIL KULKARNI
  • 105.
    HIERARCHICAL REDUCTION Hierarchicalaggregation An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram SUSHIL KULKARNI
  • 106.
    MULTIDIMENSIONAL INDEX STRUCTURES CAN BE USED FOR DATA REDUCTION Each level of the tree can be used to define a milti-dimensional equi-depth histogram E.g., R3,R4,R5,R6 define multidimensional buckets which approximate the points R0 R1 R2 R3 R4 R5 R6 f c g d h b a e i Example: an R-tree R0 (0) e f c i a b R5 R6 R3 R4 R1 R2 g h d R0: R1: R2: R3: R4: R5: R6: SUSHIL KULKARNI
  • 107.
    SAMPLING Allow amining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data - Simple random sampling may have very poor performance in the presence of skew SUSHIL KULKARNI
  • 108.
    SAMPLING Develop adaptivesampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time). SUSHIL KULKARNI
  • 109.
    SAMPLING SRSWOR (simplerandom sample without replacement) SRSWR Raw Data SUSHIL KULKARNI
  • 110.
    SAMPLING Raw Data Cluster/Stratified Sample The number of samples drawn from each cluster/stratum is analogous to its size Thus, the samples represent better the data and outliers are avoided SUSHIL KULKARNI
  • 111.
    LINK ANALYSIS LinkAnalysis uncovers relationships among data. - Affinity Analysis - Association Rules - Sequential Analysis determines sequential patterns SUSHIL KULKARNI
  • 112.
    EX: TIME SERIESANALYSIS Example: Stock Market Predict future values Determine similar patterns over time Classify behavior SUSHIL KULKARNI
  • 113.
    DATA MINING DEVELOPMENTSimilarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Neural Networks Decision Tree Algorithms Algorithm Design Techniques Algorithm Analysis Data Structures Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques SUSHIL KULKARNI
  • 114.
    INTENSIONS List the various data mining metrics What are the different visualization techniques of data mining? Write short note on “Database perspective of data mining” Write short note on each of the related concepts of data mining SUSHIL KULKARNI
  • 115.
    VIEW DATA USINGDATA MINING SUSHIL KULKARNI
  • 116.
    DATA MINING METRICSUsefulness Return on Investment (ROI) Accuracy Space/Time SUSHIL KULKARNI
  • 117.
    VISUALIZATION TECHNIQUES GraphicalGeometric Icon-based Pixel-based Hierarchical Hybrid SUSHIL KULKARNI
  • 118.
    DATA BASE PERSPECTIVEON DATA MINING Scalability Real World Data Updates Ease of Use SUSHIL KULKARNI
  • 119.
    RELATED CONCEPTS OUTLINEDatabase/OLTP Systems Fuzzy Sets and Logic Information Retrieval(Web Search Engines) Dimensional Modeling Goal: Examine some areas which are related to data mining. SUSHIL KULKARNI
  • 120.
    RELATED CONCEPTS OUTLINEData Warehousing OLAP Statistics Machine Learning Pattern Matching SUSHIL KULKARNI
  • 121.
    DB AND OLTPSYSTEMS Schema (ID,Name,Address,Salary,JobNo) Data Model ER AND Relational Transaction Query: SELECT Name FROM T WHERE Salary > 10000 DM: Only imprecise queries SUSHIL KULKARNI
  • 122.
    FUZZY SETS ANDLOGIC Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. f(x): Probability x is in F. 1-f(x): Probability x is not in F. Example: T = {x | x is a person and x is tall} Let f(x) be the probability that x is tall. Here f is the membership function DM: Prediction and classification are fuzzy. SUSHIL KULKARNI
  • 123.
  • 124.
    FUZZY SETS Fuzzyset shows the triangular view of set of member ship values are shown in fuzzy set There is gradual decrease in the set of values of short, gradual increase and decrease in the set of values of median and, gradual increase in the set of values of tall. SUSHIL KULKARNI
  • 125.
    CLASSIFICATION/ PREDICTION ISFUZZY Loan Amnt Simple Fuzzy Accept Accept Reject Reject SUSHIL KULKARNI
  • 126.
    INFORMATION RETRIEVAL InformationRetrieval (IR): retrieving desired information from textual data. 1. Library Science 2. Digital Libraries 3. Web Search Engines 4.Traditionally keyword based Sample query: “ Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data. SUSHIL KULKARNI
  • 127.
    INFORMATION RETRIEVAL Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Metrics: Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant| SUSHIL KULKARNI
  • 128.
    IR QUERY RESULTMEASURES AND CLASSIFICATION IR Classification SUSHIL KULKARNI
  • 129.
    DIMENSION MODELING Viewdata in a hierarchical manner more as business executives might Useful in decision support systems and mining Dimension: collection of logically related attributes; axis for modeling data. SUSHIL KULKARNI
  • 130.
    DIMENSION MODELING Facts: data stored Example: Dimensions – products, locations, date Facts – quantity, unit price DM: May view data as dimensional. SUSHIL KULKARNI
  • 131.
  • 132.
    STATISTICS Simple descriptivemodels Statistical inference: generalizing a model created from a sample of the data to the entire dataset. Exploratory Data Analysis: 1. Data can actually drive the creation of the model 2. Opposite of traditional statistical view. SUSHIL KULKARNI
  • 133.
    STATISTICS Data miningtargeted to business user DM: Many data mining methods come from statistical techniques. SUSHIL KULKARNI
  • 134.
    MACHINE LEARNING MachineLearning: area of AI that examines how to write programs that can learn. Often used in classification and prediction Supervised Learning: learns by example. SUSHIL KULKARNI
  • 135.
    MACHINE LEARNING UnsupervisedLearning: learns without knowledge of correct answers. Machine learning often deals with small static datasets. DM: Uses many machine learning techniques. SUSHIL KULKARNI
  • 136.
    PATTERN MATCHING (RECOGNITION)Pattern Matching: finds occurrences of a predefined pattern in the data. Applications include speech recognition, information retrieval, time series analysis. DM: Type of classification. SUSHIL KULKARNI
  • 137.
    T H AN K S ! SUSHIL KULKARNI