KEMBAR78
Introduction-to-Knowledge Discovery in Database | PPT
Chapter 1 :
Presented By :-
Kartik N. Kalpande.
What is Knowledge Acquisitions ?
 aka :: data mining, knowledge discovery, knowledge
extraction, information discovery, information
harvesting ect.
 Process of discovering useful information,hidden
pattern or rules in large quantities of data ( non-
trivial, unknown data)
 By automatic or semiautomatic means
 It’s impossible to find pattern using manual method.
Why Knowledge Acquisitions ?
Why Knowledge Acquisitions ?
 Why?
 Data explosion (tremendous amount of data available)
 Data is being warehoused
 Computing power
 Competitive pressure
Hard Disk Nowadays more than 100Ggbytes capacities
Is Data Mining Appropriate for My
problem ?
 Four general question to consider
 Can we clearly define the problem?
 Does potentially meaningful data exist?
 Does the data contain hidden knowledge or is the
data factual and useful for reporting purpose only?
 Will the cost of processing the data be less than
the likely increase in profit seen by applying any
potential knowledge gained from the data mining
project.
Traditional Approaches
 Traditional database queries:. Access a
database using a well defined query such as
SQL
 The query output consist of data from
database
 The output usually a subset of the database
DBMS
DB
SQL
Data Mining or Data Query
 Four general types of knowledge can be
define to help us determine when data
mining is appropriate.
Shallow Knowledge
Multidimensional Knowledge
Hidden Knowledge
Deep Knowledge
Shallow Knowledge
 Factual in nature
 Can be easily stored and manipulated in a
database
 Database query language such as SQL
are excellent tools for extracting shallow
knowledge from data
Multidimensional Knowledge
 also Factual
 Data are stored in a multidimensional
format
 On-line Analytical Processing (OLAP)
tools are used on multidimensional data
Hidden Knowledge
 Patterns or regularities in data that cannot
be easily found using database query
language such as SQL
 Data mining algorithms can find such
patterns with ease.
Deep Knowledge
 Knowledge stored in database that can
only be found if we are given some
direction about what we are looking for.
 Current data mining tools are not able to
locate deep knowledge.
What can computers learn?
• Four level of learning can be differentiated
(Merril & Tennyson, 1977) :
 Facts : simple statement of truth
 Concepts : set of objects, symbols, or events grouped
together because they share certain characteristics
 Procedures: step by step course of action to achieve a
goal.
 Principles: highest level of learning. General truth or
laws that are basic to other truths.
What can computers learn?
• Computer are good at learning ‘concepts’.
• Concepts are the output of data mining
session.
• There are three (3) common concept view:
a. Classical view
b. Probabilistic view
c. Exemplar View
Three Concept Views
a. Classical View:
• Definite defining properties
• These properties determine if an individual item is an
example of a particular concept.
• Crisp and leaves no room for misinterpretation.
• Example: Good Credit Rating
IF Annual Income >= 30,000
& Years at Current Position >= 5
& Owns Home = True
THEN Good Credit Risk = True
Three Concept Views
b. Probabilistic View:
• Concepts are represented by properties that are probable of concept member.
• Assumption is that people store and recall concept as generalization created
from individual instance observation.
• Cannot be directly applied to achieve answer – but can be used to help in
decision making process.
• Associate probability of membership with a specific
classification.
- The mean annual income for individuals who consistently
make loan payments on time is $30,000
- Most individuals who are good credit risks have been
working for the same company for at least five years.
- The majority of good credit risks own their own home
Three Concept Views
b. Probabilistic View:
• Example: Good Credit Rating
Home owner with an annual income of $27000, employed at the
same position for 4 years might be classified as a good credit
risk with a probability of 0.85
Three Concept Views
c. Exemplar View:
• A given instance is determine to be an example of a particular concept
if the instance is similar enough to a set of one or more known
examples of the concept .
• Assumption is that people store and recall likely concept exemplars
that are then used to classify new instances.
• Can associate a probability of concept membership with each
classification.
Three Concept Views
c. Exemplar View:
• Example:
Exemplar #1:
Annual Income = 32,000
Number of years at current position = 6
Homeowner
Exemplar #2:
Annual Income = 52,000
Number of years at current position = 16
Renter
Exemplar #1:
Annual Income = 28,000
Number of years at current position = 12
Homeowner
What can be mined?
Concepts that can be mined?
a. Classes :
• stored data is used to locate data in
predetermined groups.
• Eg: A restaurant chain could mine
customer purchase data to determine
when customers visit and what they
typically order.
Concepts that can be mined?
b. Clusters :
• Data items are grouped by logical
relationships.
• Eg: Data can be mined to identify market
segments or customer affinities.
Concepts that can be mined?
c. Associations :
• Data can be mined to identify
association.
• Eg: The beer-diaper example is typical of
associative mining.
Concepts that can be mined?
d. Sequential :
• Patterns in which data is mined to
anticipate behavior patterns and trends.
• Eg: An outdoor equipment retailer could
predict the likelihood of a backpack
purchase based on sleeping bag or
hiking shoes sale.
Multidisciplinary
Databases
Statistics
Pattern
Recognition
KDD
Machine
Learning AI
Neurocomputing
Data Mining
Disciplines Of Data Mining
Data Mining
Information RetrivalAlgorithm
Machine Learning Visualization
StatisticsDatabase System
Data Mining Model & Task
Data Mining
Predictive Descriptive
•Classification
•Regression
•Time Series Analysis
•Prediction
•Clustering
•Summarization
•Association Rules
•Sequence Discovery
Predictive Model
 Make prediction about values of data using
known results found from different data
 Or based on the use of other historical data
 Example:: credit card fraud, breast cancer
early warning, terrorist act, tsunami and ect.
Predictive Model
 Perform inference on the current data to make
predictions.
 We know what to predict based on historical data)
 Never accurate 100%
 Concentrate more to input output relation ship ( x,f(x))
 Typical Question
 Which costumer are likely to buy this product next
four month
 What kind of transactions that are likely to be
fraudulent
 Who is likely to drop this paper?
Predictive Model
x
x x
xx
x
x
x
x
x
x
x x
x
x
x
months
Profit (RM)
Current data
Future dataO ?
Descriptive Model
 Identifies pattern or relationships in data.
 Serves as a way to explore the properties of data
examined, not to predict new properties
 Always required a domain expert
 Example::
 Segmenting marketing area
 Profiling student performances
Descriptive Model
 Discovering new patterns inside the data
 We may don’t have any idea how the data looks like
 Explores the properties of the data examined
 Pattern at various granularities (eg: Student: University-
> faculty->program-> major?
 Typical Question
 What is the data
 What does it look like
 What does the data suggest for group of customer
advertisement?
Descriptive Model
major
Results
x
x x
x
x
x
x
x
x
x
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
y
y
y
y
y y
yy y
y
y
y y
y
y
Group 1
Group 2
Group 3
View Of DM
 Data To Be Mined
 Data warehouse, WWW, time series, textual. spatial
multimedia, transactional
 Knowledge To Be Mined
 Classification, prediction, summarization, trend
 Techniques Utilized
 Database, machine learning, visualization, statistics
 Applications Adapted
 Marketing, demographic segmentation, stock analysis
DM In Action
 Medical Applications ::clinical diagnosis, drug analysis
 Business (marketing segmentation & strategies, insolvency
predictor, loan risk assessment
 Education (Online learning)
 Internet (searching engine)
 Etc.
Data Mining Methodology
 Hypothesis Testing vs Knowledge Discovery
 Hypothesis Testing
 Top down approach
 Attempts to substantiate or disprove preconceived idea
 Knowledge Discovery
 Bottom-up approach
 Start with data and tries to get it to tell us something
we didn’t already know
Data Mining Methodology
 Hypothesis Testing
 Generate good ideas
 Determine what data allow these hypotheses to be
tested
 Locate the data
 Prepare the data for analysis
 Build computer models based on the data
 Evaluate computer model to confirm or reject
hypotheses
Data Mining Methodology
 Knowledge Discovery
 Directed
 Identified sources of pre classified data
 Prepare data analysis
 Select appropriated KD techniques based on data
characteristics and data mining goal
 Divide data into training, testing and evaluation
 Use the training dataset to build model
 Tune the model by applying it to test dataset
 Take action based on data mining results
 Measure the effect of the action taken
 Restart the DM process taking advantage of new data
generated by the action taken
Data Mining Methodology
 Knowledge Discovery
 Undirected
 Identified available data sources
 Prepare data analysis
 Select appropriated undirected KD techniques based on
data characteristics and data mining goal
 Use the selected technique to uncover hidden structure in
the data
 Identify potential targets for directed KD
 Generate new hypothesis to test
Question for Group Dis
Revision::
Two Approaches In data Mining
Data Mining
Predictive Descriptive
•Classification
•Regression
•Time Series Analysis
•Prediction
•Clustering
•Summarization
•Association Rules
•Sequence Discovery
Predict the future value Define R/S among data
Knowledge Discovery Process
Knowledge Discovery Process
 1.0 Selection
 The data needs for the data mining process may be
obtained from many different and heterogeneous
data sources
 Examples
 Business Transactions
 Scientific Data
 Video and pictures
Knowledge Discovery Process
 2.0 Pre Processing
 Main idea – to ensure that data is clean (high quality of
data).
 The data to be used by the process may have
incorrect or missing data.
 There may be anomalous data from multiple
sources involving different data types and
metrics
 Erroneous data may be corrected or removed,
whereas missing data must be supplied or
predicted (Often using data mining tools)
Knowledge Discovery Process
 3.0 Transformation
 Data from different sources must be converted
into a common format for processing
 Some data may be encoded or transformed into
more usable formats
 Example::
 Data Reduction Data Cleaning, Data Integration,
Data Transformation, Data Reduction and Data
Discretization
Knowledge Discovery Process
 4.0 Data Mining
 Main idea –to use intelligent method to extract patterns
and knowledge from database
 This step applies algorithms to the transformed data to
generate the desired results.
 The heart of KD process (where unknown pattern will be
revealed).
 Example of algorithms: Regression (classification,
prediction), Neural Networks (prediction, classification,
clustering), Apriori Algorithms (association rules), K-
Means & K-Nearest Neighbor (clustering), Decision
Tree (classification), Instance Learning (classification).
Knowledge Discovery Process
 5.0 Interpretation/Evaluation
 How the data mining results are presented to the
users is extremely important because the
usefulness of the results is dependent on it
 Example::
 Graphical
 Geometric
 Icon Based
 Pixel Based
 Hierarchical Based
 Hybrid
Case Study: Predicting FSK Final
Year’s Student Performance
activities
Student
database
{contains
30,000 records}
Academics
academics
Selected record
{matric, PMK, grades} –
only 2,000 records
(contains incomplete
records etc.
Selection
academics
Clean record {replace
the missing value,
removed the replicated}
Pre-processing Using neural
networks :
transform into
numerical.
Transformation
Y=w1x1+w2x2+b1
Generated Model :
pattern for performance
prediction
Data mining
Testing result:
90 % correct 
accept model
Knowledge
(apply model)
Interpretation
& evaluation

Introduction-to-Knowledge Discovery in Database

  • 1.
    Chapter 1 : PresentedBy :- Kartik N. Kalpande.
  • 2.
    What is KnowledgeAcquisitions ?  aka :: data mining, knowledge discovery, knowledge extraction, information discovery, information harvesting ect.  Process of discovering useful information,hidden pattern or rules in large quantities of data ( non- trivial, unknown data)  By automatic or semiautomatic means  It’s impossible to find pattern using manual method.
  • 3.
  • 4.
    Why Knowledge Acquisitions?  Why?  Data explosion (tremendous amount of data available)  Data is being warehoused  Computing power  Competitive pressure Hard Disk Nowadays more than 100Ggbytes capacities
  • 5.
    Is Data MiningAppropriate for My problem ?  Four general question to consider  Can we clearly define the problem?  Does potentially meaningful data exist?  Does the data contain hidden knowledge or is the data factual and useful for reporting purpose only?  Will the cost of processing the data be less than the likely increase in profit seen by applying any potential knowledge gained from the data mining project.
  • 6.
    Traditional Approaches  Traditionaldatabase queries:. Access a database using a well defined query such as SQL  The query output consist of data from database  The output usually a subset of the database DBMS DB SQL
  • 7.
    Data Mining orData Query  Four general types of knowledge can be define to help us determine when data mining is appropriate. Shallow Knowledge Multidimensional Knowledge Hidden Knowledge Deep Knowledge
  • 8.
    Shallow Knowledge  Factualin nature  Can be easily stored and manipulated in a database  Database query language such as SQL are excellent tools for extracting shallow knowledge from data
  • 9.
    Multidimensional Knowledge  alsoFactual  Data are stored in a multidimensional format  On-line Analytical Processing (OLAP) tools are used on multidimensional data
  • 10.
    Hidden Knowledge  Patternsor regularities in data that cannot be easily found using database query language such as SQL  Data mining algorithms can find such patterns with ease.
  • 11.
    Deep Knowledge  Knowledgestored in database that can only be found if we are given some direction about what we are looking for.  Current data mining tools are not able to locate deep knowledge.
  • 12.
    What can computerslearn? • Four level of learning can be differentiated (Merril & Tennyson, 1977) :  Facts : simple statement of truth  Concepts : set of objects, symbols, or events grouped together because they share certain characteristics  Procedures: step by step course of action to achieve a goal.  Principles: highest level of learning. General truth or laws that are basic to other truths.
  • 13.
    What can computerslearn? • Computer are good at learning ‘concepts’. • Concepts are the output of data mining session. • There are three (3) common concept view: a. Classical view b. Probabilistic view c. Exemplar View
  • 14.
    Three Concept Views a.Classical View: • Definite defining properties • These properties determine if an individual item is an example of a particular concept. • Crisp and leaves no room for misinterpretation. • Example: Good Credit Rating IF Annual Income >= 30,000 & Years at Current Position >= 5 & Owns Home = True THEN Good Credit Risk = True
  • 15.
    Three Concept Views b.Probabilistic View: • Concepts are represented by properties that are probable of concept member. • Assumption is that people store and recall concept as generalization created from individual instance observation. • Cannot be directly applied to achieve answer – but can be used to help in decision making process. • Associate probability of membership with a specific classification.
  • 16.
    - The meanannual income for individuals who consistently make loan payments on time is $30,000 - Most individuals who are good credit risks have been working for the same company for at least five years. - The majority of good credit risks own their own home Three Concept Views b. Probabilistic View: • Example: Good Credit Rating Home owner with an annual income of $27000, employed at the same position for 4 years might be classified as a good credit risk with a probability of 0.85
  • 17.
    Three Concept Views c.Exemplar View: • A given instance is determine to be an example of a particular concept if the instance is similar enough to a set of one or more known examples of the concept . • Assumption is that people store and recall likely concept exemplars that are then used to classify new instances. • Can associate a probability of concept membership with each classification.
  • 18.
    Three Concept Views c.Exemplar View: • Example: Exemplar #1: Annual Income = 32,000 Number of years at current position = 6 Homeowner Exemplar #2: Annual Income = 52,000 Number of years at current position = 16 Renter Exemplar #1: Annual Income = 28,000 Number of years at current position = 12 Homeowner
  • 19.
  • 20.
    Concepts that canbe mined? a. Classes : • stored data is used to locate data in predetermined groups. • Eg: A restaurant chain could mine customer purchase data to determine when customers visit and what they typically order.
  • 21.
    Concepts that canbe mined? b. Clusters : • Data items are grouped by logical relationships. • Eg: Data can be mined to identify market segments or customer affinities.
  • 22.
    Concepts that canbe mined? c. Associations : • Data can be mined to identify association. • Eg: The beer-diaper example is typical of associative mining.
  • 23.
    Concepts that canbe mined? d. Sequential : • Patterns in which data is mined to anticipate behavior patterns and trends. • Eg: An outdoor equipment retailer could predict the likelihood of a backpack purchase based on sleeping bag or hiking shoes sale.
  • 24.
  • 25.
    Disciplines Of DataMining Data Mining Information RetrivalAlgorithm Machine Learning Visualization StatisticsDatabase System
  • 26.
    Data Mining Model& Task Data Mining Predictive Descriptive •Classification •Regression •Time Series Analysis •Prediction •Clustering •Summarization •Association Rules •Sequence Discovery
  • 27.
    Predictive Model  Makeprediction about values of data using known results found from different data  Or based on the use of other historical data  Example:: credit card fraud, breast cancer early warning, terrorist act, tsunami and ect.
  • 28.
    Predictive Model  Performinference on the current data to make predictions.  We know what to predict based on historical data)  Never accurate 100%  Concentrate more to input output relation ship ( x,f(x))  Typical Question  Which costumer are likely to buy this product next four month  What kind of transactions that are likely to be fraudulent  Who is likely to drop this paper?
  • 29.
    Predictive Model x x x xx x x x x x x xx x x x months Profit (RM) Current data Future dataO ?
  • 30.
    Descriptive Model  Identifiespattern or relationships in data.  Serves as a way to explore the properties of data examined, not to predict new properties  Always required a domain expert  Example::  Segmenting marketing area  Profiling student performances
  • 31.
    Descriptive Model  Discoveringnew patterns inside the data  We may don’t have any idea how the data looks like  Explores the properties of the data examined  Pattern at various granularities (eg: Student: University- > faculty->program-> major?  Typical Question  What is the data  What does it look like  What does the data suggest for group of customer advertisement?
  • 32.
  • 33.
    View Of DM Data To Be Mined  Data warehouse, WWW, time series, textual. spatial multimedia, transactional  Knowledge To Be Mined  Classification, prediction, summarization, trend  Techniques Utilized  Database, machine learning, visualization, statistics  Applications Adapted  Marketing, demographic segmentation, stock analysis
  • 34.
    DM In Action Medical Applications ::clinical diagnosis, drug analysis  Business (marketing segmentation & strategies, insolvency predictor, loan risk assessment  Education (Online learning)  Internet (searching engine)  Etc.
  • 35.
    Data Mining Methodology Hypothesis Testing vs Knowledge Discovery  Hypothesis Testing  Top down approach  Attempts to substantiate or disprove preconceived idea  Knowledge Discovery  Bottom-up approach  Start with data and tries to get it to tell us something we didn’t already know
  • 36.
    Data Mining Methodology Hypothesis Testing  Generate good ideas  Determine what data allow these hypotheses to be tested  Locate the data  Prepare the data for analysis  Build computer models based on the data  Evaluate computer model to confirm or reject hypotheses
  • 37.
    Data Mining Methodology Knowledge Discovery  Directed  Identified sources of pre classified data  Prepare data analysis  Select appropriated KD techniques based on data characteristics and data mining goal  Divide data into training, testing and evaluation  Use the training dataset to build model  Tune the model by applying it to test dataset  Take action based on data mining results  Measure the effect of the action taken  Restart the DM process taking advantage of new data generated by the action taken
  • 38.
    Data Mining Methodology Knowledge Discovery  Undirected  Identified available data sources  Prepare data analysis  Select appropriated undirected KD techniques based on data characteristics and data mining goal  Use the selected technique to uncover hidden structure in the data  Identify potential targets for directed KD  Generate new hypothesis to test
  • 39.
  • 40.
    Revision:: Two Approaches Indata Mining Data Mining Predictive Descriptive •Classification •Regression •Time Series Analysis •Prediction •Clustering •Summarization •Association Rules •Sequence Discovery Predict the future value Define R/S among data
  • 41.
  • 42.
    Knowledge Discovery Process 1.0 Selection  The data needs for the data mining process may be obtained from many different and heterogeneous data sources  Examples  Business Transactions  Scientific Data  Video and pictures
  • 43.
    Knowledge Discovery Process 2.0 Pre Processing  Main idea – to ensure that data is clean (high quality of data).  The data to be used by the process may have incorrect or missing data.  There may be anomalous data from multiple sources involving different data types and metrics  Erroneous data may be corrected or removed, whereas missing data must be supplied or predicted (Often using data mining tools)
  • 44.
    Knowledge Discovery Process 3.0 Transformation  Data from different sources must be converted into a common format for processing  Some data may be encoded or transformed into more usable formats  Example::  Data Reduction Data Cleaning, Data Integration, Data Transformation, Data Reduction and Data Discretization
  • 45.
    Knowledge Discovery Process 4.0 Data Mining  Main idea –to use intelligent method to extract patterns and knowledge from database  This step applies algorithms to the transformed data to generate the desired results.  The heart of KD process (where unknown pattern will be revealed).  Example of algorithms: Regression (classification, prediction), Neural Networks (prediction, classification, clustering), Apriori Algorithms (association rules), K- Means & K-Nearest Neighbor (clustering), Decision Tree (classification), Instance Learning (classification).
  • 46.
    Knowledge Discovery Process 5.0 Interpretation/Evaluation  How the data mining results are presented to the users is extremely important because the usefulness of the results is dependent on it  Example::  Graphical  Geometric  Icon Based  Pixel Based  Hierarchical Based  Hybrid
  • 47.
    Case Study: PredictingFSK Final Year’s Student Performance activities Student database {contains 30,000 records} Academics academics Selected record {matric, PMK, grades} – only 2,000 records (contains incomplete records etc. Selection academics Clean record {replace the missing value, removed the replicated} Pre-processing Using neural networks : transform into numerical. Transformation Y=w1x1+w2x2+b1 Generated Model : pattern for performance prediction Data mining Testing result: 90 % correct  accept model Knowledge (apply model) Interpretation & evaluation