0% found this document useful (0 votes)

150 views39 pages

Data Mining Primitives

Uploaded by

pradibirajdar57

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

150 views39 pages

Data Mining Primitives

Uploaded by

pradibirajdar57

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Chapter-II

Data Mining Primitives,

Languages and System
Architecture
Content

 Data mining primitives

 Languages
 System architecture
Introduction

 Motivation- need to extract useful information

and knowledge from a large amount of data (data
explosion problem)

 Data Mining tools perform data analysis and may

uncover important data patterns, contributing
greatly to business strategies, knowledge bases,
and scientific and medical research.
What is Data Mining???

 Data mining refers to extracting or “mining” knowledge from

large amounts of data. Also referred as Knowledge Discovery
in Databases.

 It is a process of discovering interesting

knowledge from large amounts of data stored
either in databases, data warehouses, or other
information repositories.
Architecture of a typical data mining system

Graphical user interface

Pattern evaluation

Knowledge base

Data mining engine

Database or data warehouse

server
Data cleansing
Data Integration Filtering

Database Data
warehouse
 Misconception: Data mining systems can
autonomously dig out all of the valuable
knowledge from a given large database, without
human intervention.

 If there was no user intervention then the system

would uncover a large set of patterns that may
even surpass the size of the database. Hence,
user interference is required.

 This user communication with the system is

provided by using a set of data mining primitives.
Data Mining Primitives
Data mining primitives define a data mining task, which can be
specified in the form of a data mining query.

 Task Relevant Data

 Kinds of knowledge to be mined

 Background knowledge

 Interestingness measure

 Presentation and visualization of

discovered patterns
Task relevant data

 Data portion to be investigated.

 Attributes of interest (relevant attributes) can be

specified.

 Initial data relation

 Minable view
Example
 If a data mining task is to study associations between items
frequently purchased at AllElectronics by customers in Canada,
the task relevant data can be specified by providing the
following information:

 Name of the database or data warehouse to be used

(e.g., AllElectronics_db)
 Names of the tables or data cubes containing relevant data
(e.g., item, customer, purchases and items_sold)
 Conditions for selecting the relevant data
(e.g., retrieve data pertaining to purchases made in Canada
for the current year)
 The relevant attributes or dimensions
(e.g., name and price from the item table and income and
age from the customer table)
Kind of knowledge to be mined

 It is important to specify the knowledge to be mined, as this

determines the data mining function to be performed.

 Kinds of knowledge include concept description, association,

classification, prediction and clustering.

 User can also provide pattern templates. Also called

metapatterns or metarules or metaqueries.
Example

A user studying the buying habits of allelectronics

customers may choose to mine association rules of the
form:
Meta rules such as the following can be specified:
age (X, “30…..39”) ^ income (X, “40k….49K”)
=> buys (X, “VCR”)
[2.2%, 60%]

occupation (X, “student ”) ^ age (X,

“20…..29”)=> buys (X, “computer”)
[1.4%, 70%]
Background knowledge
 It is the information about the domain to be mined

 Concept hierarchy: is a powerful form of background knowledge.

 Four major types of concept hierarchies:

schema hierarchies
set-grouping hierarchies
operation-derived hierarchies
rule-based hierarchies
Concept hierarchies (1)
 Defines a sequence of mappings from a set of low-level concepts
to higher-level (more general) concepts.

 Allows data to be mined at multiple levels of abstraction.

 These allow users to view data from different perspectives,

allowing further insight into the relationships.

 Example (location)
Example

all Level 0

USA Level 1
Canad
a

British Ontario New Illinois Level 2

Columbia York

Vancouver Victoria Toronto Ottawa New York Buffalo Chicago Level 3

Concept hierarchies (2)
 Rolling Up - Generalization of data
Allows to view data at more meaningful and explicit
abstractions.
Makes it easier to understand
Compresses the data
Would require fewer input/output operations
 Drilling Down - Specialization of data
Concept values replaced by lower level concepts
 There may be more than concept hierarchy for a given attribute
or dimension based on different user viewpoints
 Example:
Regional sales manager may prefer the previous concept
hierarchy but marketing manager might prefer to see location
with respect to linguistic lines in order to facilitate the
distribution of commercial ads.
Schema hierarchies
 Schema hierarchy is the total or partial order among attributes in
the database schema.

 May formally express existing semantic relationships between

attributes.

 Provides metadata information.

 Example: location hierarchy

street < city < province/state < country
Set-grouping hierarchies
 Organizes values for a given attribute into groups or sets or range
of values.

 Total or partial order can be defined among groups.

 Used to refine or enrich schema-defined hierarchies.

 Typically used for small sets of object relationships.

 Example: Set-grouping hierarchy for age

{young, middle_aged, senior} all (age)
{20….29} young
{40….59} middle_aged
{60….89} senior
Operation-derived hierarchies
 Operation-derived:
based on operations specified
operations may include
decoding of information-encoded strings
information extraction from complex data objects
data clustering
Example: URL or email address
xyz@cs.iitm.in gives login name < dept. < univ. < country
Rule-based hierarchies
 Rule-based:
Occurs when either whole or portion of a concept hierarchy is
defined as a set of rules and is evaluated dynamically based on
current database data and rule definition

 Example: Following rules are used to categorize items as

low_profit, medium_profit and high_profit_margin.
low_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)<50)
medium_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-
P2)≥50)^((P1-P2)≤250)
high_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)>250)
Interestingness measure (1)
 Used to confine the number of uninteresting patterns returned by
the process.

 Based on the structure of patterns and statistics underlying them.

 Associate a threshold which can be controlled by the user.

 patterns not meeting the threshold are not presented to the user.

 Objective measures of pattern interestingness:

simplicity
certainty (confidence)
utility (support)
novelty
Interestingness measure (2)
 Simplicity
a patterns interestingness is based on its overall simplicity for
human comprehension.
Example: Rule length is a simplicity measure

 Certainty (confidence)
Assesses the validity or trustworthiness of a pattern.
confidence is a certainty measure
confidence (A=>B) = # tuples containing both A and B
# tuples containing A
A confidence of 85% for the rule buys(X,
“computer”)=>buys(X,“software”) means that 85% of all
customers who purchased a computer also bought software
Interestingness measure (3)
 Utility (support)
usefulness of a pattern
support (A=>B) = # tuples containing both A and B
total # of tuples
A support of 30% for the previous rule means that 30% of all
customers in the computer department purchased both a
computer and software.

 Association rules that satisfy both the minimum confidence and

support threshold are referred to as strong association rules.

 Novelty
Patterns contributing new information to the given pattern set
are called novel patterns (example: Data exception).
removing redundant patterns is a strategy for detecting novelty.
Presentation and visualization
 For data mining to be effective, data mining systems should be
able to display the discovered patterns in multiple forms, such as
rules, tables, crosstabs (cross-tabulations), pie or bar charts,
decision trees, cubes, or other visual representations.

 User must be able to specify the forms of presentation to be used

for displaying the discovered patterns.
Data mining query languages
 Data mining language must be designed to facilitate flexible and
effective knowledge discovery.

 Having a query language for data mining may help standardize

the development of platforms for data mining systems.

 But designed a language is challenging because data mining

covers a wide spectrum of tasks and each task has different
requirement.

 Hence, the design of a language requires deep understanding of

the limitations and underlying mechanism of the various kinds of
tasks.
Data mining query languages (2)
 So…how would you design an efficient query language???

 Based on the primitives discussed earlier.

 DMQL allows mining of different kinds of knowledge from

relational databases and data warehouses at multiple levels of
abstraction.
DMQL
 Adopts SQL-like syntax

 Hence, can be easily integrated with relational query languages

 Defined in BNF grammar (Backus–Naur form or Backus normal form (BNF)

is a notation technique for context-free grammars, often used to describe the syntax
of languages used in computing, such as computer programming languages,
document formats, instruction sets and communication protocols.)
[ ] represents 0 or one occurrence
{ } represents 0 or more occurrences
Words in sans serif represent keywords
DMQL-Syntax for task-relevant data specification

 Names of the relevant database or data warehouse, conditions

and relevant attributes or dimensions must be specified
 use database ‹database_name› or use data warehouse
‹data_warehouse_name›
 from ‹relation(s)/cube(s)› [where condition]
 in relevance to ‹attribute_or_dimension_list›
 order by ‹order_list›
 group by ‹grouping_list›
 having ‹condition›
Example
Syntax for Kind of Knowledge to be Mined
 Characterization :
‹Mine_Knowledge_Specification› ::=
mine characteristics [as ‹pattern_name›]
analyze ‹measure(s)›
 Example:
mine characteristics as customerPurchasing analyze count%

 Discrimination:
‹Mine_Knowledge_Specification› ::=
mine comparison [as ‹ pattern_name›]
for ‹target_class› where ‹target_condition›
{versus ‹contrast_class_i where ‹contrast_condition_i›}
analyze ‹measure(s)›
 Example:
Mine comparison as purchaseGroups
for bigspenders where avg(I.price) >= $100
versus budgetspenders where avg(I.price) < $100
analyze count
Syntax for Kind of Knowledge to be Mined
(2)
 Association:
‹Mine_Knowledge_Specification› ::=
mine associations [as ‹pattern_name›]
[matching ‹metapattern›]
 Example: mine associations as buyingHabits
matching P(X: customer, W) ^ Q(X,Y) => buys (X,Z)

 Classification:
‹Mine_Knowledge_Specification› ::=
mine classification [as ‹pattern_name›]
analyze ‹classifying_attribute_or_dimension›
 Example: mine classification as classifyCustomerCreditRating
analyze credit_rating
Syntax for concept hierarchy specification
 More than one concept per attribute can be specified
 Use hierarchy ‹hierarchy_name› for ‹attribute_or_dimension›
 Examples:
Schema concept hierarchy (ordering is important)
 define hierarchy location_hierarchy on address as

[street,city,province_or_state,country]

Set-Grouping concept hierarchy

 define hierarchy age_hierarchy for age on customer as

level1: {young, middle_aged, senior} < level0:

all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior
Syntax for concept hierarchy specification
(2)
 operation-derived concept hierarchy
 define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)} := cluster (default,
age, 5) < all(age)

 rule-based concept hierarchy

 define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) <=
$250))
level_1: high_profit_margin < level_0: all
if (price - cost) > $250
Syntax for interestingness measure
specification
 with [‹interest_measure_name›] threshold = ‹threshold_value›

 Example:
with support threshold = 5%
with confidence threshold = 70%
Syntax for pattern presentation and visualization
specification
 display as ‹result_form›

 The result form can be rules, tables, cubes, crosstabs, pie or bar
charts, decision trees, curves or surfaces.

 To facilitate interactive viewing at different concept levels or

different angles, the following syntax is defined:
‹Multilevel_Manipulation› ::= roll up on
‹attribute_or_dimension›
| drill down on
‹attribute_or_dimension›
| add ‹attribute_or_dimension›
| drop
‹attribute_or_dimension›
Architectures of Data Mining System
 With popular and diverse application of data mining, it is
expected that a good variety of data mining system will be
designed and developed.
 Comprehensive information processing and data analysis will be
continuously and systematically surrounded by data warehouse
and databases.
 A critical question in design is whether we should integrate data
mining systems with database systems.
 This gives rise to four architecture:
- No coupling
- Loose Coupling
- Semi-tight Coupling
- Tight Coupling
Cont.

 No Coupling: DM system will not utilize any functionality of a

DB or DW system

 Loose Coupling: DM system will use some facilities of DB and

DW system
like storing the data in either of DB or DW systems and using these
systems for
data retrieval

 Semi-tight Coupling: Besides linking a DM system to a DB/DW

systems, efficient implementation of a few DM primitives.

 Tight Coupling: DM system is smoothly integrated with DB/DW

systems. Each of these DM, DB/DW is treated as main functional
component of information retrieval system.

Data Mining Primitives, Languages and System Architecture
No ratings yet
Data Mining Primitives, Languages and System Architecture
26 pages
Data Mining Primitives, Languages and System Architecture
No ratings yet
Data Mining Primitives, Languages and System Architecture
64 pages
Data Mining Primitives, Languages and System Architecture
No ratings yet
Data Mining Primitives, Languages and System Architecture
64 pages
Ch-4 Data Mining Knowledge Representation Premitives
No ratings yet
Ch-4 Data Mining Knowledge Representation Premitives
16 pages
Data Mining: Concepts and Techniques: - Chapter 4
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 4
29 pages
Unit-2 Data Mining
No ratings yet
Unit-2 Data Mining
23 pages
4chap4 BM
No ratings yet
4chap4 BM
24 pages
Data Mining-2-1
No ratings yet
Data Mining-2-1
12 pages
Data Mining Primitives Guide
No ratings yet
Data Mining Primitives Guide
30 pages
4.data Mining Primitives
No ratings yet
4.data Mining Primitives
32 pages
Data Mining Task Primitives and Major Issues
No ratings yet
Data Mining Task Primitives and Major Issues
18 pages
Data Mining for Tech Professionals
No ratings yet
Data Mining for Tech Professionals
24 pages
Data Mining
No ratings yet
Data Mining
27 pages
Unit 2 Data Mining and Warehousing
No ratings yet
Unit 2 Data Mining and Warehousing
14 pages
UNIT-3 Data Mining Primitives, Languages, and System Architectures
No ratings yet
UNIT-3 Data Mining Primitives, Languages, and System Architectures
27 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
Unit 1 and 2
No ratings yet
Unit 1 and 2
145 pages
DM Unit2 (Part1)
No ratings yet
DM Unit2 (Part1)
19 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
Data Mining for Students
No ratings yet
Data Mining for Students
122 pages
Week1 2
No ratings yet
Week1 2
24 pages
Data Mining and Datawarehousing CS-303
No ratings yet
Data Mining and Datawarehousing CS-303
34 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
10 pages
Data Mining
No ratings yet
Data Mining
26 pages
DWDM 01 Introduction
No ratings yet
DWDM 01 Introduction
43 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
31 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
91 pages
UNIT-3 DATA MINING - Part1
No ratings yet
UNIT-3 DATA MINING - Part1
111 pages
Data Mining Unit II Am
No ratings yet
Data Mining Unit II Am
20 pages
UNIT-3 Data Mining Primitives, Languages, and System Architectures
No ratings yet
UNIT-3 Data Mining Primitives, Languages, and System Architectures
27 pages
Module 2 (A) - Introduction To Data Mining
No ratings yet
Module 2 (A) - Introduction To Data Mining
37 pages
U1 - Data Mining Task Primitives
No ratings yet
U1 - Data Mining Task Primitives
4 pages
Introduction To Data Mining 1604
No ratings yet
Introduction To Data Mining 1604
32 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
84 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
84 pages
Data Mining-Introduction
No ratings yet
Data Mining-Introduction
47 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
1 Intro
No ratings yet
1 Intro
50 pages
4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16
No ratings yet
4 - Data Mining & Preprocessing - L - 11,12,13,14,15,16
80 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
28 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
DM 1 PDF
No ratings yet
DM 1 PDF
67 pages
Inf 444e - Datamining N Advanced Databases Introduction 2019
No ratings yet
Inf 444e - Datamining N Advanced Databases Introduction 2019
32 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
Unit I Dbmi
No ratings yet
Unit I Dbmi
35 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
Module 4
No ratings yet
Module 4
54 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Standard Operating Procedures: 1. Project Inception and Requirements Gathering
No ratings yet
Standard Operating Procedures: 1. Project Inception and Requirements Gathering
10 pages
BRtools
No ratings yet
BRtools
38 pages
MySQL Sample Database Setup Guide
No ratings yet
MySQL Sample Database Setup Guide
9 pages
Week004-DML-LabExer001 Rivera Dennis
No ratings yet
Week004-DML-LabExer001 Rivera Dennis
6 pages
CH 06
No ratings yet
CH 06
12 pages
UpGrad Campus - Data Science & Analytics Brochure
100% (1)
UpGrad Campus - Data Science & Analytics Brochure
10 pages
Project Name: Personality Prediction Using Mbti
No ratings yet
Project Name: Personality Prediction Using Mbti
16 pages
Request For Information Template 42
No ratings yet
Request For Information Template 42
4 pages
Exam PDF
No ratings yet
Exam PDF
21 pages
Big Data Syllabus
No ratings yet
Big Data Syllabus
4 pages
B.K. Chatterjee Has: Quality
No ratings yet
B.K. Chatterjee Has: Quality
5 pages
Management Info System Lesson 2
No ratings yet
Management Info System Lesson 2
4 pages
Creating Reports Using SAS Web Analytics Aggregates and SAS Information Maps
No ratings yet
Creating Reports Using SAS Web Analytics Aggregates and SAS Information Maps
20 pages
M&E Plan
No ratings yet
M&E Plan
10 pages
A Dive Into Web Scraper World
100% (1)
A Dive Into Web Scraper World
5 pages
Course Outline: Veeam Certified Engineer (VMCE) v9: Audience
No ratings yet
Course Outline: Veeam Certified Engineer (VMCE) v9: Audience
4 pages
Data Sheet - Data Quality
No ratings yet
Data Sheet - Data Quality
1 page
Harvard Citation Style
No ratings yet
Harvard Citation Style
18 pages
IOT Design 1. IOT Topology
No ratings yet
IOT Design 1. IOT Topology
5 pages
3 RNAseq-Mapping LO
No ratings yet
3 RNAseq-Mapping LO
98 pages
2.3 - History of Biological Databases
No ratings yet
2.3 - History of Biological Databases
4 pages
Index of Exhibits-Rfe
No ratings yet
Index of Exhibits-Rfe
6 pages
Flair Data Analytics Tutorial
No ratings yet
Flair Data Analytics Tutorial
9 pages
Big Data Analytics in Supply Chain Management A State-Of-The-Art Literature Review
No ratings yet
Big Data Analytics in Supply Chain Management A State-Of-The-Art Literature Review
11 pages
Inforex - A Collaborative System For Text Corpora Annotation and Analysis
No ratings yet
Inforex - A Collaborative System For Text Corpora Annotation and Analysis
23 pages
SQL Basics for ISE Students
No ratings yet
SQL Basics for ISE Students
129 pages
Resume Peng Wang
No ratings yet
Resume Peng Wang
2 pages
Distributed DBMS - Failure & Commit
No ratings yet
Distributed DBMS - Failure & Commit
4 pages
General Santos Library Modernization
100% (1)
General Santos Library Modernization
8 pages
CS F415 Data Mining L1
No ratings yet
CS F415 Data Mining L1
4 pages

Data Mining Primitives

Uploaded by

Data Mining Primitives

Uploaded by

Chapter-II

Data Mining Primitives,

 Data mining primitives

 Motivation- need to extract useful information

 Data Mining tools perform data analysis and may

 Data mining refers to extracting or “mining” knowledge from

 It is a process of discovering interesting

Graphical user interface

Data mining engine

Database or data warehouse

 If there was no user intervention then the system

 This user communication with the system is

 Task Relevant Data

 Kinds of knowledge to be mined

 Presentation and visualization of

 Data portion to be investigated.

 Attributes of interest (relevant attributes) can be

 Initial data relation

 Name of the database or data warehouse to be used

 It is important to specify the knowledge to be mined, as this

 Kinds of knowledge include concept description, association,

 User can also provide pattern templates. Also called

A user studying the buying habits of allelectronics

occupation (X, “student ”) ^ age (X,

 Concept hierarchy: is a powerful form of background knowledge.

 Four major types of concept hierarchies:

 Allows data to be mined at multiple levels of abstraction.

 These allow users to view data from different perspectives,

British Ontario New Illinois Level 2

Vancouver Victoria Toronto Ottawa New York Buffalo Chicago Level 3

 May formally express existing semantic relationships between

 Provides metadata information.

 Example: location hierarchy

 Total or partial order can be defined among groups.

 Used to refine or enrich schema-defined hierarchies.

 Typically used for small sets of object relationships.

 Example: Set-grouping hierarchy for age

 Example: Following rules are used to categorize items as

 Based on the structure of patterns and statistics underlying them.

 Associate a threshold which can be controlled by the user.

 Objective measures of pattern interestingness:

 Association rules that satisfy both the minimum confidence and

 User must be able to specify the forms of presentation to be used

 Having a query language for data mining may help standardize

 But designed a language is challenging because data mining

 Hence, the design of a language requires deep understanding of

 Based on the primitives discussed earlier.

 DMQL allows mining of different kinds of knowledge from

 Hence, can be easily integrated with relational query languages

 Defined in BNF grammar (Backus–Naur form or Backus normal form (BNF)

 Names of the relevant database or data warehouse, conditions

Set-Grouping concept hierarchy

level1: {young, middle_aged, senior} < level0:

 rule-based concept hierarchy

 To facilitate interactive viewing at different concept levels or

 No Coupling: DM system will not utilize any functionality of a

 Loose Coupling: DM system will use some facilities of DB and

 Semi-tight Coupling: Besides linking a DM system to a DB/DW

 Tight Coupling: DM system is smoothly integrated with DB/DW

You might also like