Dr. A.
MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
UNIT – II
Data Mining Primitives – A data mining task can be specified in the form of a data mining query,
which is input to the data mining system. A data mining query is defined in terms of data mining
task primitives.
1. Task-relevant data:
o Specifies the portion of the database to be mined (e.g., attributes, tuples).
2. Kind of knowledge to be mined:
o Specifies the type of patterns to be discovered (e.g., classification, association,
clustering).
Associations: Discovering relationships between items (e.g., "customers who buy bread
also tend to buy milk").
Classifications: Categorizing data into predefined classes (e.g., classifying emails as
spam or not spam).
Clusters: Grouping similar data points together (e.g., segmenting customers based on
purchasing behavior).
Sequential patterns: Discovering patterns of events that occur in a specific order (e.g.,
identifying steps in a customer's online buying process).
Predictions: Forecasting future values based on historical data (e.g., predicting sales for
the next quarter).
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
3. Background knowledge to be used in discovery process
o Includes domain knowledge (e.g., concept hierarchies, ontologies).
o It is the information about the domain to be mined
o Concept hierarchy: is a powerful form of background knowledge.
o Four Major Types of Concept Hierarchies – with Examples:
Schema hierarchies:
o Defined in the database schema; e.g., City → State → Country.
Set-grouping hierarchies:
o Formed by grouping values; e.g., {apple, banana} → fruit.
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
Operation-derived hierarchies:
o Created by applying operations; e.g., age → age group (20-29, 30-39).
Rule-based hierarchies:
o Defined using domain rules; e.g., if salary > 100K → high income.
4. Interestingness measures and pattern evaluation:
o Criteria for selecting interesting patterns (e.g., support, confidence, lift).
5. Presentation of discovered patterns:
o Specifies how results should be displayed (e.g., rules, charts, tables).
6. Data mining query language (DMQL):
o Used to define mining tasks using the primitives.
o Data mining language must be designed to facilitate flexible and effective knowledge
discovery.
o DMQL allows mining of different kinds of knowledge from relational databases and
data warehouses at multiple levels of abstraction.
*****************
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
DATA MINING LANGUAGE AND SYSTEM ARCHITECTURE
Data Mining Architecture: Data mining architecture refers to the structural design and system
components that facilitate the process of extracting valuable insights and patterns from large
datasets.
1. Data Mining Query Language (DMQL):
Specialized language used to define data mining tasks.
Enables users to specify:
Data to be mined
Type of knowledge to discover
Constraints and thresholds
Example:
sql
Mine association rules from transaction_data
where support ≥ 30% and confidence ≥ 70%
2. System Architecture Components:
1. Data Sources
Databases, data warehouses, and external data repositories (flat files, web data, etc.)
Accepts mining queries and displays results.
Provide raw data for mining.
2. Data Warehouse
Centralized repository to store integrated data.
Supports OLAP operations for multidimensional analysis.
3. Data Cleaning and Integration
Cleaning: Removes noise and inconsistencies.
Integration: Combines data from multiple sources.
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
4. Data Selection and Transformation
Selection: Extracts relevant data for mining.
Transformation: Converts data into suitable format (e.g., normalization).
5. Data Mining Engine
Core component that applies algorithms to extract patterns.
Core module that performs actual mining (e.g., classification, clustering).
Supports tasks like classification, clustering, association.
6. Pattern Evaluation Module
Identifies interesting patterns using measures like support and confidence.
Filters and ranks discovered patterns based on interestingness.
7. Knowledge Base
Stores background knowledge (e.g., concept hierarchies, user constraints).
8. User Interface
Allows users to interact with the system.
Supports query submission and result visualization.
9. Data Preprocessing Module
Handles cleaning, integration, selection, and transformation of data.
Diagram: This architecture ensures efficient, flexible, and scalable data mining operations.
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
Types of Data Mining Architecture:
1. No Coupling:
o Uses external data only; not efficient or accurate.
2. Loose Coupling:
o Retrieves data from databases; suitable for memory-based mining.
3. Semi-Tight Coupling:
o Uses database features like sorting and indexing for better performance.
4. Tight Coupling:
o Fully integrates with data warehouse for high performance and scalability.
Advantages of Data Mining:
1. Predicts future trends accurately.
2. Supports key decision-making.
3. Converts raw data into useful info.
4. Identifies new trends and patterns.
5. Analyzes large datasets efficiently.
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
6. Helps attract and retain customers.
7. Improves customer relationships.
8. Optimizes production and reduces costs.
Disadvantages of Data Mining:
1. Requires skilled teams and training.
2. Involves high investment costs.
3. May risk data security and privacy.
4. Wrong data can give false results.
5. Managing large databases is complex.
*************************
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
DATA MINING QUERY LANGUAGE:
Data Mining Query Language (DMQL) is a high-level language designed to define and
control data mining tasks. It enables users to express what patterns to mine and how to mine them,
using a declarative syntax.
Key Features:
1. Specification of data:
o Define the subset of data to be mined (e.g., table, attributes).
2. Type of knowledge to be mined:
o Supports mining tasks like association, classification, clustering, etc.
3. Pattern constraints:
o Set thresholds like minimum support, confidence, interestingness.
4. Background knowledge:
o Incorporate concept hierarchies or domain ontologies.
5. Presentation preferences:
o Specify output format (rules, tables, charts).
Example DMQL Syntax:
sql
use database sales_data;
mine association_rules
from transactions
where support ≥ 30% and confidence ≥ 70%
display as rule_table;
Advantages:
User-friendly and high-level.
Allows customization of mining tasks.
Bridges the gap between users and mining algorithms.
*****************
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
CONCEPT DESCRIPTION IN DATA MINING:
Definition:
Concept description is a form of data generalization that provides a concise and high-level
summary of data, describing the characteristics of a target class or concept.
Types of Concept Description:
1. Characteristic Rule:
o Describes general features of a class (e.g., "Graduate students are mostly aged 22–
30").
2. Discriminant Rule:
o Compares features between different classes (e.g., "Graduate students have more
research hours than undergraduates").
Diagrammatic Representation:
[Raw Data] [Data Preprocessing (Cleaning, Integration, Transformation)]
[Data Generalization/Aggregation] [Concept Characterization/Comparison]
[Summarized/Compared Concepts] [Knowledge/Insights]
Techniques Used:
Attribute-oriented induction
Data summarization
OLAP operations (like roll-up, drill-down)
Applications:
Data summarization
Pattern discovery
Report generation
Decision support systems
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
Example:
Describing "Senior Customers":
Age: 60+
Preferred Products: Health supplements, reading materials
Purchase Frequency: Monthly
Goal:
To make large datasets understandable by summarizing key characteristics of data
classes.
***********
DATA GENERALIZATION AND SUMMARIZATION IN DATA MINING
Data generalization and summarization are crucial processes in data mining that simplify
large, complex datasets into more manageable and understandable forms.
They involve transforming detailed data into higher-level, abstract representations to reveal
broader patterns and trends, making it easier to extract meaningful insights and facilitate
decision-making.
1. Data Generalization:
Definition: Transforms detailed low-level data into higher-level abstract forms using concept
hierarchies.
Purpose:
Reduces the complexity of data attributes
Uses attribute-oriented induction.
Involves replacing specific values with generalized concepts.
Summarize the detailed data into higher level concepts
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
Benefits:
Simplifies data, making it easier to understand and analyze.
Reduces noise and redundancy, revealing hidden patterns.
Enables the extraction of actionable insights.
Example:
"New York" → "USA"
"25" → "Young Adult"
Attribute-Oriented Induction: Generalizing data by replacing low-level attribute values with
higher-level concepts, like replacing specific ages with age ranges (e.g., "young", "middle-
aged", "senior").
2. Data Summarization:
Definition:
Produces compact, concise descriptions of data sets or data classes.
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
Data summarization involves creating concise representations of datasets, often using
statistical measures like mean, median, mode, or quartiles.
Data summarization in data mining involves reducing large datasets into concise,
representative summaries, often using tabular or graphical formats to highlight key trends
and patterns
Purpose:
Uses statistical measures (mean, count, max, min).
To provide a high level overview of data
Highlighting key characteristics and trends
Supports data cube and OLAP operations.
Provides summary reports.
Example:
Average age of employees = 35
Total sales in Q1 = $1.2M
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
Types of Data Summarization in Data Mining are:
Tabular Summarization: This method instantly conveys patterns such as frequency
distribution, cumulative frequency, etc, and
Data Visualization: Visualizations from a chosen graph style such as histogram, time-series
line graph, column/bar graphs, etc.
Techniques Used:
Attribute-Oriented Induction:
Example: Replace "New York", "Los Angeles" → "USA"
Concept Hierarchies:
Example: "30" years → "Adult" → "Age Group"
OLAP (Roll-up, Drill-down):
Example: Roll-up from "City" → "Country", Drill-down from "Year" → "Month"
Statistical Aggregation:
Example: Average salary = $55,000; Total sales = $1M
Simplified Methods for Data Generalization and Summarization:
1. Descriptive Statistics:
o Uses measures like mean and median to describe data.
2. Data Cubes:
o Summarizes data across multiple dimensions.
3. Clustering:
o Groups similar data points for easy analysis.
4. Sampling:
o Uses a subset of data to represent the whole dataset.
***********************
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
ANALYTICAL CHARACTERIZATION IN DATA MINING
Analytical characterization is the process of summarizing and comparing data characteristics
of target classes using statistical and data mining techniques.
It involves attribute relevance analysis and comparative analysis between different classes or
clusters.
It uses descriptive statistics (mean, median, variance) to highlight key patterns.
Enables comparison between classes (e.g., high-income vs. low-income groups).
Often uses OLAP, data cube, or visualization tools.
Key Aspects:
Attribute Relevance Analysis: This is a crucial part of analytical characterization. It
involves evaluating how strongly each attribute contributes to describing the target class or
concept.
Data Generalization: Analytical characterization often involves generalizing the data to a
higher level of abstraction to reveal broader patterns and characteristics.
Class Comparison: It can also be used to compare the characteristics of different classes or
groups of data objects, helping to identify what distinguishes them.
Advantages:
Simplifies complex data into understandable summaries.
Identifies patterns and key differentiators between groups.
Benefits:
Helps in decision-making and market segmentation.
Supports business analysis and customer profiling.
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
Example of Analytical Characterization:
A university wants to understand the difference between high-performing and low-
performing students.
They analyze:
Average attendance
o High performers: 90%
o Low performers: 60%
Study hours per week
o High performers: 20 hours
o Low performers: 5 hours
Participation in activities
o High performers: Active in 2+ clubs
o Low performers: Rarely participate
*******************
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
MINING CLASS COMPARISON
Mining class comparisons involve comparing two or more data classes to discover
similarities and differences between them. It helps in understanding contrasting characteristics of
different groups.
Key Aspects:
Data Collection and Partitioning: The first step involves gathering relevant data and dividing
into target and contrasting classes.
Example: Comparing Graduate and Undergraduate Students
If the task is to compare graduate students (target class) with undergraduate students
(contrasting class):
Data on both groups is collected (e.g., age, GPA, major).
Only the most relevant dimensions (e.g., GPA and age) are kept for analysis.
Both groups are summarized to the same level (e.g., average GPA by major).
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
The results are displayed as a table, chart, or a set of rules (e.g., "Graduate students have
higher GPAs in Computer Science, while undergraduates have higher GPAs in
Humanities").
Dimension Relevance Analysis: This step focuses on identifying the most relevant attributes or
dimensions for comparison.
o For instance, in the student example, attributes like GPA, major, and research experience
might be relevant, while attributes like name or phone number might be less so.
Synchronous Generalization: Both the target and contrasting classes are generalized to the same
level of abstraction.
o For instance, you might generalize GPA to ranges (e.g., "high," "medium," "low") or
generalize research experience to the number of publications.
Presentation of Comparison Results: The final step involves presenting the comparison results in
a clear and informative manner.
o Common methods include tables, charts, and rules.
o These presentations highlight the differences between the classes, often using
contrasting measures like percentage differences or discriminant rules.
Class comparisons can be presented to users in various ways, similar to class characterizations.
These include:
Generalized relations
Crosstabs
Bar charts
Pie charts
Curves
Rules
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
Types of Class Comparisons:
1. Discriminant Analysis (Discriminative Comparison):
Purpose: Identifies attributes that distinguish one class from another.
Example:
Comparing high-income vs low-income customers:
o High-income: Spend more on luxury goods
o Low-income: Spend more on basic needs
2. Class Characterization (Descriptive Comparison):
Purpose: Describes typical features of a class.
Example:
Characterizing graduate students:
o Age: 22–30
o Study Hours: 20+week
o Common Field: Engineering
3. Cluster-Based Comparison:
Purpose: Compares user-defined clusters to understand group behavior.
Example:
o Cluster A: Young adults, tech-savvy, high online shopping
o Cluster B: Middle-aged, low online activity
4. Attribute Relevance Analysis:
Purpose: Identifies which attributes are most useful in class differentiation.
Example:
In a health dataset, attributes like exercise and diet may be more relevant than hair color in
predicting fitness level.
$$$$$$$$$$$$$$$$$$$$$$$$$$$
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
ELABORATE ON STATISTICAL MEASURES WITH EXAMPLE.
Statistical measures in data mining are used to analyze and interpret data, extracting
meaningful insights from large datasets.
These measures help summarize, describe, and understand data characteristics, enabling
informed decision-making and predictive analytics.
Descriptive Statistics:
Measures of Central Tendency: These describe the center or typical value of a dataset. Common
measures include:
Mean: The average of all values.
Median: The middle value when the data is sorted.
Mode: The most frequent value.
Measures of Dispersion: These describe how spread out the data is. Examples include:
Range: The difference between the highest and lowest values.
Standard Deviation: Measures the variability of data around the mean.
Inferential Statistics:
Hypothesis Testing: Used to make inferences about a population based on a
sample. Techniques include t-tests, chi-square tests, and ANOVA.
Regression Analysis: Used to predict a dependent variable based on one or more
independent variables.
Correlation Analysis: Measures the strength and direction of the relationship between
variables.
Clustering: Groups similar data points together.
Classification: Assigns data points to predefined categories.
Outlier Detection: Identifies unusual data points that deviate significantly from the norm.
Dr. A.MURUGANANDAM
Head cum Associate Professor, Data Mining And Warehousing
Karan Arts and Science College, TVM Department of Computer Science
Cell: 9842636119
Other Statistical Techniques:
Time Series Analysis: Analyzes data points collected over time to identify trends and
patterns.
Factor Analysis: Reduces the number of variables by identifying underlying factors.
Discriminant Analysis: Predicts a categorical outcome based on predictor variables.
Survival Analysis: Analyzes time-to-event data, such as customer churn or equipment
failure.
These statistical measures and techniques are essential tools for data mining, enabling the extraction
of valuable knowledge from data and supporting data-driven decision-making.
1 Statistical measures help summarize and describe key features of a dataset.
2 Common measures include mean, median, mode, variance, standard deviation, and
correlation.
3 Mean (average): Sum of values divided by count.
Example: Average salary of employees = ₹50,000.
4 Median: Middle value when data is sorted.
Example: Median age in a dataset = 30.
5 Mode: Most frequently occurring value.
Example: Mode of purchase category = "Electronics".
6 Standard Deviation: Measures data spread from the mean.
Example: Low std. dev. in marks = consistent performance.
7 Variance: Square of standard deviation, shows dispersion.
8 Correlation: Shows relationship between two variables.
Example: Strong positive correlation between study hours and grades.
9 These measures help in pattern discovery and data comparison.
10 Useful in preprocessing, data summarization, and decision making.
$$$$$$$$$$$$$$$$$$$$$$$