KEMBAR78
Data Mining Notes Jntuh Compress | PDF | Data Mining | Data
0% found this document useful (0 votes)
783 views62 pages

Data Mining Notes Jntuh Compress

Uploaded by

21wh1a1222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
783 views62 pages

Data Mining Notes Jntuh Compress

Uploaded by

21wh1a1222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Data Mining Notes Jntuh

Data Mining Notes Jntuh

Course: Data Mining (68--=09876) 8 documents

University: Jawaharlal Nehru Technological University, Hyderabad

Info
Data Mining Notes Jntuh

DATA MINING
Notes By Jayanth

Units Page no
1) Data Mining 2
2) Association Rule Mining 16
3) Classification 35
4) Clustering and 41
Applications
5) Advanced Concepts 53

Notes By Jayanth
Data Mining Notes Jntuh

UNIT-1

DataMining
● Data mining is the process of extracting valuable and actionable insights from large
volumes of data. It involves analyzing and exploring vast datasets to discover patterns,
relationships, and trends that are not immediately apparent.
● Data mining utilizes various techniques, such as statistical analysis, machine learning
algorithms, and pattern recognition, to uncover hidden information and make informed
decisions.

Essential step in the process of knowledge discovery in databases

Knowledge discovery as a process is depicted in following figure and consists of an


iterative sequence of the following steps:

data cleaning: to remove noise or irrelevant data


data integration: where multiple data sources may be combined
data selection: where data relevant to the analysis task are retrieved from the
database
data transformation: where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations
data mining: an essential process where intelligent methods are applied in order
to extract data patterns
pattern evaluation to identify the truly interesting patterns representing knowledge
based on some interestingness measures
knowledge presentation: where visualization and knowledge
representation techniques are used to present the mined knowledge to
the user.

Architecture of a typical data mining system/Major Components

Notes By Jayanth
Data Mining Notes Jntuh

Data–Types of Data

Data mining can be applied to different types of data:

Flat files: Flat files are actually the most common data source for data mining algorithms,
especially at the research level. Flat files are simple data files in text or binary format with a
structure known by the data mining algorithm to be applied. The data in these files can be
transactions, time-series data, scientific measurements, etc.

Relational Database (RDBMS):


● RDBMS stands for Relational Database Management System. Data mining techniques
can be used to analyze data stored in relational databases.
● These databases organize data into tables, consisting of rows and columns.
● By applying data mining algorithms and queries, valuable patterns, associations, and
trends can be discovered from the structured data in the database.

Data Warehouse:
● A data warehouse is a large centralized repository that consolidates data from various
sources within an organization.
● It is designed to support analytical processing and decision-making.
● Data mining can be performed on data warehouses to extract insights and knowledge.

Notes By Jayanth
Data Mining Notes Jntuh

Transactional Data:
● Transactional data captures records of individual transactions or activities, such as
customer purchases, financial transactions, online interactions, and user behavior.
● Data mining techniques can be applied to transactional data to discover patterns, detect
anomalies, and make predictions. For example, analyzing transactional data can help
identify customer behavior patterns, recommend products, or detect fraudulent activities.

Data Mining Functionalities

Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories:
• Descriptive
• predictive

Descriptive mining tasks involve analyzing the data in a database to understand its general
properties and characteristics. This includes summarizing the data, identifying patterns,
and gaining insights into its distribution and relationships.
Descriptive mining aims to provide a comprehensive overview of the data without making any
predictions or inferences about future outcomes.

Predictive mining tasks, on the other hand, focus on using the current data to make
predictions or inferences about future outcomes.
By applying statistical and machine learning techniques

Data mining Functionalities:


1. Concept/class description
● Data characterization
● Data discrimination
2. Mining Frequent Patterns, Association and Correlations

Notes By Jayanth
Data Mining Notes Jntuh

3. Classification and Regression For Predictive Analysis


4. Cluster Analysis
5. Outlyer Analysis
6. Correlation analysis

1) Concept/class description:
Data can be associated with classes or concepts,Data characterization and data discrimination
are two approaches used in data analysis to understand and describe a given set of data.

Data characterization
● Summarizing the general characteristics or features of a specific class of data, often
referred to as the target class.
● The goal is to provide a concise and informative description of the data, highlighting its
interesting properties.
● For example, in a student dataset, data characterization could involve summarizing the
characteristics of students who have obtained more than 75% in every semester. This
characterization could result in a general profile of such students, outlining their common
attributes.

Data discrimination :
● Comparing the general features of the target class with those of one or more contrasting
classes. The objective is to identify the distinguishing properties or patterns that
differentiate the target class from others.
● Continuing with the student dataset example, data discrimination might involve
comparing the general features of students with high GPAs to those with low GPAs. This
comparison could reveal insights such as "75% of the students with high GPAs are
fourth-year computing science students, while 65% of the students with low GPAs are
not."

2) Mining Frequent Patterns, Association and Correlations


Mining frequent patterns, association, and correlations are essential functionalities in data
mining that help discover meaningful relationships and patterns in large datasets. Here are
simple points to explain each concept:

Mining Frequent Patterns:


● Frequent patterns refer to sets of items or events that frequently occur together in a
dataset.
● It involves finding combinations of items that appear together more often than expected
by chance.
● For example, in a transaction dataset, mining frequent patterns can help identify
commonly co-occurring items in customer purchases, such as "bread and milk" or
"shampoo and conditioner."

Association Mining:

Notes By Jayanth
Data Mining Notes Jntuh

● Association mining focuses on finding associations or relationships among items or


events in a dataset.
● It discovers rules that describe the dependencies between different items or events.
● For example, in a supermarket dataset, association mining can uncover rules like "If a
customer buys bread and eggs, they are likely to also purchase butter."

Correlation Mining:
● Correlation mining identifies the statistical relationships or dependencies between
different variables in a dataset.
● It measures how changes in one variable relate to changes in another variable.
● For example, in a marketing campaign dataset, correlation mining can reveal if there is
a correlation between the amount spent on advertising and the increase in sales.

3) Classification and Regression For Predictive Analysis


Classification and regression are two fundamental techniques used in data mining for predictive
analysis. Here are simple points to explain each concept:

Classification:
● Classification is a predictive analysis technique that assigns categorical labels or classes
to instances based on their features.
● It involves learning from labeled training data to build a model that can classify new,
unseen instances into predefined classes.
● For example, in email spam detection, a classification model can be trained on labeled
emails (spam or not spam) to predict the class of incoming emails.

Regression:
● Regression is a predictive analysis technique that predicts continuous numerical values
based on input features.
● It aims to find a mathematical relationship between the input variables and the output
variable.
● For example, in house price prediction, regression can be used to build a model that
predicts the price of a house based on factors like its area, number of rooms, location,
etc.

4) Cluster analysis
● Cluster analysis is a data mining technique used to group similar data points together
based on their characteristics or attributes.
● It helps in identifying patterns and structures within the data that may not be immediately
apparent.
● Similarity measures, such as distance metrics, are used to determine the similarity
between data points and form clusters.

Notes By Jayanth

Document continues below

Discover more from:

Data Mining 68--=09876


Jawaharlal Nehru Technological University, Hyderabad
8 documents
Data Mining Notes Jntuh Go to course

DATA Mining Digital Notes

129 Data Mining 86% (22)

DM UNIT-II R16 DATA MING

15 Data Mining 100% (1)

Đề thi thpt môn toán mã 101 năm 2021

6
Data Mining None

DM Unit 2 - Association Rule Mining: Mining Frequent Patterns–Associations and correlations

14
Data Mining None

PATH TO Maang - information about data mining

4
Data Mining None

STM-spectrum m - dgsdhshb

14
Data Mining None
Data Mining Notes Jntuh

● Example: Imagine you have a dataset of customer data, including information such as
age, income, and purchasing behavior. Cluster analysis can be used to group customers
with similar characteristics into clusters, such as creating a cluster for young, high-
income customers who frequently make online purchases.

5) Outlyer Analysis
Outlier analysis is a data mining technique used to identify and analyze data points that deviate
significantly from the normal or expected patterns within a dataset. Here are some points to
explain outlier analysis:

Identification of Deviant Data:


Outlier analysis focuses on detecting and identifying data points that are significantly different
from the majority of the data.
Anomaly Detection:
Outlier analysis is often used for anomaly detection, which involves finding data points that are
inconsistent or contradictory compared to the expected behavior or normal distribution.

Statistical Methods and Algorithms:


Outlier analysis employs various statistical methods and algorithms to identify and quantify
outliers.

Visualization and Interpretation:


Outlier analysis often involves visualizing the detected outliers to understand their
characteristics and potential implications.

6) Correlation analysis
Correlation analysis is a technique use to measure the association between two
variables. A correlation coefficient (r) is a statistic used for measuring the strength of a
supposed linear association between two variables. Correlations range from -1.0 to +1.0
in value.
A correlation coefficient of 1.0 indicates a perfect positive relationship in which high
values of one variable are related perfectly to high values in the other variable, and
conversely, low values on one variable are perfectly related to low values on the other
variable.
A correlation coefficient of 0.0 indicates no relationship between the two variables. That
is, one cannot use the scores on one variable to tell anything about the scores on the
second variable.
A correlation coefficient of -1.0 indicates a perfect negative relationship in which high
values of one variable are related perfectly to low values in the other variables, and
conversely, low values in one variable are perfectly related to high values on the other
variable.

Notes By Jayanth
Data Mining Notes Jntuh

Interestingness Patterns

A pattern is interesting if,


(1) It is easily understood by humans,
(2) Valid on new or test data with some degree of certainty,
(3) Potentially useful, and
(4) Novel.
A pattern is also interesting if it validates a hypothesis that the user sought to confirm.
An interesting pattern represents knowledge.

Factors Influencing Interestingness:

Support: The pattern should occur frequently enough in the dataset to be considered
interesting. High support indicates that the pattern is not an isolated occurrence.
Confidence: The pattern should have a high level of confidence or accuracy, indicating that the
observed relationships are reliable and not due to chance.
Novelty: The pattern should reveal new or previously unknown information, providing insights
that were not evident before.
Actionability: The pattern should be actionable, meaning it can be utilized to make informed
decisions or drive actions that lead to desirable outcomes.
Interpretability: The pattern should be easily interpretable and understandable by domain
experts or end-users.

Data Mining Systems and Generating Interesting Patterns:


● Data mining systems can generate patterns based on predefined measures of
interestingness, such as support and confidence thresholds.
● However, it is not feasible for data mining systems to generate all possible patterns from
a large dataset due to computational constraints and the potential overwhelming number
of patterns.
● Instead, data mining systems focus on generating patterns that surpass a certain
threshold of interestingness, as defined by the user or domain expert.
● Generating Only Interesting Patterns:

Data mining systems do not generate only interesting patterns by default.


● It is essential to set appropriate thresholds and criteria for interestingness to guide the
pattern generation process.

Notes By Jayanth
Data Mining Notes Jntuh

● The user or analyst needs to define what is considered interesting based on the specific
problem, domain knowledge, and the goals of the data mining task.

Classification of Data Mining systems

Data mining systems can be classified based on various criteria. Here are some common
classification approaches:

Classification according to the type of data source mined:


● This classification categorizes data mining systems based on the type of data they
handle, such as spatial data, multimedia data, time-series data, text data, or data from
the World Wide Web.
● Example: A data mining system designed specifically for analyzing social media data
would fall into the category of text data mining systems.

Classification according to the data model used:


● This classification categorizes data mining systems based on the data model they
employ, such as relational databases, object-oriented databases, data warehouses, or
transactional data.
● Example: A data mining system that focuses on analyzing data stored in a data
warehouse would belong to the data warehouse-oriented classification.

Classification according to the type of knowledge discovered:


● This classification categorizes data mining systems based on the kind of knowledge they
uncover or the data mining functionalities they provide, such as characterization,
discrimination, association, classification, clustering, etc.
● Example: A data mining system that specializes in discovering associations between
items in a transactional database would fall into the association mining category.

Classification according to the mining techniques used:


● Data mining systems employ various techniques for analyzing data. This classification
categorizes them based on the data analysis approaches they utilize, such as machine
learning, neural networks, genetic algorithms, statistics, visualization, or database-
oriented and data warehouse-oriented methods.
● Example: A data mining system that employs machine learning algorithms to predict
customer churn in a telecommunications dataset would belong to the machine learning-
based classification.

Notes By Jayanth
Data Mining Notes Jntuh

Classification according to the degree of user interaction:


● This classification considers the level of user involvement in the data mining process,
such as query-driven systems where users actively specify mining tasks, interactive
exploratory systems where users interactively explore the data and analyze patterns, or
autonomous systems that automate the mining process without much user intervention.
● Example: A data mining system that allows users to iteratively explore and visualize
patterns in a dataset would be classified as an interactive exploratory system.

Data mining Task primitives

Data mining primitives refer to the fundamental components or elements involved in the data
mining process. These primitives provide the necessary instructions and specifications to
perform effective data mining tasks. They include:

Task-relevant data:
● This primitive focuses on identifying the specific data that will be used for mining.
● It involves selecting the relevant database or data warehouse, specifying conditions to
choose the appropriate data, determining the relevant attributes or dimensions for
exploration, and providing instructions for data ordering or grouping.

Knowledge type to be mined:


● This primitive defines the specific type of knowledge or patterns that are to be
discovered through data mining.
● It includes specifying the data mining function, such as characterization, discrimination,
association, classification, clustering, or evolution analysis.
● Users can also provide pattern templates or meta patterns to guide the mining process
and shape the expected patterns.

Background knowledge:
● This primitive allows users to incorporate their domain knowledge or existing knowledge
about the data being mined.
● Users can provide additional information, rules, or constraints to guide the knowledge
discovery process and evaluate the patterns that are discovered.
● For example, incorporating concept hierarchies or domain-specific rules can assist in
pattern interpretation.

Pattern interestingness measure:


● This primitive involves defining measures or criteria to determine the interestingness of
discovered patterns.

Notes By Jayanth
Data Mining Notes Jntuh

● Users specify functions that help differentiate between uninteresting and valuable
patterns.
● Interestingness measures can be based on factors such as simplicity, certainty, utility,
novelty, or domain-specific requirements.

Visualization of discovered patterns:


● This primitive focuses on how the discovered patterns should be presented or visualized
to users.
● It involves choosing appropriate visual representations, such as rules, tables, cross tabs,
charts, decision trees, or other visual formats.
● Effective visualization helps users understand and interpret the discovered patterns
more easily.

Integration of Data mining system with a Data warehouse

Integration of a Data Mining System with a Data Warehouse refers to the process of combining
and utilizing data mining techniques within a data warehouse environment.

There are different architectures for integrating a data mining system with a database or data
warehouse system. Here are the differences between these architectures:

No coupling:
● In this architecture, the data mining system operates independently of the database or
data warehouse system.
● The data mining system obtains the initial data set from flat files or other sources,
without utilizing the functionalities of the database or data warehouse system.
● This architecture is considered a poor design choice since it lacks integration and
misses out on the benefits provided by database systems.

Loose coupling:
● In this architecture, the data mining system is not tightly integrated with the database or
data warehouse system.
● The database or data warehouse system is used as the source of the initial data set for
mining and may be used for storing the results.
● While this architecture allows the data mining system to leverage the flexibility and
efficiency of the database system, it may face scalability and performance challenges,
especially with large datasets.

Semitight coupling:
● This architecture involves implementing some data mining primitives or functions within
the database or data warehouse system.
● Operations like aggregation, sorting, or pre-computation of statistical functions are
efficiently performed in the database or data warehouse system.

Notes By Jayanth
Data Mining Notes Jntuh

● Frequently used intermediate mining results can be pre-computed and stored in the
database or data warehouse, improving the performance of the data mining system.

Tight coupling:
● In this architecture, the database or data warehouse system is fully integrated as part of
the data mining system.
● The data mining sub-system is treated as a functional component of the overall
information system.
● Tight coupling enables optimized data mining query processing, leading to efficient
implementations, high system performance, and an integrated information processing
environment.

Major issues in Data Mining

Major issues in data mining can be categorized into the following Types:

1. Mining Methodology and User Interaction Issues:


● - Mining different kinds of knowledge in databases: Data mining should cover a wide
range of data analysis and knowledge discovery tasks, including characterization,
association, classification, clustering, and more.
● - Interactive mining of knowledge at multiple levels of abstraction: The data mining
process should allow users to interactively explore and discover knowledge from the
data.
● - Incorporation of background knowledge: Domain knowledge can be used to guide the
data mining process and evaluate the discovered patterns.
● - Data mining query languages and ad-hoc data mining: Users should be able to pose
ad-hoc queries for data retrieval and exploration.

2. Presentation and Visualization of Data Mining Results:


● - Discovered knowledge should be expressed in a form that is easily understandable and
usable by humans.
● - Visual representations and high-level languages can aid in effectively communicating
the insights gained from data mining.

3. Handling Outliers or Incomplete Data:


● - Data stored in databases may contain outliers, noise, or incomplete data, which can
affect the accuracy of the discovered patterns.
● - Data cleaning methods and techniques capable of handling outliers are necessary for
improving the quality of data mining results.

Notes By Jayanth
Data Mining Notes Jntuh

4. Pattern Evaluation and Interestingness:


● - Data mining systems can uncover a large number of patterns, but not all of them are
interesting or valuable.
● - Developing techniques to assess the interestingness of discovered patterns is a
challenge in data mining.

5. Performance Issues:
● - Efficiency and scalability of data mining algorithms are crucial for effectively processing
large volumes of data.
● - Parallelization and distributed algorithms can help improve the performance of data
mining tasks.

6. Issues Related to the Diversity of Database Types:


● - Efficiently handling relational databases and complex data types is important in data
mining.
● - Mining information from heterogeneous databases and global information systems,
which may have diverse data semantics, poses challenges in data integration and
analysis.

Data Preprocessing
Data preprocessing refers to the process of transforming raw data into a format that is suitable
for further analysis or processing. It is a common practice in data mining to improve the quality
and usability of the data for users.

The main reasons for performing data preprocessing are:

Data Quality Improvement: Real-world data is often dirty, meaning it can be incomplete, noisy,
or inconsistent. Preprocessing helps improve data quality, which in turn improves the quality of
mining results. High-quality data is crucial for making reliable and accurate decisions based on
the data.

Handling Incomplete Data: Incomplete data refers to missing attribute values or attributes of
interest. Preprocessing techniques can handle missing values by filling them in based on certain
criteria or imputing them using statistical methods.

Dealing with Noisy Data: Noisy data contains errors or outliers, which can negatively impact
the accuracy of mining results. Preprocessing methods can identify and handle noisy data by
smoothing or removing outliers to ensure the data is more reliable.

Resolving Inconsistent Data: Inconsistent data arises when there are discrepancies in codes,
names, or values. Data preprocessing techniques can detect and resolve such inconsistencies,
ensuring data integrity and consistency.
asks involved in data preprocessing

Notes By Jayanth
Data Mining Notes Jntuh

Tasks involved in data preprocessing

Data Cleaning:
● Handling missing values: Dealing with cases where some data points have no values by
filling them in or removing them.
● Smoothing noisy data: Removing or reducing random errors or outliers in the data.
● Removing outliers: Identifying and eliminating data points that significantly deviate from
the overall pattern.
● Resolving inconsistencies: Correcting discrepancies or conflicts in codes, names, or
values across the data.

Data Integration:
● Combining data from multiple sources: Bringing together data from different databases,
files, or data cubes into a single, unified format for analysis.

Data Transformation:
● Normalizing data: Scaling the values of different attributes to a common range, ensuring
they are on the same scale for accurate analysis.
● Aggregating data: Summarizing or grouping data to a higher level of abstraction, such as
calculating averages or totals.

Data Reduction:
● Reducing data volume: Applying techniques to reduce the size of the dataset without
losing essential information.
● Preserving important information: Ensuring that the reduced dataset still retains key
patterns, trends, or characteristics present in the original data.

Notes By Jayanth
Data Mining Notes Jntuh

Data Discretization:
● Converting continuous numerical data into categories or intervals: Grouping numerical
data into discrete ranges or classes, making it suitable for certain types of analysis or
algorithms that require categorical input.

Notes By Jayanth
Data Mining Notes Jntuh

UNIT-2

Association Rule Mining, also known as frequent pattern mining, aims to discover relationships,
associations, or correlations among items or objects in large databases. It involves finding
frequent patterns or itemsets that occur frequently in a given dataset, satisfying a minimum
support and confidence threshold.

Association Rule Mining:


The process of finding patterns, associations, correlations, or causal structures among sets of
items or objects in databases.
It helps identify relationships between different items or objects based on their co-occurrence in
the data.
Applications:
● Basket data analysis: Analyzing customer shopping habits to understand which items
are frequently purchased together.
● Cross-marketing: Identifying related or complementary products to promote together.
● Catalog design: Optimizing the arrangement of items in a catalog based on their
associations.
● Loss-leader analysis: Identifying items that act as attractors or incentives for customers
to make other purchases.
● Clustering: Grouping similar items or customers based on their associations.
● Classification: Assigning items or customers to predefined categories based on their
associations.

Market Basket Analysis:

Notes By Jayanth
Data Mining Notes Jntuh

● Market basket analysis is a specific application of association rule mining.


● It involves analyzing customer buying habits by finding associations between items
placed in their shopping baskets.
● The goal is to understand which items are frequently purchased together by customers.
● This information helps retailers develop marketing strategies, optimize product
placement, and improve cross-selling opportunities.
● In market basket analysis, association rule mining helps retailers gain insights into
customer behavior, make informed decisions about product placement, and devise
effective marketing strategies based on the discovered associations between items.

Associations and correlations

Data and Terminology:

Set of Items (I):


The set of items, denoted as I = {I1, I2, ..., Im}, represents a collection of distinct items that can
appear in transactions within a dataset. Each item is a unique element that can be part of a
transaction.

Dataset or Database (D):


The dataset or database, denoted as D, refers to the collection of transactional data that is
being analyzed. It consists of a set of transactions, where each transaction represents a specific
set of items.

Notes By Jayanth
Data Mining Notes Jntuh

Transaction (T):
A transaction, denoted as T, is a subset of the item set I. It represents a single occurrence or
instance of a collection of items that are associated with each other. In other words, a
transaction is a record or entry in the dataset that contains a set of items.

Transaction Identifier (TID):


Each transaction in the dataset is assigned a unique identifier known as the Transaction
Identifier or TID. This identifier serves to differentiate one transaction from another, allowing for
individual transactions to be identified and referenced.

For example, let's consider a dataset containing information about customer purchases
in a supermarket. The set of items (I) could include various products like milk, bread,
eggs, and so on. The dataset (D) would consist of multiple transactions, each
representing a specific customer's purchase. Each transaction (T) would be a set of
items, such as {milk, bread}, {eggs, bread}, {milk, eggs, bread}, and so on. The
transaction identifier (TID) would provide a unique identifier for each transaction,
allowing us to distinguish and refer to specific purchases.

Association Rule Definition:


● An association rule, denoted as A ⇒ B, represents a relationship or implication between
two itemsets, A and B. A and B are subsets of the itemset I, meaning they contain items
from the available set of items.
● In an association rule, A and B should not have any common items; their intersection (A
∩ B) is empty. This means that A and B are distinct and do not share any items.
● The rule A ⇒ B holds in the dataset D with support s. The support (s) is a measure that
indicates the percentage of transactions in the dataset D that contain both A and B. It
quantifies how frequently the combination of items A and B appears together in the
dataset.
● Additionally, the rule A ⇒ B has confidence c in the dataset D. The confidence (c) is a
measure that represents the percentage of transactions in the dataset D that contain A
and also contain B. It provides an estimate of the conditional probability of B given A.
● In summary, association rules help us discover relationships between different itemsets
in a dataset. The support indicates how frequently the items in the rule appear together,
while the confidence measures the reliability of the inference made by the rule. By

Notes By Jayanth
Data Mining Notes Jntuh

analyzing association rules, we can gain insights into patterns, correlations, or


associations among items in the dataset.

Finding Frequent Itemsets:


● To start the association rule mining process, we first need to find all frequent itemsets.
● A frequent itemset is an itemset that occurs at least as frequently as a predefined
minimum support threshold (both in terms of count and percentage).
● The set of frequent k-itemsets is commonly denoted by Lk.
Generating Strong Association Rules:
● Once we have identified the frequent itemsets, we can generate strong association rules
from them.
● Strong association rules are those that satisfy minimum support and minimum
confidence thresholds.
● The minimum support threshold eliminates uninteresting rules with low support, as they
may not be profitable for business purposes.
● The confidence of a rule measures the reliability of the inference made by the rule.
Higher confidence indicates a stronger association between the items.

Support:
Support measures the frequency or prevalence of an itemset in a dataset. It indicates the
proportion or percentage of transactions that contain a specific itemset. A higher support value
indicates that the itemset is more frequently occurring in the dataset.
Mathematically, support (s) is calculated as the number of transactions containing the itemset
divided by the total number of transactions in the dataset. It can also be represented as a
percentage.

Support(A) = (Number of transactions containing itemset A) / (Total number of


transactions)

Confidence:
Confidence measures the reliability or strength of an association rule. It indicates the conditional
probability of the consequent (B) given the antecedent (A). In other words, it measures how
often the items in B appear in transactions that already contain A.
Mathematically, confidence (c) is calculated as the number of transactions containing both A
and B divided by the number of transactions containing A. It can also be represented as a
percentage.

Confidence(A ⇒ B) = (Number of transactions containing both A and B) / (Number of


transactions containing A)

Notes By Jayanth
Data Mining Notes Jntuh

Mining Methods

1)The Apriori Algorithm


2)FP-Tree Growth Algorithm

The Apriori Algorithm


Apriori is an important algorithm in the field of data mining, specifically for discovering frequent
itemsets and association rules in a dataset. It was introduced by R. Agrawal and R. Srikant in
1994.

How the Apriori algorithm works:

Level-wise search: The Apriori algorithm employs a level-wise search approach, where it
iteratively explores higher-level itemsets based on the frequent itemsets discovered in the
previous iteration. It starts with finding frequent 1-itemsets and then uses them to find frequent
2-itemsets, which are used to find frequent 3-itemsets, and so on.

Finding frequent 1-itemsets: In the first iteration, the algorithm scans the entire database to
count the occurrences of each item and identifies the items that satisfy a minimum support
threshold. Support is a measure of how frequently an itemset appears in the dataset. The set of
frequent 1-itemsets is denoted as L1.

Generating frequent k-itemsets: The frequent 1-itemsets found in the previous step are used
as a basis to generate candidate itemsets of size k (k > 1). These candidate itemsets are
generated by combining frequent (k-1)-itemsets that share the same prefix. For example, if {A,
B} and {A, C} are frequent 2-itemsets, their combination {A, B, C} is a candidate 3-itemset.

Scanning and pruning: After generating the candidate itemsets of size k, the algorithm scans
the database once again to count their occurrences. Any candidate itemset that does not meet
the minimum support threshold is pruned, as it cannot be a frequent itemset. The remaining
frequent k-itemsets are added to the set Lk.

Notes By Jayanth
Data Mining Notes Jntuh

Iteration until no more frequent itemsets: Steps 3 and 4 are repeated iteratively until no more
frequent k-itemsets can be found. At each iteration, the algorithm generates candidate itemsets,
scans the database, prunes non-frequent itemsets, and adds the frequent itemsets to Lk.

Notes By Jayanth
Data Mining Notes Jntuh

Notes By
Data Mining Notes Jntuh

How to improve the efficiency of the Apriori algorithm:


5 methods to improve the efficiency of the Apriori algorithm:

Hash-based itemset counting: Use hashing techniques to efficiently count the occurrences of
itemsets. This allows for early elimination of itemsets that do not meet the support threshold,
reducing the number of itemsets that need to be considered.

Transaction reduction: Eliminate transactions that do not contain any frequent k-itemsets.
These transactions do not contribute to the discovery of frequent itemsets and can be ignored in
subsequent scans, reducing the amount of data to process.

Partitioning: Divide the database into partitions and determine the frequent itemsets separately
for each partition. This helps to identify potentially frequent itemsets that are frequent in at least
one partition, reducing the search space.

Sampling: Perform mining on a subset of the given data instead of the entire dataset. By using
a lower support threshold and employing methods to ensure the completeness of the results,
sampling can provide approximate frequent itemsets while reducing the computational cost.

Dynamic itemset counting: Add new candidate itemsets only when all of their subsets are
estimated to be frequent. This avoids generating candidate itemsets that are unlikely to be
frequent, reducing the number of candidates to be considered.

Notes By Jayanth
Data Mining Notes Jntuh

Apriori Algorithm Example (Important)


WATCH VIDEO ON YOUTUBE
LINK:- #10 Mining Methods - APRIORI algorithm with Example |DM|

Notes By Jayanth
Data Mining Notes Jntuh

Notes By Jayanth
Data Mining Notes Jntuh

Notes By Jayanth
Data Mining Notes Jntuh

Notes By Jayanth
Data Mining Notes Jntuh

FP-GROWTH Algorithm

The FP-Growth algorithm is a frequent pattern mining algorithm that efficiently discovers
frequent itemsets in a dataset. It avoids the need for generating candidate itemsets like the
Apriori algorithm by using a data structure called the FP-Tree.

The steps involved in the FP-Growth algorithm are as follows:

1. Construct the FP-Tree:


● - Scan the dataset to count the frequency of each item.
● - Sort the items in descending order of their frequency.
● - Construct the FP-Tree by inserting each transaction into the tree, considering the
frequency order of the items.

2. Mine the FP-Tree:


● - Starting with the least frequent item, create a conditional pattern base for each item.
● - From the conditional pattern base, create a conditional FP-Tree.
● - Recursively mine the conditional FP-Tree to find frequent itemsets.
● - Combine the frequent itemsets found at each level to obtain the complete set of
frequent itemsets..

Overall, the FP-Growth algorithm is an efficient and effective method for mining frequent
itemsets, allowing for valuable insights and pattern discovery in large datasets.

Benefits OF FP TREE

Completeness:
The FP-Tree structure ensures completeness by preserving the complete information for
frequent pattern mining.
It never breaks a long pattern of any transaction, meaning it retains the sequential order of items
in transactions without any loss of information.
This completeness allows for accurate analysis and discovery of frequent patterns in the
dataset.

Compactness:
The FP-Tree structure helps in reducing irrelevant information by eliminating infrequent items
from the tree.
Infrequent items are pruned from the tree, resulting in a compact representation of the dataset.
This compactness reduces the memory space required to store the dataset and subsequent
mining operations.

Notes By Jayanth
Data Mining Notes Jntuh

Frequency Descending Ordering:


The items in the FP-Tree are arranged in descending order of their frequency.
This ordering ensures that more frequent items are shared across multiple transactions, leading
to efficient mining of frequent patterns.
It helps in identifying and focusing on the most significant and relevant frequent patterns in the
dataset.

Size Efficiency:
The FP-Tree structure is typically smaller in size compared to the original database.
The tree structure itself, excluding node-links and counts, never exceeds the size of the original
database.
This size efficiency reduces memory consumption and speeds up the mining process, especially
for large datasets.

FP-GROWTH Algorithm Example [IMPORTANT]


Link
#11 Mining Methods - FP Growth algorithm with Example |DM|

Mining Various kinds of Association Rules

Mining various kinds of association rules refers to the process of discovering different types of
relationships and patterns within a dataset. Association rule mining aims to find associations,
correlations, or dependencies among items or attributes in the data.
The different kinds of association rules that can be mined include:

Mining Multi-Level Association Rules:


● Multi-level association rule mining helps find connections between items at different
levels of detail.
● It goes beyond individual items and explores associations among item categories or
groups.
● For example, it can reveal relationships between product categories like "Fruits" and
"Healthy Snacks."
● This approach provides a broader view and deeper understanding of the connections
between different levels of items.

Mining Multi-Dimensional Association Rules from Databases or Data Warehouses:


● Multi-dimensional association rule mining analyzes data along multiple dimensions (e.g.,
time, location, customer segment).
● It discovers associations specific to different combinations of dimension values.

Notes By Jayanth
Data Mining Notes Jntuh

● For instance, it may uncover relationships between specific products sold in a particular
region during a specific period.
● This method enables a comprehensive understanding of complex relationships across
various dimensions.

Mining Multi-Dimensional Association Rules from Static Discretization of Quantitative


Attributes:
● This type of mining involves converting continuous numerical attributes into discrete
intervals or categories.
● By discretizing the attributes, associations can be discovered based on the defined
intervals.
● For example, it can find relationships between income levels (e.g., low, medium, high)
and purchasing behavior.
● This approach provides insights into patterns and relationships within the data,
considering different intervals or categories.

Mining Quantitative Association Rules:


● Quantitative association rule mining focuses on finding connections between numerical
attributes.
● It identifies associations based on values or numerical relationships rather than discrete
items.
● For instance, it may reveal that if the temperature exceeds a certain threshold, ice cream
sales tend to increase.
● This type of mining helps understand numerical dependencies and correlations between
attributes in the dataset.

Correlation Analysis

Correlation analysis, specifically Pearson's correlation coefficient, is a statistical measure that


tells us how closely two continuous variables are related to each other. It helps us understand if
there is a relationship between the variables

The formula for Pearson's correlation coefficient (r) is:

r = (Σ((X - X )(Y - Ȳ))) / (sqrt(Σ(X - X)²) * sqrt(Σ(Y - Ȳ)²))

Where:
● X and Y are the values of the two variables being analyzed.
● X and Ȳ are the means of X and Y, respectively.
● Σ represents the summation of values across all observations.

The resulting value of r ranges from -1 to +1.

Notes By Jayanth
Data Mining Notes Jntuh

A positive value indicates a positive linear relationship, meaning that as one variable increases,
the other tends to increase as well.
A negative value indicates a negative linear relationship, where as one variable increases, the
other tends to decrease.
A value of 0 suggests no linear relationship between the variables

Watch Example Problem On Youtube [IMPORTANT]


LINK :
#13 Correlation Analysis - Pearson's Correlation Coefficient |DM|

Constraint based Association mining

Constraint-Based Association Mining is a data mining technique that involves incorporating


various types of constraints to guide and customize the process of discovering association rules
from a dataset. It allows the mining process to be focused, efficient, and more relevant to
specific requirements and domain knowledge.

There are 5 types;

Knowledge Type:
● Knowledge type refers to the type of prior knowledge or domain expertise that is used to
guide the association mining process.
● Example: If we are mining associations in a healthcare dataset, the knowledge type
could include medical domain knowledge about symptoms, diseases, and treatments.

Data Constraints:
● Data constraints are conditions or restrictions applied to the dataset to filter or focus the
association mining process.
● Example: We might apply a data constraint to consider only transactions made within a
specific time period, such as the last month or year.

3D Level Constraints:
● 3D level constraints involve applying constraints to associations in a three-dimensional
space, considering multiple dimensions or attributes simultaneously.
● Example: In a retail dataset, we can apply a 3D level constraint to find associations
between products, customer demographics, and geographic locations. This helps
identify specific patterns for different customer segments and regions.

Interestingness Constraints:

Notes By Jayanth
Data Mining Notes Jntuh

● Interestingness constraints help determine the level of significance or interestingness of


the discovered associations.
● Example: We can set an interestingness constraint to consider only associations with a
minimum support of 5% and a minimum confidence of 80%. This ensures that the
discovered associations are statistically significant and reliable.

Rule Constraints:
● Rule constraints specify additional conditions or requirements that association rules
must satisfy.
● Example: We can set a rule constraint that an association rule should have a certain
item in the antecedent (left-hand side) or consequent (right-hand side). For example, we
might require an association rule to include the item "milk" in the consequent.

Graph Pattern Mining

Graph Pattern Mining in Data Mining (DM) refers to the process of discovering meaningful
patterns or relationships within graph-structured data. Graphs consist of nodes (vertices) and
edges that represent relationships or connections between the nodes. Graph pattern mining
aims to uncover frequent subgraphs or graph patterns that occur frequently in a given dataset.

Apriori-based Approach:
The Apriori-based approach for graph pattern mining is inspired by the traditional Apriori
algorithm used for association rule mining. It involves a level-wise search strategy where
frequent subgraphs of increasing size are generated and evaluated. The algorithm starts by
identifying frequent individual nodes and edges in the graph, and then uses these frequent
subgraphs to generate larger subgraphs. The process continues iteratively until no more
frequent subgraphs can be found. The support threshold is used to determine the minimum
occurrence frequency required for a subgraph to be considered frequent.

Pattern Growth Approach:


The pattern growth approach is an alternative method for graph pattern mining. It aims to find
frequent subgraphs by recursively growing patterns from smaller subgraphs. The algorithm
begins with frequent individual nodes and edges as the initial patterns. It then extends these
patterns by adding new nodes or edges and checks their frequency against the dataset. This
process continues recursively, with each iteration generating larger patterns by appending new
nodes or edges to existing patterns. The algorithm terminates when no more frequent
subgraphs can be generated.

Applications:

Notes By Jayanth
Data Mining Notes Jntuh

social network analysis, biological network analysis, web mining, and fraud detection, to
uncover important patterns and relationships within complex networks

Sequential Pattern Mining ( SPM )

For Example Watch on Youtube


#16 Sequential Pattern Mining ( SPM ) |DM|

Sequential Pattern Mining (SPM) is a data mining technique that focuses on discovering
sequential patterns or temporal relationships in sequential data. It involves analyzing sequences
of events or items over time to identify frequent patterns that occur in a specific order. Here are
the key points to understand about Sequential Pattern Mining:

Definition: Sequential patterns represent the ordered occurrence of events or items in a


sequence. They capture the temporal dependencies and relationships between elements within
a sequence.

Frequent Sequential Patterns: SPM aims to find frequent sequential patterns, which are the
patterns that occur frequently above a predefined support threshold. These patterns provide
insights into the regularities or recurring sequences in the data.

Sequence Representation: Sequential data can be represented in various forms, such as


transactional databases, time-stamped event logs, clickstream data, or DNA sequences. Each
element in the sequence is associated with a timestamp or an order index.

Algorithms: There are different algorithms for sequential pattern mining, such as the Apriori-
based algorithm and the PrefixSpan algorithm. These algorithms employ different strategies to
efficiently discover frequent patterns from large sequential datasets.

Example: Let's consider a retail dataset with customer transaction sequences. Each transaction
sequence represents the items purchased by a customer over time. A frequent sequential
pattern could be {A, B, C}, indicating that customers often buy items A, B, and C in that order.
This information can be valuable for market basket analysis and personalized
recommendations.

Applications: Sequential Pattern Mining finds applications in various domains. For instance, it
is used in market basket analysis to understand customer purchasing behavior, in web usage
mining to analyze user navigation patterns, in healthcare for analyzing patient treatment
sequences, and in manufacturing for optimizing process workflows.

Notes By Jayanth
Data Mining Notes Jntuh

UNIT-3

Classification: Classification is the process of categorizing data into predefined classes or


categories. It is a supervised learning task where a model is trained on labeled data to classify
new, unseen instances into one of the predefined classes. For example, classifying emails as
spam or non-spam based on their content and attributes.

Prediction: Prediction, also known as regression, aims to estimate or predict a numerical value
or a continuous outcome based on input variables or features. It is also a supervised learning
task where the model learns from labeled data to predict a numerical value for new instances.
For example, predicting housing prices based on factors like location, size, and number of
rooms.

Applications: Classification and prediction have numerous applications in various fields. They
are used in customer segmentation for targeted marketing, credit scoring for assessing
creditworthiness, disease diagnosis based on patient symptoms, stock market forecasting, and
many other domains where making predictions or classifications is crucial for decision-making.

Decision tree induction

Notes By Jayanth
Data Mining Notes Jntuh

Decision tree induction is a popular data mining technique that involves constructing a tree-like
model to make decisions or predictions based on input data. Each attribute of a decision tree
serves a specific purpose in the process. Here's an explanation of each attribute:

Root Node: The topmost node of the decision tree represents the entire dataset. It is the
starting point for making decisions and contains the attribute that best splits the data based on
certain criteria.

Internal Nodes: Internal nodes represent decision points in the tree where different attributes
are evaluated to determine the next step. Each internal node contains an attribute and
corresponding splitting criteria.

Branches: Branches emanating from internal nodes represent the possible outcomes or values
of the attribute being evaluated. They lead to subsequent internal nodes or leaf nodes.

Leaf Nodes: Leaf nodes represent the final decision or prediction made by the decision tree.
They do not contain any further attributes. Instead, they indicate the class label or outcome
associated with a specific combination of attribute values.

Information Gain: Information Gain is a measure that quantifies the amount of information
gained by splitting the data based on a particular attribute. It is based on the concept of entropy,
which represents the impurity or disorder in the dataset. The attribute with the highest
Information Gain is selected as the splitting attribute at each node.

Entropy: Entropy is a measure of impurity or disorder in a set of examples. It measures the


uncertainty or randomness in the distribution of class labels within the data. A higher entropy
value indicates a more mixed or uncertain distribution, while a lower entropy value indicates a
more pure or certain distribution.

Entropy formula: Entropy(D) = - Σ (p_i * log2(p_i))


In the formula, p_i represents the proportion of examples in the dataset that belong to class i.
The logarithm base 2 is used to calculate the entropy in bits.

Notes By Jayanth
Data Mining Notes Jntuh

Bayesian classification

Bayesian classification is a probabilistic approach used in data mining and machine learning for
classifying instances based on their probabilities of belonging to different classes. Here's an
explanation of Bayesian classification in simple points with an example:

Bayesian approach: Bayesian classification is based on Bayes' theorem, which calculates the
probability of a hypothesis given the observed evidence. It uses prior knowledge and updates it
with new evidence to make predictions.

Example: Suppose we have a dataset of emails labeled as "spam" or "non-spam" along with
their attributes like the presence of certain keywords, length of the email, and the number of
exclamation marks. Bayesian classification can be used to classify new emails as spam or non-
spam based on the probabilities of these attributes given each class. For example, if the
probability of an email being spam given the presence of specific keywords is higher than the
probability of it being non-spam, the classifier will classify it as spam.

Naive Bayes Classifier:


The Naive Bayes classifier is a specific type of Bayesian classifier that assumes independence
among the features or attributes. It simplifies the calculation of probabilities by assuming that
the presence or absence of a particular feature in a class is unrelated to the presence or
absence of other features. Despite this simplifying assumption, the Naive Bayes classifier often
performs well in practice and is widely used.

Naive Bayes Classifier: The Naive Bayes classifier is a machine learning algorithm used for
classification tasks.

Independence Assumption: The classifier assumes that the features or attributes used for
classification are independent of each other. This means that the presence or absence of one
feature does not affect the presence or absence of another feature.

Probability Calculation: The classifier calculates the probability of an instance belonging to a


particular class based on the presence or absence of features in that class.

Simplified Probability Calculation: The independence assumption simplifies the probability


calculation. Instead of considering the joint probabilities of all features together, the classifier
calculates the probability of each feature independently.

Feature-Based Classification: The classifier evaluates the probability of an instance belonging


to each class separately based on the presence or absence of each feature. It then selects the
class with the highest probability as the predicted class for the instance.

Notes By Jayanth
Data Mining Notes Jntuh

Rule Based Classifier

● The Rule-Based Classifier is a classification algorithm that makes use of IF-THEN rules
to predict the class of new instances. The rules are structured as IF conditions are met,
THEN predict a certain class.
● Rules are represented as IF-THEN statements, where the IF part specifies the
conditions or criteria based on the input features, and the THEN part indicates the class
or category to which the instance belongs.
● Rule-Based Classifier uses predefined rules to classify instances. It generates rules from
training data, evaluates their quality, selects the most relevant rules, and applies them to
classify new instances

Here's an example to illustrate how the Rule-Based Classifier works:

IF Outlook = Sunny AND Temperature = Hot THEN PlayTennis = Yes

In this example, we have a rule that predicts whether a person will play tennis based on the
outlook and temperature. The IF part specifies the conditions, and the THEN part indicates the
predicted class.

Let's say we have a new instance with the following attributes:

Outlook = Sunny
Temperature = Hot
Humidity = High
Wind = Weak

Notes By Jayanth
Data Mining Notes Jntuh

To classify this instance, we check if the conditions of any of the rules are satisfied. In this case,
the conditions of our example rule are met (Outlook = Sunny and Temperature = Hot).
Therefore, we predict that the person will play tennis (PlayTennis = Yes)

Lazy learners

Lazy learners, also known as instance-based learners or memory-based learners, are a type of
machine learning algorithm used in data mining. Unlike eager learners that build a model during
the training phase, lazy learners do not construct a specific model. Instead, they memorize the
training instances and make predictions based on the stored instances at the time of testing.
Here are some key points to understand lazy learners:

EXAMPLE :- KNN ALGORITHM


● K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
● K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
● K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
● K-NN algorithm can be used for Regression as well as for Classification
but mostly it is used for the Classification problems.
● K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
● It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
● KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.

Notes By Jayanth
Data Mining Notes Jntuh

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

● Step-1: Select the number K of the neighbors


● Step-2: Calculate the Euclidean distance of K number of neighbors
● Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
● Step-4: Among these k neighbors, count the number of the data points in
each category.
● Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
● Step-6: Our model is ready.

Notes By Jayanth
Data Mining Notes Jntuh

UNIT-4
Cluster analysis, also known as clustering, is a technique in data mining used to group similar
objects together based on their inherent characteristics or patterns. It aims to discover natural
groupings within a dataset without any prior knowledge of the class labels or target variables.

The properties of cluster analysis in data mining:

Clustering Scalability: Cluster analysis algorithms should be able to handle large datasets
efficiently. Scalability refers to the ability to process and analyze increasingly larger amounts of
data without a significant increase in computational time and resources.

Algorithm Usability with Multiple Types of Data: Clustering algorithms should be applicable
to various types of data, including numerical, categorical, or mixed data. Different algorithms are
designed to handle different types of data, ensuring versatility in accommodating diverse
datasets.

Dealing with Unstructured Data: Cluster analysis is useful for discovering patterns and
structures within unstructured data. Unstructured data refers to information that does not have a
predefined data model or organization. Clustering algorithms can help identify hidden
relationships and groupings in such data.

Interoperability: Cluster analysis algorithms should be interoperable with other data mining
techniques and tools. They should be able to integrate seamlessly into the data analysis
workflow, allowing for the combination of clustering results with other analyses and
visualizations.

DATA STRUCTURES:-

Data Matrix:
● A data matrix is a structured representation of the data used in cluster analysis. It is a
table-like structure where rows represent objects (e.g., observations, samples) and
columns represent variables (e.g., features, attributes). Each cell in the matrix holds the
value of a specific variable for a particular object.
● For example, let's consider a dataset of houses where each row represents a house and
each column represents a variable such as price, size, number of rooms, etc. The data
matrix would have houses as rows and variables as columns, with each cell containing
the corresponding values.
● Data matrices are commonly used in clustering algorithms that rely on the values of
variables to determine similarities or dissimilarities between objects.

Notes By Jayanth
Data Mining Notes Jntuh

Dissimilarity Matrix:
● A dissimilarity matrix, also known as a distance matrix, represents the dissimilarities or
distances between pairs of objects in a dataset. Instead of directly providing the values
of variables, it focuses on capturing the dissimilarity or distance measures between
objects.
● In a dissimilarity matrix, each row and column represent an object, and the cell values
represent the dissimilarity or distance between the corresponding pair of objects. The
values in the dissimilarity matrix are typically calculated using distance metrics such as
Euclidean distance, Manhattan distance, or correlation distance.
● For instance, if we have a set of images and want to cluster them based on visual
similarity, we can create a dissimilarity matrix by calculating the distances between pairs
of images using image comparison techniques.
● Dissimilarity matrices are commonly used in hierarchical clustering and density-based
clustering algorithms, where the focus is on measuring the dissimilarities between
objects rather than analyzing the specific values of variables.

Types of Data in Cluster Analysis


Interval-Scaled Variables:
● Interval-scaled variables are continuous variables that have a specific numerical value
and maintain the same interval between each value. They can take on any real value
within a specified range.
● Examples of interval-scaled variables include temperature (measured in Celsius or
Fahrenheit), time (measured in seconds), or weight (measured in kilograms or pounds).

Binary Variables:

● Symmetric Binary Variables: Symmetric binary variables are categorical variables that
can take on only two distinct values, typically represented as 0 and 1. The two values
have equal importance and are interchangeable. Examples of symmetric binary
variables include gender (male/female), presence/absence of a certain characteristic, or
yes/no responses.

● Asymmetric Binary Variables: Asymmetric binary variables are also categorical


variables with two distinct values, but the values have different meanings or importance.
Examples of asymmetric binary variables include success/failure, true/false, or
positive/negative responses.

Notes By Jayanth
Data Mining Notes Jntuh

Categorical Variables:

● Nominal Variables: Nominal variables are categorical variables without any inherent
order or ranking. They represent distinct categories that are not numerically related.
Examples of nominal variables include colors (red, blue, green), different types of fruits,
or categories of products.

● Ordinal Variables: Ordinal variables are categorical variables with an inherent order or
ranking between categories. The categories have a meaningful relationship in terms of
their order but not necessarily in terms of the exact numerical difference between them.
Examples of ordinal variables include ratings (e.g., low, medium, high), educational
levels (e.g., elementary, middle school, high school), or levels of satisfaction (e.g., very
dissatisfied, dissatisfied, neutral, satisfied, very satisfied).

Mixed Variables: Mixed variables refer to datasets that include a combination of different types
of variables, such as a mix of interval-scaled, binary, and categorical variables. In many real-
world datasets, it is common to have variables of different types. For example, a dataset
containing information about customers may include age (interval-scaled), gender (binary), and
occupation (categorical).

Partitioning Methods

Partitioning methods are a class of clustering algorithms used in data analysis to group objects
into distinct partitions or clusters. These algorithms aim to find an optimal partitioning of the
data, where objects within each cluster are more similar to each other than to objects in other
clusters. Here's a simple explanation of partitioning methods:

K-means Clustering: K-means is one of the most widely used partitioning methods. It aims to
partition the data into K clusters, where K is a predefined number chosen by the user. The
algorithm starts by randomly selecting K initial cluster centroids and then iteratively assigns
each object to the nearest centroid and recalculates the centroids based on the mean values of
the assigned objects. This process continues until convergence, where the assignment of
objects to clusters remains unchanged.

For example, if we have a dataset of customer information and want to cluster them into three
groups based on their purchasing behavior, we can use K-means clustering to find three distinct
clusters of customers.

WATCH VIDEO ON YOUTUBE-EXAMPLE K-MEANS(IMPORTANT)

#24 Partitioning Clustering - K Means Algorithm |DM|

Notes By Jayanth
Data Mining Notes Jntuh

Hierarchical clustering

Hierarchical clustering is a class of clustering algorithms that organizes data objects into a
hierarchical structure, often represented as a tree-like structure called a dendrogram. This
method creates a hierarchical sequence of nested clusters, where clusters at higher levels
encompass smaller clusters at lower levels. There are two main types of hierarchical methods:
agglomerative and divisive.

Agglomerative Hierarchical Clustering:

Agglomerative clustering starts with each data point as an individual cluster and iteratively
merges the closest pairs of clusters until a single cluster containing all the data points is formed.

The process involves the following steps:

● Calculate the similarity or distance between each pair of objects.


● Merge the two closest clusters based on the chosen similarity or distance measure.
● Update the similarity or distance matrix.
● Repeat the above steps until all objects belong to a single cluster.

For example, let's say we have a dataset of animals, and we want to cluster them based on
their characteristics. Agglomerative hierarchical clustering would start by considering each
animal as a separate cluster, and then it would progressively merge the closest pairs of clusters,
grouping similar animals together at each step.

Notes By Jayanth
Data Mining Notes Jntuh

3 modes:-
Single Linkage:
● Single linkage, also known as the nearest neighbor linkage, measures the similarity or
distance between two clusters by considering the shortest distance between any pair of
objects belonging to the two clusters. In other words, it looks at the closest neighboring
points between clusters. The distance between clusters is defined as the minimum
distance between any two points from different clusters.

● For example, if we have three clusters A, B, and C, and we are using single linkage, the
distance between clusters A and B would be the shortest distance between any point in
A and any point in B.

Complete Linkage:
● Complete linkage, also known as the farthest neighbor linkage, measures the similarity
or distance between two clusters by considering the maximum distance between any
pair of objects belonging to the two clusters. It looks at the farthest neighboring points
between clusters. The distance between clusters is defined as the maximum distance
between any two points from different clusters.

● For example, if we have three clusters A, B, and C, and we are using complete linkage,
the distance between clusters A and B would be the maximum distance between any
point in A and any point in B.

Average Linkage:
● Average linkage measures the similarity or distance between two clusters by considering
the average distance between all pairs of objects belonging to the two clusters. It takes
into account the distances between all points from different clusters and calculates their
average. The distance between clusters is defined as the average distance between any
two points from different clusters.

● For example, if we have three clusters A, B, and C, and we are using average linkage,
the distance between clusters A and B would be the average distance between all points
in A and all points in B.

Notes By Jayanth
Data Mining Notes Jntuh

Divisive Hierarchical Clustering:

Divisive clustering takes the opposite approach of agglomerative clustering. It starts with a
single cluster containing all data points and recursively splits it into smaller clusters until each
data point becomes a separate cluster.

The process involves the following steps:

● Start with a single cluster containing all data points.


● Divide the cluster into two subclusters using a chosen criterion (e.g., similarity, distance).
● Recursively divide each subcluster into smaller subclusters until each data point
becomes a separate cluster.

For instance, consider a dataset of plants, and we want to cluster them based on their
botanical features. Divisive hierarchical clustering would begin with a single cluster containing
all plants and then successively split it into subclusters, resulting in a hierarchical structure of
smaller and more specific clusters.

Notes By Jayanth
Data Mining Notes Jntuh

Density–Based Methods

Density-based methods are a class of clustering algorithms that group data points based on
their density within the dataset. One popular density-based algorithm is DBSCAN (Density-
Based Spatial Clustering of Applications with Noise). DBSCAN is particularly effective at
identifying clusters of arbitrary shape and handling noise or outliers. Let's delve into DBSCAN
and its inputs and types of data points:

DBSCAN Inputs:

● Dataset: The input to DBSCAN is a dataset consisting of data points. Each data point is
represented by its feature values or coordinates in a multi-dimensional space.

● Epsilon (ε): Epsilon is a parameter in DBSCAN that defines the maximum distance
between two data points for them to be considered as neighbors. It determines the
neighborhood size around each data point.

● Minimum Points (MinPts): MinPts is another parameter that specifies the minimum
number of data points required within the ε-neighborhood for a point to be considered a
core point. Core points play a crucial role in forming clusters.

Types of Data Points in DBSCAN:

● Core Points: Core points are data points within the dataset that have a sufficient
number of neighboring points within the ε-neighborhood (specified by MinPts). These
points are considered as central to their respective clusters.

● Border Points: Border points are data points that have fewer neighboring points than
the required MinPts within the ε-neighborhood. They are not dense enough to be core
points but are within the neighborhood of a core point. Border points can be part of a
cluster but are less central than core points.

● Noise Points: Noise points, also known as outliers, are data points that have fewer
neighboring points than the required MinPts within the ε-neighborhood and are not within
the neighborhood of any core point. They are considered as noise or non-clustered
points.

Notes By Jayanth
Data Mining Notes Jntuh

DBSCAN Algorithm:

1. Select a data point randomly that has not been visited.


2. Find all the data points within its ε-neighborhood, forming a cluster if the number of
points is greater than or equal to MinPts. Repeat the process for each of these newly
discovered core points, expanding the cluster further.
3. If a point does not have enough neighboring points within its ε-neighborhood to be a
core point, it is labeled as noise.
4. Continue the process until all points have been visited and assigned to a cluster or
labeled as noise.

WATCH VIDEO ON YOUTUBE-EXAMPLE


#26 Density Based Clustering - DBSCAN Algorithm |DM|

Grid–Based Methods

Grid-based methods are a class of clustering algorithms that partition the data space into a grid
or a set of cells. These methods are efficient for handling large datasets by reducing the
computational complexity of clustering. One popular grid-based algorithm is STING (Statistical
Information Grid).

Notes By Jayanth
Data Mining Notes Jntuh

STING (Statistical Information Grid):


STING is a grid-based clustering algorithm that organizes data into a hierarchical grid structure.
It combines the advantages of grid-based and hierarchical clustering methods to provide a
scalable and efficient approach for clustering large datasets. Here's an explanation of STING:

Grid Construction:
● The algorithm starts by dividing the entire data space into a rectangular grid of cells.
● The number and size of the cells can be pre-defined based on the characteristics of the
dataset or adaptively determined.
● Each cell in the grid represents a spatial region in the data space.

Statistical Information:
● For each cell, STING computes statistical information (e.g., mean, standard deviation)
about the data objects contained within that cell.
● The statistical information provides a summary of the data distribution within each cell.

Hierarchical Structure:
● STING constructs a hierarchical structure by recursively partitioning cells into smaller
subcells.
● The partitioning is based on statistical measures such as variance or entropy, aiming to
maximize the homogeneity of the objects within each cell.
● This process continues until a stopping criterion is met, such as reaching a minimum cell
size or a desired level of clustering granularity.

Cluster Extraction:
● At each level of the hierarchy, STING identifies clusters by analyzing the statistical
properties of the cells.
● Clusters can be defined based on thresholds or statistical tests applied to the information
of the cells.
● The hierarchical structure of the grid allows for different levels of clustering granularity,
enabling users to explore clusters at various resolutions.

Outlier Analysis

Outlier analysis, also known as outlier detection, is the process of identifying and examining
data points that deviate significantly from the majority of the dataset. Outliers are data points
that exhibit different characteristics or behaviors compared to the rest of the data. Outlier
analysis is important in various fields, including data mining, statistics, and anomaly detection. It
helps in understanding unusual patterns, detecting errors or anomalies, and making informed
decisions. There are different approaches to outlier analysis, including statistical and proximity-
based methods.

Notes By Jayanth
Data Mining Notes Jntuh

Outlier Detection

1) Statistical Methods for Outlier Detection:

● Parametric Methods: Parametric methods assume a specific distribution for the data and
use statistical techniques to identify outliers. These methods estimate the parameters of
the assumed distribution and detect outliers based on their deviation from the expected
values. Examples of parametric methods include Z-score, Grubbs' test, and Dixon's test.
● Non-parametric Methods: Non-parametric methods make minimal assumptions about
the distribution of data and focus on ranking or ordering the data points. These methods
use statistical ranks or order statistics to identify outliers. Examples of non-parametric
methods include the Median Absolute Deviation (MAD), percentile-based methods, and
the box plot.

2) Proximity-Based Methods for Outlier Detection:

● Density-Based Methods: Density-based methods analyze the density of data points in


a given region to identify outliers. They identify points with significantly lower density
compared to their neighbors as outliers. One popular density-based method is Local
Outlier Factor (LOF).
● Distance-Based Methods: Distance-based methods measure the distance or
dissimilarity between data points to identify outliers. Points that are far away from their
neighbors or have unusually large distances are considered outliers. Examples of
distance-based methods include the k-nearest neighbors (k-NN) approach and the
Distance to the kth Nearest Neighbor (k-Distance) method.
● Grid-Based Methods: Grid-based methods partition the data space into a grid or set of
cells and analyze the distribution of points within each cell. Outliers are identified based
on their deviation from the expected density within the grid cells.
● Deviation-Based Methods: Deviation-based methods compare each data point to a
model or expected pattern and identify points that deviate significantly. These methods
can utilize statistical techniques or machine learning algorithms to build the expected
model.

Notes By Jayanth
Data Mining Notes Jntuh

Types of Outliers

Global/Point Outliers:

● Global outliers, also known as point outliers, are individual data points that
significantly deviate from the majority of the dataset. These outliers are isolated
and distinct from other data points, and they have a noticeable impact on the
overall distribution. Global outliers can arise due to measurement errors, data
entry mistakes, or rare events. They are typically easy to detect because they
stand out from the rest of the data.
● For example, in a dataset of students' exam scores, a global outlier may
represent a student who achieved an extremely high or low score compared to
other students.

Collective Outliers:

● Collective outliers, also known as contextual outliers or group outliers, are a


group or subset of data points that collectively exhibit anomalous behavior.
These outliers may not be apparent when analyzing individual data points but
become noticeable when considering their collective characteristics. Collective
outliers can occur due to specific circumstances, events, or patterns within
subgroups of the data.
● For instance, in a sales dataset, a group of customers who make unusually large
purchases during a specific time period can be considered collective outliers.

Notes By Jayanth
Data Mining Notes Jntuh

Conditional Outliers:

● Conditional outliers, also known as contextual outliers or conditional anomalies,


are data points that become outliers only under certain conditions or contexts.
These outliers may be normal within one context but anomalous within another
context. They are identified by considering the conditional relationships or
dependencies between variables.
● For example, in a weather dataset, a temperature value may not be considered
an outlier on its own, but if it is extremely high or low given the corresponding
humidity and pressure readings, it could be a conditional outlier.

Notes By Jayanth
Data Mining Notes Jntuh

UNIT5

Basic concepts in Mining data stream

Data Streams in Data Mining is extracting knowledge and valuable insights from a
continuous stream of data using stream processing software. Data Streams in Data Mining
can be considered a subset of general concepts of machine learning, knowledge extraction,
and data mining. In Data Streams in Data Mining, data analysis of a large amount of data
needs to be done in real-time. The structure of knowledge is extracted in data steam
mining represented in the case of models and patterns of infinite streams of information.

Data Stream in Data Mining should have the following characteristics:

● Continuous Stream of Data: The data stream is an infinite continuous


stream resulting in big data. In data streaming, multiple data streams are
passed simultaneously.
● Time Sensitive: Data Streams are time-sensitive, and elements of data
streams carry timestamps with them. After a particular time, the data stream
loses its significance and is relevant for a certain period.
● Data Volatility: No data is stored in data streaming as It is volatile. Once the
data mining and analysis are done, information is summarized or discarded.
● Concept Drifting: Data Streams are very unpredictable. The data changes or
evolves with time, as in this dynamic world, nothing is constant .

Notes By Jayanth
Data Mining Notes Jntuh

Mining Time-Series Data:

● Mining time-series data involves analyzing data that is recorded over time.
● Time-series data consists of observations or measurements taken at regular intervals,
● such as stock prices, temperature readings, or sensor data. The goal of mining time-
series data is to discover patterns, trends, or anomalies that can provide valuable
insights for forecasting, prediction, or anomaly detection.

Applications of mining time-series data are wide-ranging and include:

Predictive Analytics: Time-series data mining can be used to build predictive models that
forecast future trends, patterns, or events based on historical data. This is valuable in various
domains, such as finance, weather forecasting, stock market analysis, and energy consumption
prediction.

Anomaly Detection: Time-series data mining techniques can identify unusual patterns or
outliers in the data, indicating potential anomalies or abnormalities. This is beneficial in
detecting fraud, network intrusion, equipment failure, and other abnormal events.

Pattern Recognition: Time-series data mining can uncover recurring patterns, periodicities, or
trends in the data. This is useful in fields like signal processing, sensor data analysis, and
biological signal analysis.

Resource Optimization: Mining time-series data can help optimize resource allocation and
utilization by analyzing patterns and trends in data related to resource consumption, production,
or demand. This is applicable in industries such as manufacturing, logistics, and energy
management.

Characteristics of time-series data include:

Time-Dependent: Time-series data is inherently dependent on time, with data points ordered
chronologically. The temporal dimension is a crucial aspect of time-series analysis.

Sequential Correlation: Time-series data often exhibits correlation or dependency between


consecutive data points. The values at different time points can influence each other, and the
order of data points is significant.

Irregular Sampling: Time-series data can have irregular or uneven sampling intervals, where
data points are not uniformly spaced in time. Dealing with irregular sampling requires
specialized techniques for interpolation or handling missing data.

Notes By Jayanth
Data Mining Notes Jntuh

Seasonality and Trends: Time-series data can exhibit periodic patterns, seasonality, or long-
term trends. Identifying and modeling these patterns are important for accurate analysis and
prediction.

Noise and Outliers: Time-series data can be subject to noise or contain outliers, which can
affect the accuracy of analysis and modeling

Mining Sequence Patterns in Transactional Databases:

Mining sequence patterns in transactional databases focuses on discovering sequential patterns


or dependencies in a collection of transactions. A transaction is a set of items purchased or
accessed together, such as items in a shopping basket or a sequence of web pages visited by a
user. By analyzing transactional data, sequence mining techniques can identify common
sequences or patterns of item sets, which can be useful for market basket analysis,
recommendation systems, or process optimization.

Applications of mining sequence patterns in transactional databases


are diverse and include:

Market Basket Analysis: Sequence pattern mining can be used in market basket analysis to
uncover frequently occurring sequences of items purchased together. This information is
valuable for cross-selling, product recommendation systems, and optimizing store layouts.

Web Usage Mining: Mining sequence patterns in web usage data can reveal the sequential
navigation patterns of website visitors. This information can be used for personalization,
improving website design, and identifying bottlenecks or anomalies in user behavior.

Customer Behavior Analysis: Sequence pattern mining can be applied to customer


transaction data to discover patterns of behavior, such as the order in which customers perform
certain actions or make specific purchases. This helps in understanding customer preferences,
identifying valuable customer segments, and enhancing customer relationship management.

Fraud Detection: Mining sequence patterns in transactional databases can assist in detecting
fraudulent activities or patterns of suspicious behavior. By identifying abnormal or fraudulent
sequences of events, such as unauthorized transactions or unusual access patterns, fraud can
be detected and prevented.

Characteristics of mining sequence patterns in transactional


databases include:

Notes By Jayanth
Data Mining Notes Jntuh
Data Mining Notes Jntuh
Data Mining Notes Jntuh
Data Mining Notes Jntuh
Data Mining Notes Jntuh
Data Mining Notes Jntuh

Unstructured Data: Text mining deals with unstructured textual data, which lacks a predefined
structure or format. Analyzing and extracting insights from unstructured data pose challenges
due to the absence of a standardized schema or organization.

Text Preprocessing: Before applying text mining techniques, textual data usually undergoes
preprocessing steps. These steps involve tokenization, removing stop words, stemming or
lemmatization, and handling noise, punctuation, and special characters.

Feature Extraction: Text mining involves extracting meaningful features from text, such as
word frequencies, n-grams, part-of-speech tags, or semantic representations. These features
serve as inputs for machine learning algorithms or statistical analysis.

Language and Context: Text mining considers the linguistic and contextual aspects of textual
data. It deals with challenges like word ambiguity, language variations, sarcasm, irony, and
understanding the meaning of words and phrases in different contexts.

Statistical and Machine Learning Techniques: Text mining employs a range of statistical and
machine learning techniques. These include text classification algorithms (e.g., Naive Bayes,
Support Vector Machines), clustering algorithms (e.g., k-means, hierarchical clustering), topic
modeling methods (e.g., LDA), and sentiment analysis models (e.g., lexicon-based or machine
learning-based approaches).

Integration with NLP: Text mining techniques often leverage natural language processing
(NLP) techniques, such as part-of-speech tagging, named entity recognition, parsing, and
dependency analysis, to enhance the analysis and understanding of textual data.

Mining the World Wide Web:


Mining the World Wide Web involves the application of data mining techniques to analyze and
extract knowledge from web data sources, including web pages, web logs, user clickstreams, or
social media data. Web mining can involve tasks like web content mining, web structure mining,
or web usage mining to understand user behavior, improve search engines, or detect web
anomalies.

Applications of Web Mining:


Web Search and Ranking: Web mining techniques are extensively used in search engines to
crawl and index web pages, improve search results, and rank web pages based on relevance
and popularity. This application enables efficient and accurate retrieval of web content in
response to user queries.

Web Personalization and Recommendation: Web mining helps in personalizing web


experiences for users by analyzing their browsing behavior, clickstreams, and preferences. It
enables the recommendation of relevant content, products, and services, enhancing user
satisfaction and engagement.

Notes By Jayanth

You might also like