DATA MINING
What is “Science”
Systematic, Comprehensive, Investigation and
Exploration of Natural, Causes and Effects.
What is “Data”
Data refers to a collection of facts, information, and
statistics that can be in various forms such as numbers,
text, sound, images, or any other format.
DATA MINING : Data mining is the process of
extracting knowledge or insights from large
amounts of data using various statistical and
computational techniques.
The data can be structured, semi-structured or
unstructured, and can be stored in various forms
such as databases, data warehouses, and data lakes.
The primary goal of data mining is to discover
hidden patterns and relationships in the data that can
be used to make informed decisions or predictions.
This involves exploring the data using various
techniques such as clustering, classification,
regression analysis, association rule mining, and
DATA MINING
anomaly detection.
Data mining has a wide range of applications across
various industries, including marketing, finance,
healthcare, and telecommunications.
DATA MINING
For example, in marketing, data mining can be used
to identify customer segments and target marketing
campaigns, while in healthcare, it can be used to
identify risk factors for diseases and develop
personalized treatment plans.
KDD(KNOWLEDGE DECISION DATABASES)
Vs DATAMINIG
Difference between KDD and Data Mining
Parameter KDD Data Mining
KDD refers to
a process of Data Mining refers
identifying valid, to a process of
novel, potentially extracting useful
Definition useful, and and valuable
ultimately information or
understandable patterns from large
patterns and data sets.
relationships in data.
DATA MINING
Parameter KDD Data Mining
To find useful To extract useful
Objective knowledge from information from
data. data.
Data cleaning, data
Association rules,
integration, data
classification,
selection, data
clustering,
transformation, data
Techniques regression, decision
mining, pattern
Used trees, neural
evaluation, and
networks, and
knowledge
dimensionality
representation and
reduction.
visualization.
Structured Patterns,
information, such as associations, or
rules and models, insights that can be
Output
that can be used to used to improve
make decisions or decision-making or
predictions. understanding.
DATA MINING
Parameter KDD Data Mining
Focus is on the Data mining focus
discovery of useful is on the discovery
Focus knowledge, rather of patterns or
than simply finding relationships in
patterns in data. data.
Domain expertise is Domain expertise is
important in KDD, less critical in data
as it helps in defining mining, as the
Role of
the goals of the algorithms are
domain
process, choosing designed to identify
expertise
appropriate data, and patterns without
interpreting the relying on prior
results. knowledge.
DATABASE Vs DATAMINING :
DATA MINING
DATAMINING TECHNIQUES :
DATA MINING
1. Classification:
This technique is used to obtain important and
relevant information about data and metadata. This
data mining technique helps to classify data in
different classes.
Data mining techniques can be classified by
different criteria, as follows:
Classification of Data mining frameworks as per the
type of data sources mined:
This classification is as per the type of data handled. For
example, multimedia, spatial data, text data, time-series
data, World Wide Web, and so on..
Classification of data mining frameworks as per
the database involved:
This classification based on the data model involved. For
example. Object-oriented database, transactional
database, relational database, and so on..
Classification of data mining frameworks as per
the kind of knowledge discovered:
DATA MINING
This classification depends on the types of knowledge
discovered or data mining functionalities. For example,
discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks
offering a few data mining functionalities together..
Classification of data mining frameworks according
to data mining techniques used:
This classification is as per the data analysis approach
utilized, such as neural networks, machine learning,
genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of
user interaction involved in the data mining procedure,
such as query-driven systems, autonomous systems, or
interactive exploratory systems.
2. Clustering:
Clustering is a division of information into groups of
connected objects. Describing the data by a few clusters
mainly loses certain confine details, but accomplishes
improvement.
DATA MINING
It models data by its clusters. Data modeling puts
clustering from a historical point of view rooted
in statistics, mathematics, and numerical analysis.
From a machine learning point of view, clusters relate to
hidden patterns, the search for clusters is unsupervised
learning, and the subsequent framework represents a data
concept.
From a practical point of view, clustering plays an
extraordinary job in data mining applications. For
example, scientific data exploration, text mining,
information retrieval, spatial database applications, CRM,
Web analysis, computational biology, medical
diagnostics, and much more.
In other words, we can say that Clustering analysis is a
data mining technique to identify similar data. This
technique helps to recognize the differences and
similarities between the data.
Clustering is very similar to the classification, but it
involves grouping chunks of data together based on their
similarities.
3. Regression:
DATA MINING
Regression analysis is the data mining process is used to
identify and analyze the relationship between variables
because of the presence of the other factor.
It is used to define the probability of the specific variable.
Regression, primarily a form of planning and modeling.
For example, we might use it to project certain costs,
depending on other factors such as availability, consumer
demand, and competition.
Primarily it gives the exact relationship between two or
more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between
two or more items. It finds a hidden pattern in the data
set.
Association rules are if-then statements that support to
show the probability of interactions between data items
within large data sets in different types of databases.
Association rule mining has several applications and is
commonly used to help sales correlations in data or
medical data sets.
DATA MINING
The way the algorithm works is that you have various
data, For example, a list of grocery items that you have
been buying for the last six months. It calculates a
percentage of items being purchased together.
5. Outer detection:
This type of data mining technique relates to the
observation of data items in the data set, which do not
match an expected pattern or expected behavior.
This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known
as Outlier Analysis or Outlier mining.
The outlier is a data point that diverges too much from
the rest of the dataset. The majority of the real-world
datasets have an outlier.
Outlier detection plays a significant role in the data
mining field.
Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor
network data, etc.
6. Sequential Patterns:
DATA MINING
The sequential pattern is a data mining technique
specialized for evaluating sequential data to discover
sequential patterns.
It comprises of finding interesting subsequences in a set
of sequences, where the stake of a sequence can be
measured in terms of different criteria like length,
occurrence frequency, etc.
In other words, this technique of data mining helps to
discover or recognize similar patterns in transaction data
over some time.
7. Prediction:
Prediction used a combination of other data mining
techniques such as trends, clustering, classification, etc. It
analyzes past events or instances in the right sequence to
predict a future event.
PROBLEMS, ISSUES AND CHALLENGES IN
DATAMINING :
DATA MINING
Data mining is not an easy task, as the algorithms used
can get very complex and data is not always available at
one place. It needs to be integrated from various
heterogeneous data sources.
These factors also create some issues, we will discuss the
major issues regarding −
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types Issues
The following diagram describes the major issues.
DATA MINING
Data Mining issues
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases −
Different users may be interested in different kinds
of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge
discovery task.
Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be
interactive because it allows users to focus the
search for patterns, providing and refining data
mining requests based on the returned results.
Incorporation of background knowledge − To guide
discovery process and to express the discovered
patterns, the background knowledge can be used.
Background knowledge may be used to express the
discovered patterns not only in concise terms but
at multiple levels of abstraction.
DATA MINING
Data mining query languages and ad hoc data mining
− Data Mining Query language that allows the user to
describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized
for efficient and flexible data mining.
Presentation and visualization of data mining results
− Once the patterns are discovered it needs to be
expressed in high level languages, and visual
representations. These representations should be
easily understandable.
Handling noisy or incomplete data − The data
cleaning methods are required to handle the noise
and incomplete objects while mining the data
regularities. If the data cleaning methods are not
there then the accuracy of the discovered patterns
will be poor.
Pattern evaluation − The patterns discovered should
be interesting because either they represent
common knowledge or lack novelty.
DATA MINING
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms
− In order to effectively extract the information from
huge amount of data in databases, data mining
algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining
algorithms − The factors such as huge size
of
databases, wide distribution of data, and complexity of
data mining methods motivate the development of
parallel and distributed data mining algorithms.
These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the
results from the partitions is merged.
The incremental algorithms, update databases
without mining the data again from scratch.
DATA MINING
Diverse Data Types Issues
Handling of relational and complex types of data −
The database may contain complex data objects,
multimedia data objects, spatial data, temporal data
etc. It is not possible for one system to mine all these
kind of data.
Mining information from heterogeneous databases
and global information systems − The data is
available at different data sources on LAN or WAN.
These data source may be structured, semi
structured or unstructured. Therefore mining the
knowledge from them adds challenges to data
mining.
PROBLEMS IN DATAMINING :
1. Poor data quality such as noisy data, dirty data, missing
values, inexact or incorrect values, inadequate data size
and poor representation in data sampling.
2. Integrating conflicting or redundant data from
different sources and forms: multimedia files (audio,
video and
images), geo data, text, social, numeric, etc…
3. Proliferation of security and privacy concerns
by individuals, organizations and governments.
DATA MINING
4. Unavailability of data or difficult access to data.
5. Efficiency and scalability of data mining algorithms
to effectively extract the information from huge amount
of data in databases.
6. Dealing with huge datasets that require distributed
approaches.
7. Dealing with non-static, unbalanced and cost-sensitive
data.
8. Mining information from heterogeneous databases
and global information systems.
9. Constant updation of models to handle data velocity
or new incoming data.
10. High cost of buying and maintaining powerful
softwares, servers and storage hardwares that handle
large amounts of data.
11. Processing of large, complex and unstructured
data into a structured format.
12. Sheer quantity of output from many data mining
methods.
CHALLENGES IN DATAMINING:
DATA MINING
1. Data Quality
The quality of data used in data mining is one of the most
significant challenges. The accuracy, completeness, and
consistency of the data affect the accuracy of the results
obtained.
The data may contain errors, omissions, duplications, or
inconsistencies, which may lead to inaccurate results.
Moreover, the data may be incomplete, meaning that
some attributes or values are missing, making it
challenging to obtain a complete understanding of the
data.
Data quality issues can arise due to a variety of reasons,
including data entry errors, data storage issues, data
integration problems, and data transmission errors.
To address these challenges, data mining practitioners
must apply data cleaning and data preprocessing
techniques to improve the quality of the data.
Data cleaning involves detecting and correcting errors,
while data preprocessing involves transforming the data
to make it suitable for data mining.
2. Data Complexity
DATA MINING
Data complexity refers to the vast amounts of data
generated by various sources, such as sensors, social
media, and the internet of things (IoT).
The complexity of the data may make it challenging to
process, analyze, and understand. In addition, the data
may be in different formats, making it challenging to
integrate into a single dataset.
To address this challenge, data mining practitioners use
advanced techniques such as clustering, classification, and
association rule mining. These techniques help to identify
patterns and relationships in the data, which can then be
used to gain insights and make predictions.
3. Data Privacy and Security
Data privacy and security is another significant
challenge in data mining. As more data is collected,
stored, and analyzed, the risk of data breaches and cyber-
attacks increases.
The data may contain personal, sensitive, or confidential
information that must be protected.
DATA MINING
Moreover, data privacy regulations such as GDPR,
CCPA, and HIPAA impose strict rules on how data can
be collected, used, and shared.
To address this challenge, data mining practitioners must
apply data anonymization and data encryption
techniques to protect the privacy and security of the data.
Data anonymization involves removing personally
identifiable information (PII) from the data, while data
encryption involves using algorithms to encode the data to
make it unreadable to unauthorized users.
4. Scalability
Data mining algorithms must be scalable to handle large
datasets efficiently. As the size of the dataset increases,
the time and computational resources required to perform
data mining operations also increase.
Moreover, the algorithms must be able to handle
streaming data, which is generated continuously and must
be processed in real-time.
To address this challenge, data mining practitioners use
distributed computing frameworks such as Hadoop and
Spark.
DATA MINING
These frameworks distribute the data and processing
across multiple nodes, making it possible to process large
datasets quickly and efficiently.
5. Interpretability
Data mining algorithms can produce complex models that
are difficult to interpret. This is because the algorithms
use a combination of statistical and mathematical
techniques to identify patterns and relationships in the
data.
Moreover, the models may not be intuitive, making
it challenging to understand how the model arrived
at a particular conclusion.
To address this challenge, data mining practitioners use
visualization techniques to represent the data and the
models visually.
Visualization makes it easier to understand the patterns
and relationships in the data and to identify the most
important variables.
6. Ethics
Data mining raises ethical concerns related to the
collection, use, and dissemination of data. The data may
DATA MINING
be used to discriminate against certain groups, violate
privacy rights, or perpetuate existing biases.
Moreover, data mining algorithms may not be transparent,
making it challenging to detect biases or discrimination.
DATAMINING APPLICATIONS
Scientific Analysis: Scientific simulations are
generating bulks of data every day. This
includes data collected from nuclear
laboratories, data about human psychology,
etc.
DATA MINING
Data mining techniques are capable of the
analysis of these data. Now we can capture and
store more new data faster than we can analyze
the old data already accumulated. Example of
scientific analysis:
Sequence analysis in bioinformatics
Classification of astronomical objects
Medical decision support.
Intrusion Detection: A network intrusion refers
to any unauthorized activity on a digital
network.
Network intrusions often involve stealing
valuable network resources. Data mining
technique plays a vital role in searching
intrusion detection, network attacks, and
anomalies.
DATA MINING
These techniques help in selecting and refining
useful and relevant information from large data
sets. Data mining technique helps in classify
relevant data for Intrusion Detection System.
Intrusion Detection system generates alarms for
the network traffic about the foreign invasions in
the system. For example:
Detect security violations
Misuse Detection
Anomaly Detection
Business Transactions: Every business industry
is memorized for perpetuity. Such transactions
are usually time-related and can be inter-
business deals or intra-business operations.
The effective and in-time use of the data in a
reasonable time frame for competitive decision-
making is definitely the most important problem
DATA MINING
to solve for businesses that struggle to survive in
a highly competitive world.
Data mining helps to analyze these business
transactions and identify marketing approaches
and decision-making. Example :
Direct mail targeting
Stock trading
Customer segmentation
Market Basket Analysis: Market Basket
Analysis is a technique that gives the careful
study of purchases done by a customer in a
supermarket.
This concept identifies the pattern of frequent
purchase items by customers. This analysis can
help to promote deals, offers, sale by the
companies and data mining techniques helps to
achieve this analysis task. Example:
DATA MINING
Data mining concepts are in use for Sales and
marketing to provide better customer service, to
improve cross-selling opportunities, to increase
direct mail response rates.
Customer Retention in the form of pattern
identification and prediction of likely defections
is possible by Data mining.
Risk Assessment and Fraud area also use
the data-mining concept for identifying
inappropriate or unusual behavior etc.
Education: For analyzing the education sector,
data mining uses Educational Data Mining
(EDM) method.
This method generates patterns that can be used
both by learners and educators. By using data
DATA MINING
mining EDM we can perform some educational
task:
Predicting students admission in
higher education
Predicting students profiling
Predicting student performance
Teachers teaching performance
Curriculum development
Predicting student placement opportunities
Research: A data mining technique can perform
predictions, classification, clustering,
associations, and grouping of data with
perfection in the research area.
Rules generated by data mining are unique to
find results. In most of the technical research in
data mining, we create a training model and
testing model.
DATA MINING
The training/testing model is a strategy to
measure the precision of the proposed model. It
is called Train/Test because we split the data set
into two sets: a training data set and a testing
data set.
A training data set used to design the training
model whereas testing data set is used in the
testing model. Example:
Classification of uncertain data.
Information-based clustering.
Healthcare and Insurance: A Pharmaceutical
sector can examine its new deals force activity
and their outcomes to improve the focusing of
high-value physicians and figure out which
promoting activities will have the best effect
in the following upcoming months,
DATA MINING
Whereas the Insurance sector, data mining can
help to predict which customers will buy new
policies, identify behavior patterns of risky
customers and identify fraudulent behavior of
customers.
Financial/Banking Sector: A credit card
company can leverage its vast warehouse of
customer transaction data to identify
customers most likely to be interested in a new
credit product.
Credit card fraud detection.
Identify ‘Loyal’ customers.
Extraction of information related
to customers.
Determine credit card spending by
customer groups.