KEMBAR78
Unit 1 Data Mining | PDF | Data Mining | Cluster Analysis
0% found this document useful (0 votes)
31 views30 pages

Unit 1 Data Mining

Data mining is the process of extracting knowledge from large datasets using various techniques such as classification, clustering, regression, and association rule mining. It aims to discover hidden patterns and relationships to inform decision-making across various industries, including marketing and healthcare. Challenges in data mining include data quality, complexity, privacy concerns, and the need for scalable algorithms.

Uploaded by

animestudio0707
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views30 pages

Unit 1 Data Mining

Data mining is the process of extracting knowledge from large datasets using various techniques such as classification, clustering, regression, and association rule mining. It aims to discover hidden patterns and relationships to inform decision-making across various industries, including marketing and healthcare. Challenges in data mining include data quality, complexity, privacy concerns, and the need for scalable algorithms.

Uploaded by

animestudio0707
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

DATA MINING

What is “Science”
Systematic, Comprehensive, Investigation and
Exploration of Natural, Causes and Effects.
What is “Data”
Data refers to a collection of facts, information, and
statistics that can be in various forms such as numbers,
text, sound, images, or any other format.
DATA MINING : Data mining is the process of
extracting knowledge or insights from large
amounts of data using various statistical and
computational techniques.

The data can be structured, semi-structured or


unstructured, and can be stored in various forms
such as databases, data warehouses, and data lakes.

The primary goal of data mining is to discover


hidden patterns and relationships in the data that can
be used to make informed decisions or predictions.

This involves exploring the data using various


techniques such as clustering, classification,
regression analysis, association rule mining, and
DATA MINING
anomaly detection.

Data mining has a wide range of applications across


various industries, including marketing, finance,
healthcare, and telecommunications.
DATA MINING
For example, in marketing, data mining can be used
to identify customer segments and target marketing
campaigns, while in healthcare, it can be used to
identify risk factors for diseases and develop
personalized treatment plans.

KDD(KNOWLEDGE DECISION DATABASES)


Vs DATAMINIG

Difference between KDD and Data Mining

Parameter KDD Data Mining

KDD refers to
a process of Data Mining refers
identifying valid, to a process of
novel, potentially extracting useful
Definition useful, and and valuable
ultimately information or
understandable patterns from large
patterns and data sets.
relationships in data.
DATA MINING

Parameter KDD Data Mining

To find useful To extract useful


Objective knowledge from information from
data. data.

Data cleaning, data


Association rules,
integration, data
classification,
selection, data
clustering,
transformation, data
Techniques regression, decision
mining, pattern
Used trees, neural
evaluation, and
networks, and
knowledge
dimensionality
representation and
reduction.
visualization.

Structured Patterns,
information, such as associations, or
rules and models, insights that can be
Output
that can be used to used to improve
make decisions or decision-making or
predictions. understanding.
DATA MINING

Parameter KDD Data Mining

Focus is on the Data mining focus


discovery of useful is on the discovery
Focus knowledge, rather of patterns or
than simply finding relationships in
patterns in data. data.

Domain expertise is Domain expertise is


important in KDD, less critical in data
as it helps in defining mining, as the
Role of
the goals of the algorithms are
domain
process, choosing designed to identify
expertise
appropriate data, and patterns without
interpreting the relying on prior
results. knowledge.

DATABASE Vs DATAMINING :
DATA MINING

DATAMINING TECHNIQUES :
DATA MINING

1. Classification:
This technique is used to obtain important and
relevant information about data and metadata. This
data mining technique helps to classify data in
different classes.
Data mining techniques can be classified by
different criteria, as follows:

 Classification of Data mining frameworks as per the


type of data sources mined:
This classification is as per the type of data handled. For
example, multimedia, spatial data, text data, time-series
data, World Wide Web, and so on..
 Classification of data mining frameworks as per
the database involved:
This classification based on the data model involved. For
example. Object-oriented database, transactional
database, relational database, and so on..
 Classification of data mining frameworks as per
the kind of knowledge discovered:
DATA MINING
This classification depends on the types of knowledge
discovered or data mining functionalities. For example,
discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks
offering a few data mining functionalities together..
 Classification of data mining frameworks according
to data mining techniques used:
This classification is as per the data analysis approach
utilized, such as neural networks, machine learning,
genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of
user interaction involved in the data mining procedure,
such as query-driven systems, autonomous systems, or
interactive exploratory systems.
2. Clustering:
Clustering is a division of information into groups of
connected objects. Describing the data by a few clusters
mainly loses certain confine details, but accomplishes
improvement.
DATA MINING
It models data by its clusters. Data modeling puts
clustering from a historical point of view rooted
in statistics, mathematics, and numerical analysis.
From a machine learning point of view, clusters relate to
hidden patterns, the search for clusters is unsupervised
learning, and the subsequent framework represents a data
concept.
From a practical point of view, clustering plays an
extraordinary job in data mining applications. For
example, scientific data exploration, text mining,
information retrieval, spatial database applications, CRM,
Web analysis, computational biology, medical
diagnostics, and much more.
In other words, we can say that Clustering analysis is a
data mining technique to identify similar data. This
technique helps to recognize the differences and
similarities between the data.
Clustering is very similar to the classification, but it
involves grouping chunks of data together based on their
similarities.
3. Regression:
DATA MINING
Regression analysis is the data mining process is used to
identify and analyze the relationship between variables
because of the presence of the other factor.
It is used to define the probability of the specific variable.
Regression, primarily a form of planning and modeling.
For example, we might use it to project certain costs,
depending on other factors such as availability, consumer
demand, and competition.
Primarily it gives the exact relationship between two or
more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between
two or more items. It finds a hidden pattern in the data
set.
Association rules are if-then statements that support to
show the probability of interactions between data items
within large data sets in different types of databases.
Association rule mining has several applications and is
commonly used to help sales correlations in data or
medical data sets.
DATA MINING
The way the algorithm works is that you have various
data, For example, a list of grocery items that you have
been buying for the last six months. It calculates a
percentage of items being purchased together.
5. Outer detection:
This type of data mining technique relates to the
observation of data items in the data set, which do not
match an expected pattern or expected behavior.
This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known
as Outlier Analysis or Outlier mining.
The outlier is a data point that diverges too much from
the rest of the dataset. The majority of the real-world
datasets have an outlier.
Outlier detection plays a significant role in the data
mining field.
Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor
network data, etc.
6. Sequential Patterns:
DATA MINING
The sequential pattern is a data mining technique
specialized for evaluating sequential data to discover
sequential patterns.
It comprises of finding interesting subsequences in a set
of sequences, where the stake of a sequence can be
measured in terms of different criteria like length,
occurrence frequency, etc.

In other words, this technique of data mining helps to


discover or recognize similar patterns in transaction data
over some time.
7. Prediction:
Prediction used a combination of other data mining
techniques such as trends, clustering, classification, etc. It
analyzes past events or instances in the right sequence to
predict a future event.

PROBLEMS, ISSUES AND CHALLENGES IN


DATAMINING :
DATA MINING
Data mining is not an easy task, as the algorithms used
can get very complex and data is not always available at
one place. It needs to be integrated from various
heterogeneous data sources.
These factors also create some issues, we will discuss the
major issues regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.
DATA MINING
Data Mining issues
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases −
Different users may be interested in different kinds
of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge
discovery task.

 Interactive mining of knowledge at multiple levels of


abstraction − The data mining process needs to be
interactive because it allows users to focus the
search for patterns, providing and refining data
mining requests based on the returned results.

 Incorporation of background knowledge − To guide


discovery process and to express the discovered
patterns, the background knowledge can be used.

Background knowledge may be used to express the


discovered patterns not only in concise terms but
at multiple levels of abstraction.
DATA MINING

 Data mining query languages and ad hoc data mining


− Data Mining Query language that allows the user to
describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized
for efficient and flexible data mining.

 Presentation and visualization of data mining results


− Once the patterns are discovered it needs to be
expressed in high level languages, and visual
representations. These representations should be
easily understandable.

 Handling noisy or incomplete data − The data


cleaning methods are required to handle the noise
and incomplete objects while mining the data
regularities. If the data cleaning methods are not
there then the accuracy of the discovered patterns
will be poor.

 Pattern evaluation − The patterns discovered should


be interesting because either they represent
common knowledge or lack novelty.
DATA MINING

Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms
− In order to effectively extract the information from
huge amount of data in databases, data mining
algorithm must be efficient and scalable.

 Parallel, distributed, and incremental mining


algorithms − The factors such as huge size
of
databases, wide distribution of data, and complexity of
data mining methods motivate the development of
parallel and distributed data mining algorithms.

These algorithms divide the data into partitions which is


further processed in a parallel fashion. Then the
results from the partitions is merged.

The incremental algorithms, update databases


without mining the data again from scratch.
DATA MINING
Diverse Data Types Issues
 Handling of relational and complex types of data −
The database may contain complex data objects,
multimedia data objects, spatial data, temporal data
etc. It is not possible for one system to mine all these
kind of data.
 Mining information from heterogeneous databases
and global information systems − The data is
available at different data sources on LAN or WAN.
These data source may be structured, semi
structured or unstructured. Therefore mining the
knowledge from them adds challenges to data
mining.
PROBLEMS IN DATAMINING :
1. Poor data quality such as noisy data, dirty data, missing
values, inexact or incorrect values, inadequate data size
and poor representation in data sampling.
2. Integrating conflicting or redundant data from
different sources and forms: multimedia files (audio,
video and
images), geo data, text, social, numeric, etc…
3. Proliferation of security and privacy concerns
by individuals, organizations and governments.
DATA MINING
4. Unavailability of data or difficult access to data.
5. Efficiency and scalability of data mining algorithms
to effectively extract the information from huge amount
of data in databases.
6. Dealing with huge datasets that require distributed
approaches.
7. Dealing with non-static, unbalanced and cost-sensitive
data.
8. Mining information from heterogeneous databases
and global information systems.
9. Constant updation of models to handle data velocity
or new incoming data.
10. High cost of buying and maintaining powerful
softwares, servers and storage hardwares that handle
large amounts of data.
11. Processing of large, complex and unstructured
data into a structured format.
12. Sheer quantity of output from many data mining
methods.
CHALLENGES IN DATAMINING:
DATA MINING
1. Data Quality
The quality of data used in data mining is one of the most
significant challenges. The accuracy, completeness, and
consistency of the data affect the accuracy of the results
obtained.
The data may contain errors, omissions, duplications, or
inconsistencies, which may lead to inaccurate results.
Moreover, the data may be incomplete, meaning that
some attributes or values are missing, making it
challenging to obtain a complete understanding of the
data.
Data quality issues can arise due to a variety of reasons,
including data entry errors, data storage issues, data
integration problems, and data transmission errors.
To address these challenges, data mining practitioners
must apply data cleaning and data preprocessing
techniques to improve the quality of the data.
Data cleaning involves detecting and correcting errors,
while data preprocessing involves transforming the data
to make it suitable for data mining.
2. Data Complexity
DATA MINING
Data complexity refers to the vast amounts of data
generated by various sources, such as sensors, social
media, and the internet of things (IoT).
The complexity of the data may make it challenging to
process, analyze, and understand. In addition, the data
may be in different formats, making it challenging to
integrate into a single dataset.
To address this challenge, data mining practitioners use
advanced techniques such as clustering, classification, and
association rule mining. These techniques help to identify
patterns and relationships in the data, which can then be
used to gain insights and make predictions.
3. Data Privacy and Security
Data privacy and security is another significant
challenge in data mining. As more data is collected,
stored, and analyzed, the risk of data breaches and cyber-
attacks increases.
The data may contain personal, sensitive, or confidential
information that must be protected.
DATA MINING
Moreover, data privacy regulations such as GDPR,
CCPA, and HIPAA impose strict rules on how data can
be collected, used, and shared.
To address this challenge, data mining practitioners must
apply data anonymization and data encryption
techniques to protect the privacy and security of the data.
Data anonymization involves removing personally
identifiable information (PII) from the data, while data
encryption involves using algorithms to encode the data to
make it unreadable to unauthorized users.
4. Scalability
Data mining algorithms must be scalable to handle large
datasets efficiently. As the size of the dataset increases,
the time and computational resources required to perform
data mining operations also increase.
Moreover, the algorithms must be able to handle
streaming data, which is generated continuously and must
be processed in real-time.
To address this challenge, data mining practitioners use
distributed computing frameworks such as Hadoop and
Spark.
DATA MINING
These frameworks distribute the data and processing
across multiple nodes, making it possible to process large
datasets quickly and efficiently.
5. Interpretability
Data mining algorithms can produce complex models that
are difficult to interpret. This is because the algorithms
use a combination of statistical and mathematical
techniques to identify patterns and relationships in the
data.
Moreover, the models may not be intuitive, making
it challenging to understand how the model arrived
at a particular conclusion.
To address this challenge, data mining practitioners use
visualization techniques to represent the data and the
models visually.
Visualization makes it easier to understand the patterns
and relationships in the data and to identify the most
important variables.
6. Ethics
Data mining raises ethical concerns related to the
collection, use, and dissemination of data. The data may
DATA MINING
be used to discriminate against certain groups, violate
privacy rights, or perpetuate existing biases.
Moreover, data mining algorithms may not be transparent,
making it challenging to detect biases or discrimination.
DATAMINING APPLICATIONS

 Scientific Analysis: Scientific simulations are


generating bulks of data every day. This
includes data collected from nuclear
laboratories, data about human psychology,
etc.
DATA MINING
Data mining techniques are capable of the
analysis of these data. Now we can capture and
store more new data faster than we can analyze
the old data already accumulated. Example of
scientific analysis:

 Sequence analysis in bioinformatics


 Classification of astronomical objects
 Medical decision support.

 Intrusion Detection: A network intrusion refers


to any unauthorized activity on a digital
network.

Network intrusions often involve stealing


valuable network resources. Data mining
technique plays a vital role in searching
intrusion detection, network attacks, and
anomalies.
DATA MINING
These techniques help in selecting and refining
useful and relevant information from large data
sets. Data mining technique helps in classify
relevant data for Intrusion Detection System.

Intrusion Detection system generates alarms for


the network traffic about the foreign invasions in
the system. For example:

 Detect security violations


 Misuse Detection
 Anomaly Detection

 Business Transactions: Every business industry


is memorized for perpetuity. Such transactions
are usually time-related and can be inter-
business deals or intra-business operations.

The effective and in-time use of the data in a


reasonable time frame for competitive decision-
making is definitely the most important problem
DATA MINING
to solve for businesses that struggle to survive in
a highly competitive world.

Data mining helps to analyze these business


transactions and identify marketing approaches
and decision-making. Example :

 Direct mail targeting


 Stock trading
 Customer segmentation

 Market Basket Analysis: Market Basket


Analysis is a technique that gives the careful
study of purchases done by a customer in a
supermarket.

This concept identifies the pattern of frequent


purchase items by customers. This analysis can
help to promote deals, offers, sale by the
companies and data mining techniques helps to
achieve this analysis task. Example:
DATA MINING

Data mining concepts are in use for Sales and


marketing to provide better customer service, to
improve cross-selling opportunities, to increase
direct mail response rates.

Customer Retention in the form of pattern


identification and prediction of likely defections
is possible by Data mining.

Risk Assessment and Fraud area also use


the data-mining concept for identifying
inappropriate or unusual behavior etc.

 Education: For analyzing the education sector,


data mining uses Educational Data Mining
(EDM) method.

This method generates patterns that can be used


both by learners and educators. By using data
DATA MINING
mining EDM we can perform some educational
task:

 Predicting students admission in


higher education
 Predicting students profiling
 Predicting student performance
 Teachers teaching performance
 Curriculum development
 Predicting student placement opportunities

 Research: A data mining technique can perform


predictions, classification, clustering,
associations, and grouping of data with
perfection in the research area.

Rules generated by data mining are unique to


find results. In most of the technical research in
data mining, we create a training model and
testing model.
DATA MINING
The training/testing model is a strategy to
measure the precision of the proposed model. It
is called Train/Test because we split the data set
into two sets: a training data set and a testing
data set.

A training data set used to design the training


model whereas testing data set is used in the
testing model. Example:

 Classification of uncertain data.


 Information-based clustering.

 Healthcare and Insurance: A Pharmaceutical


sector can examine its new deals force activity
and their outcomes to improve the focusing of
high-value physicians and figure out which
promoting activities will have the best effect
in the following upcoming months,
DATA MINING
Whereas the Insurance sector, data mining can
help to predict which customers will buy new
policies, identify behavior patterns of risky
customers and identify fraudulent behavior of
customers.

 Financial/Banking Sector: A credit card


company can leverage its vast warehouse of
customer transaction data to identify
customers most likely to be interested in a new
credit product.

 Credit card fraud detection.


 Identify ‘Loyal’ customers.
 Extraction of information related
to customers.
 Determine credit card spending by
customer groups.

You might also like