0% found this document useful (0 votes)

9 views16 pages

Unit 2

Uploaded by

238x1a1228

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views16 pages

Unit 2

Uploaded by

238x1a1228

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

Unit -2

Data preprocessing:

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific
data mining task.

Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data
preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data fusion can be used for
data integration.
Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while
standardization is used to transform the data to have zero mean and unit variance.
Discretization is used to convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a lower-dimensional space
while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal width
binning, equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as between 0
and 1 or -1 and 1. Normalization is often used to handle data with different units and scales.
Common normalization techniques include min-max normalization, z-score normalization, and
decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the
analysis results. The specific steps involved in data

preprocessing may vary depending on the nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results
become more accurate.

Data Cleaning in Data Mining

Data cleaning is a crucial process in Data Mining. It carries an important part in the building of a
model. Data Cleaning can be regarded as the process needed, but everyone often neglects it. Data
quality is the main issue in quality information management. Data quality problems occur
anywhere in information systems. These problems are solved by data cleaning.

Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or

incomplete data within a dataset. If data is incorrect, outcomes and algorithms are unreliable,
even though they may look correct. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabeled.

Generally, data cleaning reduces errors and improves data quality. Correcting errors in data and
eliminating bad records can be a time-consuming and tedious process, but it cannot be ignored.
Data mining is a key technique for data cleaning. Data mining is a technique for discovering
interesting information in data. Data quality mining is a recent approach applying data mining
techniques to identify and recover data quality problems in large databases. Data mining
automatically extracts hidden and intrinsic information from the collections of data. Data mining
has various techniques that are suitable for data cleaning.

Understanding and correcting the quality of your data is imperative in getting to an accurate final
analysis. The data needs to be prepared to discover crucial patterns. Data mining is considered
exploratory. Data cleaning in data mining allows the user to discover inaccurate or incomplete
data before the business analysis and insights.

Steps of Data Cleaning

While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:

1. Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients or multiple
departments, there are opportunities to create duplicate data. De-duplication is one of the largest
areas to be considered in this process. Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to analyze.

For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient, minimize distraction from your primary target, and create a more
manageable and performable dataset.

2. Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find "N/A" and "Not Applicable" in any sheet, but they should be
analyzed in the same category.

3. Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data
entry, doing so will help the performance of the data you are working with.

However, sometimes, the appearance of an outlier will prove a theory you are working on. And
just because an outlier exists doesn't mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.
4. Handle missing data

You can't ignore missing data because many algorithms will not accept missing values. There are
a couple of ways to deal with missing data. Neither is optimal, but both can be considered, such
as:

o You can drop observations with missing values, but this will drop or lose information, so
be careful before removing it.
o You can input missing values based on other observations; again, there is an opportunity
to lose the integrity of the data because you may be operating from assumptions and not
actual observations.
o You might alter how the data is used to navigate null values effectively.

5. Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation, such as:

o Does the data make sense?

o Does the data follow the appropriate rules for its field?
o Does it prove or disprove your working theory or bring any insight to light?
o Can you find trends in the data to help you for your next theory?
o If not, is that because of a data quality issue?

Because of incorrect or noisy data, false conclusions can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn't stand up to study. Before you get there, it is important to
create a culture of quality data in your organization. To do this, you should document the tools
you might use to create this strategy.

Methods of Data Cleaning

There are many data cleaning methods through which the data should be run. The methods are
described below:
1. Ignore the tuples: This method is not very feasible, as it only comes to use when the
tuple has several attributes is has missing values.
2. Fill the missing value: This approach is also not very effective or feasible. Moreover, it
can be a time-consuming method. In the approach, one has to fill in the missing value.
This is usually done manually, but it can also be done by attribute mean or using the most
probable value.
3. Binning method: This approach is very simple to understand. The smoothing of sorted
data is done using the values around it. The data is then divided into several segments of
equal size. After that, the different methods are executed to complete the task.
4. Regression: The data is made smooth with the help of using the regression function. The
regression can be linear or multiple. Linear regression has only one independent variable,
and multiple regressions have more than one independent variable.
5. Clustering: This method mainly operates on the group. Clustering groups the data in a
cluster. Then, the outliers are detected with the help of clustering. Next, the similar values
are then arranged into a "group" or a "cluster".
 Data integration in data mining refers to the process of combining data from multiple
sources into a single, unified view. This can involve cleaning and transforming the data, as
well as resolving any inconsistencies or conflicts that may exist between the different
sources. The goal of data integration is to make the data more useful and meaningful for the
purposes of analysis and decision making. Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and data federation.

Data Integration is a data preprocessing technique that combines data from multiple
heterogeneous data sources into a coherent data store and provides a unified view of the data.
These sources may include multiple data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
Whatisdataintegration:
Data integration is the process of combining data from multiple sources into a cohesive and
consistent view. This process involves identifying and accessing the different data sources,
mapping the data to a common format, and reconciling any inconsistencies or discrepancies
between the sources. The goal of data integration is to make it easier to access and analyze data
that is spread across multiple systems or platforms, in order to gain a more complete and
accurate understanding of the data.
Data integration can be challenging due to the variety of data formats, structures, and
semantics used by different data sources. Different data sources may use different data types,
naming conventions, and schemas, making it difficult to combine the data into a single view.
Data integration typically involves a combination of manual and automated processes,
including data profiling, data mapping, data transformation, and data reconciliation.
Data integration is used in a wide range of applications, such as business intelligence, data
warehousing, master data management, and analytics. Data integration can be critical to the
success of these applications, as it enables organizations to access and analyze data that is
spread across different systems, departments, and lines of business, in order to make better
decisions, improve operational efficiency, and gain a competitive advantage.
There are mainly 2 major approaches for data integration – one is the “tight coupling
approach” and another is the “loose coupling approach”.

Tight Coupling:
This approach involves creating a centralized repository or data warehouse to store the
integrated data. The data is extracted from various sources, transformed and loaded into a data
warehouse. Data is integrated in a tightly coupled manner, meaning that the data is integrated
at a high level, such as at the level of the entire dataset or schema. This approach is also known
as data warehousing, and it enables data consistency and integrity, but it can be inflexible and
difficult to change or update.
 Here, a data warehouse is treated as an information retrieval component.
 In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation, and Loading.

Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of individual
data elements or records. Data is integrated in a loosely coupled manner, meaning that the data
is integrated at a low level, and it allows data to be integrated without having to create a central
repository or data warehouse. This approach is also known as data federation, and it enables
data flexibility and easy updates, but it can be difficult to maintain consistency and integrity
across multiple data sources.
 Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases to
obtain the result.
 And the data only remains in the actual source databases.
Issues in Data Integration:
There are several issues that can arise when integrating data from multiple sources, including:
1. Data Quality: Inconsistencies and errors in the data can make it difficult to combine and
analyze.
2. Data Semantics: Different sources may use different terms or definitions for the same
data, making it difficult to combine and understand the data.
3. Data Heterogeneity: Different sources may use different data formats, structures, or
schemas, making it difficult to combine and analyze the data.
4. Data Privacy and Security: Protecting sensitive information and maintaining security can
be difficult when integrating data from multiple sources.
5. Scalability: Integrating large amounts of data from multiple sources can be
computationally expensive and time-consuming.
6. Data Governance: Managing and maintaining the integration of data from multiple
sources can be difficult, especially when it comes to ensuring data accuracy, consistency,
and timeliness.
7. Performance: Integrating data from multiple sources can also affect the performance of
the system.
8. Integration with existing systems: Integrating new data sources with existing systems can
be a complex task, requiring significant effort and resources.
9. Complexity: The complexity of integrating data from multiple sources can be high,
requiring specialized skills and knowledge.

There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These are explained in
brief below.

1. Schema Integration:
 Integrate metadata from different sources.
 The real-world entities from multiple sources are referred to as the entity identification
problem.ER

2. Redundancy Detection:
 An attribute may be redundant if it can be derived or obtained from another attribute or set
of attributes.
 Inconsistencies in attributes can also cause redundancies in the resulting data set.
 Some redundancies can be detected by correlation analysis.

3. Resolution of data value conflicts:

 This is the third critical issue in data integration.
 Attribute values from different sources may differ for the same real-world entity.
 An attribute in one system may be recorded at a lower level of abstraction than the “same”
attribute in another.
Data Reduction:

Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount of
irrelevant or redundant information.

There are several different data reduction techniques that can be used in data mining,
including:

1. Data Sampling: This technique involves selecting a subset of the data to work with, rather
than using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features
into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data
by partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset
that are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy and the
size of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.
In conclusion, data reduction is an important step in data mining, as it can help to improve the
efficiency and performance of machine learning algorithms by reducing the size of the dataset.
However, it is important to be aware of the trade-off between the size and accuracy of the data,
and carefully assess the risks and benefits before implementing it.

Methodsofdatareduction:
These are explained as following below.

1.DataCubeAggregation:
This technique is used to aggregate data in a simpler form. For example, imagine the
information you gathered for your analysis for the years 2012 to 2014, that data includes the
revenue of your company every three months. They involve you in the annual sales, rather than
the quarterly average, So we can summarize the data in such a way that the resulting data
summarizes the total sales per year instead of per quarter. It summarizes the data.

2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates outdated or redundant features.
 Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide the best of the
original attributes on the set based on their relevance to other attributes. We know it as a p-
value in statistics.

Suppose there are the following attributes in the data set in which few attributes are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

 Step-wise Backward Selection –

This selection starts with a set of complete attributes in the original data and at each point,
it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}

Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}

 Combination of forwarding and Backward Selection –

It allows us to remove the worst and select the best attributes, saving time and making the
process faster.

3. Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
 Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original data
from the compressed data.
 Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. In lossy-data
compression, the decompressed data may differ from the original data but are useful
enough to retrieve information from them.

4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller
representations of the data instead of actual data, it is important to only store the model
parameter. Or non-parametric methods such as clustering, histogram, and sampling.

5. Discretization & Concept Hierarchy Operation:

Techniques of data discretization are used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes by labels of small
intervals. This means that mining results are shown in a concise, and easily understandable
way.
 Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to
divide the whole set of attributes and repeat this method up to the end, then the process is
known as top-down discretization also known as splitting.
 Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through a
combination of the neighborhood values in the interval, that process is called bottom-up
discretization.

Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for
age) with high-level concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
 Binning –
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user.
 Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X,
into disjoint ranges called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number
of bins i.e. a set of values ranging from 0-20.

3. Clustering: Grouping similar data together.

ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data Mining :

Data reduction in data mining can have a number of advantages and disadvantages.

Advantages:

1. Improved efficiency: Data reduction can help to improve the efficiency of machine
learning algorithms by reducing the size of the dataset. This can make it faster and more
practical to work with large datasets.
2. Improved performance: Data reduction can help to improve the performance of machine
learning algorithms by removing irrelevant or redundant information from the dataset. This
can help to make the model more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage costs associated with
large datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the interpretability of the
results by removing irrelevant or redundant information from the dataset.

Disadvantages:

1. Loss of information: Data reduction can result in a loss of information, if important data is
removed during the reduction process.
2. Impact on accuracy: Data reduction can impact the accuracy of a model, as reducing the
size of the dataset can also remove important information that is needed for accurate
predictions.
3. Impact on interpretability: Data reduction can make it harder to interpret the results, as
removing irrelevant or redundant information can also remove context that is needed to
understand the results.
4. Additional computational costs: Data reduction can add additional computational costs to
the data mining process, as it requires additional processing time to reduce the data.
5. In conclusion, data reduction can have both advantages and disadvantages. It can improve
the efficiency and performance of machine learning algorithms by reducing the size of the
dataset. However, it can also result in a loss of information, and make it harder to interpret
the results. It’s important to weigh the pros and cons of data reduction and carefully assess
the risks and benefits before implementing it.

Data Transformation:

INTRODUCTION:
Data transformation in data mining refers to the process of converting raw data into a format
that is suitable for analysis and modeling. The goal of data transformation is to prepare the data
for data mining so that it can be used to extract useful insights and knowledge. Data
transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the
data.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as between 0 and
1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant
features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by summing
or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to ensure
that the data is in a format that is suitable for analysis and modeling, and that it is free of errors
and inconsistencies. Data transformation can also help to improve the performance of data
mining algorithms, by reducing the dimensionality of the data, and by scaling the data to a
common range of values.
The data are transformed in ways that are ideal for mining the data. The data transformation
involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form. The concept behind data smoothing is that it will be able to
identify simple changes to help predict different trends and patterns. This serves as a help to
analysts or traders who need to look at a lot of data which can often be difficult to digest for
finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting data in
a summary format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the data used. Gathering
accurate data of high quality and a large enough quantity is necessary to produce relevant
results. The collection of data is useful for everything from decisions concerning financing or
business strategy of the product, pricing, operations, and marketing strategies. For example,
Sales, data may be aggregated to compute monthly& annual total amounts.

3. Discretization: It is a process of transforming continuous data into set of small intervals.

Most Data Mining activities in the real world require continuous attributes. Yet many of the
existing data mining frameworks are unable to handle these attributes. Also, even if a data
mining task can manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values. For example, (1-10, 11-20) (age:-
young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the mining
process from the given set of attributes. This simplifies the original data & makes the mining
more efficient.

5. Generalization: It converts low-level data attributes to high-level data attributes using

concept hierarchy. For Example Age initially in Numerical form (22, 25) is converted into
categorical value (young, old). For example, Categorical attributes, such as house addresses,
may be generalized to higher-level definitions, such as town or country.
6. Normalization: Data normalization involves converting all data variables into a given
range. Techniques that are used for normalization are:

Min-Max Normalization:
 This transforms the original data linearly.
 Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
 Where v is the value you want to plot in the new range.
 v’ is the new value you get after normalizing the old value.
 Z-Score Normalization:
 In z-score normalization (or zero-mean normalization) the values of an attribute
(A), are normalized based on the mean of A and its standard deviation
 A value, v, of attribute A is normalized to v’ by computing
 Decimal Scaling:
 It normalizes the values of an attribute by changing the position of their decimal
points
 The number of points by which the decimal point is moved can be determined
by the absolute maximum value of attribute A.
 A value, v, of attribute A is normalized to v’ by computing
 where j is the smallest integer such that Max(|v’|) < 1.
 Suppose: Values of an attribute P varies from -99 to 99.
 The maximum absolute value of P is 99.
 For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number
of integers in the largest number) so that values come out to be as 0.98, 0.97 and
so on.

ADVANTAGES OR DISADVANTAGES:

Advantages of Data Transformation in Data Mining:

1. Improves Data Quality: Data transformation helps to improve the quality of data by
removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of data from
multiple sources, which can improve the accuracy and completeness of the data.
3. Improves Data Analysis: Data transformation helps to prepare the data for analysis and
modeling by normalizing, reducing dimensionality, and discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive data, or to
remove sensitive information from the data, which can help to increase data security.
5. Enhances Data Mining Algorithm Performance: Data transformation can improve the
performance of data mining algorithms by reducing the dimensionality of the data and
scaling the data to a common range of values.

Disadvantages of Data Transformation in Data Mining:

1. Time-consuming: Data transformation can be a time-consuming process, especially when

dealing with large datasets.
2. Complexity: Data transformation can be a complex process, requiring specialized skills and
knowledge to implement and interpret the results.
3. Data Loss: Data transformation can result in data loss, such as when discretizing
continuous data, or when removing attributes or features from the data.
4. Biased transformation: Data transformation can result in bias, if the data is not properly
understood or used.
5. High cost: Data transformation can be an expensive process, requiring significant
investments in hardware, software, and personnel.

Discretization in data mining

Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy. In other words, data
discretization is a method of converting attributes values of continuous data into a finite set of
intervals with minimum data loss. There are two forms of data discretization first is supervised
discretization, and the second is unsupervised discretization. Supervised discretization refers to a
method in which the class data is used. Unsupervised discretization refers to a method depending
upon the way which operation proceeds. It means it works on the top-down splitting strategy and
bottom-up merging strategy.

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46

After Discretization Child Young Mature

Another example is analytics, where we gather the static data of website visitors. For example,
all visitors who visit the site with the IP address of India are shown under country level.

Some Famous techniques of data discretization

Histogram analysis

Histogram refers to a plot used to represent the underlying frequency distribution of a continuous
data set. Histogram assists the data inspection for data distribution. For example, Outliers,
skewness representation, normal distribution representation, etc.

Binning

Binning refers to a data smoothing technique that helps to group a huge number of continuous
values into smaller values. For data discretization and the development of idea hierarchy, this
technique can also be used.

Cluster Analysis

Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing

the values of x numbers into clusters to isolate a computational feature of x.

Data discretization using decision tree analysis

Data discretization refers to a decision tree analysis in which a top-down slicing technique is
used. It is done through a supervised procedure. In a numeric attribute discretization, first, you
need to select the attribute that has the least entropy, and then you need to run it with the help of
a recursive process. The recursive process divides it into various discretized disjoint intervals,
from top to bottom, using the same splitting criterion.

Data discretization using correlation analysis

Discretizing data by linear regression technique, you can get the best neighboring interval, and
then the large intervals are combined to develop a larger overlap to form the final 20 overlapping
intervals. It is a supervised procedure.

Data Cleaning
No ratings yet
Data Cleaning
8 pages
Foundation of DS
No ratings yet
Foundation of DS
21 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Ids Unit 2
No ratings yet
Ids Unit 2
26 pages
Data Cleaning for Analysts
No ratings yet
Data Cleaning for Analysts
1 page
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Data Cleansing Steps
No ratings yet
Data Cleansing Steps
8 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
SMA Expt 3
No ratings yet
SMA Expt 3
9 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data Cleaning Preprocessing
No ratings yet
Data Cleaning Preprocessing
28 pages
Chapter 2.data Warehouse
No ratings yet
Chapter 2.data Warehouse
42 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Data Cleaning Guide
No ratings yet
Data Cleaning Guide
4 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
Unit 1 (DWV)
No ratings yet
Unit 1 (DWV)
12 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Unit 3
No ratings yet
Unit 3
18 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Unit-2 Data Warehouse Notes
No ratings yet
Unit-2 Data Warehouse Notes
11 pages
DWM
No ratings yet
DWM
14 pages
Data Mining: Pre-Processing Essentials
No ratings yet
Data Mining: Pre-Processing Essentials
11 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
Data Mining Group Assignment4
No ratings yet
Data Mining Group Assignment4
10 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
5 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
BI Unit 4 Final
No ratings yet
BI Unit 4 Final
2 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Data Mining
No ratings yet
Data Mining
9 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Data Mining: Steps and Challenges
No ratings yet
Data Mining: Steps and Challenges
19 pages
Data Cleaning
No ratings yet
Data Cleaning
6 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
38 pages
20PMHS012 RH
No ratings yet
20PMHS012 RH
32 pages
Data Binning
No ratings yet
Data Binning
9 pages
Intro. Data Science 3
No ratings yet
Intro. Data Science 3
38 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Data Pre Processing
No ratings yet
Data Pre Processing
28 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Unit 2 Preprocessing in Data Analytics
No ratings yet
Unit 2 Preprocessing in Data Analytics
36 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
ML 01
No ratings yet
ML 01
15 pages
Noorul Islam Centre For Higher Education Noorul Islam University, Kumaracoil M.E. Biomedical Instrumentation Curriculum & Syllabus Semester I
No ratings yet
Noorul Islam Centre For Higher Education Noorul Islam University, Kumaracoil M.E. Biomedical Instrumentation Curriculum & Syllabus Semester I
26 pages
Estmt - 2024 07 17
No ratings yet
Estmt - 2024 07 17
6 pages
Assignment: 2C - Complete Report: Team 16
No ratings yet
Assignment: 2C - Complete Report: Team 16
118 pages
Ntic
100% (2)
Ntic
510 pages
List of Practicals OS Jan-Apr2022
No ratings yet
List of Practicals OS Jan-Apr2022
2 pages
RX200A-3-25-1D-MRZ 200mm Pedestrian + Acoustic Device
No ratings yet
RX200A-3-25-1D-MRZ 200mm Pedestrian + Acoustic Device
4 pages
Class XI Admission Fees 2022-23
No ratings yet
Class XI Admission Fees 2022-23
1 page
Emerging Land Policy Issues in India
No ratings yet
Emerging Land Policy Issues in India
20 pages
Manual IBC 5 New Controls ENG MAY 2023 - 2
No ratings yet
Manual IBC 5 New Controls ENG MAY 2023 - 2
41 pages
CSEC EDPM CoverSheetForESBA V02 Fillable
No ratings yet
CSEC EDPM CoverSheetForESBA V02 Fillable
1 page
Lesson 3 Associative Property
No ratings yet
Lesson 3 Associative Property
16 pages
Namma Kalvi 12th Commerce Book Inside One Mark Study Material EM 220550
No ratings yet
Namma Kalvi 12th Commerce Book Inside One Mark Study Material EM 220550
145 pages
Compressor: Dynamic Compressors Centrifugal Compressors
100% (1)
Compressor: Dynamic Compressors Centrifugal Compressors
7 pages
TechCalc Thermal Software Guide
No ratings yet
TechCalc Thermal Software Guide
108 pages
OPENMARK 4000 Brochure-Re
No ratings yet
OPENMARK 4000 Brochure-Re
4 pages
Excel Tutorial PDF
No ratings yet
Excel Tutorial PDF
13 pages
MT Test 1 QP
No ratings yet
MT Test 1 QP
2 pages
LG Oem Lgit Plde-P017a SCH
No ratings yet
LG Oem Lgit Plde-P017a SCH
2 pages
CTO-20AC Data Sheet
No ratings yet
CTO-20AC Data Sheet
3 pages
Humanities and Art First Session
No ratings yet
Humanities and Art First Session
31 pages
Class 12 Geography: Planning & Sustainable Development
No ratings yet
Class 12 Geography: Planning & Sustainable Development
40 pages
CLASS 8th Soc SC BRIDGE COURSE Bridge Course Primary 2024 25
No ratings yet
CLASS 8th Soc SC BRIDGE COURSE Bridge Course Primary 2024 25
42 pages
AWW Dust Collector Article Jan 2006
No ratings yet
AWW Dust Collector Article Jan 2006
7 pages
Heep 111
0% (1)
Heep 111
7 pages
1375 2013
0% (1)
1375 2013
10 pages
Manual-C-9102 (UL) Conventional Photoelectric Smoke Detecto20Issue1.04
No ratings yet
Manual-C-9102 (UL) Conventional Photoelectric Smoke Detecto20Issue1.04
10 pages
Bis Two Mark Questions
No ratings yet
Bis Two Mark Questions
3 pages
Water Distribution Systems
100% (1)
Water Distribution Systems
49 pages
MEIOSIS POWERPOINT Grade 12 Bio Corrected 2024-1
No ratings yet
MEIOSIS POWERPOINT Grade 12 Bio Corrected 2024-1
47 pages

Unit 2

Uploaded by

Unit 2

Uploaded by

Unit -2

Some common steps in data preprocessing include:

Data Cleaning in Data Mining

Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or

Steps of Data Cleaning

1. Remove duplicate or irrelevant observations

2. Fix structural errors

3. Filter unwanted outliers

o Does the data make sense?

Methods of Data Cleaning

3. Resolution of data value conflicts:

Final reduced attribute set: {X1, X2, X5}

 Step-wise Backward Selection –

Step-1: {X1, X2, X3, X4, X5}

 Combination of forwarding and Backward Selection –

5. Discretization & Concept Hierarchy Operation:

3. Clustering: Grouping similar data together.

ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data Mining :

3. Discretization: It is a process of transforming continuous data into set of small intervals.

5. Generalization: It converts low-level data attributes to high-level data attributes using

Advantages of Data Transformation in Data Mining:

Disadvantages of Data Transformation in Data Mining:

1. Time-consuming: Data transformation can be a time-consuming process, especially when

Discretization in data mining

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46

Some Famous techniques of data discretization

Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing

Data discretization using decision tree analysis

Data discretization using correlation analysis

You might also like