KEMBAR78
Unit-1 Data Mining | PDF | Data Mining | Data Warehouse
0% found this document useful (0 votes)
35 views19 pages

Unit-1 Data Mining

Data mining is the process of extracting meaningful information from large datasets using algorithms to identify patterns and trends. It is crucial for improved decision-making, increased efficiency, enhanced customer experience, and gaining a competitive advantage. The Knowledge Discovery in Databases (KDD) process involves several steps including data selection, cleaning, transformation, mining, and evaluation to uncover valuable insights from data.

Uploaded by

paramt1315
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views19 pages

Unit-1 Data Mining

Data mining is the process of extracting meaningful information from large datasets using algorithms to identify patterns and trends. It is crucial for improved decision-making, increased efficiency, enhanced customer experience, and gaining a competitive advantage. The Knowledge Discovery in Databases (KDD) process involves several steps including data selection, cleaning, transformation, mining, and evaluation to uncover valuable insights from data.

Uploaded by

paramt1315
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT-1 INTRODUCTION TO DATA MINING

INTRODUCTION:
Data mining is the process of extracting meaningful information from large datasets. It
involves algorithms to identify patterns, trends, and relationships among data.

DEFINITIONS OF DATA MINING:


1. "Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis. Data mining
techniques and tools help enterprises to predict future trends and make more informed
business decisions."
2. "Data mining is the process of searching and analyzing a large batch of raw data in order
to identify patterns and extract useful information."
3. "Data mining is the process of using statistical analysis and machine learning to discover
hidden patterns, correlations, and anomalies within large datasets."
4. "Data mining is the process of extracting and discovering patterns in large datasets
involving methods at the intersection of machine learning, statistics, and database
systems."

Why is Data Mining Important?


In today's data-driven world, businesses and organizations of all sizes are generating massive
amounts of data. Data mining provides a way to make sense of this data and gain valuable
insights. Here are some key reasons why data mining is important:
1. Improved Decision Making: By uncovering hidden patterns and trends, data mining
can help businesses make more informed decisions.
2. Increased Efficiency: Data mining can help automate processes, identify areas for
improvement, and reduce costs.
3. Enhanced Customer Experience: By understanding customer behaviour, data mining
can help businesses personalize their offerings and improve customer satisfaction.
4. Competitive Advantage: Businesses that can effectively use data mining can gain a
competitive edge by identifying new opportunities and responding to market changes
more quickly.

1
KNOWLEDGE DISCOVERY IN DATABASES (KDD):
 Knowledge Discovery in Databases (KDD) refers to the complete process of
uncovering valuable knowledge from large datasets.
 It starts with the selection of relevant data, followed by pre-processing to clean and
organize it, transformation to prepare it for analysis, data mining to uncover patterns
and relationships, and concludes with the evaluation and interpretation of results,
ultimately producing valuable knowledge or insights.
 KDD is widely utilized in fields like machine learning, pattern recognition, statistics,
artificial intelligence, and data visualization.
 The KDD process is iterative, involving repeated refinements to ensure the accuracy
and reliability of the knowledge extracted.
 The whole process consists of the following steps:
1. Data Selection
2. Data Cleaning and Pre-processing
3. Data Transformation and Reduction
4. Data Mining
5. Evaluation and Interpretation of Results

2
1. Data Selection:
 Data Selection is the initial step in the Knowledge Discovery in Databases (KDD)
process, where relevant data is identified and chosen for analysis.
 It involves selecting a dataset or focusing on specific variables, samples, or subsets of
data that will be used to extract meaningful insights.
 It ensures that only the most relevant data is used for analysis, improving efficiency and
accuracy. It involves selecting entire dataset to particular features or subsets based on
task’s goals. By carefully selecting data, it delivers accurate and relevant results.
2. Data Cleaning and Pre-Processing:
 In the KDD process, Data Cleaning is essential for ensuring that the dataset is accurate
and reliable by correcting errors, handling missing values, removing duplicates, and
addressing noisy or outlier data. Data cleaning is crucial in KDD to enhance the quality
of the data and improve the effectiveness of data mining.
 Missing Values: Gaps in data are filled with the mean or most probable value to
maintain dataset completeness.
 Removing Duplicates: Duplicate records are removed to maintain consistency
and avoid errors in analysis.
 Noisy Data: Noise is reduced using techniques like binning, regression, or
clustering to smooth or group the data.
3. Data Transformation and Reduction:
 Data Transformation involves converting data into a format that is suitable for analysis.
 Normalization: Scaling data to a common range for consistency across variables.
 Discretization: Converting continuous data into discrete categories for simpler
analysis.
 Data Aggregation: Summarizing multiple data points (e.g., averages or totals) to
simplify analysis.
 Concept Hierarchy Generation: Organizing data into hierarchies for a clearer,
higher-level view.
 Data Reduction helps simplify the dataset while preserving key information.
1. Dimensionality Reduction: Reducing the number of variables while keeping
essential data.
2. Numerosity Reduction: Reducing data points using methods like sampling to
maintain critical patterns.
3. Data Compression: Compacting data for easier storage and processing.
3
4. Data Mining:
 Data Mining is the process of discovering valuable, previously unknown patterns from
large datasets through automatic or semi-automatic means.
 It involves exploring vast amounts of data to extract useful information that can drive
decision-making.
 In KDD process, choosing the data mining task is critical. Depending on the objective,
the task could involve classification, regression, clustering, or association rule mining.
 After determining the task, selecting the appropriate data mining algorithms is essential.
 These algorithms are chosen based on their ability to efficiently and accurately identify
patterns that align with the goals of the analysis.
5. Evaluation and Interpretation of Results:
 Evaluation in KDD involves assessing the patterns identified during data mining to
determine their relevance and usefulness.
 It includes calculating the “interestingness score” for each pattern, which helps to
identify valuable insights.
 Visualization and summarization techniques are then applied to make the data more
understandable and accessible for the user.
 Interpretation of Results focuses on presenting these understandings in a way that is
meaningful and actionable.
 By effectively communicating the findings, decision-makers can use the results to drive
informed actions and strategies.

FUNCTIONALITIES OF DATA MINING:


 Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks.
 Data mining tasks can be classified into two types including descriptive and predictive.
 Descriptive mining tasks define the common features of the data in the database.
 Predictive mining tasks act inference on the current information to develop predictions.
There are various data mining functionalities which are as follows −
1. Data characterization: It is a summarization of the general characteristics of an object
class of data. The data corresponding to the user-specified class is generally collected
by a database query. The output of data characterization can be presented in multiple
forms.
4
2. Data discrimination: It is a comparison of the general characteristics of target class
data objects with the general characteristics of objects from one or a set of contrasting
classes. The target and contrasting classes can be represented by the user, and the
equivalent data objects fetched through database queries.
3. Association Analysis: It analyses the set of items that generally occur together in a
transactional dataset. There are two parameters that are used for determining the
association rules −
 It provides which identifies the common item set in the database.
 Confidence is the conditional probability that an item occurs in a transaction
when another item occurs.
4. Classification: Classification is the procedure of discovering a model that represents
and distinguishes data classes or concepts, for the objective of being able to use the
model to predict the class of objects whose class label is unknown. The derived model
is established on the analysis of a set of training data.
5. Prediction: It defines to predict some unavailable data values or pending trends. An
object can be anticipated based on the attribute values of the object and attribute values
of the classes. It can be a prediction of missing numerical values or increase/decrease
trends in time-related information.
6. Clustering: It is similar to classification but the classes are not predefined. The classes
are represented by data attributes. It is unsupervised learning. The objects are clustered
or grouped, depends on the principle of maximizing the intra-class similarity and
minimizing the intra-class similarity.
7. Outlier analysis: Outliers are data elements that cannot be grouped in a given class or
cluster. These are the data objects which have multiple behaviour from the general
behaviour of other data objects.
8. Evolution analysis: It defines trends for objects whose behaviour changes over time.

Summary of Functionalities:
Functionality Goal Example Application
Association Rule Mining Discover relationships Market basket analysis
Classification Predict categories Spam email detection
Prediction Forecast outcomes Sales or demand forecasting
Clustering Group similar data Customer segmentation
Anomaly Detection Find outliers Fraud detection
Summarization Provide data overviews Business reports
Sequence Analysis Analyse ordered data Stock price trends
Knowledge Discovery Reveal hidden insights Trend detection in customer data

5
INTRODUCTION TO DATA WAREHOUSE:
 A Data Warehouse consists of data from multiple heterogeneous data sources and is used
for analytical reporting and decision making.
 Data Warehouse is a central place where data is stored from different data sources and
applications.
 The term Data Warehouse was first invented by Bill Inmom in 1990.
 Data Warehouse is always kept separate from Operational Database.
 The data in a Data warehouse system is loaded from operational transaction systems like –
Sales, Marketing, HR etc.
 A Data Warehouse is used for reporting and analyzing of information and stores both
historical and current data.
 The data in Data warehouse system is used for Analytical reporting, which is later used by
Business Analysts, Sales Managers or Knowledge workers for decision-making.
 Data in data warehouse is accessed by BI (Business Intelligence) users for Analytical
Reporting, Data Mining and Analysis.
 This is used for decision making by Business Users, Sales Manager, Analysts to define
future strategy.
 In the below image, we can see that the data is coming from multiple heterogeneous
data sources to a Data Warehouse. Common data sources for a data warehouse includes −
Operational databases, reports, data mining patterns, Flat Files (xls, csv, txt files) etc.

6
CHARACTERISTICS OF A DATA WAREHOUSE:
1. Subject Oriented: In a DW system, the data is categorized and stored by a business
subject rather than by application like equity plans, shares, loans, etc.
2. Integrated: Data from multiple data sources are integrated in a Data Warehouse.
3. Non Volatile: Data in data warehouse is non-volatile. It means when data is loaded in
Data Warehouse system, it is not altered.
4. Time Variant: A Data Warehouse system contains historical data as compared to
Transactional system which contains only current data. In a Data warehouse you can
see data for 3 months, 6 months, 1 year, 5 years, etc.

FEATURES OF DATA WAREHOUSING:


1. Centralized Data Repository: It provides a centralized repository for all enterprise
data from various sources, such as transactional databases, operational systems, and
external sources. This enables organizations to have a comprehensive view of their data,
which can help in making informed business decisions.
2. Data Integration: Data warehousing integrates data from different sources into a
single, unified view, which can help in eliminating data silos and reducing data
inconsistencies.
3. Historical Data Storage: Data warehousing stores historical data, which enables
organizations to analyze data trends over time. This can help in identifying patterns and
anomalies in the data, which can be used to improve business performance.
4. Query and Analysis: Data warehousing provides powerful query and analysis
capabilities that enable users to explore and analyze data in different ways. This can
help in identifying patterns and trends, and can also help in making informed business
decisions.
5. Data Transformation: Data warehousing includes a process of data transformation,
which involves cleaning, filtering, and formatting data from various sources to make it
consistent and usable. This can help in improving data quality and reducing data
inconsistencies.
6. Data Mining: Data warehousing provides data mining capabilities, which enable
organizations to discover hidden patterns and relationships in their data. This can help
in identifying new opportunities, predicting future trends, and mitigating risks.
7. Data Security: It provides robust data security features, such as access controls, data
encryption, and data backups, which ensure that the data is secure and protected from
unauthorized access.

7
Data Warehouse vs DBMS:
Database Data Warehouse
It is based on operational or transactional A data Warehouse is based on analytical
processing. Each operation is an processing.
indivisible transaction.
Generally, a Database stores current and It maintains historical data over time. Historical
up-to-date data which is used for daily data is the data kept over years and can used for
operations. trend analysis, make future predictions and
decision support.
A database is generally application A Data Warehouse is integrated generally at the
specific. organization level, by combining data from
Example – A database stores related different databases.
data, such as the student details in a Example: Data warehouse integrates the data
school. from one or more databases, so that analysis can
be done to get results, such as the best
performing school in a city.
Constructing a Database is not so Constructing a Data Warehouse can be
expensive. expensive.

ADVANTAGES OF DATA WAREHOUSING:


1. Intelligent Decision-Making: With centralized data in warehouses, decisions may be
made more quickly and intelligently.
2. Business Intelligence: Provides strong operational views through business intelligence.
3. Historical Analysis: Predictions and trend analysis are made easier by storing past data.
4. Data Quality: Guarantees data quality and consistency for trustworthy reporting.
5. Scalability: Capable of managing massive data volumes and expanding to meet
changing requirements.
6. Effective Queries: Fast and effective data retrieval is made by an optimized structure.
7. Cost reductions: Data warehousing can result in cost savings over time by reducing
data management procedures and increasing overall efficiency, even when there are
setup costs initially.
8. Data security: Data warehouses employ security protocols to safeguard confidential
information, guaranteeing that only authorized personnel are granted access to certain
data.

8
DISADVANTAGES OF DATA WAREHOUSING:
1. Cost: Building a data warehouse can be expensive, requiring significant investments in
hardware, software, and personnel.
2. Complexity: Data warehousing can be complex, and businesses may need to hire
specialized personnel to manage the system.
3. Time-consuming: Building a data warehouse can take a significant amount of time,
requiring businesses to be patient and committed to the process.
4. Data integration challenges: Data from different sources can be challenging to
integrate, requiring significant effort to ensure consistency and accuracy.
5. Data security: Data warehousing can pose data security risks, and businesses must take
measures to protect sensitive data from unauthorized access or breaches.

APPLICATIONS OF DATA WAREHOUSING:


Data Warehousing can be applied anywhere where we have a huge amount of data and we
want to see statistical results that help in decision making.
1. Social Media Websites: The social networking websites like Facebook, Twitter,
Linkedin, etc. are based on analyzing large data sets. These sites gather data related to
members, groups, locations, etc., and store it in a single central repository. Being a large
amount of data, Data Warehouse is needed for implementing the same.
2. Banking: Most of the banks these days use warehouses to see the spending patterns of
account/cardholders. They use this to provide them with special offers, deals, etc.
3. Government: Government uses a data warehouse to store and analyze tax payments
which are used to detect tax thefts.

9
DATA PRE-PROCESSING:
Data pre-processing is an important process of data mining. In this process, raw data is
converted into an understandable format and made ready for further analysis. The motive is to
improve data quality and make it up to mark for specific tasks.
Tasks in Data Pre-processing:
1. Data Cleaning
2. Data Integration
3. Data Transformation
4. Data Reduction

DATA CLEANING:
 Data cleaning defines to clean the data by filling in the missing values, smoothing noisy
data, analyzing and removing outliers, and removing inconsistencies in the data.
 Sometimes data at multiple levels of detail can be different from what is required, for
example, it can need the age ranges of 20-30, 30-40, 40-50, and the imported data
includes birth date. The data can be cleans by splitting the data into appropriate types.
There are various types of data cleaning which are as follows:
1. Missing values
2. Noisy data
3. Combined computer and human inspection
4. Inconsistence data
1. Missing Values: Missing values are filled with appropriate values. There are the following
approaches to fill the values.
a) The tuple is ignored when it includes several attributes with missing values.
b) The values are filled manually for the missing value.
c) The same global constant can fill the values.
d) The attribute mean can fill the missing values.
e) The most probable value can fill the missing values.
2. Noisy data: Noise is a random error or variance in a measured variable. There are the
following smoothing methods to handle noise. They are: binning, regression, clustering.
a) Binning: These methods smooth out a arrange data value by consulting its
“neighbourhood”. The arranged values are distributed into multiple buckets or bins.
Because binning methods consult neighbourhood of values, they implement local
smoothing.
10
Types of binning techniques: bin by means, bin by medians, bin by boundary.
Example: The data for price are first sorted and then partitioned into equal-frequency / equi-
depth bins of size 3 (i.e., each bin contains three values).
In smoothing by bin means, each value in a bin is replaced by the mean value of the
bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each
original value in this bin is replaced by the value 9.
In smoothing by bin medians, each bin value is replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a given bin
are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value.
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means: Smoothing by bin median: Smoothing by bin boundaries:
Bin 1: 9, 9, 9 Bin 1: 8, 8, 8 Bin 1: 4, 4, 15
Bin 2: 22, 22, 22 Bin 2: 21, 21, 21 Bin 2: 21, 21, 24
Bin 3: 29, 29, 29 Bin 3: 28, 28, 28 Bin 3: 25, 25, 34

b) Regression: Regression functions are used to smoothen the data. Regression can be
linear (consists of one independent variable) or multiple (consists of multiple
independent variables).
c) Clustering: Clustering supports in identifying the outliers. The same values are
organized into clusters and those values which fall outside the cluster are known as
outliers.
3. Combined computer and human inspection: The outliers can also be recognized with
the support of computer and human inspection. The outliers pattern can be descriptive or
garbage.
4. Inconsistence data: The inconsistency can be recorded in various transactions, during data
entry, or arising from integrating information from multiple databases. Some redundancies
can be recognized by correlation analysis. Accurate and proper integration of the data from
various sources can decrease and avoid redundancy.

11
DATA INTEGRATION:
 Data integration in data mining refers to the process of combining data from multiple
heterogeneous data sources into a single, combined view. These sources may include
multiple data cubes, databases, or flat files.
 This can involve cleaning the data, as well as resolving any inconsistencies or conflicts
that may exist between the different sources.
 The goal of data integration is to make the data more useful and meaningful for the
purposes of analysis and decision making.
 It is easier to access and analyse data that is spread across multiple systems or platforms,
in order to gain a more complete and accurate understanding of the data.
 The data integration approaches are formally defined as triple (G, S, M) where,
 G stand for the global schema,
 S stands for the heterogeneous source of schema,
 M stands for mapping between the queries of source and global schema.
There are three issues to consider during data integration:
1. Schema Integration:
 Integrate metadata from different sources. Different data sources may have varying
schemas, structures, or representations of the same data. The real-world entities from
multiple sources are referred to as the entity identification problem.
 Example: One system may use "CustomerID" while another uses "ClientID" for the
same entity.
 Solution: Schema matching algorithms or manual mapping can help resolve
inconsistencies, though it can be time-consuming.
2. Redundancy Detection:
 An attribute may be redundant if it can be derived or obtained from another attribute
or set of attributes. Inconsistencies in attributes can also cause redundancies in the
resulting data set. Some redundancies can be detected by correlation analysis.
3. Resolution of data value conflicts:
 Attribute values from different sources may differ for the same real-world entity.
Conflicting data values may arise from different sources. Without proper conflict
resolution, integrated data may remain inconsistent or incorrect.
 Example: A customer’s address may differ between two databases.
 Solution: Use conflict resolution strategies, such as prioritizing specific sources or
averaging numerical values.
12
DATA TRANSFORMATION:
 Data transformation is a pre-processing step in which raw data is converted into an
appropriate format to be used effectively by mining algorithms.
 The goal is to improve the quality, efficiency, and accuracy of analysis. Transformation
ensures that data is consistent, relevant, and suitable for extracting meaningful patterns.
In this, data are transformed or combined into forms suitable for mining.
Data Transformation Techniques are:
1. Smoothing: In Smoothing, we can remove noise from the data. Such methods contain
binning, regression, and clustering.
2. Aggregation: In Aggregation, where summary or aggregation operations are applied to
the data. For example, the daily sales data may be aggregated to compute monthly and
annual total amounts. This phase is generally used in making a data cube for the analysis
of the data at multiple granularities.
3. Generalization: In Generalization, where low-level or “primitive” (raw) data are
restored by larger-level concepts through the use of concept hierarchies. For example,
categorical attributes, such as street, can be generalized to larger-level concepts, such as
city or country. Similarly, values for numerical attributes, such as age, can be mapped
to larger-level concepts, like youth, middle-aged, and senior.
4. Normalization: In Normalization, where the attribute data are scaled to fall within a
small specified range, such as: 1.0 to 1.0, or 0.0 to 1.0. An attribute is normalized by
scaling its values so that they decline within small specified order, including 0.0 to 1.0.
The methods for data normalization are: min-max, z-score, decimal scaling.
a) Min-Max Normalization: It is a way to rescale data to a fixed range, typically
between 0 and 1. It's done by taking the minimum and maximum values of a dataset
and then proportionally transforming each data point to fit within that range.
Formula: X_normalized = (X - X_min) / (X_max - X_min)
Where:
 X_normalized is the normalized value of X.
 X is the original value.
 X_min is the minimum value in the dataset.
 X_max is the maximum value in the dataset.
Example: Suppose we have the following dataset of house prices (in thousands of
dollars): 200, 250, 300, 350, 400
Find the minimum and maximum values: X_min = 200, X_max = 400
13
Normalize each value:
For 200: (200 - 200) / (400 - 200) = 0
For 250: (250 - 200) / (400 - 200) = 0.25
For 300: (300 - 200) / (400 - 200) = 0.5
For 350: (350 - 200) / (400 - 200) = 0.75
For 400: (400 - 200) / (400 - 200) = 1

Result: The normalized dataset becomes: 0 0.25 0.5 0.75 1

b) Z-score normalization: It is also called standardization. It is a way to rescale data


so that it has a mean of 0 and a standard deviation of 1. This is useful when you want
to compare data from different sources that might have different scales.
Formula: The formula for calculating a Z-score is: Z = (X - μ) / σ
Where:
 Z is the Z-score
 X is the original value
 μ is the mean of the data
 σ is the standard deviation of the data
Example: Let's say we have the following dataset of exam scores: 85, 92, 78, 88, 95
Calculate the mean (μ): μ = ( 85 + 92 + 78 + 88 + 95 ) / 5 = 87.6
Calculate the standard deviation (σ):
σ = sqrt [ ( (85-87.6)^2 + (92-87.6)^2 + (78-87.6)^2 + (88-87.6)^2 + (95-87.6)^2 ) / 5 ]
σ ≈ 6.5
Calculate the Z-score for each value:
For 85: Z = (85 - 87.6) / 6.5 ≈ -0.4
For 92: Z = (92 - 87.6) / 6.5 ≈ 0.7
For 78: Z = (78 - 87.6) / 6.5 ≈ -1.5
For 88: Z = (88 - 87.6) / 6.5 ≈ 0.1
For 95: Z = (95 - 87.6) / 6.5 ≈ 1.1

Result: The normalized dataset becomes: -0.4 0.7 -1.5 0.1 1.1
14
c) Decimal Scaling: It is a normalization technique that scales data by moving the
decimal point of the values. The number of decimal places to move is determined by
the largest absolute value in the dataset. This ensures that all values fall within a
defined range, usually between -1 and 1.
Formula: Normalized value = Value / 10^j
Where: j is the smallest integer such that max(|Value|) < 10^j
Example: Suppose we have the following dataset of house prices (in thousands of
dollars): 150, 250, 100, 300, 200
Find the maximum absolute value: The maximum value in the dataset is 300.
Determine the value of j:
We need to find the smallest integer j such that 300 < 10^j.
In this case, j = 3 (since 300 has 3 digits → j=3)
And, 300 < 103 => 300 < 1000
Normalize each value:
For 150: 150 / 10^3 = 0.15
For 250: 250 / 10^3 = 0.25
For 100: 100 / 10^3 = 0.10
For 300: 300 / 10^3 = 0.30
For 200: 200 / 10^3 = 0.20

Result: The normalized dataset becomes: 0.15 0.25 0.10 0.30 0.20

15
DATA REDUCTION:
 Data reduction is a technique used in data mining to minimize the volume of data while
maintaining its integrity and the quality of the analysis.
 It helps improve efficiency, reduce storage costs, and enhance processing speed without
significant loss of important information.
 This can be beneficial in situations where dataset is too large to be processed efficiently,
or where the dataset contains large amount of irrelevant or redundant information.
Data Reduction Techniques are:
1. Data Cube Aggregation
2. Attribute Subset Selection
3. Data Compression
4. Numerosity Reduction
5. Discretization and Concept Hierarchy Generation
1. Data Cube Aggregation:
 Data Cube Aggregation is a multidimensional aggregation that uses aggregation at
various levels of a data cube to represent the original data set, thus achieving data
reduction. This technique is used to aggregate data in a simpler form.
 For example, Suppose we have the data of All Electronics sales per quarter for the year
2018 to the year 2020. If we want to get the annual sale per year, we just have to
aggregate the sales per quarter for each year.
 In this way, aggregation provides you with the required data, which is much smaller in
size, and thereby we achieve data reduction even without losing any data.

16
2. Attribute Subset Selection:
 Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates redundant features.
 The methods for attribute selection are:
a) Step-wise Forward Selection
b) Step-wise Backward Elimination
c) Combination of forwarding and Backward Selection
a) Step-wise Forward Selection:
 The selection begins with an empty set of attributes later on we decide the best of
the original attributes on the set based on their relevance to other attributes.
 Example: Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
b) Step-wise Backward Elimination:
 This selection starts with a set of complete attributes in the original data and at
each point, it eliminates the worst remaining attribute in the set.
 Example: Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }
Step-1: {X1, X2, X3, X4, X5}
Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
c) Combination of forwarding and Backward Selection:
 It allows us to remove the worst and select the best attributes, saving time and
making the process faster.
17
3. Data Compression:
 Data Compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding).
The two types of Data Compression Techniques are:
a) Lossless Compression:
 In this, No information is lost during compression and decompression. It is used
when exact data recovery is required (e.g., text, medical data, and financial
records). The Common techniques are:
1. Run-Length Encoding (RLE): Replaces consecutive identical values with a
single value and count
2. Huffman Coding: Uses variable-length codes based on frequency
(shorter codes for frequent values)
3. Lempel-Ziv-Welch (LZW): Dictionary-based method used in ZIP files and
GIF images
For Example:
Original Data: AAAAABBBCCDAA
RLE Compression: 5A3B2C1D2A
b) Lossy Compression:
 In this, some data is discarded to achieve a higher compression ratio. It is used
when exact data recovery is not required (e.g., images, audio, video). The
Common techniques are: JPEG Compression (for images), MP3 Compression
(for audio), MPEG Compression (for video).
 Example (JPEG Compression): It removes minor colour variations that are not
noticeable to the human eye. It achieves high compression while retaining visual
quality.

18
4. Numerosity Reduction:
 Numerosity reduction is a technique that replaces the original dataset with a smaller,
more compact representation while preserving its essential information. Instead of
storing the entire dataset, a mathematical model or sample is used to represent the
data, reducing storage and processing costs.
The types of Numerosity Reduction Techniques are: Parametric and Non-Parametric.
a) Parametric Methods:
 These methods approximate the dataset using mathematical models. Instead of
storing the entire dataset, only the model parameters are stored.
 They includes: Regression Models (Fits the data into a function) and Log-Linear
Models (uses probability distribution).
b) Non-Parametric Methods:
 These methods do not assume any model but reduce data volume through
techniques like: clustering (Groups similar data points together), sampling
(Selects representative subset of data rather than entire dataset), and histograms
(Groups numeric data into bins and stores only the bin boundaries instead of
every individual value).
5. Discretization & Concept Hierarchy Generation:
Discretization:
 It is the process of converting continuous data values into a finite set of intervals or
categories. It simplifies data and makes it easier to analyse and understand.
 It improves algorithm performance to work better with discrete data. It reduces noise
and helps to smooth out minor variations in continuous data.
 For example, Imagine we have the ages of customers. Instead of treating each age as
a unique value, we can group them into ranges like "18-25", "26-35", etc.
 The Methods used are: Binning, Histogram and Clustering.
Concept Hierarchy Generation:
 It is creating a hierarchical structure for categorical data, where more general
concepts are at higher levels and more specific concepts are at lower levels.
 It provides structured way to organize and interpret categorical data to improve data
understanding. It helps in DM tasks like association rule mining and classification.
The Methods are Manual-definition, Schema-based, Automatic-generation.
 For example, "city" can be grouped into "state", then "country". It allows analysis
at different levels of detail. We can explore data at a broad level (country) or a more
specific level (city).
19

You might also like