KEMBAR78
Data Mining Notes UNIT I | PDF | Databases | Data Mining
0% found this document useful (0 votes)
69 views21 pages

Data Mining Notes UNIT I

The document provides an overview of data mining, detailing its definition, process, architecture, and functionalities. It outlines the steps involved in knowledge discovery, such as data cleaning, integration, selection, and transformation, and discusses various types of data that can be mined, including relational and multimedia databases. Additionally, it addresses major issues in data mining, including performance, diverse data types, and the importance of data preprocessing for successful analytics.

Uploaded by

Gayathri T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views21 pages

Data Mining Notes UNIT I

The document provides an overview of data mining, detailing its definition, process, architecture, and functionalities. It outlines the steps involved in knowledge discovery, such as data cleaning, integration, selection, and transformation, and discusses various types of data that can be mined, including relational and multimedia databases. Additionally, it addresses major issues in data mining, including performance, diverse data types, and the importance of data preprocessing for successful analytics.

Uploaded by

Gayathri T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

SENGUNTHAR ARTS AND SCIENCE COLLEGE

(Affiliated to Periyar University, Salem and Approved by AICTE, New Delhi)


An ISO 9001:2015 Certified Institution
Recognised under section 2(f) and 12(B) of the UGC Act 1956
Since 1991 Accredited by NAAC
PG & RESEARCH DEPARTMENT OF COMPUTER SCIENCE

Unit 1

DATA MINING

Introduction to data mining:

Data Mining, which is also known as Knowledge Discovery in Databases is a process of


discovering useful information from large volumes of data stored in databases and data
warehouses.

Data Mining refers to "extracting or mining knowledge from large amount of data" or
"knowledge mining from data" or otherwise called as Knowledge Discovery in Database or KDD.

Data Mining is a process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, the web, and other
information repositories or data that are streamed into the system dynamically.

 Knowledge discovery in the database is the process of searching for hidden knowledge in the
massive amounts of data that we are technically capable of generating and storing.
 The basic task of KDD is to extract knowledge (or information) from a lower level data
(databases).
 It is the non-trivial (significant) process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data.
 The goal is to distinguish between unprocessed data something that may not be obvious but is
valuable or enlightening in its discovery.
 The overall process of finding and interpreting patterns from data involves the repeated
application of the following steps:

Step by step process of KDD:

 Data Cleaning
 Data Integration
 Data Selection
 Data Transformation
 Data Mining
 Pattern Evaluation
 Knowledge Presentation

Steps 1 to 4 are different forms of data preprocessing, where data are prepared for mining.
Fig. Data Mining as a step in the process of knowledge discovery
Data Cleaning
 Removal of noise, inconsistent data, and outliers
 Strategies to handle missing data fields.
Data Integration
 Data from various sources such as databases, data warehouse, and transactional data are
integrated.
 where multiple data sources may be combined into a single data format.
Data Selection
 Data relevant to the analysis task is retrieved from the database.
 Collecting only necessary information to the model.
 Finding useful features to represent data depending on the goal of the task.
Data Transformation
 Data are transformed and consolidated into forms appropriate for mining by performing
summary or aggregation operations.
 By using transformation methods invariant representations for the data is found.
Data Mining
 An essential process where intelligent methods are applied to extract data patterns.
 Deciding which model and parameter may be appropriate.
Pattern Evaluation
 To identify the truly interesting patterns representing knowledge based on interesting
measures.
Knowledge Presentation
 Visualization and knowledge representation techniques are used to present mined
knowledge to users.
 Visualizations can be in form of graphs, charts or table.

Data Mining Architecture:

Data mining is a significant method where previously unknown and potentially useful
information is extracted from the vast amount of data. The data mining process involves several
components, and these components constitute a data mining system architecture.

The significant components of data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user interface, and knowledge
base.

Data Source:

The actual source of data is the Database, data warehouse, World Wide Web (WWW), text
files, and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data warehouses
may comprise one or more databases, text files spreadsheets, or other repositories of data.
Sometimes, even plain text files or spreadsheets may contain information. Another primary source
of data is the World Wide Web or the internet.
Data cleaning, Integration and Selection:

Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different formats, it
can't be used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified.

Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on data mining as
per user request.

Data Mining Engine:

The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc.

Pattern Evaluation Module:

The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the search
on exciting patterns. It might utilize a stake threshold to filter out discovered patterns.

Graphical User Interface:

The graphical user interface (GUI) module communicates between the data mining system and the
user. This module helps the user to easily and efficiently use the system without knowing the
complexity of the process. This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.

Knowledge Base:

The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the
search or evaluate the stake of the result patterns. The knowledge base may even contain user
views and data from user experiences that might be helpful in the data mining process.

Type of Data that can be mined


Different kind of data can be mine. Some of the examples are mentioned below.

1. Spatial Databases
2. Flat Files
3. Relational Databases
4. Transactional Databases
5. Multimedia Databases
6. Data Warehouse
7. World Wide Web(WWW)
8. Time Series Databases

Spatial Database
Spatial Database is a suitable way to Store geographical information. Spatial Database stores
the data in the form of coordinates, lines, and different shapes, etc. Maps, Global positioning, etc are
the famous applications of Spatial Database.

Flat files
Flat files are in the binary form or text form and having a structure that can be easily
extracted by data mining algorithms. Representation of Flat files can be done with the data
dictionary. The most suitable example is the CSV file. Flat Files are famous in Data Warehousing due
to many reasons. Some important reasons are mentioned below;

1. Flat Files can be used to store the data.


2. Flat Files can be used in carrying the data to and from the server, etc.

Relational Databases
Relational Databases is an organized collection of related data. This organization is in the
form of tables with rows and columns. Different kind of scheme used in relational databases. A
physical and logical schema is famous schema.

 In Physical schema, we can define the structure of tables.


 In Logical schema, we can define a different kind of relationship among tables.

Standard API of the relational database is Structured Query Language (SQL).

Transactional Databases
Transactional Databases is an organized collection of data that is organized by timestamps
etc. For example, organized by any date to represent the transaction in databases. Transactional
Databases must have the capability to roll back any transaction. It is most commonly used in ATM
machines. Object databases, ATM machine, Banking, and Distributed systems are very famous
applications of a transactional database.
Multimedia Databases
Multimedia databases are the databases that can store the followings;

 Video
 images
 Audio
 text etc
Multimedia Databases can be stored on Object-Oriented Databases. E books databases,
Video websites databases, news websites databases etc are famous applications of Multimedia
Databases.

Data Warehouse:
A data warehouse is the collection of data that is collected and integrated from one or more
sources. Later this data can be mined for business decision making.
Three famous types of a data warehouse are mentioned below;

1. Virtual Warehouse
2. Data Mart
3. Enterprise data warehouse

Business decision making and Data mining are very useful applications of the data warehouse.

WWW
WWW stands for World wide web. WWW is a collection of documents and resources and
can contain a different kind of data like video, audio, and text, etc. Each data can be identified by
Uniform Resource Locators (URLs) through web browsers. Online tools, online video, images, and
text searching sites are the famous applications of WWW.

Time-series Databases
Time-series databases are the databases that can store the stock exchange data etc.
Graphite and eXtreme DB etc are the famous applications of Time-series Databases.

Data mining functionalities

Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. In general, data mining tasks can be classified into two types
including descriptive and predictive. Descriptive mining tasks define the common features of the
data in the database and the predictive mining tasks act inference on the current information to
develop predictions.

There are various data mining functionalities which are as follows −

 Data characterization − It is a summarization of the general characteristics of an object


class of data. The data corresponding to the user-specified class is generally collected by a
database query. The output of data characterization can be presented in multiple forms.

 Data discrimination − It is a comparison of the general characteristics of target class data


objects with the general characteristics of objects from one or a set of contrasting classes.
The target and contrasting classes can be represented by the user, and the equivalent data
objects fetched through database queries.
 Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset. There are two parameters that are used for determining the
association rules −

o It provides which identifies the common item set in the database.

o Confidence is the conditional probability that an item occurs in a transaction when


another item occurs.

Example:
buys(X,“computer”)=buys(X,“software”) [support=1%,confidence=50%]

where X is a variable representing a customer.Confidence=50% means that if a customer


buys a computer, there is a 50% chance that she will buy software as well.

Support=1% means that 1% of all of the transactions under analysis showed that computer
and software were purchased together.

 Classification − Classification is the procedure of discovering a model that represents and


distinguishes data classes or concepts, for the objective of being able to use the model to
predict the class of objects whose class label is anonymous. The derived model is
established on the analysis of a set of training data (i.e., data objects whose class label is
common).

A decision tree is a flow-chart-like tree structure, where each node denotes a test
on an attribute value, each branch represents an outcome of the test, and tree leaves
represent classes or class distributions.

 Prediction − It defines predict some unavailable data values or pending trends. An object
can be anticipated based on the attribute values of the object and attribute values of the
classes. It can be a prediction of missing numerical values or increase/decrease trends in
time-related information.

 Clustering − It is similar to classification but the classes are not predefined. The classes are
represented by data attributes. It is unsupervised learning. The objects are clustered or
grouped, depends on the principle of maximizing the intraclass similarity and minimizing
the intraclass similarity.
 Outlier analysis − Outliers are data elements that cannot be grouped in a given class or
cluster. These are the data objects which have multiple behaviour from the general
behaviour of other data objects. The analysis of this type of data can be essential to mine the
knowledge.

 Evolution analysis − It defines the trends for objects whose behaviour changes over some
time.

Major Issues in Data Mining:

Mining Methodology and User Interaction Issues:


It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad
range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in concise terms but at multiple
levels of abstraction.

 Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.

 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.

 Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues:

There can be performance-related issues such as follows −

 Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.

 Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.

Diverse Data Types Issues

 Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them
adds challenges to data mining.

Data Preprocessing
Data preprocessing involves transforming raw data to well-formed data sets so that data
mining analytics can be applied. Raw data is often incomplete and has inconsistent formatting. The
adequacy or inadequacy of data preparation has a direct correlation with the success of any project
that involve data analytics.

Data Cleaning in Data Mining:


Quality of your data is critical in getting to final analysis. Any data which tend to be
incomplete, noisy and inconsistent can affect your result. Data cleaning in data mining is the
process of detecting and removing corrupt or inaccurate records from a record set, table or
database.

Some data cleaning methods :-


1. You can ignore the tuple. This is done when class label is missing. This method is not very
effective , unless the tuple contains several attributes with missing values.
2. You can fill in the missing value manually. This approach is effective on small data set with
some missing values.
3. You can replace all missing attribute values with global constant, such as a label like
“Unknown” or minus infinity.
4. You can use the attribute mean to fill in the missing value. For example customer average
income is 25000 then you can use this value to replace missing value for income.
5. Use the most probable value to fill in the missing value.

Noisy Data:

Noise data means meaningless data. Noise is a random error or variance in a measured
variable. Noisy Data may be due to faulty data collection instruments, data entry problems and
technology limitation. Noise data can be handled by the following methods

1. Binning method
2. Regression
3. Clustering

I Binning:

Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values
around it. The whole data is divided into segments of equal size and then various methods are
performed to complete the task. The sorted values are distributed into a number of “buckets,” or
bins. There are 3 types of binning methods for data smoothing.

1. Smoothing by bin means


2. Smoothing by bin medians
3. Smoothing by bin boundaries

For example
Price = 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:

Bin a: 4, 8, 15
Bin b: 21, 21, 24
Bin c: 25, 28, 34
In this example, the data for price are first sorted and then partitioned into equal-frequency bins of
size 3.

Smoothing by bin means:

Bin a: 9, 9, 9
Bin b: 22, 22, 22
Bin c: 29, 29, 29
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
Smoothing by bin boundaries:

Bin a: 4, 4, 15
Bin b: 21, 21, 24
Bin c: 25, 25, 34
In smoothing by bin boundaries, each bin value is replaced by the closest boundary value.

II Regression
Data can be smoothed by fitting the data into a regression functions.

1. Linear regression
2. Multi linear regression

Linear Regression:
To find the best line to fit two attributes( or variables). So that one attribute can be used to
predict the other
Multi linear regression
More than two attributes are involved and the data are fit to a multidimensional surface.

III Clustering:
Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters. Values that fall outside of the set of clusters may be considered outliers.

Clustering

Data Integration In Data Mining


Data Integration is a data preprocessing technique that combines data from multiple
sources and provides users a unified view of these data. These sources may include multiple
databases, data cubes, or flat files. One of the most well-known implementation of data integration
is building an enterprise's data warehouse. The benefit of a data warehouse enables a business to
perform analyses based on the data in the data warehouse.
Data Integration

There are mainly 2 major approaches for data integration:-

1 Tight Coupling

In tight coupling data is combined from different sources into a single physical location
through the process of ETL - Extraction, Transformation and Loading.

2 Loose Coupling

In loose coupling data only remains in the actual source databases. In this approach, an
interface is provided that takes query from user and transforms it in a way the source database can
understand and then sends the query directly to the source databases to obtain the result.

Issues in data Integration:

1. Schema Integration:
Integrated meta data from different sources. Entities from multiple sources are matched
referred to as the entity identification problem.
2. Redundancy
An attribute may b redundant if it can be derived from another attribute or set of
attributes. Some redundancies can be detected by correlation analysis.

Data Transformation:

Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information. Data transformation includes data
cleaning techniques and a data reduction technique to convert the data into the appropriate form.
Data Transformation Techniques

There are several data transformation techniques that can help structure and clean up the
data before analysis or storage in a data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data reduction and data cleaning.

1. Data Smoothing

Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form.

The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to look
at a lot of data which can often be difficult to digest for finding patterns that they wouldn't see
otherwise.

We have seen how the noise is removed from the data using the techniques such as binning,
regression, clustering.

o Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
o Regression: This method identifies the relation among two dependent attributes so that if
we have one attribute, it can be used to predict the other attribute.
o Clustering: This method groups similar data values and form a cluster. The values that lie
outside a cluster are known as outliers.

2. Attribute Construction

In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied to assist the
mining process from the given attributes. This simplifies the original data and makes the mining
more efficient.

example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So here, we can construct a new attribute 'area' from
attributes 'height' and 'weight'. This also helps understand the relations among the attributes in a
data set.

3. Data Aggregation

Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources into a
data analysis description. This is a crucial step since the accuracy of data analysis insights is highly
dependent on the quantity and quality of the data used.

Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning financing
or business strategy of the product, pricing, operations, and marketing strategies.

For example, we have a data set of sales reports of an enterprise that has quarterly sales of
each year. We can aggregate the data to get the enterprise's annual sales report.

4. Data Normalization

Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0,
1.0]. There are different methods to normalize the data, as discussed below.

Consider that we have a numeric attribute A and we have n number of observed values for attribute
A that are V1, V2, V3, ….Vn.

o Min-max normalization: This method implements a linear transformation on the original


data. Let us consider that we have min A and maxA as the minimum and maximum value
observed for attribute A and Viis the value for attribute A that has to be normalized.
The min-max normalization would map Vi to the V'i in a new smaller range [new_minA,
new_maxA]. The formula for min-max normalization is given below:
For example, we have $1200 and $9800 as the minimum, and maximum value for the
attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:

o Z-score normalization: This method normalizes the value for attribute A using
the meanand standard deviation. The following formula is used for Z-score normalization:

Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and
$16,000. And we have to normalize the value $73,600 using z-score normalization.

o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal
point in the value. This movement of a decimal point depends on the maximum absolute
value of A. The formula for the decimal scaling is given below:

Here j is the smallest integer such that max(|v'i|)<1


For example, the observed values for attribute A range from -986 to 917, and the maximum
absolute value for attribute A is 986. Here, to normalize each value of attribute A using
decimal scaling, we have to divide each value of attribute A by 1000, i.e., j=3.
So, the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917.
The normalization parameters such as mean, standard deviation, the maximum absolute
value must be preserved to normalize the future data uniformly.

5. Data Discretization

This is a process of converting continuous data into a set of data intervals. Continuous attribute
values are substituted by small interval labels. This makes the data easier to study and analyze. If a
data mining task handles a continuous attribute, then its discrete values can be replaced by
constant quality attributes. This improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset into a set of
categorical data. Discretization also uses decision tree-based algorithms to produce short, compact,
and accurate results when using discrete values.

Data discretization can be classified into two types: supervised discretization, where the class
information is used, and unsupervised discretization, which is based on which direction the
process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging strategy'.

For example, the values for the age attribute can be replaced by the interval labels such as (0-10,
11-20…) or (kid, youth, adult, senior).

6. Data Generalization

It converts low-level data attributes to high-level data attributes using concept hierarchy. This
conversion from a lower level to a higher conceptual level is useful to get a clearer picture of the
data. Data generalization can be divided into two approaches:

o Data cube process (OLAP) approach.


o Attribute-oriented induction (AOI) approach.

For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).

Data Reduction:

Data reduction is a process that reduces the volume of original data and represents it in a
much smaller volume. Data reduction techniques are used to obtain a reduced representation of the
dataset that is much smaller in volume by maintaining the integrity of the original data. By reducing
the data, the efficiency of the data mining process is improved, which produces the same analytical
results.

Data reduction does not affect the result obtained from data mining. That means the result
obtained from data mining before and after data reduction is the same or almost the same.

Data reduction aims to define it more compactly. When the data size is smaller, it is simpler
to apply sophisticated and computationally high-priced algorithms. The reduction of the data may
be in terms of the number of rows (records) or terms of the number of columns (dimensions).

Techniques of Data Reduction:

Here are the following techniques or methods of data reduction in data mining, such as:
1. Dimensionality Reduction

Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby
reducing the volume of original data. It reduces data size as it eliminates outdated or redundant
features. Here are three methods of dimensionality reduction.

i. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed into
a numerically different data vector A' such that both A and A' vectors are of the same length.
Then how it is useful in reducing data because the data obtained from the wavelet
transform can be truncated. The compressed data is obtained by retaining the smallest
fragment of the strongest wavelet coefficients. Wavelet transform can be applied to data
cubes, sparse data, or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to be analyzed that has tuples
with n attributes. The principal component analysis identifies k independent tuples with n
attributes that can represent the data set.
In this way, the original data can be cast on a much smaller space, and dimensionality
reduction can be achieved. Principal component analysis can be applied to sparse and
skewed data.
iii. Attribute Subset Selection: The large data set has many attributes, some of which are
irrelevant to data mining or some are redundant. The core attribute subset selection
reduces the data volume and dimensionality. The attribute subset selection reduces the
volume of data by eliminating redundant and irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original attributes even
after eliminating the unwanted attributes. The resulting probability of data distribution is
as close as possible to the original data distribution using all the attributes.

2. Numerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much smaller
form. This technique includes two types parametric and non-parametric numerosity reduction.

i. Parametric: Parametric numerosity reduction incorporates storing only data parameters


instead of the original data. One method of parametric numerosity reduction is the
regression and log-linear method.
o Regression and Log-Linear: Linear regression models a relationship between the
two attributes by modeling a linear equation to the data set. Suppose we need to
model a linear function between two attributes.
y = wx +b
Here, y is the response attribute, and x is the predictor attribute. If we discuss in
terms of data mining, attribute x and attribute y are the numeric database attributes,
whereas w and b are regression coefficients.
Multiple linear regressions let the response variable y model linear function
between two or more predictor variables.
Log-linear model discovers the relation between two or more discrete attributes in
the database. Suppose we have a set of tuples presented in n-dimensional space.
Then the log-linear model is used to study the probability of each tuple in a
multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.
ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume any
model. The non-Parametric technique results in a more uniform reduction, irrespective of
data size, but it may not achieve a high volume of data reduction like the parametric. There
are at least four types of Non-Parametric data reduction techniques, Histogram, Clustering,
Sampling, Data Cube Aggregation, and Data Compression.
o Histogram: A histogram is a graph that represents frequency distribution which
describes how often a value appears in the data. Histogram uses the binning method
to represent an attribute's data distribution. It uses a disjoint subset which we call
bin or buckets.
A histogram can represent a dense, sparse, uniform, or skewed data. Instead of only
one attribute, the histogram can be implemented for multiple attributes. It can
effectively represent up to five attributes.
o Clustering: Clustering techniques groups similar objects from the data so that the
objects in a cluster are similar to each other, but they are dissimilar to objects in
another cluster.
How much similar are the objects inside a cluster can be calculated using a distance
function. More is the similarity between the objects in a cluster closer they appear in
the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the max
distance between any two objects in the cluster.
The cluster representation replaces the original data. This technique is more
effective if the present data can be classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is sampling, as it can reduce
the large data set into a much smaller data sample. Below we will discuss the
different methods in which we can sample a large data set D containing N tuples:
a. Simple random sample without replacement (SRSWOR) of size s: In this
s, some tuples are drawn from N tuples such that in the data set D (s<N). The
probability of drawing any tuple from the data set D is 1/N. This means all
tuples have an equal probability of getting sampled.
b. Simple random sample with replacement (SRSWR) of size s: It is similar
to the SRSWOR, but the tuple is drawn from data set D, is recorded, and then
replaced into the data set D so that it can be drawn again.

c. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing
SRSWOR on these clusters. A simple random sample of size s could be
generated from these clusters where s<M.
d. Stratified sample: The large data set D is partitioned into mutually disjoint
sets called 'strata'. A simple random sample is taken from each stratum to
get stratified data. This method is effective for skewed data.

3. Data Cube Aggregation

This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent the
original data set, thus achieving data reduction.

For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the
year 2022. If you want to get the annual sale per year, you just have to aggregate the sales per
quarter for each year. In this way, aggregation provides you with the required data, which is much
smaller in size, and thereby we achieve data reduction even without losing any data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis.
The data cube present precomputed and summarized data which eases the data mining into fast
access.

4. Data Compression

Data compression employs modification, encoding, or converting the structure of data in a way that
consumes less space. Data compression involves building a compact representation of information
by removing redundancy and representing data in binary form. Data that can be restored
successfully from its compressed form is called Lossless compression. In contrast, the opposite
where it is not possible to restore the original form from the compressed form is Lossy
compression. Dimensionality and numerosity reduction method are also used for data compression.

This technique reduces the size of the files using different encoding mechanisms, such as Huffman
Encoding and run-length Encoding. We can divide it into two types based on their compression
techniques.

i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them. For example, the
JPEG image format is a lossy compression, but we can find the meaning equivalent to the
original image. Methods such as the Discrete Wavelet transform technique PCA (principal
component analysis) are examples of this compression.

5. Discretization Operation

The data discretization technique is used to divide the attributes of the continuous nature
into data with intervals. We replace many constant values of the attributes with labels of small
intervals. This means that mining results are shown in a concise and easily understandable way.

i. Top-down discretization: If you first consider one or a couple of points (so-called


breakpoints or split points) to divide the whole set of attributes and repeat this method up
to the end, then the process is known as top-down discretization, also known as splitting.
ii. Bottom-up discretization: If you first consider all the constant values as split-points, some
are discarded through a combination of the neighborhood values in the interval. That
process is called bottom-up discretization.

Unit I Completed

You might also like