Data Mining Notes UNIT I
Data Mining Notes UNIT I
Unit 1
DATA MINING
Data Mining refers to "extracting or mining knowledge from large amount of data" or
"knowledge mining from data" or otherwise called as Knowledge Discovery in Database or KDD.
Data Mining is a process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, the web, and other
information repositories or data that are streamed into the system dynamically.
Knowledge discovery in the database is the process of searching for hidden knowledge in the
massive amounts of data that we are technically capable of generating and storing.
The basic task of KDD is to extract knowledge (or information) from a lower level data
(databases).
It is the non-trivial (significant) process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data.
The goal is to distinguish between unprocessed data something that may not be obvious but is
valuable or enlightening in its discovery.
The overall process of finding and interpreting patterns from data involves the repeated
application of the following steps:
Data Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Presentation
Steps 1 to 4 are different forms of data preprocessing, where data are prepared for mining.
Fig. Data Mining as a step in the process of knowledge discovery
Data Cleaning
Removal of noise, inconsistent data, and outliers
Strategies to handle missing data fields.
Data Integration
Data from various sources such as databases, data warehouse, and transactional data are
integrated.
where multiple data sources may be combined into a single data format.
Data Selection
Data relevant to the analysis task is retrieved from the database.
Collecting only necessary information to the model.
Finding useful features to represent data depending on the goal of the task.
Data Transformation
Data are transformed and consolidated into forms appropriate for mining by performing
summary or aggregation operations.
By using transformation methods invariant representations for the data is found.
Data Mining
An essential process where intelligent methods are applied to extract data patterns.
Deciding which model and parameter may be appropriate.
Pattern Evaluation
To identify the truly interesting patterns representing knowledge based on interesting
measures.
Knowledge Presentation
Visualization and knowledge representation techniques are used to present mined
knowledge to users.
Visualizations can be in form of graphs, charts or table.
Data mining is a significant method where previously unknown and potentially useful
information is extracted from the vast amount of data. The data mining process involves several
components, and these components constitute a data mining system architecture.
The significant components of data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user interface, and knowledge
base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text
files, and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data warehouses
may comprise one or more databases, text files spreadsheets, or other repositories of data.
Sometimes, even plain text files or spreadsheets may contain information. Another primary source
of data is the World Wide Web or the internet.
Data cleaning, Integration and Selection:
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different formats, it
can't be used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified.
The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on data mining as
per user request.
The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc.
The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the search
on exciting patterns. It might utilize a stake threshold to filter out discovered patterns.
The graphical user interface (GUI) module communicates between the data mining system and the
user. This module helps the user to easily and efficiently use the system without knowing the
complexity of the process. This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the
search or evaluate the stake of the result patterns. The knowledge base may even contain user
views and data from user experiences that might be helpful in the data mining process.
1. Spatial Databases
2. Flat Files
3. Relational Databases
4. Transactional Databases
5. Multimedia Databases
6. Data Warehouse
7. World Wide Web(WWW)
8. Time Series Databases
Spatial Database
Spatial Database is a suitable way to Store geographical information. Spatial Database stores
the data in the form of coordinates, lines, and different shapes, etc. Maps, Global positioning, etc are
the famous applications of Spatial Database.
Flat files
Flat files are in the binary form or text form and having a structure that can be easily
extracted by data mining algorithms. Representation of Flat files can be done with the data
dictionary. The most suitable example is the CSV file. Flat Files are famous in Data Warehousing due
to many reasons. Some important reasons are mentioned below;
Relational Databases
Relational Databases is an organized collection of related data. This organization is in the
form of tables with rows and columns. Different kind of scheme used in relational databases. A
physical and logical schema is famous schema.
Transactional Databases
Transactional Databases is an organized collection of data that is organized by timestamps
etc. For example, organized by any date to represent the transaction in databases. Transactional
Databases must have the capability to roll back any transaction. It is most commonly used in ATM
machines. Object databases, ATM machine, Banking, and Distributed systems are very famous
applications of a transactional database.
Multimedia Databases
Multimedia databases are the databases that can store the followings;
Video
images
Audio
text etc
Multimedia Databases can be stored on Object-Oriented Databases. E books databases,
Video websites databases, news websites databases etc are famous applications of Multimedia
Databases.
Data Warehouse:
A data warehouse is the collection of data that is collected and integrated from one or more
sources. Later this data can be mined for business decision making.
Three famous types of a data warehouse are mentioned below;
1. Virtual Warehouse
2. Data Mart
3. Enterprise data warehouse
Business decision making and Data mining are very useful applications of the data warehouse.
WWW
WWW stands for World wide web. WWW is a collection of documents and resources and
can contain a different kind of data like video, audio, and text, etc. Each data can be identified by
Uniform Resource Locators (URLs) through web browsers. Online tools, online video, images, and
text searching sites are the famous applications of WWW.
Time-series Databases
Time-series databases are the databases that can store the stock exchange data etc.
Graphite and eXtreme DB etc are the famous applications of Time-series Databases.
Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. In general, data mining tasks can be classified into two types
including descriptive and predictive. Descriptive mining tasks define the common features of the
data in the database and the predictive mining tasks act inference on the current information to
develop predictions.
Example:
buys(X,“computer”)=buys(X,“software”) [support=1%,confidence=50%]
Support=1% means that 1% of all of the transactions under analysis showed that computer
and software were purchased together.
A decision tree is a flow-chart-like tree structure, where each node denotes a test
on an attribute value, each branch represents an outcome of the test, and tree leaves
represent classes or class distributions.
Prediction − It defines predict some unavailable data values or pending trends. An object
can be anticipated based on the attribute values of the object and attribute values of the
classes. It can be a prediction of missing numerical values or increase/decrease trends in
time-related information.
Clustering − It is similar to classification but the classes are not predefined. The classes are
represented by data attributes. It is unsupervised learning. The objects are clustered or
grouped, depends on the principle of maximizing the intraclass similarity and minimizing
the intraclass similarity.
Outlier analysis − Outliers are data elements that cannot be grouped in a given class or
cluster. These are the data objects which have multiple behaviour from the general
behaviour of other data objects. The analysis of this type of data can be essential to mine the
knowledge.
Evolution analysis − It defines the trends for objects whose behaviour changes over some
time.
Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Performance Issues:
Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.
Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them
adds challenges to data mining.
Data Preprocessing
Data preprocessing involves transforming raw data to well-formed data sets so that data
mining analytics can be applied. Raw data is often incomplete and has inconsistent formatting. The
adequacy or inadequacy of data preparation has a direct correlation with the success of any project
that involve data analytics.
Noisy Data:
Noise data means meaningless data. Noise is a random error or variance in a measured
variable. Noisy Data may be due to faulty data collection instruments, data entry problems and
technology limitation. Noise data can be handled by the following methods
1. Binning method
2. Regression
3. Clustering
I Binning:
Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values
around it. The whole data is divided into segments of equal size and then various methods are
performed to complete the task. The sorted values are distributed into a number of “buckets,” or
bins. There are 3 types of binning methods for data smoothing.
For example
Price = 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin a: 4, 8, 15
Bin b: 21, 21, 24
Bin c: 25, 28, 34
In this example, the data for price are first sorted and then partitioned into equal-frequency bins of
size 3.
Bin a: 9, 9, 9
Bin b: 22, 22, 22
Bin c: 29, 29, 29
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
Smoothing by bin boundaries:
Bin a: 4, 4, 15
Bin b: 21, 21, 24
Bin c: 25, 25, 34
In smoothing by bin boundaries, each bin value is replaced by the closest boundary value.
II Regression
Data can be smoothed by fitting the data into a regression functions.
1. Linear regression
2. Multi linear regression
Linear Regression:
To find the best line to fit two attributes( or variables). So that one attribute can be used to
predict the other
Multi linear regression
More than two attributes are involved and the data are fit to a multidimensional surface.
III Clustering:
Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters. Values that fall outside of the set of clusters may be considered outliers.
Clustering
1 Tight Coupling
In tight coupling data is combined from different sources into a single physical location
through the process of ETL - Extraction, Transformation and Loading.
2 Loose Coupling
In loose coupling data only remains in the actual source databases. In this approach, an
interface is provided that takes query from user and transforms it in a way the source database can
understand and then sends the query directly to the source databases to obtain the result.
1. Schema Integration:
Integrated meta data from different sources. Entities from multiple sources are matched
referred to as the entity identification problem.
2. Redundancy
An attribute may b redundant if it can be derived from another attribute or set of
attributes. Some redundancies can be detected by correlation analysis.
Data Transformation:
Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information. Data transformation includes data
cleaning techniques and a data reduction technique to convert the data into the appropriate form.
Data Transformation Techniques
There are several data transformation techniques that can help structure and clean up the
data before analysis or storage in a data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data reduction and data cleaning.
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to look
at a lot of data which can often be difficult to digest for finding patterns that they wouldn't see
otherwise.
We have seen how the noise is removed from the data using the techniques such as binning,
regression, clustering.
o Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
o Regression: This method identifies the relation among two dependent attributes so that if
we have one attribute, it can be used to predict the other attribute.
o Clustering: This method groups similar data values and form a cluster. The values that lie
outside a cluster are known as outliers.
2. Attribute Construction
In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied to assist the
mining process from the given attributes. This simplifies the original data and makes the mining
more efficient.
example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So here, we can construct a new attribute 'area' from
attributes 'height' and 'weight'. This also helps understand the relations among the attributes in a
data set.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources into a
data analysis description. This is a crucial step since the accuracy of data analysis insights is highly
dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning financing
or business strategy of the product, pricing, operations, and marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of
each year. We can aggregate the data to get the enterprise's annual sales report.
4. Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0,
1.0]. There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for attribute
A that are V1, V2, V3, ….Vn.
o Z-score normalization: This method normalizes the value for attribute A using
the meanand standard deviation. The following formula is used for Z-score normalization:
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and
$16,000. And we have to normalize the value $73,600 using z-score normalization.
o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal
point in the value. This movement of a decimal point depends on the maximum absolute
value of A. The formula for the decimal scaling is given below:
5. Data Discretization
This is a process of converting continuous data into a set of data intervals. Continuous attribute
values are substituted by small interval labels. This makes the data easier to study and analyze. If a
data mining task handles a continuous attribute, then its discrete values can be replaced by
constant quality attributes. This improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset into a set of
categorical data. Discretization also uses decision tree-based algorithms to produce short, compact,
and accurate results when using discrete values.
Data discretization can be classified into two types: supervised discretization, where the class
information is used, and unsupervised discretization, which is based on which direction the
process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging strategy'.
For example, the values for the age attribute can be replaced by the interval labels such as (0-10,
11-20…) or (kid, youth, adult, senior).
6. Data Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy. This
conversion from a lower level to a higher conceptual level is useful to get a clearer picture of the
data. Data generalization can be divided into two approaches:
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).
Data Reduction:
Data reduction is a process that reduces the volume of original data and represents it in a
much smaller volume. Data reduction techniques are used to obtain a reduced representation of the
dataset that is much smaller in volume by maintaining the integrity of the original data. By reducing
the data, the efficiency of the data mining process is improved, which produces the same analytical
results.
Data reduction does not affect the result obtained from data mining. That means the result
obtained from data mining before and after data reduction is the same or almost the same.
Data reduction aims to define it more compactly. When the data size is smaller, it is simpler
to apply sophisticated and computationally high-priced algorithms. The reduction of the data may
be in terms of the number of rows (records) or terms of the number of columns (dimensions).
Here are the following techniques or methods of data reduction in data mining, such as:
1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby
reducing the volume of original data. It reduces data size as it eliminates outdated or redundant
features. Here are three methods of dimensionality reduction.
i. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed into
a numerically different data vector A' such that both A and A' vectors are of the same length.
Then how it is useful in reducing data because the data obtained from the wavelet
transform can be truncated. The compressed data is obtained by retaining the smallest
fragment of the strongest wavelet coefficients. Wavelet transform can be applied to data
cubes, sparse data, or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to be analyzed that has tuples
with n attributes. The principal component analysis identifies k independent tuples with n
attributes that can represent the data set.
In this way, the original data can be cast on a much smaller space, and dimensionality
reduction can be achieved. Principal component analysis can be applied to sparse and
skewed data.
iii. Attribute Subset Selection: The large data set has many attributes, some of which are
irrelevant to data mining or some are redundant. The core attribute subset selection
reduces the data volume and dimensionality. The attribute subset selection reduces the
volume of data by eliminating redundant and irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original attributes even
after eliminating the unwanted attributes. The resulting probability of data distribution is
as close as possible to the original data distribution using all the attributes.
2. Numerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much smaller
form. This technique includes two types parametric and non-parametric numerosity reduction.
c. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing
SRSWOR on these clusters. A simple random sample of size s could be
generated from these clusters where s<M.
d. Stratified sample: The large data set D is partitioned into mutually disjoint
sets called 'strata'. A simple random sample is taken from each stratum to
get stratified data. This method is effective for skewed data.
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent the
original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the
year 2022. If you want to get the annual sale per year, you just have to aggregate the sales per
quarter for each year. In this way, aggregation provides you with the required data, which is much
smaller in size, and thereby we achieve data reduction even without losing any data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis.
The data cube present precomputed and summarized data which eases the data mining into fast
access.
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way that
consumes less space. Data compression involves building a compact representation of information
by removing redundancy and representing data in binary form. Data that can be restored
successfully from its compressed form is called Lossless compression. In contrast, the opposite
where it is not possible to restore the original form from the compressed form is Lossy
compression. Dimensionality and numerosity reduction method are also used for data compression.
This technique reduces the size of the files using different encoding mechanisms, such as Huffman
Encoding and run-length Encoding. We can divide it into two types based on their compression
techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them. For example, the
JPEG image format is a lossy compression, but we can find the meaning equivalent to the
original image. Methods such as the Discrete Wavelet transform technique PCA (principal
component analysis) are examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature
into data with intervals. We replace many constant values of the attributes with labels of small
intervals. This means that mining results are shown in a concise and easily understandable way.
Unit I Completed