UNIT-II
DATA MINING – INTRODUCTION
Introduction to Data Mining Systems – Knowledge Discovery Process – Data
Mining Techniques – Issues – applications- Data Objects and attribute types, Statistical
description of data, Data Preprocessing – Cleaning, Integration, Reduction,
Transformation and discretization, Data Visualization, Data similarity and dissimilarity
measures.
INTRODUCTION
Data mining refers to extracting or mining knowledge from large amounts of
data. The term is actually a misnomer. Thus, data mining should have been
more appropriately named as knowledge mining which emphasis on mining
from large amounts of data. It is the computational process of discovering
patterns in large data sets involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems. The overall
goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
The key properties of data mining are
Automatic discovery of patterns
Prediction of likely outcomes
Creation of actionable information
Focus on large datasets and databases
KNOWLEDGE DISCOVERY PROCESS
Data Mining is defined as extracting information from huge sets of data. In other words,
we can say that data mining is the procedure of mining knowledge from data. The
information or knowledge extracted so can be used for any of the following applications −
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Knowledge discovery from Data (KDD) is essential for data mining. While others view
data mining as an essential step in the process of knowledge discovery. Here is the list of
steps involved in the knowledge discovery process –
Data Cleaning − In this step, the noise and inconsistent data is removed.
Data Integration − In this step, multiple data sources are combined.
Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation − In this step, data patterns are evaluated.
Knowledge Presentation − In this step, knowledge is represented.
DATA MINING TECHNIQUES
Data mining functionalities are used to specify the kind of patterns to be found in
data mining tasks. In general, data mining tasks can be classified into two
categories: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the
database. Predictive mining tasks perform inference on the current data in order
to make predictions. Data mining functionalities, and the kinds of patterns they
can discover, are described below.
Concept/Class Description: Characterization and Discrimination Data
can be associated with classes or concepts.
It can be useful to describe individual classes and concepts in
summarized, concise, and yet precise terms.
Such descriptions of a class or a concept are called class/concept descriptions.
These descriptions can be derived via
(1)Data characterization, by summarizing the data of the class under
study (often called the target class) in general terms, or
(2)Data discrimination, by comparison of the target class with one or a set
of comparative classes (often called the contrasting classes), or
(3)Both data characterization and discrimination. Data characterization is
a summarization of the general characteristics or features of a target class
of data.
The data corresponding to the user-specified class are typically collected by
a database query.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional data. Here
is the list of kind of frequent patterns −
Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.
3. Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among
data and determining association rules.
For example, a retailer generates an association rule that shows that 70% of time
milk is sold with bread and only 30% of times biscuits are sold with bread.
4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs or between two item sets to
analyze that if they have positive, negative or no effect on each other.
5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.
Classification and Prediction
Classification is the process of finding a model that describes the data classes or
concepts. The purpose is to be able to use this model to predict the class of objects
whose class label is unknown. This derived model is based on the analysis of sets of
training data. The derived model can be presented in the following forms −
1. Classification (IF-THEN) Rules
2. Prediction
3. Decision Trees
4. Mathematical Formulae
5. Neural Networks
6. Outlier Analysis
7. Evolution Analysis
The list of functions involved in these processes is as follows −
1. Classification − It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the
data object whose class label is well known.
2. Prediction − It is used to predict missing or unavailable numerical data values
rather than class labels. Regression Analysis is generally used for prediction.
Prediction can also be used for identification of distribution trends based on
available data.
3. Decision Trees − A decision tree is a structure that includes a root node,
branches, and leaf nodes. Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf node holds a class label.
4. Mathematical Formulae – Data can be mined by using some mathematical
formulas.
5. Neural Networks − Neural networks represent a brain metaphor for information
processing. These models are biologically inspired rather than an exact replica of
how the brain actually functions. Neural networks have been shown to be very
promising systems in many forecasting applications and business classification
applications due to their ability to “learn” from the data.
6. Outlier Analysis − Outliers may be defined as the data objects that do not
comply with the general behavior or model of the data available.
7. Evolution Analysis − Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.
MAJOR ISSUES IN DATA MINING
Data mining is not an easy task, as the algorithms used can get very complex and
data is not always available at one place. It needs to be integrated from various
heterogeneous data sources. These factors also create some issues. Here in this
tutorial, we will discuss the major issues regarding −
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not only in
concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns will
be poor.
Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as
huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data again from
scratch.
Diverse Data Types Issues
Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
DATA MINING APPLICATIONS
Here is the list of areas where data mining is widely used −
Financial Data Analysis
Retail Industry
Telecommunication Industry
Biological Data Analysis
Education
Research
Healthcare and Insurance
Transportation
Other Scientific Applications
Intrusion Detection
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Like,
Loan payment prediction and customer credit policy analysis.
Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will
continue to expand rapidly because of the increasing ease, availability and
popularity of the web.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, internet messenger,
images, e- mail, web data transmission, etc. Due to the development of new
computer and communication technologies, the telecommunication industry is
rapidly expanding. This is the reason why data mining is become very important to
help and understand the business.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological
data mining is a very important part of Bioinformatics.
Education: For analyzing the education sector, data mining uses Educational Data
Mining (EDM) method. This method generates patterns that can be used both by
learners and educators. By using data mining EDM we can perform some
educational task:
Predicting students admission in higher education
Predicting students profiling
Predicting student performance
Teachers teaching performance
Curriculum development
Predicting student placement opportunities
Research: A data mining technique can perform predictions, classification,
clustering, associations, and grouping of data with perfection in the research area.
Rules generated by data mining are unique to find results. In most of the technical
research in data mining, we create a training model and testing model. The
training/testing model is a strategy to measure the precision of the proposed model.
It is called Train/Test because we split the data set into two sets: a training data set
and a testing data set. A training data set used to design the training model whereas
testing data set is used in the testing model. Example:
Classification of uncertain data.
Information-based clustering.
Decision support system
Web Mining
Domain-driven data mining
IoT (Internet of Things)and Cybersecurity
Smart farming IoT(Internet of Things)
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals
force activity and their outcomes to improve the focusing of high-value physicians
and figure out which promoting activities will have the best effect in the following
upcoming months, Whereas the Insurance sector, data mining can help to predict
which customers will buy new policies, identify behavior patterns of risky
customers and identify fraudulent behavior of customers.
Claims analysis i.e which medical procedures are claimed together.
Identify successful medical therapies for different illnesses.
Characterizes patient behavior to predict office visits.
Transportation: A diversified transportation company with a large direct sales
force can apply data mining to identify the best prospects for its services. A large
consumer merchandise organization can apply information mining to improve its
business cycle to retailers.
Determine the distribution schedules among outlets.
Analyze loading patterns.
Other Scientific Applications
The applications discussed above tend to handle relatively small and homogeneous
data sets for which the statistical techniques are appropriate. Huge amount of data
have been collected from scientific domains such as geosciences, astronomy, etc.
A large amount of data sets is being generated because of the fast numerical
simulations in various fields such as climate and ecosystem modeling, chemical
engineering, fluid dynamics, etc.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become
the major issue. With increased usage of internet and availability of the tools and
tricks for intruding and attacking network prompted intrusion detection to become a
critical component of network administration.
DATA OBJECTS AND ATTRIBUTE TYPES
Data Object: An Object is real time entity.
Attribute:
It can be seen as a data field that represents characteristics or features of a data
object. For a customer object attributes can be customer Id, address etc. The
attribute types can represented as follows—
1. Nominal Attributes – related to names: The values of a Nominal attribute are
name of things, some kind of symbols. Values of Nominal attributes represent some
category or state and that’s why nominal attribute also referred as categorical
attributes.
Example: Attribute Values
Colors Black, Green, Brown, red
.
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no,
affected or
unaffected, true or false.
i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).
3. Ordinal Attributes: The Ordinal Attributes contains values that have a
meaningful sequence or ranking (order) between them, but the magnitude between
values is not actually known, the order of values that shows what is important but
don’t indicate how important it is.
Attribute Values
Grade O, S, A, B, C, D, F
4. Numeric: A numeric attribute is quantitative because, it is a measurable quantity,
represented in integer or real values. Numerical attributes are of 2 types.
i. An interval-scaled attribute has values, whose differences are interpretable, but
the numerical attributes do not have the correct reference point or we can call zero
point. Data can be added and subtracted at interval scale but cannot be multiplied or
divided. Consider an example of temperature in degrees Centigrade. If a day’s
temperature of one day is twice than the other day we cannot say that one day is
twice as hot as another day.
ii. A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a
measurement is ratio-scaled, we can say of a value as being a multiple (or ratio) of
another value. The values are ordered, and we can also compute the difference
between values, and the mean, median, mode, Quantile-range and five number
summaries can be given.
5.Discrete: Discrete data have finite values it can be numerical and can also be in
categorical form.
These attributes has finite or countably infinite set of values.
Example Attribute Values
Profession Teacher, Business man, Peon
ZIP Code 521157, 521301
6. Continuous: Continuous data have infinite no of states. Continuous data is of float
type. There can be many values between 2 and 3.
Example: Attribute Values
Height 5.4, 5.7, 6.2, etc.,
Weight 50, 65, 70, 73, etc.
BASIC STATISTICAL DESCRIPTIONS OF DATA
Basic Statistical descriptions of data can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
1. Measuring the central Tendency:
There are many ways to measure the central tendency.
a) Mean: Let us consider the values are be a set of N observed values or observations of
X.
The Most common numeric measure of the “center” of a set of data is the mean.
Sometimes, each values in a set maybe associated with weight for i=1, 2, …., N. then the
mean is as follows
This is called weighted arithmetic mean or weighted average.
b) Median: Median is middle value among all values.
For N number of odd list median is the middle value.
For N number of even list median is average of the middle two values.
Estimated by interpolation
c) Mode:
The mode is another measure of central tendency.
Datasets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal.
A dataset with two or more modes is multimodal. If each data occurs only once, then
there is no mode.
2. Measuring the Dispersion of data:
The data can be measured as range, quantiles, quartiles, percentiles, and interquartile
range.
The kth percentile of a set of data in numerical order is the value xi having the
property that k percent of the data entries lay at or below xi.
The measures of data dispersion: Range, Five-number summary, Interquartile range
(IQR), Variance
and Standard deviation
Five Number Summary:
This contains five values Minimum, Q1 (25% value), Median, Q3 (75% value), and
Maximum. These Five numbers are represented as Boxplot in graphical format.
Boxplot is Data is represented with a box.
The ends of the box are at the first and third quartiles, i.e., the height of the box is
IRQ.
The median is marked by a line within the box.
Whiskers: two lines outside the box extend to Minimum and Maximum.
To show outliers, the whiskers are extended to the extreme low and high observations
only if these values are less than 1.5 * IQR beyond the quartiles.
Variance and Standard Deviation:
Let us consider the values are be a set of N observed values or observations of X. The
variance formula is
Standarard deviation is the square root of the variance
σ measures spread about the mean and should be used only when the mean is
chosen as the measure of center.
σ =0 only when there is no spread, that is,
when all observations have the same value
3. Graphical Displays of Basic Statistical data:
There are many types of graphs for the display of data summaries and distributions,
such as:
Bar charts
Pie charts
Line graphs
Boxplot
Histograms
Quantile plots, Quantile - Quantile plots
Scatter plots
The data values can represent as Bar charts, pie charts, Line graphs, etc.
Quantile plots:
A quantile plot is a simple and effective way to have a first look at a univariate data
distribution.
Plots quantile information
For a data xi data sorted in increasing order, fi indicates that approximately 100 fi%
of the data are below or equal to the value xi
Note that
The 0.25 quantile corresponds to quartile Q1,
The 0.50 quantile is the median, and
The 0.75 quantile is Q3.
Quantile - Quantile plots:
In statistics, a Q-Q plot is a portability plot, which is a graphical method for comparing
two portability distributions by plotting their quantiles against each other.
Histograms or frequency histograms:
A univariate graphical method
Consists of a set of rectangles that reflect the counts or frequencies of the classes
present in the given data
If the attribute is categorical, then one rectangle is drawn for each known value of
A, and the resulting graph is more commonly referred to as a bar chart.
If the attribute is numeric, the term histogram is preferred.
Scatter Plot:
Scatter plot
Is one of the most effective graphical methods for determining if there appears to be
a relationship, clusters of points, or outliers between two numerical attributes.
Each pair of values is treated as a pair of coordinates and plotted as points in the
plane
DATA PREPROCESSING
There are the major steps involved in data preprocessing, namely, data cleaning, data
integration, data reduction, and data transformation as follows −
Data Cleaning − Data cleaning routines operate to “clean” the information by filling
in missing values, smoothing noisy information, identifying or eliminating outliers,
and resolving deviation. If users understand the data are dirty, they are unlikely to trust
the results of some data mining that has been used.
Moreover, dirty data can make confusion for the mining phase, resulting in unstable
output. Some mining routines have some phase for dealing with incomplete or noisy
information, they are not always potent. Instead, they can concentrate on preventing
overfitting the information to the function being modeled.
Data Integration − Data integration is the procedure of merging data from several
disparate sources. While performing data integration, it must work on data
redundancy, inconsistency, duplicity, etc. In data mining, data integration is a record
preprocessing method that includes merging data from a couple of the heterogeneous
data sources into coherent data to retain and provide a unified perspective of the data.
Data integration is especially important in the healthcare industry. Integrated data from
multiple patient data and clinics assist clinicians in recognizing medical disorders and
diseases by integrating data from multiple systems into an individual perspective of
beneficial data from which beneficial insights can be derived.
Data Reduction − The objective of Data reduction is to define it more compactly.
When the data size is smaller, it is simpler to use sophisticated and computationally
high-cost algorithms. The reduction of the data can be in terms of the multiple rows
(records) or terms of the multiple columns (dimensions).
In dimensionality reduction, data encoding schemes are used so as to acquire a
reduced or “compressed” description of the initial data. Examples involve data
compression methods (e.g., wavelet transforms and principal components analysis),
attribute subset selection (e.g., removing irrelevant attributes), and attribute
construction (e.g., where a small set of more beneficial attributes is changed from the
initial set).
In numerosity reduction, the data are restored by alternative, smaller description using
parametric models such as regression or log-linear models or nonparametric models
such as histograms, clusters, sampling, or data aggregation.
Data transformation − In data transformation, where data are transformed or linked
into forms applicable for mining by executing summary or aggregation operations. In
Data transformation, it includes −
Smoothing − It can work to remove noise from the data. Such techniques includes
binning, regression, and clustering.
Aggregation − In aggregation, where summary or aggregation services are used to the
data. For instance, the daily sales data can be aggregated to calculate monthly and
annual total amounts. This procedure is generally used in developing a data cube for
the analysis of the records at several granularities.
DATA CLEANING
Data cleaning defines to clean the data by filling in the missing values, smoothing
noisy data, analyzing and removing outliers, and removing inconsistencies in the data.
Sometimes data at multiple levels of detail can be different from what is required, for
example, it can need the age ranges of 20-30, 30-40, 40-50, and the imported data
includes birth date. The data can be cleans by splitting the data into appropriate types.
Types of data cleaning
There are various types of data cleaning which are as follows −
Missing Values − Missing values are filled with appropriate values. There are the
following approaches to fill the values.
The tuple is ignored when it includes several attributes with missing values.
The values are filled manually for the missing value.
The same global constant can fill the values.
The attribute mean can fill the missing values.
The most probable value can fill the missing values.
Noisy data − Noise is a random error or variance in a measured variable. There are the
following smoothing methods to handle noise which are as follows −
Binning − These methods smooth out a arrange data value by consulting its
“neighborhood,” especially, the values around the noisy information. The arranged
values are distributed into multiple buckets or bins. Because binning methods consult
the neighborhood of values, they implement local smoothing.
Regression − Data can be smoothed by fitting the information to a function, including
with regression. Linear regression contains finding the “best” line to fit two attributes
(or variables) so that one attribute can be used to forecast the other. Multiple linear
regression is a development of linear regression, where more than two attributes are
contained and the data are fit to a multidimensional area.
Clustering − Clustering supports in identifying the outliers. The same values are
organized into clusters and those values which fall outside the cluster are known as
outliers.
Combined computer and human inspection − The outliers can also be recognized
with the support of computer and human inspection. The outliers pattern can be
descriptive or garbage. Patterns having astonishment value can be output to a list.
Inconsistence data − The inconsistency can be recorded in various transactions,
during data entry, or arising from integrating information from multiple databases.
Some redundancies can be recognized by correlation analysis. Accurate and proper
integration of the data from various sources can decrease and avoid redundancy.
DATA INTEGRATION
Data integration is the phase of combining data from several disparate sources. While
implementing data integration, it should work on data redundancy, inconsistency,
duplicity, etc. In data mining, data integration is a data pre-processing technique that
contains merging data from numerous heterogeneous data sources into coherent data to
retain and support a consolidated perspective of the information.
It combines data from various sources into a coherent data store, including in data
warehousing. These sources can involve multiple databases, data cubes, or flat files,
etc. There are multiple issues to consider during data integration.
Schema integration and object matching can be complex. For example, matching the
entity identification (emp_id in one database and emp_no in another database), such
issues can be prevented using metadata.
Redundancy is another issue. An attribute including annual revenue, for instance, can
be redundant if it can be derived from another attribute or set of attributes.
Inconsistencies in attribute or dimension naming can also generate redundancies in the
appearing data set.
Some redundancies can be discovered by correlation analysis. Given two attributes,
such analysis can compute how strongly one attribute implies the other, based on the
available data. For numerical attributes, it can evaluate the correlation between two
attributes, A and B, by computing the correlation coefficient (also known as Pearson’s
product-moment coefficient, named after its inventor, Karl Pearson). This is
rA,B=∑ni=1(ai−A′)(bi−B′)NσAσB=∑ni=1(aibi)−NA′B′NσAσB
where N is the number of tuples, aiand bi are the respective values of A and B in tuple
i, A’ and B’ are the respective mean values of A and B, σA and σB are the respective
standard deviations of A and B and Σ(aibi) is the sum of the AB cross-product that is,
for each tuple, the value for A is multiplied by the value for B in that tuple.
Correlation does not imply causality. That is, if A and B are correlated, this does not
necessarily imply that A causes B or that B causes A. For example, in analyzing a
demographic database, it can find that attributes defining the multiple hospitals and the
several car thefts in a region are correlated. This does not define that one causes the
other. Both are generally connected to a third attribute, such as population.
A third important issue in data integration is the detection and resolution of data value
conflicts. For example, for the same real-world entity, attribute values from multiple
sources can differ. This can be because of differences in representation, scaling, or
encoding.
DATA REDUCTION:
Data mining is applied to the selected data in a large amount database. When data
analysis and mining is done on a huge amount of data then it takes a very long time to
process, which makes it impractical and infeasible. It can reduce the processing time
for data analysis, data reduction techniques are used to obtain a reduced representation
of the dataset that is much smaller in volume by maintaining the integrity of the
original data. By reducing the data, the efficiency of the data mining process is
improved which produces the same analytical results.
Data reduction aims to define it more compactly. When the data size is smaller, it is
simpler to apply sophisticated and computationally high-priced algorithms. The
reduction of the data may be in terms of the number of rows (records) or terms of the
number of columns (dimensions).
There are various strategies for data reduction which are as follows −
Data cube aggregation − In this method, where aggregation operations are used to the
data in the construction of a data cube. These data include the All Electronics sales per
quarter, for the years 2002 to 2004. It is interested in the annual sales (total per year),
rather than the total per quarter. Thus the data can be aggregated so that the resulting
data summarize the total sales per year instead of per quarter. The resulting data set is
smaller in volume, without loss of data essential for the analysis task.
Attribute subset selection − In this method, where irrelevant, weakly relevant, or
redundant attributes or dimensions can be discovered and deleted. Data sets for
analysis can include hundreds of attributes, some of which can be irrelevant to the
mining task or redundant. For instance, if the task is to arrange customers as to
whether or not they are likely to purchase a popular new CD at All Electronics when
notified of a sale, attributes such as the customer’s telephone number are likely to be
irrelevant, unlike attributes such as age or music_taste.
Dimensionality reduction − Encoding mechanisms are used to reduce the data set
size. In dimensionality reduction, data encoding or transformations are applied to
obtain a reduced or “compressed” representation of the original data. If the original
data can be reconstructed from the compressed data without any loss of information,
the data reduction is called lossless.
Numerosity reduction − The data are restored or predicted by alternative, smaller
data representations including parametric models (which are required to save only the
model parameters rather than the actual data) or nonparametric methods including
clustering, sampling, and the use of histograms.
Discretization and concept hierarchy generation − In this method, where raw data
values for attributes are replaced by ranges or higher conceptual levels. Data
discretization is a form of numerosity reduction that is very beneficial for the
automatic production of concept hierarchies. Discretization and concept hierarchy
generation are dynamic tools for data mining, in that they enable the mining of data at
various levels of abstraction.
DATA TRANSFORMATION AND DISCRETIZATION
Data transformation in data mining is done by combining unstructured data with
structured data to analyze it later. It is also important when the data is transferred to a
new cloud data warehouse.
When the data is homogeneous and well-structured, it is easier to analyze and look for
patterns.
For example, a company has acquired another firm and now has to consolidate all the
business data. The Smaller company may be using a different database than the parent
firm. Also, the data in these databases may have unique IDs, keys, and values. All this
needs to be formatted so that all the records are similar and can be evaluated.
This is why data transformation methods are applied. And, they are described below:
Data Smoothing
This method is used for removing the noise from a dataset. Noise is referred to as the
distorted and meaningless data within a dataset.
Smoothing uses algorithms to highlight the special features in the data.
After removing noise, the process can detect any small changes to the data to detect
special patterns.
Any data modification or trend can be identified by this method.
Data Aggregation
Aggregation is the process of collecting data froma variely of sources and storing it in
a single format. Here, data is collected, stored, analyzed, and presented in a report or
summary format.
It helps in gathering more information about a particular data cluster. The method
helps in collecting vast amounts of data.
This is a crucial step as accuracy and quantity of data is important for proper analysis.
Companies collect data about their website visitors. This gives them an idea about
customer demographics and behavior metrics. This aggregated data assists them in
designing personalized messages, offers, and discounts.
Discretization
This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to
study and analyze.
If a continuous attribute is handled by a data mining task, then its discrete values can
be replaced by constant quality attributes. This improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset
into a set of categorical data.
Discretization can be done by Binning, Histogram Analysis, and Correlation Analyses.
Discretization also uses decision tree-based aigorithms to produce short, compact, and
accurate results when xSing discrete values.
Generalization
In this process, low-level data attributes are transformed into high-level data attributes
using concept hierarchies. This conversion from a lower level to a higher conceptual
level is useful to get a clearer picture of the data.
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into
a higher conceptual level into a categorical value (young, old).
Data generalization can be divided into two approaches
data cube process (OLAP) and attribute-oriented induction approach (AO).
Attribute construction
In the attribute construction method, new attributes are created from an existing set of
attributes.
For example, in a dataset of employee information, the attributes can be employee
name, employee ID, and address.
These attributes can be used to construct another dataset that contains information
about the employees who have joined in the year 2019 only.
This method of reconstruction makes mining more piicient and helps in creating new
datasets quickly.
Normalization
Also called data pre-processing, this is one of the crucial techniques for data
transformation in data mining.
Here, the data is transformed so that it falls under a given range. When attributes are
on different ranges or scales, data modeling and mining can be difficult.
Normalization helps in applying data mining algorithms and extracting data faster.
The popular normalization methods are:
Min-max normalization
In this technique of data normalization, a linear transformation is performed on the
original data. Minimum and maximum value from data is fetched and each value is
replaced according to the following formula:
v′=v−minAmaxA−minA( new_ AmaxA− new −min A) + new_min A
Where A is the attribute data, min-max are the minimum and maximum the absolute
value of A respectively, vis the new value of each entry in data, v is the old value of
each entry in data, new_maxA, new_min is the max and min value of the range (i.e.
boundary value of the range required) respectively.
Example: Suppose the income range from 10,000toEx.95,000 is normalized to [0.0,
1.0]. By min-max normalization, a value of $64,300 for income is in transformed to
64300-10000 95000-100001.0-0.0) + 0.0 =0.6388
Z-score normalization
In this technique, values are normalized based on a mean and standard deviation of the
data A.
Decimal scaling
It normalizes by moving the decimal point of values of the data. To normalize the data
by this technique, we divide each value of the data by the maximum absolute value of
the data. The data value, Vi, of data is normalized to v; by using the formula below:
where, j is the smallest integer such that max (Iv )<1.
DATA VISUALIZATION
Visualization is the use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data.
Categorization of visualization methods:
a) Pixel-oriented visualization techniques
b) Geometric projection visualization techniques
c) Icon-based visualization techniques
d) Hierarchical visualization techniques
e) Visualizing complex data and relations
a) Pixel-oriented visualization techniques
For a data set of m dimensions, create m windows on the screen, one for each
dimension
The m dimension values of a record are mapped to m pixels at the corresponding
positions in the windows
The colors of the pixels reflect the corresponding values
To save space and show the connections among multiple dimensions, space filling
is often done in a circle segment
b) Geometric projection visualization techniques
Visualization of geometric transformations and projections of the data Methods
Direct visualization
Scatterplot and scatterplot matrices
Landscapes
Projection pursuit technique: Help users find meaningful projections of
multidimensional data
Prosection views
Hyperslice
Parallel coordinates
c) Icon-based visualization techniques
Visualization of the data values as features of icons Typical visualization methods
Chernoff Faces
Stick Figures General techniques
Shape coding: Use shape to represent certain information encoding
Color icons: Use color icons to encode more information
Tile bars: Use small icons to represent the relevant feature vectors in document
retrieval
Hierarchical visualization techniques
Visualization of the data using a hierarchical partitioning into subspaces Methods
Dimensional Stacking
Worlds-within-Worlds
Tree-Map
Cone Trees
InfoCube
e) Visualizing complex data and relations
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags
The importance of tag is represented by font size/color
Besides text data, there are also methods to visualize relationships, such as
visualizing
.
DATA SIMILARITY AND DISSIMILARITY MEASURES
Distance or similarity measures are essential to solve many pattern recognition
problems such as classification and clustering. Various distance/similarity
measures are available in literature to compare two data distributions. As the
names suggest, a similarity measures how close two distributions are. For
multivariate data complex summary methods are developed to answer this
question.
Similarity Measure
Numerical measure of how alike two data objects are.
Often falls between 0 (no similarity) and 1 (complete similarity).
Dissimilarity Measure
Numerical measure of how different two data objects are.
Range from 0 (objects are alike) to ∞ (objects are different). Proximity refers to
a similarity or dissimilarity. Similarity/Dissimilarity for Simple Attributes
Here, p and q are the attribute values for two data objects.
Common Properties of Dissimilarity Measures
Distance, such as the Euclidean distance, is a dissimilarity measure and has some
well known properties:
1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
2. d(p, q) = d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance
(dissimilarity) between points (data objects), p and q.
A distance that satisfies these properties are called a metric. Following is a list of several
common distance measures to compare multivariate data. We will assume that the
attributes are all continuous.
a) Euclidean Distance
Assume that we have measurements xik, i = 1, … , N, on variables k = 1, … , p (also
called attributes).
The Euclidean distance between the ith and jth objects is
for every pair (i, j) of observations. The weighted Euclidean distance is
If scales of the attributes differ substantially, standardization is necessary.
b) Minkowski Distance
The Minkowski distance is a generalization of the Euclidean distance.
With the measurement, xik , i = 1, … , N, k = 1, … , p, the Minkowski distance is
where λ ≥ 1. It is also called the Lλ metric.
λ = 1 : L1 metric, Manhattan or City-block distance. λ = 2 : L2 metric, Euclidean
distance.
λ → ∞ : L∞ metric, Supremum distance.
Note that λ and p are two different parameters. Dimension of the data matrix
remains finite.
c) Mahalanobis Distance
Let X be a N × p matrix. Then the ith row of X is
The Mahalanobis distance is
where Σ is the p×p sample covariance matrix.
Common Properties of Similarity Measures
Similarities have some well known properties:
1. s(p, q) = 1 (or maximum similarity) only if p = q,
2. s(p, q) = s(q, p) for all p and q, where s(p, q) is the similarity between data objects, p
and q.
Similarity Between Two Binary Variables
The above similarity or distance measures are appropriate for continuous variables.
However, for binary variables a different approach is necessary.
Simple Matching and Jaccard Coefficients
Simple matching coefficient = (n1,1+ n0,0) / (n1,1 + n1,0 + n0,1 + n0,0). Jaccard
coefficient = n1,1 / (n1,1 + n1,0 + n0,1).