Data Mining
Data Mining is defined as the procedure of extracting information from huge sets
of data. In other words, we can say that data mining is mining knowledge from
data.
Data mining, also known as Knowledge Discovery in Data (KDD), is the process
of uncovering patterns and other valuable information from large data sets. Over
the last few decades, the development of data warehousing technology and the
growth of big data have rapidly accelerated the adoption of data mining techniques,
helping companies transform their raw data into useful information. However, even
though that technology co ntinuously evolves to handle data at a large scale,
leaders still face challenges with scalability and automation.
Data mining enables organizations to make better decisions through intelligent
data analyses. Two main purposes may be given to the data mining techniques that
underlie these analyses; they can indicate the target file, or predict its outcome
using machine learning algorithms. These methods are being used to organize and
filter data, showing the most interesting information such as fraud detection, user
behavior, bottlenecks, or even security failures.
Data Mining Process
The data mining process explains different phases to be executed step by step.
Understand Business
Identify the Company's and Project's Objectives first
Problems that need to be addressed
Project constraints or limitations
The business impact of potential solutions
Understand the Data
Identify what type of data is needed to solve the issue i.e.begin preliminary
analysis of the data
Collect it from authentic sources; obtain access rights, and prepare a data
description report
Prepare the Data
Clean the data: handle missing data, data errors, default values, and data
corrections.
Integrate the data: combine two disparate data sets to get the final target data
set.
Format the data: convert data types or configure data for the specific mining
technology being used.
Prepare the data in a format
Model the Data
Employ algorithms to ascertain data patterns
create, the model, test it, and validate the model
Evaluation
Validate models with business goals
Change the model, adjust the business goal, or revisit the data, if needed
Deployment
Generate business intelligence
Continually monitoring, and maintaining the data mining application
Why Data Mining?
Data mining is important to learn for several reasons:
Extracting Insights: Data mining techniques allow users to extract useful
information and patterns from vast amounts of data. Businesses can make
sound decisions, identify trends, and compete with their peers through
analysis of these patterns.
Decision Making: Data mining contributes to the decision-making process.
Businesses can predict future trends and outcomes with a high degree of
confidence through the analysis of historical data.
Customer Understanding: By analyzing the behavior, preferences, and
purchasing patterns of customers, data mining enables enterprises to gain a
more accurate understanding of their clients. This information can be used
for personalized marketing strategies, improving customer satisfaction, and
enhancing their loyalty.
Risk Management: Using data mining techniques to analyze patterns and
anomalies in the data, businesses can identify possible risks or frauds. In
sectors such as finance, insurance, and healthcare where risk management is
of paramount importance, this should be a particular concern.
Improved Efficiency: Data mining, which can greatly enhance the
efficiency of operations, aids in automatically discovering patterns and
insights from data. Businesses can reduce the time and resources needed to
focus on more strategy initiatives by outsourcing repetitive tasks.
Innovation: Hidden patterns and relationships in the data that can lead to
new product ideas, innovativeness, or business possibilities may be
discovered by analyzing it. Businesses can remain ahead of the competition
and drive innovation through creative data exploration and analysis.
Personal Development: The analytical and problem-solving skills are
enhanced by the knowledge of data mining. It provides you with valuable
tools and techniques for handling and analyzing large datasets, which are
essential skills in today's data-driven world.
In general, data mining is important for learning as it enables businesses to collect
useful information from the data so that they can make educated decisions,
mitigate risks, increase efficiency, understand customers more effectively,
innovate, and develop themselves.
Data Mining Applications
Data mining applications are vast and varied, with applications across industries
and disciplines. Here are some common areas where data mining techniques are
applied:
Business and Marketing: Data mining in business and marketing is used
for shopping cart analysis to understand customer purchasing behavior and
perform customer segmentation for targeted marketing campaigns.
Predictive modeling for sales forecasting and customer churn prediction.
Sentiment analysis of social media data provides a recommendation system
to understand customer opinions and feedback and recommend personalized
products.
Finance: Data mining techniques are most commonly used for detecting
fraud in banking transactions, risk assessment and credit scoring for loan
approval, stock market analysis and forecasting, and predicting customer
lifetime value for marketing strategies.
Healthcare: Healthcare data mining is the discovery of patterns,
correlations, and insights from large data sets generated in the healthcare
industry. The most common tasks of data mining in healthcare include
disease prediction and diagnosis, Drug discovery and development, Patient
monitoring and personalized treatment recommendations, and Health
outcome prediction for patient care management.
Telecommunications: Data mining techniques are most commonly used for
detecting fraud in banking transactions, risk assessment and credit scoring
for loan approval, stock market analysis and forecasting, and predicting
customer lifetime value for marketing strategies.
Manufacturing and Supply Chain: Predictive maintenance of machinery
and systems, supply chain optimization, demand forecasting, quality control,
and error detection in manufacturing processes.
Education: Adaptive learning systems for personalized education and
dropout prediction and prevention strategies, student performance prediction
and early intervention, and adaptive learning systems.
Government and Public Sector: To extract useful information and patterns
from large amounts of data collected by government agencies and
organizations, data mining uses advanced analytical techniques. Fraud
detection in public welfare programs, Crime pattern analysis for law
enforcement, and Traffic flow prediction and optimization.
E-commerce and Retail: Data mining plays a crucial role in the E-
commerce and retail industries, offering insights into customer behavior,
market trends, product performance, and more. Product recommendation
systems, Price optimization and dynamic pricing, and Inventory
management and demand forecasting.
Energy and Utilities: Data mining within the energy and utilities sector
includes extricating important insights and patterns from large datasets
produced by different operations within these businesses. Energy
consumption prediction and optimization, equipment failure prediction for
planning, and renewable energy forecasting.
Media and Entertainment: Data mining is the process of collecting
valuable information and patterns from a large amount of data on various
aspects of media consumption, audience behavior, content preferences, or
anything else that might be relevant to this industry. Content
recommendation systems, segmentation of audiences for targeted
advertising, and Box Office revenue estimates.
The above-mentioned are some of the most common applications; as new data
sources and technologies become available, the use of data mining is growing.
What are the functionalities of data mining?
Data mining functionalities are used to represent the type of patterns that have to
be discovered in data mining tasks. In general, data mining tasks can be classified
into two types including descriptive and predictive. Descriptive mining tasks
define the common features of the data in the database and the predictive mining
tasks act inference on the current information to develop predictions.
There are various data mining functionalities which are as follows −
Data characterization − It is a summarization of the general characteristics
of an object class of data. The data corresponding to the user-specified class
is generally collected by a database query. The output of data
characterization can be presented in multiple forms.
Data discrimination − It is a comparison of the general characteristics of
target class data objects with the general characteristics of objects from one
or a set of contrasting classes. The target and contrasting classes can be
represented by the user, and the equivalent data objects fetched through
database queries.
Association Analysis − It analyses the set of items that generally occur
together in a transactional dataset. There are two parameters that are used for
determining the association rules −
o It provides which identifies the common item set in the database.
o Confidence is the conditional probability that an item occurs in a
transaction when another item occurs.
Classification − Classification is the procedure of discovering a model that
represents and distinguishes data classes or concepts, for the objective of
being able to use the model to predict the class of objects whose class label
is anonymous. The derived model is established on the analysis of a set of
training data (i.e., data objects whose class label is common).
Prediction − It defines predict some unavailable data values or pending
trends. An object can be anticipated based on the attribute values of the
object and attribute values of the classes. It can be a prediction of missing
numerical values or increase/decrease trends in time-related information.
Clustering − It is similar to classification but the classes are not predefined.
The classes are represented by data attributes. It is unsupervised learning.
The objects are clustered or grouped, depends on the principle of
maximizing the intraclass similarity and minimizing the intraclass similarity.
Outlier analysis − Outliers are data elements that cannot be grouped in a
given class or cluster. These are the data objects which have multiple
behaviour from the general behaviour of other data objects. The analysis of
this type of data can be essential to mine the knowledge.
Evolution analysis − It defines the trends for objects whose behaviour
changes over some time.
Data mining is the process of finding useful new correlations, patterns, and
trends by transferring through a high amount of data saved in repositories,
using pattern recognition technologies including statistical and mathematical
techniques. It is the analysis of factual datasets to discover unsuspected
relationships and to summarize the records in novel methods that are both
logical and helpful to the data owner.
Data mining techniques can be used to make three kinds of models for three
kinds of tasks such as descriptive profiling, directed profiling, and
prediction.
Descriptive Profiling − Descriptive profiling in data mining focuses on
understanding the characteristics of data through summarizing, visualizing,
and identifying patterns within it. It helps answer questions like "what
happened?" and "what is happening?" by revealing the core nature of the
data, rather than predicting future outcomes. Common techniques include
summarization, clustering, association rule generation, and sequence
discovery.
Key Aspects of Descriptive Profiling:
Summarization:
This involves presenting general characteristics of the data, often using
statistical measures like mean, median, mode, and standard deviation, or
techniques like OLAP (Online Analytical Processing) and attribute-oriented
induction.
Clustering:
Grouping similar data points together based on certain criteria, such as
customer buying habits or product affinities.
Association Rule Mining:
Discovering relationships between items in a dataset, like identifying
products frequently purchased together (e.g., "if a customer buys bread, they
are also likely to buy milk").
Sequence Discovery:
Identifying patterns in sequential data, like the order in which users browse a
website or the steps in a manufacturing process.
Examples:
Market Basket Analysis:
Analyzing which products are frequently purchased together to inform
product placement and promotions.
Customer Segmentation:
Grouping customers based on their demographics, purchasing history, or
online behavior to tailor marketing campaigns.
Anomaly Detection:
Identifying unusual patterns or outliers in data, such as fraudulent
transactions or system errors.
Recency, Frequency, Monetary (RFM) Analysis:
Categorizing customers based on how recently they made a purchase, how
often they buy, and how much they spend.
Benefits:
Provides a clear understanding of the data's structure and characteristics.
Reveals hidden patterns and relationships that can be used for decision-
making.
Helps in data exploration and preparation for other data mining tasks,
like predictive modeling.
Enables better reporting and monitoring of business performance.
In essence, descriptive profiling is a fundamental step in data mining that
provides valuable insights into the data's current state, paving the way for
more advanced analyses and applications
.
Directed Profiling − Profiling is a familiar approach to many problems. It
need not involve any sophisticated data analysis. Surveys, for instance, are
one common method of building customer profiles. Surveys reveal what
customers and prospects look like, or at least the way survey responders
answer questions.
Profiles are often based on demographic variables, such as geographic
location, gender, and age. Since advertising is sold according to these same
variables, demographic profiles can turn directly into media strategies.
Prediction − Profiling uses data from the past to describe what happened in
the past. Prediction goes one step further. The prediction uses data from the
past to predict what is likely to happen in the future. This is a dynamic use
of information.
While the correlation between low storing balances and CD ownership
cannot be beneficial in a profile of CD holders, having a high storing
balance is likely (in combination with other indicators) a predictor of future
CD purchases.
It is building a predictive model requires separation in time between the
model inputs or predictors and the model output, the thing to be predicted. If
this partition is not supported, the model will not work.
Data Mining Task Primitives
A data mining task can be defined by a data mining query, which is entered into
the data mining system. A data mining query is made up of task primitives, which
are the basic components that help users communicate with the system during the
data mining process. These primitives allow users to guide the system in
discovering patterns or to examine the findings from different perspectives. The
data mining primitives define the following:
1. Set of data that is relevant to the task.
2. Type of knowledge the user wants to extract.
3. Background knowledge that should be considered during the process.
4. Evaluation measures for judging the usefulness of the discovered patterns.
5. Visual representation for displaying the patterns found.
A data mining query language can be developed to include these primitives,
giving users the ability to interact with the data mining system more easily. Such a
language provides the foundation for building user-friendly graphical interfaces.
Creating a complete data mining language is challenging because data mining
includes many different tasks, such as describing data or analyzing changes over
time. Each task requires different approaches. Designing a good data mining query
language involves understanding the capabilities and limitations of the different
data mining tasks, which helps the system communicate with other information
systems and fit into the broader information processing environment.
List of Data mining Task Primitives
1. The Set of Task-Relevant Data to be Mined
This refers to the specific parts of the database or data that the user wants to
analyze. It includes the relevant database attributes or dimensions (in data
warehouses) that are important for the task.
In a relational database, this data can be collected using a relational query, which
might involve operations like selecting, projecting, joining, or aggregating data.
The process of collecting the data creates a new data set, known as the initial data
relation. This data can be ordered or grouped based on the conditions specified in
the query. This step is part of the data mining process.
The initial data relation may not always match a physical table in the database. In
databases, virtual tables are called Views, so the set of relevant data for data
mining is referred to as a minable view.
2. The kind of knowledge to be mined
This defines the data mining tasks to be carried out, such as characterization,
discrimination, association or correlation analysis, classification, prediction,
clustering, outlier detection, or trend analysis.
3.The Background Knowledge to be Used in the Discovery Process
Background knowledge about the domain being mined is helpful for guiding the
data discovery process and evaluating the patterns that are found. One common
form of background knowledge is concept hierarchies, which allow data to be
mined at different levels of abstraction.
A concept hierarchy is a mapping of low-level concepts to higher-level, more
general concepts. It helps to organize data in a way that can provide more
meaningful insights.
Rolling Up (Generalization): This process involves viewing data at higher,
more general levels of abstraction. It simplifies and compresses the data,
making it easier to understand and reducing the need for input/output
operations.
Drilling Down (Specialization): This is the opposite of rolling up, where
higher-level concepts are replaced with more detailed, lower-level ones.
Depending on the user’s perspective, there may be multiple concept
hierarchies for a given attribute or dimension.
For example, a concept hierarchy for the "age" attribute might go from a broad
category like "adult" to more specific ranges such as "18-25 years" or "26-35
years."
Another form of background knowledge involves user beliefs about relationships
in the data, which can further guide the mining process.
4. The Interestingness Measures and Thresholds for Pattern Evaluation
Different types of knowledge may require different measures to assess their
relevance or "interestingness." These measures help guide the mining process or,
after patterns are discovered, evaluate their significance. For example, in
association rule mining, common interestingness measures
include support and confidence.
Support refers to how frequently an item appears in the dataset, while confidence
measures the likelihood that a rule will hold true. If the support and confidence
values of a rule fall below user-defined thresholds, the rule is considered
uninteresting and may be ignored.
5. The Expected Representation for Visualizing the Discovered Patterns
This refers to how the discovered patterns will be displayed. The representation
can include various formats such as rules, tables, cross-tabulations, charts, graphs,
decision trees, cubes, or other visual formats.
Users should be able to specify which forms of representation to use when
displaying the patterns. Some forms of representation may be more effective than
others, depending on the type of knowledge being presented.
Integration of Data mining system with a Data warehouse
The data mining system is integrated with a database or data warehouse
system so that it can do its tasks in an effective mode. A data mining system
operates in an environment that needs to communicate with other data
systems like a Database or Datawarehouse system.
There are differentpossible integration (coupling) schemes as follows:
No Coupling
Loose Coupling
Semi-Tight Coupling
Tight Coupling
No Coupling
No coupling means that a Data Mining system will not utilize any function of a
Data Base or Data Warehouse system.
It may fetch data from a particular source (such as a file system), process data
using some data mining algorithms, and then store the mining results in
another file.
Drawbacks of No Coupling
First, without using a Database/Data Warehouse system, a Data Mining
system may spend a substantial amount of time finding, collecting,
cleaning, and transforming data.
Second, there are many tested, scalable algorithms and data structures
implemented in Database and Data Warehouse systems.
Loose Coupling
In this Loose coupling, the data mining system uses some facilities / services of
a database or data warehouse system. The data is fetched from a data
repository managed by these (DB/DW) systems.
Data mining approaches are used to process the data and then the processed
data is saved either in a file or in a designated area in a database or data
warehouse.
Loose coupling is better than no coupling because it can fetch any portion of
data stored in Databases or Data Warehouses by using query processing,
indexing, and other system facilities.
Drawbacks of Loose Coupling
It is difficult for loose coupling to achieve high scalability and good
performance with large data sets.
Semi-Tight Coupling
Semitight couplingmeans that besides linking a Data Mining system to a Data
Base/Data Warehousesystem, efficient implementations of a few essential data
mining primitives can be provided in the DB/DW system. These primitives can
include sorting, indexing, aggregation, histogram analysis, multi way join, and
precomputation of some essential statistical measures, such as sum, count,
max, min, standard deviation.
Advantage of Semi-Tight Coupling
This Coupling will enhance the performance of Data Mining systems
Tight Coupling
Tight couplingmeans that a Data Mining system is smoothly integrated into
the Data Base/Data Warehousesystem. The data mining subsystem is treated
as one functional component of information system. Data mining queries and
functions are optimized based on mining query analysis, data structures,
indexing schemes, and query processing methods of a DB or DW system.
Major issues in Data Mining
Data mining, the process of extracting knowledge from data, has become
increasingly important as the amount of data generated by individuals,
organizations, and machines has grown exponentially.Data mining is not an easy
task, as the algorithms used can get very complex and data is not always available
at one place. It needs to be integrated from various heterogeneous data sources.
The above factors may lead to some issues in data mining. These issues are mainly
divided into three categories, which are given below:
1. Mining Methodology and User Interaction
2. Performance Issues
3. Diverse Data Types Issues
Mining Methodology and User Interaction
It refers to the following kinds of issues
Mining different kinds of knowledge in databases − Different users may
be interested in different kinds of knowledge. Therefore, it is necessary for
data mining to cover a broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on
the returned results.
Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks, should
be integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of
the discovered patterns will be poor.
Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows
Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in databases,
data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed
data mining algorithms. These algorithms divide the data into partitions
which is further processed in a parallel fashion. Theincremental algorithms,
update databases without mining the data again from scratch.
Diverse Data Types Issues
Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data, temporal
data etc. It is not possible for one system to mine all these kinds of data.
Mining information from heterogeneous databases and global
information systems − The data is available at different data sources on
LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore, mining the knowledge from them adds challenges
to data mining.
Data Preprocessing
What is Data Preprocessing?
Data preprocessing is a crucial step in data mining. It involves transforming raw
data into a clean, structured, and suitable format for mining. Proper data
preprocessing helps improve the quality of the data, enhances the performance of
algorithms, and ensures more accurate and reliable results.
Why Preprocess the Data?
In the real world, many databases and data warehouses have noisy, missing,
and inconsistent data due to their huge size. Low quality data leads to low quality
data mining.
Noisy: Containing errors or outliers. E.g., Salary = “-10”
Noisy data may come from
Human or computer error at data entry.
Errors in data transmission.
Missing: lacking certain attribute values or containing only aggregate data. E.g.,
Occupation = “”
Missing (Incomplete) may data come from
“Not applicable” data value when collected.
Human/hardware/software problems.
Inconsistent: Data inconsistency meaning is that different versions of the same
data appear in different places.For example, the ZIP code is saved in one table
as 1234-567 numeric data format; while in another table it may be represented
in 1234567.
Inconsistent data may come from
Errors in data entry.
Merging data from different sources with varying formats.
Differences in the data collection process.
Data preprocessing is used to improve the quality of data and mining results. And
The goal of data preprocessing is to enhance the accuracy, efficiency, and
reliability of data mining algorithms.
Major Tasks in Data Preprocessing
Data preprocessing is an essentialstepin the knowledge discovery process, because
quality decisions must be based on qualitydata.And Data Preprocessing
involvesData Cleaning, Data Integration, Data Reduction and Data Transformation.
Steps in Data Preprocessing
1. Data Cleaning
Data cleaning is a process that "cleans" the data by filling in the missing values,
smoothing noisy data, analyzing, and removing outliers, and removing
inconsistencies in the data.
If usersbelieve the data are dirty, they are unlikely to trust the results of any data
mining that hasbeen applied.
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or
datacleansing) routines attempt to fill in missing values, smooth out noise while
identifyingoutliers, and correct inconsistencies in the data.
Missing Values
Imagine that you need to analyze All Electronics sales and customer data. You note
thatmany tuples have no recorded value for several attributes such as customer
income. Howcan you go about filling in the missing values for this attribute? There
are several methods to fill the missing values.
Those are,
a. Ignore the tuple: This is usually done when the class label is
missing(classification). This method is not very effective, unless the tuple
contains several attributes with missing values.
b. Fill in the missing value manually: In general, this approach is time
consuming andmay not be feasible given a large data set with many missing
values.
c. Use a global constant to fill in the missing value: Replace all missing
attribute valuesby the same constant such as a label like “Unknown” or “- ∞
“.
d. Use the attribute mean or median to fill in the missing value: Replace all
missing values in the attribute by the mean or median of that attribute
values.
Noisy Data
Noise is a random error or variance in a measured variable. Data smoothing
techniques are used to eliminate noise and extract the useful patterns. The different
techniques used for data smoothing are:
a. Binning: Binning methods smooth a sorted data value by consulting its
“neighbourhood,” that is, the values around it. The sorted values are
distributed into several “buckets,” or bins. Because binning methods consult
the neighbourhood of values, they perform local smoothing.
There are three kinds of binning. They are:
o Smoothing by Bin Means:In this method, each value in a bin is
replaced by the mean value of the bin. For example, the mean of the
values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this
bin is replaced by the value 9.
o Smoothing by Bin Medians:In this method, each value in a bin is
replaced by the median value of the bin. For example, the median of
the values 4, 8, and 15 in Bin 1 is 8. Therefore, each original value in
this bin is replaced by the value 8.
o Smoothing by Bin Boundaries:In this method, the minimum and
maximum values in each bin are identified as the bin boundaries. Each
bin value is then replaced by the closest boundary value.For example,
the middle value of the values 4, 8, and 15 in Bin 1is replaced with
nearest boundary i.e., 4.
Example:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin medians:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
b. Regression: Data smoothing can also be done by regression, a technique that
used to predict the numeric values in a given data set. It analyses the
relationship between a target variable (dependent) and its predictor variable
(independent).
o Regression is a form of a supervised machine learning technique that
tries to predict any continuous valued attribute.
o Regression done in two ways; Linear regression involves finding the
“best” line to fit two attributes (or variables) so that one attribute can
be used to predict the other. Multiple linear regression is an extension
of linear regression, where more than two attributes are involved and
the data are fit to a multidimensional surface.
c. Clustering:It supports in identifying the outliers. The similar values are
organized into clusters and those values which fall outside the cluster are
known as outliers.
2. Data Integration
Data integration is the process of combining data from multiple sources into a
single, unified view. This process involves identifying and accessing the different
data sources, mapping the data to a common format. Different data sources may
include multiple data cubes, databases, or flat files.
The goal of data integration is to make it easier to access and analyze data that is
spread across multiple systems or platforms, in order to gain a more complete and
accurate understanding of the data.
Data integration strategy is typically described using a triple (G, S, M) approach,
where G denotes the global schema, S denotes the schema of the heterogeneous
data sources, and M represents the mapping between the queries of the source and
global schema.
Example: To understand the (G, S, M) approach, let us consider a data integration
scenario that aims to combine employee data from two different HR databases,
database A and database B. The global schema (G) would define the unified view
of employee data, including attributes like EmployeeID, Name, Department, and
Salary.
In the schema of heterogeneous sources, database A (S1) might have attributes like
EmpID, FullName, Dept, and Pay, while database B's schema (S2) might have
attributes like ID, EmployeeName, DepartmentName, and Wage. The mappings
(M) would then define how the attributes in S1 and S2 map to the attributes in G,
allowing for the integration of employee data from both systems into the global
schema.
Issues in Data Integration
There are several issues that can arise when integrating data from multiple sources,
including:
a. Data Quality: Data from different sources may have varying levels of
accuracy, completeness, and consistency, which can lead to data quality
issues in the integrated data.
b. Data Semantics: Integrating data from different sources can be challenging
because the same data element may have different meanings across sources.
c. Data Heterogeneity: Different sources may use different data formats,
structures, or schemas, making it difficult to combine and analyze the data.
3. Data Reduction
Imagine that you have selected data from the All Electronics data warehouse for
analysis.The data set will likely be huge! Complex data analysis and mining on
huge amounts ofdata can take a long time, making such analysis impractical or
infeasible.
Data reduction techniques can be applied to obtain a reduced representation of
thedata set that ismuch smaller in volume, yet closely maintains the integrity of the
originaldata. That is, mining on the reduced data set should be more efficient yet
produce thesame (or almost the same) analytical results.
In simple words,Data reduction is a technique used in data mining to reduce the
size of a dataset while still preserving the most important information. This can be
beneficial in situations where the dataset is too large to be processed efficiently, or
where the dataset contains a large amount of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data
mining, including:
a. Data Sampling: This technique involves selecting a subset of the data to
work with, rather than using the entire dataset. This can be useful for
reducing the size of a dataset while still preserving the overall trends and
patterns in the data.
b. Dimensionality Reduction: This technique involves reducing the number of
features in the dataset, either by removing features that are not relevant or by
combining multiple features into a single feature.
c. Data compression:This is the process of altering, encoding, or transforming
the structure of data in order to save space. By reducing duplication and
encoding data in binary form, data compression creates a compact
representation of information. And it involves the techniques such as lossy
or lossless compression to reduce the size of a dataset.
4. Data Transformation
Data transformation in data mining refers to the process of converting raw data into
a format that is suitable for analysis and modelling. The goal of data transformation
is to prepare the data for data mining so that it can be used to extract useful insights
and knowledge.
Data transformation typically involves several steps, including:
1. Smoothing: It is a process that is used to remove noise from the dataset
using techniques include binning, regression, and clustering.
2. Attribute construction (or feature construction): In this, new attributes
are constructed and added from the given set of attributes to help the mining
process.
3. Aggregation: In this, summary or aggregation operations are applied to the
data. For example, the daily sales data may be aggregated to compute
monthly and annual total amounts.
4. Data normalization: This process involves converting all data variables
into a small range. such as -1.0 to 1.0, or 0.0 to 1.0.
5. Generalization: It converts low-level data attributes to high-level data
attributes using concept hierarchy. For Example, Age initially in Numerical
form (22, ) is converted into categorical value (young, old).
Method Name Irregularity Output
Data Cleaning Missing, Nosie, and Inconsistent Quality Data before
data Integration
Data Integration Different data sources (data cubes, Unified view
databases, or flat files)
Data Reduction Huge amounts of data can take a Reduce the size of a
long time, making such analysis dataset and maintains
impractical or infeasible. the integrity.
Data Raw data Prepare the data for
Transformation data mining