Unit-1 Data Mining
Unit-1 Data Mining
INTRODUCTION:
Data mining is the process of extracting meaningful information from large datasets. It
involves algorithms to identify patterns, trends, and relationships among data.
1
KNOWLEDGE DISCOVERY IN DATABASES (KDD):
Knowledge Discovery in Databases (KDD) refers to the complete process of
uncovering valuable knowledge from large datasets.
It starts with the selection of relevant data, followed by pre-processing to clean and
organize it, transformation to prepare it for analysis, data mining to uncover patterns
and relationships, and concludes with the evaluation and interpretation of results,
ultimately producing valuable knowledge or insights.
KDD is widely utilized in fields like machine learning, pattern recognition, statistics,
artificial intelligence, and data visualization.
The KDD process is iterative, involving repeated refinements to ensure the accuracy
and reliability of the knowledge extracted.
The whole process consists of the following steps:
1. Data Selection
2. Data Cleaning and Pre-processing
3. Data Transformation and Reduction
4. Data Mining
5. Evaluation and Interpretation of Results
2
1. Data Selection:
Data Selection is the initial step in the Knowledge Discovery in Databases (KDD)
process, where relevant data is identified and chosen for analysis.
It involves selecting a dataset or focusing on specific variables, samples, or subsets of
data that will be used to extract meaningful insights.
It ensures that only the most relevant data is used for analysis, improving efficiency and
accuracy. It involves selecting entire dataset to particular features or subsets based on
task’s goals. By carefully selecting data, it delivers accurate and relevant results.
2. Data Cleaning and Pre-Processing:
In the KDD process, Data Cleaning is essential for ensuring that the dataset is accurate
and reliable by correcting errors, handling missing values, removing duplicates, and
addressing noisy or outlier data. Data cleaning is crucial in KDD to enhance the quality
of the data and improve the effectiveness of data mining.
Missing Values: Gaps in data are filled with the mean or most probable value to
maintain dataset completeness.
Removing Duplicates: Duplicate records are removed to maintain consistency
and avoid errors in analysis.
Noisy Data: Noise is reduced using techniques like binning, regression, or
clustering to smooth or group the data.
3. Data Transformation and Reduction:
Data Transformation involves converting data into a format that is suitable for analysis.
Normalization: Scaling data to a common range for consistency across variables.
Discretization: Converting continuous data into discrete categories for simpler
analysis.
Data Aggregation: Summarizing multiple data points (e.g., averages or totals) to
simplify analysis.
Concept Hierarchy Generation: Organizing data into hierarchies for a clearer,
higher-level view.
Data Reduction helps simplify the dataset while preserving key information.
1. Dimensionality Reduction: Reducing the number of variables while keeping
essential data.
2. Numerosity Reduction: Reducing data points using methods like sampling to
maintain critical patterns.
3. Data Compression: Compacting data for easier storage and processing.
3
4. Data Mining:
Data Mining is the process of discovering valuable, previously unknown patterns from
large datasets through automatic or semi-automatic means.
It involves exploring vast amounts of data to extract useful information that can drive
decision-making.
In KDD process, choosing the data mining task is critical. Depending on the objective,
the task could involve classification, regression, clustering, or association rule mining.
After determining the task, selecting the appropriate data mining algorithms is essential.
These algorithms are chosen based on their ability to efficiently and accurately identify
patterns that align with the goals of the analysis.
5. Evaluation and Interpretation of Results:
Evaluation in KDD involves assessing the patterns identified during data mining to
determine their relevance and usefulness.
It includes calculating the “interestingness score” for each pattern, which helps to
identify valuable insights.
Visualization and summarization techniques are then applied to make the data more
understandable and accessible for the user.
Interpretation of Results focuses on presenting these understandings in a way that is
meaningful and actionable.
By effectively communicating the findings, decision-makers can use the results to drive
informed actions and strategies.
Summary of Functionalities:
Functionality Goal Example Application
Association Rule Mining Discover relationships Market basket analysis
Classification Predict categories Spam email detection
Prediction Forecast outcomes Sales or demand forecasting
Clustering Group similar data Customer segmentation
Anomaly Detection Find outliers Fraud detection
Summarization Provide data overviews Business reports
Sequence Analysis Analyse ordered data Stock price trends
Knowledge Discovery Reveal hidden insights Trend detection in customer data
5
INTRODUCTION TO DATA WAREHOUSE:
A Data Warehouse consists of data from multiple heterogeneous data sources and is used
for analytical reporting and decision making.
Data Warehouse is a central place where data is stored from different data sources and
applications.
The term Data Warehouse was first invented by Bill Inmom in 1990.
Data Warehouse is always kept separate from Operational Database.
The data in a Data warehouse system is loaded from operational transaction systems like –
Sales, Marketing, HR etc.
A Data Warehouse is used for reporting and analyzing of information and stores both
historical and current data.
The data in Data warehouse system is used for Analytical reporting, which is later used by
Business Analysts, Sales Managers or Knowledge workers for decision-making.
Data in data warehouse is accessed by BI (Business Intelligence) users for Analytical
Reporting, Data Mining and Analysis.
This is used for decision making by Business Users, Sales Manager, Analysts to define
future strategy.
In the below image, we can see that the data is coming from multiple heterogeneous
data sources to a Data Warehouse. Common data sources for a data warehouse includes −
Operational databases, reports, data mining patterns, Flat Files (xls, csv, txt files) etc.
6
CHARACTERISTICS OF A DATA WAREHOUSE:
1. Subject Oriented: In a DW system, the data is categorized and stored by a business
subject rather than by application like equity plans, shares, loans, etc.
2. Integrated: Data from multiple data sources are integrated in a Data Warehouse.
3. Non Volatile: Data in data warehouse is non-volatile. It means when data is loaded in
Data Warehouse system, it is not altered.
4. Time Variant: A Data Warehouse system contains historical data as compared to
Transactional system which contains only current data. In a Data warehouse you can
see data for 3 months, 6 months, 1 year, 5 years, etc.
7
Data Warehouse vs DBMS:
Database Data Warehouse
It is based on operational or transactional A data Warehouse is based on analytical
processing. Each operation is an processing.
indivisible transaction.
Generally, a Database stores current and It maintains historical data over time. Historical
up-to-date data which is used for daily data is the data kept over years and can used for
operations. trend analysis, make future predictions and
decision support.
A database is generally application A Data Warehouse is integrated generally at the
specific. organization level, by combining data from
Example – A database stores related different databases.
data, such as the student details in a Example: Data warehouse integrates the data
school. from one or more databases, so that analysis can
be done to get results, such as the best
performing school in a city.
Constructing a Database is not so Constructing a Data Warehouse can be
expensive. expensive.
8
DISADVANTAGES OF DATA WAREHOUSING:
1. Cost: Building a data warehouse can be expensive, requiring significant investments in
hardware, software, and personnel.
2. Complexity: Data warehousing can be complex, and businesses may need to hire
specialized personnel to manage the system.
3. Time-consuming: Building a data warehouse can take a significant amount of time,
requiring businesses to be patient and committed to the process.
4. Data integration challenges: Data from different sources can be challenging to
integrate, requiring significant effort to ensure consistency and accuracy.
5. Data security: Data warehousing can pose data security risks, and businesses must take
measures to protect sensitive data from unauthorized access or breaches.
9
DATA PRE-PROCESSING:
Data pre-processing is an important process of data mining. In this process, raw data is
converted into an understandable format and made ready for further analysis. The motive is to
improve data quality and make it up to mark for specific tasks.
Tasks in Data Pre-processing:
1. Data Cleaning
2. Data Integration
3. Data Transformation
4. Data Reduction
DATA CLEANING:
Data cleaning defines to clean the data by filling in the missing values, smoothing noisy
data, analyzing and removing outliers, and removing inconsistencies in the data.
Sometimes data at multiple levels of detail can be different from what is required, for
example, it can need the age ranges of 20-30, 30-40, 40-50, and the imported data
includes birth date. The data can be cleans by splitting the data into appropriate types.
There are various types of data cleaning which are as follows:
1. Missing values
2. Noisy data
3. Combined computer and human inspection
4. Inconsistence data
1. Missing Values: Missing values are filled with appropriate values. There are the following
approaches to fill the values.
a) The tuple is ignored when it includes several attributes with missing values.
b) The values are filled manually for the missing value.
c) The same global constant can fill the values.
d) The attribute mean can fill the missing values.
e) The most probable value can fill the missing values.
2. Noisy data: Noise is a random error or variance in a measured variable. There are the
following smoothing methods to handle noise. They are: binning, regression, clustering.
a) Binning: These methods smooth out a arrange data value by consulting its
“neighbourhood”. The arranged values are distributed into multiple buckets or bins.
Because binning methods consult neighbourhood of values, they implement local
smoothing.
10
Types of binning techniques: bin by means, bin by medians, bin by boundary.
Example: The data for price are first sorted and then partitioned into equal-frequency / equi-
depth bins of size 3 (i.e., each bin contains three values).
In smoothing by bin means, each value in a bin is replaced by the mean value of the
bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each
original value in this bin is replaced by the value 9.
In smoothing by bin medians, each bin value is replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a given bin
are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value.
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means: Smoothing by bin median: Smoothing by bin boundaries:
Bin 1: 9, 9, 9 Bin 1: 8, 8, 8 Bin 1: 4, 4, 15
Bin 2: 22, 22, 22 Bin 2: 21, 21, 21 Bin 2: 21, 21, 24
Bin 3: 29, 29, 29 Bin 3: 28, 28, 28 Bin 3: 25, 25, 34
b) Regression: Regression functions are used to smoothen the data. Regression can be
linear (consists of one independent variable) or multiple (consists of multiple
independent variables).
c) Clustering: Clustering supports in identifying the outliers. The same values are
organized into clusters and those values which fall outside the cluster are known as
outliers.
3. Combined computer and human inspection: The outliers can also be recognized with
the support of computer and human inspection. The outliers pattern can be descriptive or
garbage.
4. Inconsistence data: The inconsistency can be recorded in various transactions, during data
entry, or arising from integrating information from multiple databases. Some redundancies
can be recognized by correlation analysis. Accurate and proper integration of the data from
various sources can decrease and avoid redundancy.
11
DATA INTEGRATION:
Data integration in data mining refers to the process of combining data from multiple
heterogeneous data sources into a single, combined view. These sources may include
multiple data cubes, databases, or flat files.
This can involve cleaning the data, as well as resolving any inconsistencies or conflicts
that may exist between the different sources.
The goal of data integration is to make the data more useful and meaningful for the
purposes of analysis and decision making.
It is easier to access and analyse data that is spread across multiple systems or platforms,
in order to gain a more complete and accurate understanding of the data.
The data integration approaches are formally defined as triple (G, S, M) where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
There are three issues to consider during data integration:
1. Schema Integration:
Integrate metadata from different sources. Different data sources may have varying
schemas, structures, or representations of the same data. The real-world entities from
multiple sources are referred to as the entity identification problem.
Example: One system may use "CustomerID" while another uses "ClientID" for the
same entity.
Solution: Schema matching algorithms or manual mapping can help resolve
inconsistencies, though it can be time-consuming.
2. Redundancy Detection:
An attribute may be redundant if it can be derived or obtained from another attribute
or set of attributes. Inconsistencies in attributes can also cause redundancies in the
resulting data set. Some redundancies can be detected by correlation analysis.
3. Resolution of data value conflicts:
Attribute values from different sources may differ for the same real-world entity.
Conflicting data values may arise from different sources. Without proper conflict
resolution, integrated data may remain inconsistent or incorrect.
Example: A customer’s address may differ between two databases.
Solution: Use conflict resolution strategies, such as prioritizing specific sources or
averaging numerical values.
12
DATA TRANSFORMATION:
Data transformation is a pre-processing step in which raw data is converted into an
appropriate format to be used effectively by mining algorithms.
The goal is to improve the quality, efficiency, and accuracy of analysis. Transformation
ensures that data is consistent, relevant, and suitable for extracting meaningful patterns.
In this, data are transformed or combined into forms suitable for mining.
Data Transformation Techniques are:
1. Smoothing: In Smoothing, we can remove noise from the data. Such methods contain
binning, regression, and clustering.
2. Aggregation: In Aggregation, where summary or aggregation operations are applied to
the data. For example, the daily sales data may be aggregated to compute monthly and
annual total amounts. This phase is generally used in making a data cube for the analysis
of the data at multiple granularities.
3. Generalization: In Generalization, where low-level or “primitive” (raw) data are
restored by larger-level concepts through the use of concept hierarchies. For example,
categorical attributes, such as street, can be generalized to larger-level concepts, such as
city or country. Similarly, values for numerical attributes, such as age, can be mapped
to larger-level concepts, like youth, middle-aged, and senior.
4. Normalization: In Normalization, where the attribute data are scaled to fall within a
small specified range, such as: 1.0 to 1.0, or 0.0 to 1.0. An attribute is normalized by
scaling its values so that they decline within small specified order, including 0.0 to 1.0.
The methods for data normalization are: min-max, z-score, decimal scaling.
a) Min-Max Normalization: It is a way to rescale data to a fixed range, typically
between 0 and 1. It's done by taking the minimum and maximum values of a dataset
and then proportionally transforming each data point to fit within that range.
Formula: X_normalized = (X - X_min) / (X_max - X_min)
Where:
X_normalized is the normalized value of X.
X is the original value.
X_min is the minimum value in the dataset.
X_max is the maximum value in the dataset.
Example: Suppose we have the following dataset of house prices (in thousands of
dollars): 200, 250, 300, 350, 400
Find the minimum and maximum values: X_min = 200, X_max = 400
13
Normalize each value:
For 200: (200 - 200) / (400 - 200) = 0
For 250: (250 - 200) / (400 - 200) = 0.25
For 300: (300 - 200) / (400 - 200) = 0.5
For 350: (350 - 200) / (400 - 200) = 0.75
For 400: (400 - 200) / (400 - 200) = 1
Result: The normalized dataset becomes: -0.4 0.7 -1.5 0.1 1.1
14
c) Decimal Scaling: It is a normalization technique that scales data by moving the
decimal point of the values. The number of decimal places to move is determined by
the largest absolute value in the dataset. This ensures that all values fall within a
defined range, usually between -1 and 1.
Formula: Normalized value = Value / 10^j
Where: j is the smallest integer such that max(|Value|) < 10^j
Example: Suppose we have the following dataset of house prices (in thousands of
dollars): 150, 250, 100, 300, 200
Find the maximum absolute value: The maximum value in the dataset is 300.
Determine the value of j:
We need to find the smallest integer j such that 300 < 10^j.
In this case, j = 3 (since 300 has 3 digits → j=3)
And, 300 < 103 => 300 < 1000
Normalize each value:
For 150: 150 / 10^3 = 0.15
For 250: 250 / 10^3 = 0.25
For 100: 100 / 10^3 = 0.10
For 300: 300 / 10^3 = 0.30
For 200: 200 / 10^3 = 0.20
Result: The normalized dataset becomes: 0.15 0.25 0.10 0.30 0.20
15
DATA REDUCTION:
Data reduction is a technique used in data mining to minimize the volume of data while
maintaining its integrity and the quality of the analysis.
It helps improve efficiency, reduce storage costs, and enhance processing speed without
significant loss of important information.
This can be beneficial in situations where dataset is too large to be processed efficiently,
or where the dataset contains large amount of irrelevant or redundant information.
Data Reduction Techniques are:
1. Data Cube Aggregation
2. Attribute Subset Selection
3. Data Compression
4. Numerosity Reduction
5. Discretization and Concept Hierarchy Generation
1. Data Cube Aggregation:
Data Cube Aggregation is a multidimensional aggregation that uses aggregation at
various levels of a data cube to represent the original data set, thus achieving data
reduction. This technique is used to aggregate data in a simpler form.
For example, Suppose we have the data of All Electronics sales per quarter for the year
2018 to the year 2020. If we want to get the annual sale per year, we just have to
aggregate the sales per quarter for each year.
In this way, aggregation provides you with the required data, which is much smaller in
size, and thereby we achieve data reduction even without losing any data.
16
2. Attribute Subset Selection:
Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates redundant features.
The methods for attribute selection are:
a) Step-wise Forward Selection
b) Step-wise Backward Elimination
c) Combination of forwarding and Backward Selection
a) Step-wise Forward Selection:
The selection begins with an empty set of attributes later on we decide the best of
the original attributes on the set based on their relevance to other attributes.
Example: Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
b) Step-wise Backward Elimination:
This selection starts with a set of complete attributes in the original data and at
each point, it eliminates the worst remaining attribute in the set.
Example: Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }
Step-1: {X1, X2, X3, X4, X5}
Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
c) Combination of forwarding and Backward Selection:
It allows us to remove the worst and select the best attributes, saving time and
making the process faster.
17
3. Data Compression:
Data Compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding).
The two types of Data Compression Techniques are:
a) Lossless Compression:
In this, No information is lost during compression and decompression. It is used
when exact data recovery is required (e.g., text, medical data, and financial
records). The Common techniques are:
1. Run-Length Encoding (RLE): Replaces consecutive identical values with a
single value and count
2. Huffman Coding: Uses variable-length codes based on frequency
(shorter codes for frequent values)
3. Lempel-Ziv-Welch (LZW): Dictionary-based method used in ZIP files and
GIF images
For Example:
Original Data: AAAAABBBCCDAA
RLE Compression: 5A3B2C1D2A
b) Lossy Compression:
In this, some data is discarded to achieve a higher compression ratio. It is used
when exact data recovery is not required (e.g., images, audio, video). The
Common techniques are: JPEG Compression (for images), MP3 Compression
(for audio), MPEG Compression (for video).
Example (JPEG Compression): It removes minor colour variations that are not
noticeable to the human eye. It achieves high compression while retaining visual
quality.
18
4. Numerosity Reduction:
Numerosity reduction is a technique that replaces the original dataset with a smaller,
more compact representation while preserving its essential information. Instead of
storing the entire dataset, a mathematical model or sample is used to represent the
data, reducing storage and processing costs.
The types of Numerosity Reduction Techniques are: Parametric and Non-Parametric.
a) Parametric Methods:
These methods approximate the dataset using mathematical models. Instead of
storing the entire dataset, only the model parameters are stored.
They includes: Regression Models (Fits the data into a function) and Log-Linear
Models (uses probability distribution).
b) Non-Parametric Methods:
These methods do not assume any model but reduce data volume through
techniques like: clustering (Groups similar data points together), sampling
(Selects representative subset of data rather than entire dataset), and histograms
(Groups numeric data into bins and stores only the bin boundaries instead of
every individual value).
5. Discretization & Concept Hierarchy Generation:
Discretization:
It is the process of converting continuous data values into a finite set of intervals or
categories. It simplifies data and makes it easier to analyse and understand.
It improves algorithm performance to work better with discrete data. It reduces noise
and helps to smooth out minor variations in continuous data.
For example, Imagine we have the ages of customers. Instead of treating each age as
a unique value, we can group them into ranges like "18-25", "26-35", etc.
The Methods used are: Binning, Histogram and Clustering.
Concept Hierarchy Generation:
It is creating a hierarchical structure for categorical data, where more general
concepts are at higher levels and more specific concepts are at lower levels.
It provides structured way to organize and interpret categorical data to improve data
understanding. It helps in DM tasks like association rule mining and classification.
The Methods are Manual-definition, Schema-based, Automatic-generation.
For example, "city" can be grouped into "state", then "country". It allows analysis
at different levels of detail. We can explore data at a broad level (country) or a more
specific level (city).
19