Data Warehouse and Data Mining: Definition and Concepts
What is Data Mining?
Data mining is the process of extracting knowledge or insights from large
amounts of data using various statistical and computational techniques. The
data can be structured, semi-structured or unstructured, and can be stored in
various forms such as databases, data warehouses, and data lakes.
The primary goal of data mining is to discover hidden patterns and
relationships in the data that can be used to make informed decisions or
predictions. This involves exploring the data using various techniques such as
clustering, classification, regression analysis, association rule mining, and
anomaly detection.
Data Preprocessing in Data Mining
Data preprocessing is the process of preparing raw data for analysis by
cleaning and transforming it into a usable format. In data mining it refers to
preparing raw data for mining by performing tasks like cleaning,
transforming, and organizing it into a format suitable for mining algorithms.
● Goal is to improve the quality of the data.
● Helps in handling missing values, removing duplicates, and
normalizing data.
● Ensures the accuracy and consistency of the dataset.
Steps in Data Preprocessing
Some key steps in data preprocessing are Data Cleaning, Data Integration,
Data Transformation, and Data Reduction.
1. Data Cleaning: It is the process of identifying and correcting errors or
inconsistencies in the dataset. It involves handling missing values, removing
duplicates, and correcting incorrect or outlier data to ensure the dataset is
accurate and reliable. Clean data is essential for effective analysis, as it
improves the quality of results and enhances the performance of data
models.
● Missing Values: This occur when data is absent from a dataset. You
can either ignore the rows with missing data or fill the gaps
manually, with the attribute mean, or by using the most probable
value. This ensures the dataset remains accurate and complete for
analysis.
● Noisy Data: It refers to irrelevant or incorrect data that is difficult for
machines to interpret, often caused by errors in data collection or
entry. It can be handled in several ways:
○ Binning Method: The data is sorted into equal
segments, and each segment is smoothed by
replacing values with the mean or boundary values.
○ Regression: Data can be smoothed by fitting it to a
regression function, either linear or multiple, to
predict values.
○ Clustering: This method groups similar data points
together, with outliers either being undetected or
falling outside the clusters. These techniques help
remove noise and improve data quality.
● Removing Duplicates: It involves identifying and eliminating
repeated data entries to ensure accuracy and consistency in the
dataset. This process prevents errors and ensures reliable analysis
by keeping only unique records.
2. Data Integration: It involves merging data from various sources into a
single, unified dataset. It can be challenging due to differences in data
formats, structures, and meanings. Techniques like record linkage and data
fusion help in combining data efficiently, ensuring consistency and accuracy.
● Record Linkage is the process of identifying and matching records
from different datasets that refer to the same entity, even if they are
represented differently. It helps in combining data from various
sources by finding corresponding records based on common
identifiers or attributes.
● Data Fusion involves combining data from multiple sources to
create a more comprehensive and accurate dataset. It integrates
information that may be inconsistent or incomplete from different
sources, ensuring a unified and richer dataset for analysis.
3. Data Transformation: It involves converting data into a format suitable for
analysis. Common techniques include normalization, which scales data to a
common range; standardization, which adjusts data to have zero mean and
unit variance; and discretization, which converts continuous data into discrete
categories. These techniques help prepare the data for more accurate
analysis.
● Data Normalization: The process of scaling data to a common range
to ensure consistency across variables.
● Discretization: Converting continuous data into discrete categories
for easier analysis.
● Data Aggregation: Combining multiple data points into a summary
form, such as averages or totals, to simplify analysis.
● Concept Hierarchy Generation: Organizing data into a hierarchy of
concepts to provide a higher-level view for better understanding
and analysis.
4. Data Reduction: It reduces the dataset’s size while maintaining key
information. This can be done through feature selection, which chooses the
most relevant features, and feature extraction, which transforms the data into
a lower-dimensional space while preserving important details. It uses various
reduction techniques such as,
● Dimensionality Reduction (e.g., Principal Component Analysis): A
technique that reduces the number of variables in a dataset while
retaining its essential information.
● Numerosity Reduction: Reducing the number of data points by
methods like sampling to simplify the dataset without losing critical
patterns.
● Data Compression: Reducing the size of data by encoding it in a
more compact form, making it easier to store and process.
● Concept Hierarchy Generation: The main idea behind the concept
of hierarchy is that the same data can have different levels of
granularity or levels of detail and that by organizing the data in a
hierarchical fashion, it is easier to understand and perform analysis.
● Data cube Aggregation: A data cube enables data to be modeled and
viewed in several dimensions. It is represented by dimensions and
facts. In other terms, dimensions are the views or entities related to
which an organization is required to keep records.
Uses of Data Preprocessing
Data preprocessing is utilized across various fields to ensure that raw data is
transformed into a usable format for analysis and decision-making. Here are
some key areas where data preprocessing is applied:
1. Data Warehousing: In data warehousing, preprocessing is essential for
cleaning, integrating, and structuring data before it is stored in a centralized
repository. This ensures the data is consistent and reliable for future queries
and reporting.
2. Data Mining: Data preprocessing in data mining involves cleaning and
transforming raw data to make it suitable for analysis. This step is crucial for
identifying patterns and extracting insights from large datasets.
3. Machine Learning: In machine learning, preprocessing prepares raw data
for model training. This includes handling missing values, normalizing
features, encoding categorical variables, and splitting datasets into training
and testing sets to improve model performance and accuracy.
4. Data Science: Data preprocessing is a fundamental step in data science
projects, ensuring that the data used for analysis or building predictive
models is clean, structured, and relevant. It enhances the overall quality of
insights derived from the data.
5. Web Mining: In web mining, preprocessing helps analyze web usage logs
to extract meaningful user behavior patterns. This can inform marketing
strategies and improve user experience through personalized
recommendations.
6. Business Intelligence (BI): Preprocessing supports BI by organizing and
cleaning data to create dashboards and reports that provide actionable
insights for decision-makers.
7. Deep Learning Purpose: Similar to machine learning, deep learning
applications require preprocessing to normalize or enhance features of the
input data, optimizing model training processes.
Advantages of Data Preprocessing
● Improved Data Quality: Ensures data is clean, consistent, and
reliable for analysis.
● Better Model Performance: Reduces noise and irrelevant data,
leading to more accurate predictions and insights.
● Efficient Data Analysis: Streamlines data for faster and easier
processing.
● Enhanced Decision-Making: Provides clear and well-organized data
for better business decisions.
Disadvantages of Data Preprocessing
● Time-Consuming: Requires significant time and effort to clean,
transform, and organize data.
● Resource-Intensive: Demands computational power and skilled
personnel for complex preprocessing tasks.
● Potential Data Loss: Incorrect handling may result in losing
valuable information.
● Complexity: Handling large datasets o
Statistical measures in large databases
Relational database systems supports five built-in aggregate functions such as
count(), sum(), avg(), max() and min(). These aggregate functions can be
used as basic measures in the descriptive mining of multidimensional
information. There are two descriptive statistical measures such as measures of
central tendency and measures of data dispersion can be used effectively in
high multidimensional databases.
Measures of central tendency − Measures of central tendency such as mean,
median, mode, and mid-range.
Mean − The arithmetic average is evaluated simply by inserting together all
values and splitting them by the number of values. It uses data from every
single value. Let x1, x2,... xn be a set of N values or observations like salary. The
mean of this set of values is:
X = (x1+x2+x3+......)/N
𝑠𝑢𝑚
Mean = 𝑐𝑜𝑢𝑛𝑡
This corresponds to the assembled aggregate function, average (avg())
supported in the relational database system. In several data cubes, sum and
count are saved in pre-computation. Therefore, the derivation of average is
straightforward.
Median − There are two methods for computing the median, based on the
distribution of values.
If x1, x2, .... xn are arranged in descending order and n is odd. Thus the
𝑛+1
Median= ( )th value
2
For example, 1, 4, 6, 7, 12, 14, 18
Median = 7
When n is even. Then the median is:
𝑛 𝑛
median = [( 2 ) + ( 2 + 1)]/2
For example, 1, 4, 6, 7, 8, 12, 14, 16.
Median = (7+8)/2 = 7.5
The median is neither a distributive measure nor an algebraic measure, it is the
holistic measure. Although it is not simply to evaluate the exact median value in
a huge database, an approximate median can be effectively computed.
Mode − It is the most common value in a set of values. Distributions can be
unimodal, bimodal, or multimodal. If the data is categorical (measured on the
nominal scale) then only the mode can be computed. The mode can also be
computed with ordinal and higher data, but it is not suitable.
Measuring the dispersion of data − The degree to which numerical
information tends to spread is known as the dispersion or variance of the data.
The most frequent measures of data dispersion are range, interquartile range,
and standard derivations.
Range − The range is represented as the difference between the largest value
and the smallest value in the set of data.
Range: = (MaxValue- MinValue)
Quartiles − The most common percentile other than the median are quartiles.
The first quartile indicated by Q1 is the 25th percentile, the third quartile
indicated by Q3 is the 75th percentile. The quartiles containing the median,
provide some indication of the center, spread, and shape of a quartile is a
simple measure of spread that provides the range covered by the middle half of
the data. This is known as the interquartile range (IQR) and is defined as −
IQR = Q3-Q1
Association rule:
The association rule learning is the most important approach of Data Mining,
and it is employed in Market Basket analysis, Web usage mining, continuous
production, etc. In market basket analysis, it is an approach used by several big
retailers to find the relations between items.
In market basket analysis, customer buying habits are analyzed by finding
associations between the different items that customers place in their shopping
baskets. By discovering such associations, retailers produce marketing methods
by analyzing which elements are frequently purchased by users. This
association can lead to increased sales by supporting retailers to do selective
marketing and plan for their shelf area.
Rule Evaluation Metrics –
● Support(s) – The number of transactions that include items in the {X} and
{Y} parts of the rule as a percentage of the total number of transaction.It
is a measure of how frequently the collection of items occur together as a
percentage of all transactions.
● Support = Freq(X+Y) /total – It is interpreted as fraction of transactions
that contain both X and Y.
● Confidence(c) – It is the ratio of the no of transactions that includes all
items in {B} as well as the no of transactions that includes all items in {A}
to the no of transactions that includes all items in {A}.
● Conf(X=>Y) = Supp(X+Y) /Supp(X) – It measures how often each item in
Y appears in transactions that contains items in X also.
● Lift(l) – The lift of the rule X=>Y is the confidence of the rule divided by
the expected confidence, assuming that the itemsets X and Y are
independent of each other.The expected confidence is the confidence
divided by the frequency of {Y}.
● Lift(X=>Y) = Supp(X+Y) /(Supp(X)*Supp(Y) – Lift value near 1 indicates
X and Y almost often appear together as expected, greater than 1 means
they appear together more than expected and less than 1 means they
appear less than expected.Greater lift values indicate stronger
association.
Types of Association Rule Learning
There are the following types of Association rule learning which are as follows −
Apriori Algorithm − This algorithm needs frequent datasets to produce
association rules. It is designed to work on databases that include transactions.
This algorithm needs a breadth-first search and hash tree to compute the
itemset efficiently.
It is generally used for market basket analysis and support to learn the products
that can be purchased together. It can be used in the healthcare area to
discover drug reactions for patients.
Eclat Algorithm − The Eclat algorithm represents Equivalence Class
Transformation. This algorithm needs a depth-first search method to discover
frequent itemsets in a transaction database. It implements quicker execution
than Apriori Algorithm.
F-P Growth Algorithm − The F-P growth algorithm represents Frequent
Pattern. It is the enhanced version of the Apriori Algorithm. It describes the
database in the form of a tree structure that is referred to as a frequent pattern
or tree. This frequent tree aims to extract the most frequent patterns.
Apriori Algorithm
Apriori Algorithm is a foundational method in data mining used for discovering frequent
itemsets and generating association rules. Its significance lies in its ability to identify
relationships between items in large datasets which is particularly valuable in market
basket analysis.
For example, if a grocery store finds that customers who buy bread often also buy butter, it
can use this information to optimize product placement or marketing strategies.
How the Apriori Algorithm Works?
The Apriori Algorithm operates through a systematic process that involves several
key steps:
1. Identifying Frequent Itemsets: The algorithm begins by scanning the
dataset to identify individual items (1-item) and their frequencies. It then
establishes a minimum support threshold, which determines whether an
itemset is considered frequent.
2. Creating Possible item group: Once frequent 1-itemgroup(single items)
are identified, the algorithm generates candidate 2-itemgroup by
combining frequent items. This process continues iteratively, forming
larger itemsets (k-itemgroup) until no more frequent itemgroup can be
found.
3. Removing Infrequent Item groups: The algorithm employs a pruning
technique based on the Apriori Property, which states that if an itemset is
infrequent, all its supersets must also be infrequent. This significantly
reduces the number of combinations that need to be evaluated.
4. Generating Association Rules: After identifying frequent itemsets, the
algorithm generates association rules that illustrate how items relate to
one another, using metrics like support, confidence, and lift to evaluate
the strength of these relationships.
Key Metrics of Apriori Algorithm
● Support: This metric measures how frequently an item appears in the
dataset relative to the total number of transactions. A higher support
indicates a more significant presence of the itemset in the dataset.
Support tells us how often a particular item or combination of items
appears in all the transactions (“Bread is bought in 20% of all
transactions.”)
● Confidence: Confidence assesses the likelihood that an item Y is
purchased when item X is purchased. It provides insight into the strength
of the association between two items.
● Confidence tells us how often items go together. (“If bread is bought,
butter is bought 75% of the time.”)
● Lift: Lift evaluates how much more likely two items are to be purchased
together compared to being purchased independently. A lift greater than
1 suggests a strong positive association. Lift shows how strong the
connection is between items. (“Bread and butter are much more likely to
be bought together than by chance.”)
Data Mining Multidimensional Association Rule
In this article, we are going to discuss the Multidimensional Association Rule. Also,
we will discuss examples of each. Let’s discuss one by one.
Multidimensional Association rule Mining
In Multi dimensional association rule Qualities can be absolute or
quantitative.
● Quantitative characteristics are numeric and consolidates order.
● Numeric traits should be discretized.
● Multi dimensional affiliation rule comprises of more than one
measurement.
● Example –buys(X, “IBM Laptop computer”)buys(X, “HP Inkjet
Printer”)
Approaches in mining multi dimensional affiliation rules :
Three approaches in mining multi dimensional affiliation rules are as
following.
1. Using static discretization of quantitative qualities :
● Discretization is static and happens preceding mining.
● Discretized ascribes are treated as unmitigated.
● Use apriori calculation to locate all k-regular predicate
sets(this requires k or k+1 table outputs). Each subset of
regular predicate set should be continuous.
Example –
If in an information block the 3D cuboid (age, pay, purchases) is
continuous suggests (age, pay), (age, purchases), (pay, purchases) are
likewise regular.
Note –
Information blocks are appropriate for mining since they make mining
quicker. The cells of an n-dimensional information cuboid relate to the
predicate cells.
2. Using powerful discretization of quantitative traits :
● Known as mining Quantitative Association Rules.
● Numeric properties are progressively discretized.
3. Using distance based discretization with bunching –
This id dynamic discretization measure that considers the distance
between information focuses. It includes a two stage mining measure
as following.
● Perform bunching to discover the time period included.
● Get affiliation rules via looking for gatherings of groups that happen
together.
Multilevel Association Rule in data mining
Multilevel Association Rule :
Association rules created from mining information at different degrees of reflection
are called various level or staggered association rules.
Multilevel association rules can be mined effectively utilizing idea progressions
under a help certainty system.
Rules at a high idea level may add to good judgment while rules at a low idea level
may not be valuable consistently.
Utilizing uniform least help for all levels :
● At the point when a uniform least help edge is utilized, the pursuit
system is rearranged.
● The technique is likewise straightforward, in that clients are needed
to indicate just a single least help edge.
● A similar least help edge is utilized when mining at each degree of
deliberation. (for example for mining from “PC” down to “PC”). Both
“PC” and “PC” discovered to be incessant, while “PC” isn’t.
Needs of Multidimensional Rule :
● Sometimes at the low data level, data does not show any significant
pattern but there is useful information hiding behind it.
● The aim is to find the hidden information in or between levels of
abstraction.
Approaches to multilevel association rule mining :
1. Uniform Support(Using uniform minimum support for all level)
2. Reduced Support (Using reduced minimum support at lower levels)
3. Group-based Support(Using item or group based support)
Let’s discuss one by one.
1. Uniform Support –
At the point when a uniform least help edge is used, the search
methodology is simplified. The technique is likewise basic in that
clients are needed to determine just a single least help threshold.
An advancement technique can be adopted, based on the
information that a progenitor is a superset of its descendant. the
search keeps away from analyzing item sets containing anything
that doesn’t have minimum support. The uniform support approach
however has some difficulties. It is unlikely that items at lower levels
of abstraction will occur as frequently as those at higher levels of
abstraction. If the minimum support threshold is set too high it could
miss several meaningful associations occurring at low abstraction
levels. This provides the motivation for the following approach.
2. Reduce Support –
For mining various level relationship with diminished support, there are
various elective hunt techniques as follows.
a. Level-by-Level independence –
This is a full-broadness search, where no foundation information
on regular item sets is utilized for pruning. Each hub is examined,
regardless of whether its parent hub is discovered to be
incessant.
b. Level – cross-separating by single thing –
A thing at the I level is inspected if and just if its parent hub at
the (I-1) level is regular .all in all, we research a more explicit
relationship from a more broad one. If a hub is frequent, its kids
will be examined; otherwise, its descendant is pruned from the
inquiry.
c. Level-cross separating by – K-itemset –
A-itemset at the I level is inspected if and just if it’s For mining
various level relationship with diminished support, there are
various elective hunt techniques.
d. Level-by-Level independence –
This is a full-broadness search, where no foundation information
on regular item sets is utilized for pruning. Each hub is examined,
regardless of whether its parent hub is discovered to be
incessant.
e. Level – cross-separating by single thing –
A thing at the 1st level is inspected if and just if its parent hub at
the (I-1) the level is regular .all in all, we research a more explicit
relationship from a more broad one. If a hub is frequent, its kids
will be examined otherwise, its descendant is pruned from the
inquiry.
f. Level-cross separating by – K-item set –
A-item set at the I level is inspected if and just if its
corresponding parents A item set (i-1) level is frequent.