KEMBAR78
Data Warehouse and Data Mining - Definition and Concepts | PDF | Data | Data Mining
0% found this document useful (0 votes)
37 views20 pages

Data Warehouse and Data Mining - Definition and Concepts

Data mining is the process of extracting insights from large datasets using various techniques, with a focus on discovering hidden patterns for informed decision-making. Data preprocessing is crucial for preparing raw data for analysis, involving steps like cleaning, integration, transformation, and reduction to enhance data quality and model performance. The document also discusses statistical measures, association rule learning, and specific algorithms like Apriori, Eclat, and F-P Growth used in data mining.

Uploaded by

murshad.045100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views20 pages

Data Warehouse and Data Mining - Definition and Concepts

Data mining is the process of extracting insights from large datasets using various techniques, with a focus on discovering hidden patterns for informed decision-making. Data preprocessing is crucial for preparing raw data for analysis, involving steps like cleaning, integration, transformation, and reduction to enhance data quality and model performance. The document also discusses statistical measures, association rule learning, and specific algorithms like Apriori, Eclat, and F-P Growth used in data mining.

Uploaded by

murshad.045100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Warehouse and Data Mining: Definition and Concepts

What is Data Mining?


Data mining is the process of extracting knowledge or insights from large
amounts of data using various statistical and computational techniques. The
data can be structured, semi-structured or unstructured, and can be stored in
various forms such as databases, data warehouses, and data lakes.

The primary goal of data mining is to discover hidden patterns and


relationships in the data that can be used to make informed decisions or
predictions. This involves exploring the data using various techniques such as
clustering, classification, regression analysis, association rule mining, and
anomaly detection.

Data Preprocessing in Data Mining


Data preprocessing is the process of preparing raw data for analysis by
cleaning and transforming it into a usable format. In data mining it refers to
preparing raw data for mining by performing tasks like cleaning,
transforming, and organizing it into a format suitable for mining algorithms.
●​ Goal is to improve the quality of the data.

●​ Helps in handling missing values, removing duplicates, and

normalizing data.

●​ Ensures the accuracy and consistency of the dataset.

Steps in Data Preprocessing


Some key steps in data preprocessing are Data Cleaning, Data Integration,
Data Transformation, and Data Reduction.
1. Data Cleaning: It is the process of identifying and correcting errors or
inconsistencies in the dataset. It involves handling missing values, removing
duplicates, and correcting incorrect or outlier data to ensure the dataset is
accurate and reliable. Clean data is essential for effective analysis, as it
improves the quality of results and enhances the performance of data
models.

●​ Missing Values: This occur when data is absent from a dataset. You

can either ignore the rows with missing data or fill the gaps

manually, with the attribute mean, or by using the most probable

value. This ensures the dataset remains accurate and complete for

analysis.

●​ Noisy Data: It refers to irrelevant or incorrect data that is difficult for

machines to interpret, often caused by errors in data collection or

entry. It can be handled in several ways:

○​ Binning Method: The data is sorted into equal

segments, and each segment is smoothed by

replacing values with the mean or boundary values.

○​ Regression: Data can be smoothed by fitting it to a

regression function, either linear or multiple, to

predict values.

○​ Clustering: This method groups similar data points

together, with outliers either being undetected or


falling outside the clusters. These techniques help

remove noise and improve data quality.

●​ Removing Duplicates: It involves identifying and eliminating

repeated data entries to ensure accuracy and consistency in the

dataset. This process prevents errors and ensures reliable analysis

by keeping only unique records.

2. Data Integration: It involves merging data from various sources into a


single, unified dataset. It can be challenging due to differences in data
formats, structures, and meanings. Techniques like record linkage and data
fusion help in combining data efficiently, ensuring consistency and accuracy.

●​ Record Linkage is the process of identifying and matching records

from different datasets that refer to the same entity, even if they are

represented differently. It helps in combining data from various

sources by finding corresponding records based on common

identifiers or attributes.

●​ Data Fusion involves combining data from multiple sources to

create a more comprehensive and accurate dataset. It integrates

information that may be inconsistent or incomplete from different

sources, ensuring a unified and richer dataset for analysis.

3. Data Transformation: It involves converting data into a format suitable for


analysis. Common techniques include normalization, which scales data to a
common range; standardization, which adjusts data to have zero mean and
unit variance; and discretization, which converts continuous data into discrete
categories. These techniques help prepare the data for more accurate
analysis.

●​ Data Normalization: The process of scaling data to a common range

to ensure consistency across variables.

●​ Discretization: Converting continuous data into discrete categories

for easier analysis.

●​ Data Aggregation: Combining multiple data points into a summary

form, such as averages or totals, to simplify analysis.

●​ Concept Hierarchy Generation: Organizing data into a hierarchy of

concepts to provide a higher-level view for better understanding

and analysis.

4. Data Reduction: It reduces the dataset’s size while maintaining key


information. This can be done through feature selection, which chooses the
most relevant features, and feature extraction, which transforms the data into
a lower-dimensional space while preserving important details. It uses various
reduction techniques such as,

●​ Dimensionality Reduction (e.g., Principal Component Analysis): A

technique that reduces the number of variables in a dataset while

retaining its essential information.

●​ Numerosity Reduction: Reducing the number of data points by

methods like sampling to simplify the dataset without losing critical

patterns.
●​ Data Compression: Reducing the size of data by encoding it in a

more compact form, making it easier to store and process.

●​ Concept Hierarchy Generation: The main idea behind the concept

of hierarchy is that the same data can have different levels of

granularity or levels of detail and that by organizing the data in a

hierarchical fashion, it is easier to understand and perform analysis.

●​ Data cube Aggregation: A data cube enables data to be modeled and

viewed in several dimensions. It is represented by dimensions and


facts. In other terms, dimensions are the views or entities related to
which an organization is required to keep records.

Uses of Data Preprocessing


Data preprocessing is utilized across various fields to ensure that raw data is
transformed into a usable format for analysis and decision-making. Here are
some key areas where data preprocessing is applied:

1. Data Warehousing: In data warehousing, preprocessing is essential for


cleaning, integrating, and structuring data before it is stored in a centralized
repository. This ensures the data is consistent and reliable for future queries
and reporting.

2. Data Mining: Data preprocessing in data mining involves cleaning and


transforming raw data to make it suitable for analysis. This step is crucial for
identifying patterns and extracting insights from large datasets.

3. Machine Learning: In machine learning, preprocessing prepares raw data


for model training. This includes handling missing values, normalizing
features, encoding categorical variables, and splitting datasets into training
and testing sets to improve model performance and accuracy.
4. Data Science: Data preprocessing is a fundamental step in data science
projects, ensuring that the data used for analysis or building predictive
models is clean, structured, and relevant. It enhances the overall quality of
insights derived from the data.

5. Web Mining: In web mining, preprocessing helps analyze web usage logs
to extract meaningful user behavior patterns. This can inform marketing
strategies and improve user experience through personalized
recommendations.

6. Business Intelligence (BI): Preprocessing supports BI by organizing and


cleaning data to create dashboards and reports that provide actionable
insights for decision-makers.

7. Deep Learning Purpose: Similar to machine learning, deep learning


applications require preprocessing to normalize or enhance features of the
input data, optimizing model training processes.

Advantages of Data Preprocessing


●​ Improved Data Quality: Ensures data is clean, consistent, and

reliable for analysis.

●​ Better Model Performance: Reduces noise and irrelevant data,

leading to more accurate predictions and insights.

●​ Efficient Data Analysis: Streamlines data for faster and easier

processing.

●​ Enhanced Decision-Making: Provides clear and well-organized data

for better business decisions.

Disadvantages of Data Preprocessing


●​ Time-Consuming: Requires significant time and effort to clean,

transform, and organize data.

●​ Resource-Intensive: Demands computational power and skilled

personnel for complex preprocessing tasks.

●​ Potential Data Loss: Incorrect handling may result in losing

valuable information.

●​ Complexity: Handling large datasets o

Statistical measures in large databases


Relational database systems supports five built-in aggregate functions such as
count(), sum(), avg(), max() and min(). These aggregate functions can be
used as basic measures in the descriptive mining of multidimensional
information. There are two descriptive statistical measures such as measures of
central tendency and measures of data dispersion can be used effectively in
high multidimensional databases.

Measures of central tendency − Measures of central tendency such as mean,


median, mode, and mid-range.

Mean − The arithmetic average is evaluated simply by inserting together all


values and splitting them by the number of values. It uses data from every
single value. Let x1, x2,... xn be a set of N values or observations like salary. The
mean of this set of values is:

X = (x1+x2+x3+......)/N

𝑠𝑢𝑚
​ ​ ​ Mean = 𝑐𝑜𝑢𝑛𝑡
This corresponds to the assembled aggregate function, average (avg())
supported in the relational database system. In several data cubes, sum and
count are saved in pre-computation. Therefore, the derivation of average is
straightforward.

Median − There are two methods for computing the median, based on the
distribution of values.

If x1, x2, .... xn are arranged in descending order and n is odd. Thus the

𝑛+1
Median= ( )th value
2

For example, 1, 4, 6, 7, 12, 14, 18

Median = 7

When n is even. Then the median is:

𝑛 𝑛
median = [( 2 ) + ( 2 + 1)]/2

For example, 1, 4, 6, 7, 8, 12, 14, 16.

Median = (7+8)/2 = 7.5

The median is neither a distributive measure nor an algebraic measure, it is the


holistic measure. Although it is not simply to evaluate the exact median value in
a huge database, an approximate median can be effectively computed.

Mode − It is the most common value in a set of values. Distributions can be


unimodal, bimodal, or multimodal. If the data is categorical (measured on the
nominal scale) then only the mode can be computed. The mode can also be
computed with ordinal and higher data, but it is not suitable.
Measuring the dispersion of data − The degree to which numerical
information tends to spread is known as the dispersion or variance of the data.
The most frequent measures of data dispersion are range, interquartile range,
and standard derivations.

Range − The range is represented as the difference between the largest value
and the smallest value in the set of data.

Range: = (MaxValue- MinValue)

Quartiles − The most common percentile other than the median are quartiles.
The first quartile indicated by Q1 is the 25th percentile, the third quartile
indicated by Q3 is the 75th percentile. The quartiles containing the median,
provide some indication of the center, spread, and shape of a quartile is a
simple measure of spread that provides the range covered by the middle half of
the data. This is known as the interquartile range (IQR) and is defined as −

IQR = Q3-Q1

Association rule:
The association rule learning is the most important approach of Data Mining,
and it is employed in Market Basket analysis, Web usage mining, continuous
production, etc. In market basket analysis, it is an approach used by several big
retailers to find the relations between items.
In market basket analysis, customer buying habits are analyzed by finding
associations between the different items that customers place in their shopping
baskets. By discovering such associations, retailers produce marketing methods
by analyzing which elements are frequently purchased by users. This
association can lead to increased sales by supporting retailers to do selective
marketing and plan for their shelf area.
Rule Evaluation Metrics –

●​ Support(s) – The number of transactions that include items in the {X} and

{Y} parts of the rule as a percentage of the total number of transaction.It

is a measure of how frequently the collection of items occur together as a

percentage of all transactions.

●​ Support = Freq(X+Y) /total – It is interpreted as fraction of transactions

that contain both X and Y.

●​ Confidence(c) – It is the ratio of the no of transactions that includes all

items in {B} as well as the no of transactions that includes all items in {A}

to the no of transactions that includes all items in {A}.

●​ Conf(X=>Y) = Supp(X+Y) /Supp(X) – It measures how often each item in

Y appears in transactions that contains items in X also.

●​ Lift(l) – The lift of the rule X=>Y is the confidence of the rule divided by

the expected confidence, assuming that the itemsets X and Y are

independent of each other.The expected confidence is the confidence

divided by the frequency of {Y}.

●​ Lift(X=>Y) = Supp(X+Y) /(Supp(X)*Supp(Y) – Lift value near 1 indicates


X and Y almost often appear together as expected, greater than 1 means

they appear together more than expected and less than 1 means they

appear less than expected.Greater lift values indicate stronger

association.
​ ​ ​

Types of Association Rule Learning

There are the following types of Association rule learning which are as follows −

Apriori Algorithm − This algorithm needs frequent datasets to produce


association rules. It is designed to work on databases that include transactions.
This algorithm needs a breadth-first search and hash tree to compute the
itemset efficiently.

It is generally used for market basket analysis and support to learn the products
that can be purchased together. It can be used in the healthcare area to
discover drug reactions for patients.

Eclat Algorithm − The Eclat algorithm represents Equivalence Class


Transformation. This algorithm needs a depth-first search method to discover
frequent itemsets in a transaction database. It implements quicker execution
than Apriori Algorithm.

F-P Growth Algorithm − The F-P growth algorithm represents Frequent


Pattern. It is the enhanced version of the Apriori Algorithm. It describes the
database in the form of a tree structure that is referred to as a frequent pattern
or tree. This frequent tree aims to extract the most frequent patterns.
Apriori Algorithm
Apriori Algorithm is a foundational method in data mining used for discovering frequent

itemsets and generating association rules. Its significance lies in its ability to identify

relationships between items in large datasets which is particularly valuable in market

basket analysis.

For example, if a grocery store finds that customers who buy bread often also buy butter, it

can use this information to optimize product placement or marketing strategies.

How the Apriori Algorithm Works?

The Apriori Algorithm operates through a systematic process that involves several

key steps:

1.​ Identifying Frequent Itemsets: The algorithm begins by scanning the

dataset to identify individual items (1-item) and their frequencies. It then

establishes a minimum support threshold, which determines whether an

itemset is considered frequent.

2.​ Creating Possible item group: Once frequent 1-itemgroup(single items)

are identified, the algorithm generates candidate 2-itemgroup by

combining frequent items. This process continues iteratively, forming

larger itemsets (k-itemgroup) until no more frequent itemgroup can be

found.

3.​ Removing Infrequent Item groups: The algorithm employs a pruning

technique based on the Apriori Property, which states that if an itemset is


infrequent, all its supersets must also be infrequent. This significantly

reduces the number of combinations that need to be evaluated.

4.​ Generating Association Rules: After identifying frequent itemsets, the

algorithm generates association rules that illustrate how items relate to

one another, using metrics like support, confidence, and lift to evaluate

the strength of these relationships.

Key Metrics of Apriori Algorithm

●​ Support: This metric measures how frequently an item appears in the

dataset relative to the total number of transactions. A higher support

indicates a more significant presence of the itemset in the dataset.

Support tells us how often a particular item or combination of items

appears in all the transactions (“Bread is bought in 20% of all

transactions.”)

●​ Confidence: Confidence assesses the likelihood that an item Y is

purchased when item X is purchased. It provides insight into the strength

of the association between two items.

●​ Confidence tells us how often items go together. (“If bread is bought,

butter is bought 75% of the time.”)

●​ Lift: Lift evaluates how much more likely two items are to be purchased

together compared to being purchased independently. A lift greater than

1 suggests a strong positive association. Lift shows how strong the


connection is between items. (“Bread and butter are much more likely to

be bought together than by chance.”)

Data Mining Multidimensional Association Rule


In this article, we are going to discuss the Multidimensional Association Rule. Also,

we will discuss examples of each. Let’s discuss one by one.

Multidimensional Association rule Mining

In Multi dimensional association rule Qualities can be absolute or

quantitative.

●​ Quantitative characteristics are numeric and consolidates order.

●​ Numeric traits should be discretized.

●​ Multi dimensional affiliation rule comprises of more than one

measurement.

●​ Example –buys(X, “IBM Laptop computer”)buys(X, “HP Inkjet

Printer”)

Approaches in mining multi dimensional affiliation rules :​

Three approaches in mining multi dimensional affiliation rules are as

following.

1.​ Using static discretization of quantitative qualities :

●​ Discretization is static and happens preceding mining.

●​ Discretized ascribes are treated as unmitigated.


●​ Use apriori calculation to locate all k-regular predicate

sets(this requires k or k+1 table outputs). Each subset of

regular predicate set should be continuous.

Example –​

If in an information block the 3D cuboid (age, pay, purchases) is

continuous suggests (age, pay), (age, purchases), (pay, purchases) are

likewise regular.​

Note –​

Information blocks are appropriate for mining since they make mining

quicker. The cells of an n-dimensional information cuboid relate to the

predicate cells.

2.​ Using powerful discretization of quantitative traits :

●​ Known as mining Quantitative Association Rules.

●​ Numeric properties are progressively discretized.

3. Using distance based discretization with bunching –

This id dynamic discretization measure that considers the distance

between information focuses. It includes a two stage mining measure

as following.
●​ Perform bunching to discover the time period included.

●​ Get affiliation rules via looking for gatherings of groups that happen

together.

Multilevel Association Rule in data mining


Multilevel Association Rule :​

Association rules created from mining information at different degrees of reflection

are called various level or staggered association rules.​

Multilevel association rules can be mined effectively utilizing idea progressions

under a help certainty system.​

Rules at a high idea level may add to good judgment while rules at a low idea level

may not be valuable consistently.

Utilizing uniform least help for all levels :

●​ At the point when a uniform least help edge is utilized, the pursuit

system is rearranged.

●​ The technique is likewise straightforward, in that clients are needed

to indicate just a single least help edge.

●​ A similar least help edge is utilized when mining at each degree of

deliberation. (for example for mining from “PC” down to “PC”). Both

“PC” and “PC” discovered to be incessant, while “PC” isn’t.

Needs of Multidimensional Rule :


●​ Sometimes at the low data level, data does not show any significant

pattern but there is useful information hiding behind it.

●​ The aim is to find the hidden information in or between levels of

abstraction.

Approaches to multilevel association rule mining :

1.​ Uniform Support(Using uniform minimum support for all level)

2.​ Reduced Support (Using reduced minimum support at lower levels)

3.​ Group-based Support(Using item or group based support)

Let’s discuss one by one.

1.​ Uniform Support –​

At the point when a uniform least help edge is used, the search

methodology is simplified. The technique is likewise basic in that

clients are needed to determine just a single least help threshold.

An advancement technique can be adopted, based on the

information that a progenitor is a superset of its descendant. the

search keeps away from analyzing item sets containing anything

that doesn’t have minimum support. The uniform support approach

however has some difficulties. It is unlikely that items at lower levels

of abstraction will occur as frequently as those at higher levels of

abstraction. If the minimum support threshold is set too high it could


miss several meaningful associations occurring at low abstraction

levels. This provides the motivation for the following approach.

2.​ Reduce Support –​

For mining various level relationship with diminished support, there are

various elective hunt techniques as follows.

a.​ Level-by-Level independence –​

This is a full-broadness search, where no foundation information

on regular item sets is utilized for pruning. Each hub is examined,

regardless of whether its parent hub is discovered to be

incessant.

b.​ Level – cross-separating by single thing –​

A thing at the I level is inspected if and just if its parent hub at

the (I-1) level is regular .all in all, we research a more explicit

relationship from a more broad one. If a hub is frequent, its kids

will be examined; otherwise, its descendant is pruned from the

inquiry.

c.​ Level-cross separating by – K-itemset –​

A-itemset at the I level is inspected if and just if it’s For mining

various level relationship with diminished support, there are

various elective hunt techniques.

d.​ Level-by-Level independence –​

This is a full-broadness search, where no foundation information


on regular item sets is utilized for pruning. Each hub is examined,

regardless of whether its parent hub is discovered to be

incessant.

e.​ Level – cross-separating by single thing –​

A thing at the 1st level is inspected if and just if its parent hub at

the (I-1) the level is regular .all in all, we research a more explicit

relationship from a more broad one. If a hub is frequent, its kids

will be examined otherwise, its descendant is pruned from the

inquiry.

f.​ Level-cross separating by – K-item set –​

A-item set at the I level is inspected if and just if its

corresponding parents A item set (i-1) level is frequent.

You might also like