0% found this document useful (0 votes)

37 views20 pages

Data Warehouse and Data Mining - Definition and Concepts

Data mining is the process of extracting insights from large datasets using various techniques, with a focus on discovering hidden patterns for informed decision-making. Data preprocessing is crucial for preparing raw data for analysis, involving steps like cleaning, integration, transformation, and reduction to enhance data quality and model performance. The document also discusses statistical measures, association rule learning, and specific algorithms like Apriori, Eclat, and F-P Growth used in data mining.

Uploaded by

murshad.045100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views20 pages

Data Warehouse and Data Mining - Definition and Concepts

Uploaded by

murshad.045100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Data Warehouse and Data Mining: Definition and Concepts

What is Data Mining?

Data mining is the process of extracting knowledge or insights from large
amounts of data using various statistical and computational techniques. The
data can be structured, semi-structured or unstructured, and can be stored in
various forms such as databases, data warehouses, and data lakes.

The primary goal of data mining is to discover hidden patterns and

relationships in the data that can be used to make informed decisions or
predictions. This involves exploring the data using various techniques such as
clustering, classification, regression analysis, association rule mining, and
anomaly detection.

Data Preprocessing in Data Mining

Data preprocessing is the process of preparing raw data for analysis by
cleaning and transforming it into a usable format. In data mining it refers to
preparing raw data for mining by performing tasks like cleaning,
transforming, and organizing it into a format suitable for mining algorithms.
● Goal is to improve the quality of the data.

● Helps in handling missing values, removing duplicates, and

normalizing data.

● Ensures the accuracy and consistency of the dataset.

Steps in Data Preprocessing

Some key steps in data preprocessing are Data Cleaning, Data Integration,
Data Transformation, and Data Reduction.
1. Data Cleaning: It is the process of identifying and correcting errors or
inconsistencies in the dataset. It involves handling missing values, removing
duplicates, and correcting incorrect or outlier data to ensure the dataset is
accurate and reliable. Clean data is essential for effective analysis, as it
improves the quality of results and enhances the performance of data
models.

● Missing Values: This occur when data is absent from a dataset. You

can either ignore the rows with missing data or fill the gaps

manually, with the attribute mean, or by using the most probable

value. This ensures the dataset remains accurate and complete for

analysis.

● Noisy Data: It refers to irrelevant or incorrect data that is difficult for

machines to interpret, often caused by errors in data collection or

entry. It can be handled in several ways:

○ Binning Method: The data is sorted into equal

segments, and each segment is smoothed by

replacing values with the mean or boundary values.

○ Regression: Data can be smoothed by fitting it to a

regression function, either linear or multiple, to

predict values.

○ Clustering: This method groups similar data points

together, with outliers either being undetected or

falling outside the clusters. These techniques help

remove noise and improve data quality.

● Removing Duplicates: It involves identifying and eliminating

repeated data entries to ensure accuracy and consistency in the

dataset. This process prevents errors and ensures reliable analysis

by keeping only unique records.

2. Data Integration: It involves merging data from various sources into a

single, unified dataset. It can be challenging due to differences in data
formats, structures, and meanings. Techniques like record linkage and data
fusion help in combining data efficiently, ensuring consistency and accuracy.

● Record Linkage is the process of identifying and matching records

from different datasets that refer to the same entity, even if they are

represented differently. It helps in combining data from various

sources by finding corresponding records based on common

identifiers or attributes.

● Data Fusion involves combining data from multiple sources to

create a more comprehensive and accurate dataset. It integrates

information that may be inconsistent or incomplete from different

sources, ensuring a unified and richer dataset for analysis.

3. Data Transformation: It involves converting data into a format suitable for

analysis. Common techniques include normalization, which scales data to a
common range; standardization, which adjusts data to have zero mean and
unit variance; and discretization, which converts continuous data into discrete
categories. These techniques help prepare the data for more accurate
analysis.

● Data Normalization: The process of scaling data to a common range

to ensure consistency across variables.

● Discretization: Converting continuous data into discrete categories

for easier analysis.

● Data Aggregation: Combining multiple data points into a summary

form, such as averages or totals, to simplify analysis.

● Concept Hierarchy Generation: Organizing data into a hierarchy of

concepts to provide a higher-level view for better understanding

and analysis.

4. Data Reduction: It reduces the dataset’s size while maintaining key

information. This can be done through feature selection, which chooses the
most relevant features, and feature extraction, which transforms the data into
a lower-dimensional space while preserving important details. It uses various
reduction techniques such as,

● Dimensionality Reduction (e.g., Principal Component Analysis): A

technique that reduces the number of variables in a dataset while

retaining its essential information.

● Numerosity Reduction: Reducing the number of data points by

methods like sampling to simplify the dataset without losing critical

patterns.
● Data Compression: Reducing the size of data by encoding it in a

more compact form, making it easier to store and process.

● Concept Hierarchy Generation: The main idea behind the concept

of hierarchy is that the same data can have different levels of

granularity or levels of detail and that by organizing the data in a

hierarchical fashion, it is easier to understand and perform analysis.

● Data cube Aggregation: A data cube enables data to be modeled and

viewed in several dimensions. It is represented by dimensions and

facts. In other terms, dimensions are the views or entities related to
which an organization is required to keep records.

Uses of Data Preprocessing

Data preprocessing is utilized across various fields to ensure that raw data is
transformed into a usable format for analysis and decision-making. Here are
some key areas where data preprocessing is applied:

1. Data Warehousing: In data warehousing, preprocessing is essential for

cleaning, integrating, and structuring data before it is stored in a centralized
repository. This ensures the data is consistent and reliable for future queries
and reporting.

2. Data Mining: Data preprocessing in data mining involves cleaning and

transforming raw data to make it suitable for analysis. This step is crucial for
identifying patterns and extracting insights from large datasets.

3. Machine Learning: In machine learning, preprocessing prepares raw data

for model training. This includes handling missing values, normalizing
features, encoding categorical variables, and splitting datasets into training
and testing sets to improve model performance and accuracy.
4. Data Science: Data preprocessing is a fundamental step in data science
projects, ensuring that the data used for analysis or building predictive
models is clean, structured, and relevant. It enhances the overall quality of
insights derived from the data.

5. Web Mining: In web mining, preprocessing helps analyze web usage logs
to extract meaningful user behavior patterns. This can inform marketing
strategies and improve user experience through personalized
recommendations.

6. Business Intelligence (BI): Preprocessing supports BI by organizing and

cleaning data to create dashboards and reports that provide actionable
insights for decision-makers.

7. Deep Learning Purpose: Similar to machine learning, deep learning

applications require preprocessing to normalize or enhance features of the
input data, optimizing model training processes.

Advantages of Data Preprocessing

● Improved Data Quality: Ensures data is clean, consistent, and

reliable for analysis.

● Better Model Performance: Reduces noise and irrelevant data,

leading to more accurate predictions and insights.

● Efficient Data Analysis: Streamlines data for faster and easier

processing.

● Enhanced Decision-Making: Provides clear and well-organized data

for better business decisions.

Disadvantages of Data Preprocessing

● Time-Consuming: Requires significant time and effort to clean,

transform, and organize data.

● Resource-Intensive: Demands computational power and skilled

personnel for complex preprocessing tasks.

● Potential Data Loss: Incorrect handling may result in losing

valuable information.

● Complexity: Handling large datasets o

Statistical measures in large databases

Relational database systems supports five built-in aggregate functions such as
count(), sum(), avg(), max() and min(). These aggregate functions can be
used as basic measures in the descriptive mining of multidimensional
information. There are two descriptive statistical measures such as measures of
central tendency and measures of data dispersion can be used effectively in
high multidimensional databases.

Measures of central tendency − Measures of central tendency such as mean,

median, mode, and mid-range.

Mean − The arithmetic average is evaluated simply by inserting together all

values and splitting them by the number of values. It uses data from every
single value. Let x1, x2,... xn be a set of N values or observations like salary. The
mean of this set of values is:

X = (x1+x2+x3+......)/N

𝑠𝑢𝑚
Mean = 𝑐𝑜𝑢𝑛𝑡
This corresponds to the assembled aggregate function, average (avg())
supported in the relational database system. In several data cubes, sum and
count are saved in pre-computation. Therefore, the derivation of average is
straightforward.

Median − There are two methods for computing the median, based on the
distribution of values.

If x1, x2, .... xn are arranged in descending order and n is odd. Thus the

𝑛+1
Median= ( )th value
2

For example, 1, 4, 6, 7, 12, 14, 18

Median = 7

When n is even. Then the median is:

𝑛 𝑛
median = [( 2 ) + ( 2 + 1)]/2

For example, 1, 4, 6, 7, 8, 12, 14, 16.

Median = (7+8)/2 = 7.5

The median is neither a distributive measure nor an algebraic measure, it is the

holistic measure. Although it is not simply to evaluate the exact median value in
a huge database, an approximate median can be effectively computed.

Mode − It is the most common value in a set of values. Distributions can be

unimodal, bimodal, or multimodal. If the data is categorical (measured on the
nominal scale) then only the mode can be computed. The mode can also be
computed with ordinal and higher data, but it is not suitable.
Measuring the dispersion of data − The degree to which numerical
information tends to spread is known as the dispersion or variance of the data.
The most frequent measures of data dispersion are range, interquartile range,
and standard derivations.

Range − The range is represented as the difference between the largest value
and the smallest value in the set of data.

Range: = (MaxValue- MinValue)

Quartiles − The most common percentile other than the median are quartiles.
The first quartile indicated by Q1 is the 25th percentile, the third quartile
indicated by Q3 is the 75th percentile. The quartiles containing the median,
provide some indication of the center, spread, and shape of a quartile is a
simple measure of spread that provides the range covered by the middle half of
the data. This is known as the interquartile range (IQR) and is defined as −

IQR = Q3-Q1

Association rule:
The association rule learning is the most important approach of Data Mining,
and it is employed in Market Basket analysis, Web usage mining, continuous
production, etc. In market basket analysis, it is an approach used by several big
retailers to find the relations between items.
In market basket analysis, customer buying habits are analyzed by finding
associations between the different items that customers place in their shopping
baskets. By discovering such associations, retailers produce marketing methods
by analyzing which elements are frequently purchased by users. This
association can lead to increased sales by supporting retailers to do selective
marketing and plan for their shelf area.
Rule Evaluation Metrics –

● Support(s) – The number of transactions that include items in the {X} and

{Y} parts of the rule as a percentage of the total number of transaction.It

is a measure of how frequently the collection of items occur together as a

percentage of all transactions.

● Support = Freq(X+Y) /total – It is interpreted as fraction of transactions

that contain both X and Y.

● Confidence(c) – It is the ratio of the no of transactions that includes all

items in {B} as well as the no of transactions that includes all items in {A}

to the no of transactions that includes all items in {A}.

● Conf(X=>Y) = Supp(X+Y) /Supp(X) – It measures how often each item in

Y appears in transactions that contains items in X also.

● Lift(l) – The lift of the rule X=>Y is the confidence of the rule divided by

the expected confidence, assuming that the itemsets X and Y are

independent of each other.The expected confidence is the confidence

divided by the frequency of {Y}.

● Lift(X=>Y) = Supp(X+Y) /(Supp(X)*Supp(Y) – Lift value near 1 indicates

X and Y almost often appear together as expected, greater than 1 means

they appear together more than expected and less than 1 means they

appear less than expected.Greater lift values indicate stronger

association.

Types of Association Rule Learning

There are the following types of Association rule learning which are as follows −

Apriori Algorithm − This algorithm needs frequent datasets to produce

association rules. It is designed to work on databases that include transactions.
This algorithm needs a breadth-first search and hash tree to compute the
itemset efficiently.

It is generally used for market basket analysis and support to learn the products
that can be purchased together. It can be used in the healthcare area to
discover drug reactions for patients.

Eclat Algorithm − The Eclat algorithm represents Equivalence Class

Transformation. This algorithm needs a depth-first search method to discover
frequent itemsets in a transaction database. It implements quicker execution
than Apriori Algorithm.

F-P Growth Algorithm − The F-P growth algorithm represents Frequent

Pattern. It is the enhanced version of the Apriori Algorithm. It describes the
database in the form of a tree structure that is referred to as a frequent pattern
or tree. This frequent tree aims to extract the most frequent patterns.
Apriori Algorithm
Apriori Algorithm is a foundational method in data mining used for discovering frequent

itemsets and generating association rules. Its significance lies in its ability to identify

relationships between items in large datasets which is particularly valuable in market

basket analysis.

For example, if a grocery store finds that customers who buy bread often also buy butter, it

can use this information to optimize product placement or marketing strategies.

How the Apriori Algorithm Works?

The Apriori Algorithm operates through a systematic process that involves several

key steps:

1. Identifying Frequent Itemsets: The algorithm begins by scanning the

dataset to identify individual items (1-item) and their frequencies. It then

establishes a minimum support threshold, which determines whether an

itemset is considered frequent.

2. Creating Possible item group: Once frequent 1-itemgroup(single items)

are identified, the algorithm generates candidate 2-itemgroup by

combining frequent items. This process continues iteratively, forming

larger itemsets (k-itemgroup) until no more frequent itemgroup can be

found.

3. Removing Infrequent Item groups: The algorithm employs a pruning

technique based on the Apriori Property, which states that if an itemset is

infrequent, all its supersets must also be infrequent. This significantly

reduces the number of combinations that need to be evaluated.

4. Generating Association Rules: After identifying frequent itemsets, the

algorithm generates association rules that illustrate how items relate to

one another, using metrics like support, confidence, and lift to evaluate

the strength of these relationships.

Key Metrics of Apriori Algorithm

● Support: This metric measures how frequently an item appears in the

dataset relative to the total number of transactions. A higher support

indicates a more significant presence of the itemset in the dataset.

Support tells us how often a particular item or combination of items

appears in all the transactions (“Bread is bought in 20% of all

transactions.”)

● Confidence: Confidence assesses the likelihood that an item Y is

purchased when item X is purchased. It provides insight into the strength

of the association between two items.

● Confidence tells us how often items go together. (“If bread is bought,

butter is bought 75% of the time.”)

● Lift: Lift evaluates how much more likely two items are to be purchased

together compared to being purchased independently. A lift greater than

1 suggests a strong positive association. Lift shows how strong the

connection is between items. (“Bread and butter are much more likely to

be bought together than by chance.”)

Data Mining Multidimensional Association Rule

In this article, we are going to discuss the Multidimensional Association Rule. Also,

we will discuss examples of each. Let’s discuss one by one.

Multidimensional Association rule Mining

In Multi dimensional association rule Qualities can be absolute or

quantitative.

● Quantitative characteristics are numeric and consolidates order.

● Numeric traits should be discretized.

● Multi dimensional affiliation rule comprises of more than one

measurement.

● Example –buys(X, “IBM Laptop computer”)buys(X, “HP Inkjet

Printer”)

Approaches in mining multi dimensional affiliation rules :

Three approaches in mining multi dimensional affiliation rules are as

following.

1. Using static discretization of quantitative qualities :

● Discretization is static and happens preceding mining.

● Discretized ascribes are treated as unmitigated.

● Use apriori calculation to locate all k-regular predicate

sets(this requires k or k+1 table outputs). Each subset of

regular predicate set should be continuous.

Example –

If in an information block the 3D cuboid (age, pay, purchases) is

continuous suggests (age, pay), (age, purchases), (pay, purchases) are

likewise regular.

Note –

Information blocks are appropriate for mining since they make mining

quicker. The cells of an n-dimensional information cuboid relate to the

predicate cells.

2. Using powerful discretization of quantitative traits :

● Known as mining Quantitative Association Rules.

● Numeric properties are progressively discretized.

3. Using distance based discretization with bunching –

This id dynamic discretization measure that considers the distance

between information focuses. It includes a two stage mining measure

as following.
● Perform bunching to discover the time period included.

● Get affiliation rules via looking for gatherings of groups that happen

together.

Multilevel Association Rule in data mining

Multilevel Association Rule :

Association rules created from mining information at different degrees of reflection

are called various level or staggered association rules.

Multilevel association rules can be mined effectively utilizing idea progressions

under a help certainty system.

Rules at a high idea level may add to good judgment while rules at a low idea level

may not be valuable consistently.

Utilizing uniform least help for all levels :

● At the point when a uniform least help edge is utilized, the pursuit

system is rearranged.

● The technique is likewise straightforward, in that clients are needed

to indicate just a single least help edge.

● A similar least help edge is utilized when mining at each degree of

deliberation. (for example for mining from “PC” down to “PC”). Both

“PC” and “PC” discovered to be incessant, while “PC” isn’t.

Needs of Multidimensional Rule :

● Sometimes at the low data level, data does not show any significant

pattern but there is useful information hiding behind it.

● The aim is to find the hidden information in or between levels of

abstraction.

Approaches to multilevel association rule mining :

1. Uniform Support(Using uniform minimum support for all level)

2. Reduced Support (Using reduced minimum support at lower levels)

3. Group-based Support(Using item or group based support)

Let’s discuss one by one.

1. Uniform Support –

At the point when a uniform least help edge is used, the search

methodology is simplified. The technique is likewise basic in that

clients are needed to determine just a single least help threshold.

An advancement technique can be adopted, based on the

information that a progenitor is a superset of its descendant. the

search keeps away from analyzing item sets containing anything

that doesn’t have minimum support. The uniform support approach

however has some difficulties. It is unlikely that items at lower levels

of abstraction will occur as frequently as those at higher levels of

abstraction. If the minimum support threshold is set too high it could

miss several meaningful associations occurring at low abstraction

levels. This provides the motivation for the following approach.

2. Reduce Support –

For mining various level relationship with diminished support, there are

various elective hunt techniques as follows.

a. Level-by-Level independence –

This is a full-broadness search, where no foundation information

on regular item sets is utilized for pruning. Each hub is examined,

regardless of whether its parent hub is discovered to be

incessant.

b. Level – cross-separating by single thing –

A thing at the I level is inspected if and just if its parent hub at

the (I-1) level is regular .all in all, we research a more explicit

relationship from a more broad one. If a hub is frequent, its kids

will be examined; otherwise, its descendant is pruned from the

inquiry.

c. Level-cross separating by – K-itemset –

A-itemset at the I level is inspected if and just if it’s For mining

various level relationship with diminished support, there are

various elective hunt techniques.

d. Level-by-Level independence –

This is a full-broadness search, where no foundation information

on regular item sets is utilized for pruning. Each hub is examined,

regardless of whether its parent hub is discovered to be

incessant.

e. Level – cross-separating by single thing –

A thing at the 1st level is inspected if and just if its parent hub at

the (I-1) the level is regular .all in all, we research a more explicit

relationship from a more broad one. If a hub is frequent, its kids

will be examined otherwise, its descendant is pruned from the

inquiry.

f. Level-cross separating by – K-item set –

A-item set at the I level is inspected if and just if its

corresponding parents A item set (i-1) level is frequent.

Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Unit 3
No ratings yet
Unit 3
18 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
5 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Data Mining
No ratings yet
Data Mining
5 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Data Preprocessing for Analysts
No ratings yet
Data Preprocessing for Analysts
3 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Chap 8 Data Preprocessing - Short
No ratings yet
Chap 8 Data Preprocessing - Short
7 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Mining
No ratings yet
Data Mining
55 pages
Data Pre Processing
No ratings yet
Data Pre Processing
28 pages
Dta Mining
No ratings yet
Dta Mining
15 pages
Unit 2
No ratings yet
Unit 2
144 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Unit 2 DA
No ratings yet
Unit 2 DA
3 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
Data Mining 3
No ratings yet
Data Mining 3
31 pages
Data Binning
No ratings yet
Data Binning
9 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Data Integration and Data Reduction
No ratings yet
Data Integration and Data Reduction
27 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
DM Unit 1
No ratings yet
DM Unit 1
18 pages
Lecture 2 DM
No ratings yet
Lecture 2 DM
11 pages
Shortnjn
No ratings yet
Shortnjn
12 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
Data Mining for Tech Enthusiasts
No ratings yet
Data Mining for Tech Enthusiasts
61 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Data Mining
No ratings yet
Data Mining
22 pages
Dw&bi PR2,3
No ratings yet
Dw&bi PR2,3
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
6 pages
Unit 1
No ratings yet
Unit 1
8 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Metadata & Data Mining Essentials
No ratings yet
Metadata & Data Mining Essentials
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Warehousing Unit 1
No ratings yet
Data Warehousing Unit 1
26 pages
What Is Data Preprocessing
No ratings yet
What Is Data Preprocessing
4 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
14 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
Business Intelligence Value Proposition
No ratings yet
Business Intelligence Value Proposition
14 pages
Unit-5 of MIS (Database)
No ratings yet
Unit-5 of MIS (Database)
11 pages
Advanced Excel for Professionals
No ratings yet
Advanced Excel for Professionals
5 pages
The Coptic Apocalypse of Peter
No ratings yet
The Coptic Apocalypse of Peter
3 pages
Information Management System
No ratings yet
Information Management System
20 pages
360 1540 1 PB
No ratings yet
360 1540 1 PB
9 pages
Final Exam IT381
No ratings yet
Final Exam IT381
5 pages
Antigena Email - Data Storage and Security Schedule
No ratings yet
Antigena Email - Data Storage and Security Schedule
3 pages
Named Entitity Recognision SOmali'
No ratings yet
Named Entitity Recognision SOmali'
4 pages
Ui Ux Seminar Report Sample
No ratings yet
Ui Ux Seminar Report Sample
19 pages
Digital Marketing Questionbank
No ratings yet
Digital Marketing Questionbank
9 pages
Healthcare Data Standards
67% (3)
Healthcare Data Standards
23 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Fiche Bim v14
No ratings yet
Fiche Bim v14
2 pages
Offline Set B: SRM Institute of Science and Technology College of Engineering and Technology Department of Ece
No ratings yet
Offline Set B: SRM Institute of Science and Technology College of Engineering and Technology Department of Ece
6 pages
Cloud Healthcare API Insights
No ratings yet
Cloud Healthcare API Insights
11 pages
Data Analytics Chennai
No ratings yet
Data Analytics Chennai
20 pages
CH2 Quizlet
No ratings yet
CH2 Quizlet
4 pages
Bounouh
No ratings yet
Bounouh
13 pages
ISO 27001: Secure Your Business
No ratings yet
ISO 27001: Secure Your Business
19 pages
Forresters Bi Maturity Assessment Tool
No ratings yet
Forresters Bi Maturity Assessment Tool
6 pages
Top 50 SQL Interview Questions and Answers For Experienced
No ratings yet
Top 50 SQL Interview Questions and Answers For Experienced
12 pages
Course Outline: Veeam Certified Engineer (VMCE) v9: Audience
No ratings yet
Course Outline: Veeam Certified Engineer (VMCE) v9: Audience
4 pages
Rani Anak Mat Case 4 Report
No ratings yet
Rani Anak Mat Case 4 Report
5 pages
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
0% (1)
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
15 pages
Three-Schema Database Architecture Guide
No ratings yet
Three-Schema Database Architecture Guide
1 page
Master Mongodb Development For Web & Mobile Apps. Crud Operations, Indexes, Aggregation Framework - All About Mongodb!
No ratings yet
Master Mongodb Development For Web & Mobile Apps. Crud Operations, Indexes, Aggregation Framework - All About Mongodb!
3 pages
ID2f15b35da-1996 Ap Chem Scoring Guidelines
No ratings yet
ID2f15b35da-1996 Ap Chem Scoring Guidelines
2 pages
Ansh20csu169bi DV
No ratings yet
Ansh20csu169bi DV
70 pages
Yandex vs Google: Russian Search Comparison
No ratings yet
Yandex vs Google: Russian Search Comparison
70 pages

Data Warehouse and Data Mining - Definition and Concepts

Uploaded by

Data Warehouse and Data Mining - Definition and Concepts

Uploaded by

Data Warehouse and Data Mining: Definition and Concepts

What is Data Mining?

The primary goal of data mining is to discover hidden patterns and

Data Preprocessing in Data Mining

●​ Helps in handling missing values, removing duplicates, and

●​ Ensures the accuracy and consistency of the dataset.

Steps in Data Preprocessing

manually, with the attribute mean, or by using the most probable

●​ Noisy Data: It refers to irrelevant or incorrect data that is difficult for

machines to interpret, often caused by errors in data collection or

entry. It can be handled in several ways:

○​ Binning Method: The data is sorted into equal

segments, and each segment is smoothed by

replacing values with the mean or boundary values.

○​ Regression: Data can be smoothed by fitting it to a

regression function, either linear or multiple, to

○​ Clustering: This method groups similar data points

together, with outliers either being undetected or

remove noise and improve data quality.

●​ Removing Duplicates: It involves identifying and eliminating

repeated data entries to ensure accuracy and consistency in the

dataset. This process prevents errors and ensures reliable analysis

by keeping only unique records.

2. Data Integration: It involves merging data from various sources into a

●​ Record Linkage is the process of identifying and matching records

represented differently. It helps in combining data from various

sources by finding corresponding records based on common

●​ Data Fusion involves combining data from multiple sources to

create a more comprehensive and accurate dataset. It integrates

information that may be inconsistent or incomplete from different

sources, ensuring a unified and richer dataset for analysis.

3. Data Transformation: It involves converting data into a format suitable for

●​ Data Normalization: The process of scaling data to a common range

to ensure consistency across variables.

●​ Discretization: Converting continuous data into discrete categories

for easier analysis.

●​ Data Aggregation: Combining multiple data points into a summary

form, such as averages or totals, to simplify analysis.

●​ Concept Hierarchy Generation: Organizing data into a hierarchy of

concepts to provide a higher-level view for better understanding

4. Data Reduction: It reduces the dataset’s size while maintaining key

●​ Dimensionality Reduction (e.g., Principal Component Analysis): A

technique that reduces the number of variables in a dataset while

retaining its essential information.

●​ Numerosity Reduction: Reducing the number of data points by

methods like sampling to simplify the dataset without losing critical

more compact form, making it easier to store and process.

●​ Concept Hierarchy Generation: The main idea behind the concept

of hierarchy is that the same data can have different levels of

granularity or levels of detail and that by organizing the data in a

hierarchical fashion, it is easier to understand and perform analysis.

●​ Data cube Aggregation: A data cube enables data to be modeled and

viewed in several dimensions. It is represented by dimensions and

Uses of Data Preprocessing

1. Data Warehousing: In data warehousing, preprocessing is essential for

2. Data Mining: Data preprocessing in data mining involves cleaning and

3. Machine Learning: In machine learning, preprocessing prepares raw data

6. Business Intelligence (BI): Preprocessing supports BI by organizing and

7. Deep Learning Purpose: Similar to machine learning, deep learning

Advantages of Data Preprocessing

reliable for analysis.

●​ Better Model Performance: Reduces noise and irrelevant data,

leading to more accurate predictions and insights.

●​ Efficient Data Analysis: Streamlines data for faster and easier

●​ Enhanced Decision-Making: Provides clear and well-organized data

for better business decisions.

Disadvantages of Data Preprocessing

transform, and organize data.

●​ Resource-Intensive: Demands computational power and skilled

personnel for complex preprocessing tasks.

●​ Potential Data Loss: Incorrect handling may result in losing

●​ Complexity: Handling large datasets o

Statistical measures in large databases

Measures of central tendency − Measures of central tendency such as mean,

Mean − The arithmetic average is evaluated simply by inserting together all

For example, 1, 4, 6, 7, 12, 14, 18

● Helps in handling missing values, removing duplicates, and

● Ensures the accuracy and consistency of the dataset.

● Noisy Data: It refers to irrelevant or incorrect data that is difficult for

○ Binning Method: The data is sorted into equal

○ Regression: Data can be smoothed by fitting it to a

○ Clustering: This method groups similar data points

● Removing Duplicates: It involves identifying and eliminating

● Record Linkage is the process of identifying and matching records

● Data Fusion involves combining data from multiple sources to

● Data Normalization: The process of scaling data to a common range

● Discretization: Converting continuous data into discrete categories

● Data Aggregation: Combining multiple data points into a summary

● Concept Hierarchy Generation: Organizing data into a hierarchy of

● Dimensionality Reduction (e.g., Principal Component Analysis): A

● Numerosity Reduction: Reducing the number of data points by

● Concept Hierarchy Generation: The main idea behind the concept

● Data cube Aggregation: A data cube enables data to be modeled and

● Better Model Performance: Reduces noise and irrelevant data,

● Efficient Data Analysis: Streamlines data for faster and easier

● Enhanced Decision-Making: Provides clear and well-organized data

● Resource-Intensive: Demands computational power and skilled

● Potential Data Loss: Incorrect handling may result in losing

● Complexity: Handling large datasets o

● Support = Freq(X+Y) /total – It is interpreted as fraction of transactions

● Confidence(c) – It is the ratio of the no of transactions that includes all

● Conf(X=>Y) = Supp(X+Y) /Supp(X) – It measures how often each item in

● Lift(X=>Y) = Supp(X+Y) /(Supp(X)*Supp(Y) – Lift value near 1 indicates

1. Identifying Frequent Itemsets: The algorithm begins by scanning the

2. Creating Possible item group: Once frequent 1-itemgroup(single items)

3. Removing Infrequent Item groups: The algorithm employs a pruning

4. Generating Association Rules: After identifying frequent itemsets, the

● Support: This metric measures how frequently an item appears in the

● Confidence: Confidence assesses the likelihood that an item Y is

● Confidence tells us how often items go together. (“If bread is bought,

● Quantitative characteristics are numeric and consolidates order.

● Numeric traits should be discretized.

● Multi dimensional affiliation rule comprises of more than one

● Example –buys(X, “IBM Laptop computer”)buys(X, “HP Inkjet

Approaches in mining multi dimensional affiliation rules :

1. Using static discretization of quantitative qualities :

● Discretization is static and happens preceding mining.

● Discretized ascribes are treated as unmitigated.