Unit:2
Data mining query languages- Integration of Data mining system with a Data warehouse
issues, Data pre-processing-Data cleaning, Data transformation-feature selection-
Dimensionality reduction.
Data Mining Query Language (DMQL)?
Data Mining Query Languages are specialized languages designed to facilitate the querying and manipulation
of data mining tasks. They allow users to express data mining operations in a more intuitive manner, similar
to SQL for relational databases.
1. Purpose:
o Enable users to specify patterns to be discovered, data to be analyzed, and the conditions
under which data mining operations should be performed.
o Support complex queries involving aggregation, filtering, and pattern recognition.
2. Examples:
o DMQL (Data Mining Query Language): Designed for specifying data mining tasks and providing
access to data mining models.
o SQL with Data Mining Extensions: Some database systems extend SQL to include data mining
capabilities, allowing users to integrate data mining tasks directly within their SQL queries.
3. Typical Functions:
o Specifying target variables and patterns (e.g., mining frequent itemsets, building classification
models).
o Supporting the retrieval of discovered patterns and associations for further analysis.
Data Mining Query Language Using R
In R, we don’t have a specific "Data Mining Query Language" (DMQL), but we can perform data mining tasks
such as association rule mining, classification, and clustering using R packages and functions. Below, I'll
demonstrate these tasks with sample data, R code, and sample output to simulate a data mining process.
Let’s work through examples that cover association rule mining (Apriori), classification (Decision Tree)
1. Association Rule Mining Using Apriori
Objective: Find frequent patterns in transactional data.
Example Code:
We’ll use the arules package to mine association rules.
r
# Install and load the necessary package
if (!require("arules")) install.packages("arules", dependencies=TRUE)
library(arules)
# Sample transactional data
transactions <- list(
c("Bread", "Milk"),
c("Bread", "Diaper", "Beer", "Eggs"),
c("Milk", "Diaper", "Beer", "Cola"),
c("Bread", "Milk", "Diaper", "Beer"),
c("Bread", "Milk", "Diaper", "Cola")
)
# Convert the list to transaction data format
trans <- as(transactions, "transactions")
# Apply Apriori algorithm to find frequent itemsets and association rules
rules <- apriori(trans, parameter = list(supp = 0.2, conf = 0.6))
# Inspect the rules generated
inspect(rules)
Explanation:
transactions: A list simulating transactional data (products bought together).
Apriori: We specify a minimum support of 20% and a confidence level of 60% to mine the association
rules.
Sample Output:
lhs rhs support confidence lift
1 {Milk} => {Bread} 0.6 0.75 1.25
2 {Diaper}=> {Milk} 0.6 0.75 1.25
This output means:
Rule 1: If Milk is bought, there is a 75% chance Bread will also be bought.
Rule 2: If Diaper is bought, there is a 75% chance Milk will also be bought.
2. Classification Using Decision Tree
Objective: Classify if a customer will churn based on age and income.
Example Code:
Using the rpart package for decision tree classification.
R
# Load rpart library
if (!require("rpart")) install.packages("rpart", dependencies=TRUE)
library(rpart)
# Sample customer data
customer_data <- data.frame(
Age = c(25, 45, 35, 50, 23),
Income = c(40000, 90000, 60000, 85000, 45000),
Churn = as.factor(c('No', 'Yes', 'No', 'Yes', 'No'))
)
# Build a decision tree model
churn_model <- rpart(Churn ~ Age + Income, data = customer_data, method = "class")
# Display the decision tree
print(churn_model)
Explanation:
churn_model: Builds a decision tree using Age and Income to predict Churn.
The model divides the dataset based on attribute values to make classifications.
Sample Output:
n= 5
node), split, n, loss, yval, (yprob)
1) Age>=45 2 0 Yes (0.0 1.0) *
2) Age< 45 3 1 No (0.67 0.33) *
Advantages of Using R for Data Mining:
1. Flexibility: R offers a flexible and extensible environment for data mining, with many libraries
available for different tasks.
2. Wide Range of Algorithms: R supports a broad spectrum of data mining algorithms (e.g., decision
trees, k-means, apriori) through its packages.
3. Visualization Support: R integrates data mining with advanced visualization tools to help users
understand patterns.
4. Scalability: R can handle large datasets and complex mining tasks using parallel computing and big
data solutions.
Conclusion:While R does not have a formal Data Mining Query Language (DMQL), its extensive packages
such as rpart, arules, and dplyr provide a rich framework to perform data mining tasks. These packages offer
functionalities that mimic DMQL, enabling users to classify, cluster, and mine association rules effectively.
With its versatility and powerful statistical tools, R is a go-to solution for data mining tasks in academic and
professional environments.
INTEGRATION OF DATA MINING SYSTEM WITH A DATA WAREHOUSE ISSUES
Integration of a data mining system with a data warehouse:
DB and DW systems, possible integration schemes include no coupling, loose coupling, semi-tight coupling,
and tight coupling. We examine each of these schemes, as follows:
1. No coupling: No coupling means that a DM system will not utilize any function of a DB or DW system. It
may fetch data from a particular source (such as a file system), process data using some data mining
algorithms, and then store the mining results in another file.
2. Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or DW system,
fetching data from a data repository managed by these systems, performing data mining, and then storing
the mining results either in a file or in a designated place in a database or data Warehouse. Loose coupling
is better than no coupling because it can fetch any portion of data stored in databases or data warehouses
by using query processing, indexing, and other system facilities.
However, many loosely coupled mining systems are main memory-based. Because mining does not explore
data structures and query optimization methods provided by DB or DW systems, it is difficult for loose
coupling to achieve high scalability and good performance with large data sets.
3. Semi-tight coupling: Semi-tight coupling means that besides linking a DM system to a DB/DW system,
efficient implementations of a few essential data mining primitives (identified by the analysis of frequently
encountered data mining functions) can be provided in the DB/DW system. These primitives can include
sorting, indexing, aggregation, histogram analysis, multi way join, and pre computation of some essential
statistical measures, such as sum, count, max, min ,standard deviation,
4. Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW system.
The data mining subsystem is treated as one functional component of information system. Data mining
queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and
query processing methods of a DB or DW system.
Fig: Integration of a data mining system with a data warehouse
Issues in Data Integration
When you integrate the data in Data Mining, you may face many issues. There are some of those issues:
1.Entity Identification Problem
As you understand, the records are obtained from heterogeneous sources, and how can you 'match the real-
world entities from the data'.
For example, you were given client data from specialized statistics sites. Customer identity is assigned to an
entity from one statistics supply, while a customer range is assigned to an entity from another statistics
supply. Analyzing such metadata statistics will prevent you from making errors during schema integration.
2.Redundancy and Correlation Analysis
One of the major issues in the course of data integration is redundancy. Unimportant data that are no longer
required are referred to as redundant data. It may also appear due to attributes created from the use of
another property inside the information set.
For example, if one truth set contains the patronage and distinct data set as the purchaser's date of the
beginning, then age may be a redundant attribute because it can be deduced from the use of the beginning
date.
3.Tuple Duplication
Information integration has also handled duplicate tuples in addition to redundancy. Duplicate tuples may
also appear in the generated information if the denormalized table was utilized as a deliverable for data
integration.
4.Data warfare Detection and backbone
The data warfare technique of combining records from several sources is unhealthy. In the same way, that
characteristic values can vary, so can statistics units. The disparity may be related to the fact that they are
represented differently within the special data units. For example, in one-of-a-kind towns, the price of an inn
room might be expressed in a particular currency. This type of issue is recognized and fixed during the data
integration process.
DATA PREPROCESSING
Data preprocessing is a data mining technique that involves transforming raw data into an understandable
format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and
is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data
preprocessing prepares raw data for further processing. Data preprocessing is used database-driven
applications such as customer relationship management and rule-based applications (like neural networks).
Steps of Data Preprocessing
Data preprocessing is an important step in the data mining process that involves cleaning and transforming
raw data to make it suitable for analysis. Some common steps in data preprocessing include:
1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as
missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
2. Data Integration: This involves combining data from multiple sources to create a unified dataset. Data
integration can be challenging as it requires handling data with different formats, structures, and
semantics. Techniques such as record linkage and data fusion can be used for data integration.
3. Data Transformation: This involves converting the data into a suitable format for analysis. Common
techniques used in data transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while standardization is used to transform
the data to have zero mean and unit variance. Discretization is used to convert continuous data into
discrete categories.
4. Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and feature
extraction. Feature selection involves selecting a subset of relevant features from the dataset, while
feature extraction involves transforming the data into a lower-dimensional space while preserving
the important information.
5. Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require categorical
data. Discretization can be achieved through techniques such as equal width binning, equal frequency
binning, and clustering.
6. Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -
1 and 1. Normalization is often used to handle data with different units and scales. Common
normalization techniques include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis results.
The specific steps involved in data preprocessing may vary depending on the nature of the data and the
analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become more
accurate.
Preprocessing in Data Mining
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient
format.
Steps Involved in
Data Preprocessing
1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
Missing Data: This situation arises when some data is missing in the data. It can be handled in various
ways.
Some of them are:
o Ignore the tuples: This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
o Fill the Missing values: There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable value.
Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in following ways :
o Binning Method: This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.
o Regression:Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having multiple
independent variables).
o Clustering: This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
Data Cleaning Techniques
Automated Data Cleaning Tools:
o Use tools and libraries (e.g., OpenRefine, pandas in Python, dplyr in R) that provide
functionalities for cleaning data efficiently.
Regular Expressions (Regex):
o Utilize regex for pattern matching and string manipulation to clean and standardize text data.
Data Profiling:
o Perform data profiling to understand the data's structure, quality, and content, which aids in
identifying cleaning needs.
Data Integration:
o When merging data from multiple sources, ensure consistent formats and values across
datasets.
2. Data Transformation: This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)
Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes
to help the mining process.
Discretization: This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
Concept Hierarchy Generation: Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to “country”.
Data Transformation Techniques
1. ETL (Extract, Transform, Load):
o The ETL process is crucial in data warehousing, where data is extracted from various sources,
transformed into a usable format, and then loaded into the data warehouse.
2. Scripting and Programming:
o Data transformation can be accomplished using programming languages such as Python (with
libraries like pandas and NumPy) or R, which offer extensive functionalities for data
manipulation and transformation.
3. Data Transformation Tools:
o Utilize data transformation tools (e.g., Talend, Apache Nifi, Alteryx) that provide user-friendly
interfaces for performing various transformation tasks.
4. Regular Expressions (Regex):
o Use regex for pattern matching and text manipulation, particularly for cleaning and
standardizing string data.
3. Data Integration: Integrating data from heterogenous sources of data are combined into single dataset.
There are two type of data integration:
1. Tight coupling: data is combined together into a physical location
2. Loose coupling: only an interface is created and data is combined through the interface.
4. Data Reduction: Data reduction is a crucial step in the data mining process that involves reducing the size
of the dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can be
done using various techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features are
high-dimensional and complex. It can be done using techniques such as PCA, linear discriminant
analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to
reduce the size of the dataset while preserving the important information. It can be done using
techniques such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often used
to reduce the size of the dataset by replacing similar data points with a representative centroid. It can
be done using techniques such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important information.
Compression is often used to reduce the size of the dataset for storage and transmission purposes. It
can be done using techniques such as wavelet compression, JPEG compression, and gif compression
What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given dataset is known as dimensionality,
and the process to reduce these features is called dimensionality reduction.
A dataset contains a huge number of input features in various cases, which makes the predictive modeling
task more complicated. Because it is very difficult to visualize or make predictions for the training dataset
with a high number of features, for such cases, dimensionality reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions
dataset into lesser dimensions dataset ensuring that it provides similar information."
It is commonly used in the fields that deal with high-dimensional data, such as speech recognition, signal
processing, bioinformatics, etc. It can also be used for data visualization, noise reduction, cluster analysis,
etc.
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly known as the curse of
dimensionality. If the dimensionality of the input dataset increases, any machine learning algorithm and
model becomes more complex. As the number of features increases, the number of samples also gets
increased proportionally, and the chance of overfitting also increases. If the machine learning model is
trained on high-dimensional data, it becomes overfitted and results in poor performance.
Hence, it is often required to reduce the number of features, which can be done with dimensionality
reduction.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given dataset are given below:
o By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of dimensionality Reduction
There are also some disadvantages of applying the dimensionality reduction, which are given below:
o Some data may be lost due to dimensionality reduction.
o In the PCA dimensionality reduction technique, sometimes the principal components required to
consider are unknown.
Approaches of Dimension Reduction
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high accuracy. In other words, it is a way of selecting the
optimal features from the input dataset.
Three methods are used for the feature selection:
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some
common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning model for its
evaluation. In this method, some features are fed to the ML model, and evaluate the performance. The
performance decides whether to add those features or remove to increase the accuracy of the model. This
method is more accurate than the filtering method but complex to work. Some common techniques of
wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the machine learning
model and evaluate the importance of each feature. Some common techniques of Embedded methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into space with
fewer dimensions. This approach is useful when we want to keep the whole information but use fewer
resources while processing the information.
Some common feature extraction techniques are:
1. Principal Component Analysis
2. Linear Discriminant Analysis
3. Kernel PCA
4. Quadratic Discriminant Analysis
Common techniques of Dimensionality Reduction
1. Principal Component Analysis
2. Backward Elimination
3. Forward Selection
4. Score comparison
5. Missing Value Ratio
6. Low Variance Filter
7. High Correlation Filter
8. Random Forest
9. Factor Analysis
10. Auto-Encoder
Principal Component Analysis (PCA)
Principal Component Analysis is a statistical process that converts the observations of correlated features
into a set of linearly uncorrelated features with the help of orthogonal transformation. These new
transformed features are called the Principal Components. It is one of the popular tools that is used for
exploratory data analysis and predictive modeling.
PCA works by considering the variance of each attribute because the high attribute shows the good split
between the classes, and hence it reduces the dimensionality. Some real-world applications of PCA are image
processing, movie recommendation system, optimizing the power allocation in various communication
channels.
Backward Feature Elimination
The backward feature elimination technique is mainly used while developing Linear Regression or Logistic
Regression model. Below steps are performed in this technique to reduce the dimensionality or in feature
selection:
o In this technique, firstly, all the n variables of the given dataset are taken to train the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1 features for n times, and will
compute the performance of the model.
o We will check the variable that has made the smallest or no change in the performance of the model,
and then we will drop that variable or features; after that, we will be left with n-1 features.
o Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and maximum tolerable error rate,
we can define the optimal number of features require for the machine learning algorithms.
Forward Feature Selection
Forward feature selection follows the inverse process of the backward elimination process. It means, in this
technique, we don't eliminate the feature; instead, we will find the best features that can produce the highest
increase in the performance of the model. Below steps are performed in this technique:
o We start with a single feature only, and progressively we will add each feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the performance of the model.
Missing Value Ratio
If a dataset has too many missing values, then we drop those variables as they do not carry much useful
information. To perform this, we can set a threshold level, and if a variable has missing values more than that
threshold, we will drop that variable. The higher the threshold value, the more efficient the reduction.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine learning. This algorithm
contains an in-built feature importance package, so we do not need to program it separately. In this
technique, we need to generate a large set of trees against the target variable, and with the help of usage
statistics of each attribute, we need to find the subset of features.
Random forest algorithm takes only numerical variables, so we need to convert the input data into numeric
data using hot encoding.
Methods of Dimensionality Reduction:
1. Principal Component Analysis (PCA): Transforms the original features into a new set of orthogonal
features (principal components) that capture the maximum variance in the data.
o Useful for linear relationships and when preserving variance is crucial.
2. t-Distributed Stochastic Neighbor Embedding (t-SNE): A non-linear technique that is particularly
useful for visualizing high-dimensional data in two or three dimensions.
o Preserves local relationships, making it ideal for clustering and visualization.
3. Linear Discriminant Analysis (LDA): A supervised dimensionality reduction technique that aims to
project data in a way that maximizes class separability.
o Useful for classification problems where class labels are available.
4. Autoencoders: Neural network-based techniques that learn to compress data into a lower-
dimensional space and then reconstruct it back.
o Effective for capturing complex relationships in data.
Key Differences
Aspect Feature Selection Dimensionality Reduction
Objective Selects a subset of original features Transforms data into a new feature space
Approach Subset selection Transformation and projection
Resulting Features Original features retained New features (components) generated
Interpretability Easier to interpret Less interpretable (new dimensions)
When to Use
Use Feature Selection when:
o You have a large number of features and want to identify the most relevant ones.
o You aim to improve model performance and interpretability without altering the feature
space.
Use Dimensionality Reduction when:
o You need to visualize high-dimensional data.
o You want to reduce noise and redundancy while preserving the underlying structure of the
data.
o You are dealing with data that has multicollinearity issues (highly correlated features).