Unit 3 Data Mining
Unit 3 Data Mining
KDD (Knowledge Discovery in Databases) is a field of computer science, which includes the
tools and theories to help humans in extracting useful and previously unknown information (i.e.,
knowledge) from large collections of digitized data. KDD consists of several steps, and Data
Mining is one of them. Data Mining is the application of a specific algorithm to extract patterns
from data. Nonetheless, KDD and Data Mining are used interchangeably.
What is KDD?
KDD is a computer science field specializing in extracting previously unknown and interesting
information from raw data. KDD is the whole process of trying to make sense of data by
developing appropriate methods or techniques. This process deals with low-level mapping data
into other forms that are more compact, abstract, and useful. This is achieved by creating short
reports, modeling the process of generating data, and developing predictive models that can
predict future cases.
Due to the exponential growth of data, especially in areas such as business, KDD has become a
very important process to convert this large wealth of data into business intelligence, as manual
extraction of patterns has become seemingly impossible in the past few decades.
For example, it is currently used for various applications such as social network analysis, fraud
detection, science, investment, manufacturing, telecommunications, data cleaning, sports,
information retrieval, and marketing. KDD is usually used to answer questions like what are the
main products that might help to obtain high-profit next year in V-Mart.
Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD, while others view data mining as merely an
essential step in the process of knowledge discovery. The knowledge discovery process is
shown in Figure 1.4 as an iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)3
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)4
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures—
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user and may be stored as new knowledge in the knowledge base. The preceding
view shows data mining as one step in the knowledge discovery process, albeit an essential one
because it uncovers hidden patterns for evaluation. However, in industry, in media, and in the
research milieu, the term data mining is often used to refer to the entire knowledge discovery
process (perhaps because the term is shorter than knowledge discovery from data).
Therefore, we adopt a broad view of data mining functionality: Data mining is the process of
discovering interesting patterns and knowledge from large amounts of data. The data sources can
include databases, data warehouses, theWeb, other information repositories, or data that are
streamed into the system dynamically.
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −
Performance Issues
Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems
− The data is available at different data sources on LAN or WAN. These data source may
be structured, semi structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
Data mining is one of the forms of artificial intelligence that uses perception models, analytical
models, and multiple algorithms to simulate the techniques of the human brain. Data mining
supports machines to take human decisions and create human choices.
The user of the data mining tools will have to direct the machine rules, preferences, and even
experiences to have decision support data mining metrics are as follows −
Usefulness − Usefulness involves several metrics that tell us whether the model provides useful
data. For instance, a data mining model that correlates save the location with sales can be both
accurate and reliable, but cannot be useful, because it cannot generalize that result by inserting
more stores at the same location.
Furthermore, it does not answer the fundamental business question of why specific locations
have more sales. It can also find that a model that appears successful is meaningless because it
depends on cross-correlations in the data.
Return on Investment (ROI) − Data mining tools will find interesting patterns buried inside the
data and develop predictive models. These models will have several measures for denoting how
well they fit the records. It is not clear how to create a decision based on some of the measures
reported as an element of data mining analyses.
Access Financial Information during Data Mining − The simplest way to frame decisions in
financial terms is to augment the raw information that is generally mined to also contain
financial data. Some organizations are investing and developing data warehouses, and data
marts.
The design of a warehouse or mart contains considerations about the types of analyses and data
needed for expected queries. It is designing warehouses in a way that allows access to financial
information along with access to more typical data on product attributes, user profiles, etc. can be
useful.
Converting Data Mining Metrics into Financial Terms − A general data mining metric is the
measure of "Lift". Lift is a measure of what is achieved by using the specific model or pattern
relative to a base rate in which the model is not used. High values mean much is achieved. It can
seem then that one can simply create a decision based on Lift.
Accuracy − Accuracy is a measure of how well the model correlates results with the attributes in
the data that has been supported. There are several measures of accuracy, but all measures of
accuracy are dependent on the information that is used. In reality, values can be missing or
approximate, or the data can have been changed by several processes.
It is the procedure of exploration and development, it can decide to accept a specific amount of
error in the data, especially if the data is fairly uniform in its characteristics. For example, a
model that predicts sales for a specific store based on past sales can be powerfully correlated and
very accurate, even if that store consistently used the wrong accounting techniques. Thus,
measurements of accuracy should be balanced by assessments of reliability.
4.Data Mining Architecture
The significant components of data mining systems are a data source, data mining engine, data
warehouse server, the pattern evaluation module, graphical user interface, and knowledge base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files,
and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data warehouses
may comprise one or more databases, text files spreadsheets, or other repositories of data.
Sometimes, even plain text files or spreadsheets may contain information. Another primary
source of data is the World Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different formats,
it can't be used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified. More information than needed will
be collected from various data sources, and only the data of interest will have to be selected and
passed to the server. These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.
Database or Data Warehouse Server:
The database or data warehouse server consists of the original data that is ready to be processed.
Hence, the server is cause for retrieving the relevant data that is based on data mining as per user
request.
The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from various
data sources and stored within the data warehouse.
The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the search
on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining modules to
focus the search towards fascinating patterns. It might utilize a stake threshold to filter out
discovered patterns. On the other hand, the pattern evaluation module might be coordinated with
the mining module, depending on the implementation of the data mining techniques used. For
efficient data mining, it is abnormally suggested to push the evaluation of pattern stake as much
as possible into the mining procedure to confine the search to only fascinating patterns.
The graphical user interface (GUI) module communicates between the data mining system and
the user. This module helps the user to easily and efficiently use the system without knowing the
complexity of the process. This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide
the search or evaluate the stake of the result patterns. The knowledge base may even contain user
views and data from user experiences that might be helpful in the data mining process. The data
mining engine may receive inputs from the knowledge base to make the result more accurate and
reliable. The pattern assessment module regularly interacts with the knowledge base to get
inputs, and also update it.
5.Data Cleaning in Data Mining
Data cleaning is a crucial process in Data Mining. It carries an important part in the building of a
model. Data Cleaning can be regarded as the process needed, but everyone often neglects it. Data
quality is the main issue in quality information management. Data quality problems occur
anywhere in information systems. These problems are solved by data cleaning.
Generally, data cleaning reduces errors and improves data quality. Correcting errors in data and
eliminating bad records can be a time-consuming and tedious process, but it cannot be ignored.
Data mining is a key technique for data cleaning. Data mining is a technique for discovering
interesting information in data. Data quality mining is a recent approach applying data mining
techniques to identify and recover data quality problems in large databases. Data mining
automatically extracts hidden and intrinsic information from the collections of data. Data mining
has various techniques that are suitable for data cleaning.
Understanding and correcting the quality of your data is imperative in getting to an accurate final
analysis. The data needs to be prepared to discover crucial patterns. Data mining is considered
exploratory. Data cleaning in data mining allows the user to discover inaccurate or incomplete
data before the business analysis and insights.
In most cases, data cleaning in data mining can be a laborious process and typically requires IT
resources to help in the initial step of evaluating your data because data cleaning before data
mining is so time-consuming. But without proper data quality, your final analysis will suffer
inaccuracy, or you could potentially arrive at the wrong conclusion.
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:
Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients or multiple
departments, there are opportunities to create duplicate data. De-duplication is one of the largest
areas to be considered in this process. Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to analyze.
For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient, minimize distraction from your primary target, and create a more
manageable and performable dataset.
Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find "N/A" and "Not Applicable" in any sheet, but they should be
analyzed in the same category.
Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data
entry, doing so will help the performance of the data you are working with.
However, sometimes, the appearance of an outlier will prove a theory you are working on. And
just because an outlier exists doesn't mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.
You can't ignore missing data because many algorithms will not accept missing values. There are
a couple of ways to deal with missing data. Neither is optimal, but both can be considered, such
as:
You can drop observations with missing values, but this will drop or lose information, so
be careful before removing it.
You can input missing values based on other observations; again, there is an opportunity
to lose the integrity of the data because you may be operating from assumptions and not
actual observations.
You might alter how the data is used to navigate null values effectively.
5. Validate and QA
At the end of the data cleaning process, you should be able to answer these questions as a part of
basic validation, such as:
The following steps show the process of data cleaning in data mining.
1. Monitoring the errors: Keep a note of suitability where the most mistakes arise. It will
make it easier to determine and stabilize false or corrupt information. Information is
especially necessary while integrating another possible alternative with established
management software.
2. Standardize the mining process: Standardize the point of insertion to assist and reduce
the chances of duplicity.
3. Validate data accuracy: Analyze and invest in data tools to clean the record in real-
time. Tools used Artificial Intelligence to better examine for correctness.
4. Scrub for duplicate data: Determine duplicates to save time when analyzing data.
Frequently attempted the same data can be avoided by analyzing and investing in
separate data erasing tools that can analyze rough data in quantity and automate the
operation.
5. Research on data: Before this activity, our data must be standardized, validated, and
scrubbed for duplicates. There are many third-party sources, and these Approved &
authorized parties sources can capture information directly from our databases. They help
us to clean and compile the data to ensure completeness, accuracy, and reliability for
business decision-making.
6. Communicate with the team: Keeping the group in the loop will assist in developing
and strengthening the client and sending more targeted data to prospective customers.
Raw data is difficult to trace or understand. That's why it needs to be preprocessed before
retrieving any information from it. Data transformation is a technique used to convert the raw
data into a suitable format that efficiently eases data mining and retrieves strategic information.
Data transformation includes data cleaning techniques and a data reduction technique to convert
the data into the appropriate form.
Data transformation is an essential data preprocessing technique that must be performed on the
data before data mining to provide patterns that are easier to understand.
Data transformation changes the format, structure, or values of the data and converts them into
clean, usable data. Data may be transformed at two stages of the data pipeline for data analytics
projects. Organizations that use on-premises data warehouses generally use an ETL (extract,
transform, and load) process, in which data transformation is the middle step. Today, most
organizations use cloud-based data warehouses to scale compute and storage resources with
latency measured in seconds or minutes. The scalability of the cloud platform lets organizations
skip preload transformations and load raw data into the data warehouse, then transform it at
query time.
Data integration, migration, data warehousing, data wrangling may all involve data
transformation. Data transformation increases the efficiency of business and analytic processes,
and it enables businesses to make better data-driven decisions. During the data transformation
process, an analyst will determine the structure of the data. This could mean that data
transformation may be:
There are several data transformation techniques that can help structure and clean up the data
before analysis or storage in a data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data reduction and data cleaning.
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some algorithms.
It allows for highlighting important features present in the dataset. It helps in predicting the
patterns. When collecting data, it can be manipulated to eliminate or reduce any variance or any
other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to look
at a lot of data which can often be difficult to digest for finding patterns that they wouldn't see
otherwise.
We have seen how the noise is removed from the data using the techniques such as binning,
regression, clustering.
Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
Regression: This method identifies the relation among two dependent attributes so that if
we have one attribute, it can be used to predict the other attribute.
Clustering: This method groups similar data values and form a cluster. The values that
lie outside a cluster are known as outliers.
2. Attribute Construction
In the attribute construction method, the new attributes consult the existing attributes to construct
a new data set that eases data mining. New attributes are created and applied to assist the mining
process from the given attributes. This simplifies the original data and makes the mining more
efficient.
For example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So here, we can construct a new attribute 'area' from
attributes 'height' and 'weight'. This also helps understand the relations among the attributes in a
data set.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary format.
The data may be obtained from multiple data sources to integrate these data sources into a data
analysis description. This is a crucial step since the accuracy of data analysis insights is highly
dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of each
year. We can aggregate the data to get the enterprise's annual sales report.
4. Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or
[0.0, 1.0]. There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for
attribute A that are V1, V2, V3, ….Vn.
Min-max normalization: This method implements a linear transformation on the original data.
Let us consider that we have minA and maxA as the minimum and maximum value observed for
attribute A and Viis the value for attribute A that has to be normalized.
The min-max normalization would map Vi to the V'i in a new smaller range [new_minA,
new_maxA]. The formula for min-max normalization is given below:
For example, we have $1200 and $9800 as the minimum, and maximum value for the attribute
income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:
Z-score normalization: This method normalizes the value for attribute A using the meanand
standard deviation. The following formula is used for Z-score normalization:
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and $16,000.
And we have to normalize the value $73,600 using z-score normalization.
Decimal Scaling: This method normalizes the value of attribute A by moving the decimal point
in the value. This movement of a decimal point depends on the maximum absolute value of A.
The formula for the decimal scaling is given below:
5. Data Discretization
This is a process of converting continuous data into a set of data intervals. Continuous attribute
values are substituted by small interval labels. This makes the data easier to study and analyze. If
a data mining task handles a continuous attribute, then its discrete values can be replaced by
constant quality attributes. This improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset into a set
of categorical data. Discretization also uses decision tree-based algorithms to produce short,
compact, and accurate results when using discrete values.
Data discretization can be classified into two types: supervised discretization, where the class
information is used, and unsupervised discretization, which is based on which direction the
process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging strategy'.
For example, the values for the age attribute can be replaced by the interval labels such as (0-10,
11-20…) or (kid, youth, adult, senior).
6. Data Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy. This
conversion from a lower level to a higher conceptual level is useful to get a clearer picture of the
data. Data generalization can be divided into two approaches:
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).
Data mining is applied to the selected data in a large amount database. When data analysis and
mining is done on a huge amount of data, then it takes a very long time to process, making it
impractical and infeasible.
Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is
a process that reduces the volume of original data and represents it in a much smaller volume.
Data reduction techniques are used to obtain a reduced representation of the dataset that is much
smaller in volume by maintaining the integrity of the original data. By reducing the data, the
efficiency of the data mining process is improved, which produces the same analytical results.
Data reduction does not affect the result obtained from data mining. That means the result
obtained from data mining before and after data reduction is the same or almost the same.
Data reduction aims to define it more compactly. When the data size is smaller, it is simpler to
apply sophisticated and computationally high-priced algorithms. The reduction of the data may
be in terms of the number of rows (records) or terms of the number of columns (dimensions).
Here are the following techniques or methods of data reduction in data mining, such as:
1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby
reducing the volume of original data. It reduces data size as it eliminates outdated or redundant
features. Here are three methods of dimensionality reduction.
2. sNumerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much smaller
form. This technique includes two types parametric and non-parametric numerosity reduction.
i. Parametric: Parametric numerosity reduction incorporates storing only data parameters
instead of the original data. One method of parametric numerosity reduction is the
regression and log-linear method.
o Regression and Log-Linear: Linear regression models a relationship between
the two attributes by modeling a linear equation to the data set. Suppose we need
to model a linear function between two attributes.
y = wx +b
Here, y is the response attribute, and x is the predictor attribute. If we discuss in
terms of data mining, attribute x and attribute y are the numeric database
attributes, whereas w and b are regression coefficients.
Multiple linear regressions let the response variable y model linear function
between two or more predictor variables.
Log-linear model discovers the relation between two or more discrete attributes in
the database. Suppose we have a set of tuples presented in n-dimensional space.
Then the log-linear model is used to study the probability of each tuple in a
multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.
ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume
any model. The non-Parametric technique results in a more uniform reduction,
irrespective of data size, but it may not achieve a high volume of data reduction like the
parametric. There are at least four types of Non-Parametric data reduction techniques,
Histogram, Clustering, Sampling, Data Cube Aggregation, and Data Compression.
o Histogram: A histogram is a graph that represents frequency distribution which
describes how often a value appears in the data. Histogram uses the binning
method to represent an attribute's data distribution. It uses a disjoint subset which
we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or skewed data. Instead of
only one attribute, the histogram can be implemented for multiple attributes. It
can effectively represent up to five attributes.
o Clustering: Clustering techniques groups similar objects from the data so that the
objects in a cluster are similar to each other, but they are dissimilar to objects in
another cluster.
How much similar are the objects inside a cluster can be calculated using a
distance function. More is the similarity between the objects in a cluster closer
they appear in the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the max
distance between any two objects in the cluster.
The cluster representation replaces the original data. This technique is more
effective if the present data can be classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is sampling, as it can
reduce the large data set into a much smaller data sample. Below we will discuss
the different methods in which we can sample a large data set D containing N
tuples:
1. Simple random sample without replacement (SRSWOR) of size s: In
this s, some tuples are drawn from N tuples such that in the data set D
(s<N). The probability of drawing any tuple from the data set D is 1/N.
This means all tuples have an equal probability of getting sampled.
2. Simple random sample with replacement (SRSWR) of size s: It is
similar to the SRSWOR, but the tuple is drawn from data set D, is
recorded, and then replaced into the data set D so that it can be drawn
again.
3. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing
SRSWOR on these clusters. A simple random sample of size s could be
generated from these clusters where s<M.
4. Stratified sample: The large data set D is partitioned into mutually
disjoint sets called 'strata'. A simple random sample is taken from each
stratum to get stratified data. This method is effective for skewed data.
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to
the year 2022. If you want to get the annual sale per year, you just have to aggregate the sales per
quarter for each year. In this way, aggregation provides you with the required data, which is
much smaller in size, and thereby we achieve data reduction even without losing any data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional
analysis. The data cube present precomputed and summarized data which eases the data mining
into fast access.
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way
that consumes less space. Data compression involves building a compact representation of
information by removing redundancy and representing data in binary form. Data that can be
restored successfully from its compressed form is called Lossless compression. In contrast, the
opposite where it is not possible to restore the original form from the compressed form is Lossy
compression. Dimensionality and numerosity reduction method are also used for data
compression.
This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on their
compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them. For example,
the JPEG image format is a lossy compression, but we can find the meaning equivalent to
the original image. Methods such as the Discrete Wavelet transform technique PCA
(principal component analysis) are examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes with labels of small
intervals. This means that mining results are shown in a concise and easily understandable way.
The main benefit of data reduction is simple: the more data you can fit into a terabyte of disk
space, the less capacity you will need to purchase. Here are some benefits of data reduction, such
as:
Data reduction greatly increases the efficiency of a storage system and directly impacts your total
spending on capacity.
Task-relevant data: This is the database portion to be investigated. For example, suppose that
you are a manager of All Electronics in charge of sales in the United States and Canada. In
particular, you would like to study the buying trends of customers in Canada. Rather than
mining on the entire database. These are referred to as relevant attributes
The kinds of knowledge to be mined: This specifies the data mining functions to be performed,
such as characterization, discrimination, association, classification, clustering, or evolution
analysis. For instance, if studying the buying habits of customers in Canada, you may choose to
mine associations between customer profiles and the items that these customers like to buy
Background knowledge: Users can specify background knowledge, or knowledge about the
domain to be mined. This knowledge is useful for guiding the knowledge discovery process, and
for evaluating the patterns found. There are several kinds of background knowledge.
Interestingness measures: These functions are used to separate uninteresting patterns from
knowledge. They may be used to guide the mining process, or after discovery, to evaluate the
discovered patterns. Different kinds of knowledge may have different interestingness measures.
Presentation and visualization of discovered patterns: This refers to the form in which
discovered patterns are to be displayed. Users can choose from different forms for knowledge
presentation, such as rules, tables, charts, graphs, decision trees, and cubes.