Knowledge Discovery in Databases
Knowledge discovery in databases (KDD) is the process of discovering useful
knowledge from a collection of data. This widely used data mining technique is a
process that includes data preparation and selection, data cleansing, incorporating
prior knowledge on data sets and interpreting accurate solutions from the observed
results.
KDD includes multidisciplinary activities. This encompasses data storage and
access, scaling algorithms to massive data sets and interpreting results. The data
cleansing and data access process included in data warehousing facilitate the KDD
process. Artificial intelligence also supports KDD by discovering empirical laws from
experimentation and observations. The patterns recognized in the data must be valid
on new data, and possess some degree of certainty. These patterns are considered
new knowledge.
Steps involved in the entire KDD process are:
1. Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data
transformation tools.
2. Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
3. Data Selection
Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Naive bayes, Clustering, and Regression methods.
4. Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
1. Data Mapping: Assigning elements from source base to destination
to capture transformations.
2. Code generation: Creation of the actual transformation program.
5. Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.
6. Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It find interestingness
score of each pattern, and uses summarization and Visualization to make
data understandable by user.
7. Knowledge Representation
This involves presenting the results in a way that is meaningful and can be
used to make decisions.
Note: KDD is an iterative process where evaluation measures can be
enhanced, mining can be refined, new data can be integrated and
transformed in order to get different and more appropriate
results.Preprocessing of databases consists of Data cleaning and Data
Integration.
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and
knowledge that can help organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and
time-consuming tasks and makes the data ready for analysis, which
saves time and money.
3. Better customer service: KDD helps organizations gain a better
understanding of their customers’ needs and preferences, which can
help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by
identifying patterns and anomalies in the data that may indicate
fraud.
5. Predictive modeling: KDD can be used to build predictive models
that can forecast future trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves
collecting and analyzing large amounts of data, which can include
sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires
specialized skills and knowledge to implement and interpret the
results.
3. Unintended consequences: KDD can lead to unintended
consequences, such as bias or discrimination, if the data or models
are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data,
if data is not accurate or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant
investments in hardware, software, and personnel.