Unit -1
Data Mining
When the first computers were created and put to use for mathematical and scientific research in
the 1950s, data mining got its start. Researchers started looking into using computers to analyze
and draw conclusions from massive data sets as computing power and data storage methods
advanced.
Dr. Herbert Simon, an economics Nobel winner and often regarded as the inventor of artificial
intelligence, was one of the first and most significant pioneers of data mining. Simon and his
colleagues created a variety of algorithms and methods in the 1950s and 1960s for drawing
insightful conclusions and valuable information from data, such as decision trees, classification,
and clustering.
As data mining continued to advance in the 1980s and 1990s, new methods and algorithms were
created to handle the difficulties of handling big, complicated data sets. Applying data mining
techniques to an organization's data has been simpler with the introduction of data mining
platforms and software, such as SAS, SPSS, and RapidMiner.
Data mining is the process of discovering patterns, trends, and useful information from large
datasets. It involves using methods from various fields like statistics, machine learning, and
database systems to extract knowledge that can be used for decision-making and other purposes.
What Kind of Information are we collecting?
1. Customer Data:
Demographics: Age, gender, location, income, education, etc.
Transactional Data: Purchase history, website browsing activity, app usage, customer
service interactions.
Behavioral Data: Online behavior, product preferences, responses to marketing
campaigns.
Social Media Data: Publicly available information from social media profiles, including
interests, connections, and opinions.
2. Business Data:
Sales Data: Revenue, sales volume, product performance, sales channels.
Financial Data: Stock prices, market trends, economic indicators.
Operational Data: Supply chain information, manufacturing processes, logistics.
Human Resources Data: Employee demographics, performance reviews, training
records.
3. Sensor Data:
Environmental Data: Temperature, humidity, air quality, weather conditions.
Machine Data: Data from industrial equipment, vehicles, and other devices, including
performance metrics, maintenance records, and error logs.
Medical Data: Patient vital signs, medical images, electronic health records.
4. Web Data:
Website Traffic Data: Page views, click-through rates, bounce rates, user navigation
patterns.
Search Engine Data: Search queries, search results, website rankings.
Social Media Data: Posts, comments, shares, likes, and other interactions on social
media platforms.
5. Multimedia Data:
Image Data: Photos, videos, medical images.
Audio Data: Music, speech recordings, sound effects.
Video Data: Movies, TV shows, surveillance footage
Motivation Behind Data Mining
The Data Explosion:
Increased data generation: We live in a world of ever-increasing data. From social
media interactions and online transactions to sensor readings and scientific experiments,
data is being generated at an unprecedented rate. This sheer volume of data makes it
impossible for humans to analyze it manually, creating a need for automated data mining
techniques.
Need for knowledge discovery: Hidden within this massive data are valuable insights,
patterns, and trends that can drive better decision-making and innovation. Data mining
provides the tools to extract this knowledge.
2. Business Needs:
Competitive advantage: In today's competitive business environment, organizations
need to make informed decisions quickly. Data mining helps businesses understand their
customers, markets, and operations better, giving them a competitive edge.
Improved customer relationship management (CRM): By analyzing customer data,
businesses can personalize marketing campaigns, improve customer service, and build
stronger customer relationships.
Fraud detection: Data mining techniques can identify patterns indicative of fraudulent
activities, helping businesses prevent losses and protect their assets.
Risk management: Data mining can help organizations assess and manage risks by
identifying potential problems and predicting future outcomes.
3. Scientific and Technological Advancements:
Advancements in computing power: The development of powerful computers and
distributed computing systems has made it possible to process and analyze massive
datasets efficiently.
Development of sophisticated algorithms: Researchers have developed increasingly
sophisticated data mining algorithms that can discover complex patterns and relationships
in data.
Advances in database technology: The development of advanced database management
systems and data warehousing technologies has provided the infrastructure for storing
and managing large datasets.
4. Societal Needs:
Improved healthcare: Data mining can help healthcare providers improve patient care,
develop new treatments, and predict disease outbreaks.
Enhanced security: Data mining can be used to detect and prevent terrorist attacks,
cybercrime, and other security threats.
Environmental protection: Data mining can help us understand and address
environmental challenges such as climate change and pollution.
“Data Mining” can be referred to as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging. Data Mining also known as
Knowledge Discovery in Databases, refers to the nontrivial extraction of implicit, previously
unknown and potentially useful information from data stored in databases.
The need of data mining is to extract useful information from large datasets and use it to make
predictions or better decision-making. Nowadays, data mining is used in almost all places
where a large amount of data is stored and processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion Detection.
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets. The KDD
process is an iterative process and it requires multiple iterations of the above steps to extract
accurate knowledge from the data.The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a
common source(DataWarehouse). Data integration using Data Migration tools, Data
Synchronization tools and ETL(Extract-Load-Transformation) process .
Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and
retrieved from the data collection. For this we can use Neural network, Decision Trees,
Naive bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure. Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to destination to capture
transformations.
2. Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns potentially useful. It
transforms task relevant data into patterns, and decides purpose of model
using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge
based on given measures. It find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by user .
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used to make
decisions.
Note: KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, new data can be integrated and transformed in order to get different and more
appropriate results.Preprocessing of databases consists of Data cleaning and Data
Integration.
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and knowledge that can help
organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the
data ready for analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better understanding of their
customers’ needs and preferences, which can help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns
and anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can forecast future
trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing
large amounts of data, which can include sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and knowledge
to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or
discrimination, if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not accurate
or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in
hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem in machine
learning where a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new unseen data.
Difference between KDD and Data Mining
Parameter KDD Data Mining
KDD refers to a process of
identifying valid, novel, Data Mining refers to a
potentially useful, and process of extracting useful
Definition
ultimately understandable and valuable information or
patterns and relationships in patterns from large data sets.
data.
To find useful knowledge from To extract useful information
Objective
data. from data.
Data cleaning, data integration,
Association rules,
data selection, data
classification, clustering,
Techniques transformation, data mining,
regression, decision trees,
Used pattern evaluation, and
neural networks, and
knowledge representation and
dimensionality reduction.
visualization.
Structured information, such as Patterns, associations, or
rules and models, that can be insights that can be used to
Output
used to make decisions or improve decision-making or
predictions. understanding.
Focus is on the discovery of Data mining focus is on the
Focus useful knowledge, rather than discovery of patterns or
simply finding patterns in data. relationships in data.
Domain expertise is important Domain expertise is less
Role of in KDD, as it helps in defining critical in data mining, as the
domain the goals of the process, algorithms are designed to
expertise choosing appropriate data, and identify patterns without
interpreting the results. relying on prior knowledge.
Data Mining Architecture
The general layout and composition of a data mining system are referred to as its data mining
architecture. In order to complete data mining activities and extract valuable insights and
information from data, a data mining architecture usually consists of a number of essential
components. A typical data mining architecture's essential elements include the following:
Data sources: The sources of data used in data mining are known as data sources. These may
consist of both structured and unstructured data from files, databases, sensors, and other sources.
To produce a useful data collection for analysis, data sources supply the raw data required in data
mining, which can then be cleaned, processed, and transformed.
Data Preprocessing: The process of getting data ready for analysis is known as data preparation.
Usually, this entails preparing the data for analysis by cleaning and converting it to get rid of
mistakes, inconsistencies, and unnecessary information. A crucial stage in data mining is data
preprocessing, which guarantees that the data is high-quality and prepared for analysis.
Data Mining Algorithms: These are the models and algorithms that are used to carry out data
mining. These algorithms can be both supervised and unsupervised learning algorithms, such
clustering, regression, and classification, as well as more task-specific algorithms like anomaly
detection and association rule mining. To extract valuable information and insights from the
data, data mining methods are deployed.
Data Visualization: Data visualization is the process of presenting data and insights in a clear and
effective manner, typically using charts, graphs, and other visualizations. Data visualization is an
important part of data mining, as it allows data miners to communicate their findings and insights
to others in a way that is easy to understand and interpret.
Data Mining Techniques
There are a wide array of data mining Techniques used in data science and data analytics.
Predictive Modeling is a fundamental component of mining data and is widely used to make
predictions or forecasts based on historical data patterns.
Top 10 data mining techniques are:-
1. Classification
Classification is a technique used to categorize data into predefined classes or categories based on the
features or attributes of the data instances. It involves training a model on labeled data and using it to
predict the class labels of new, unseen data instances.
2. Regression
Regression is employed to predict numeric or continuous values based on the relationship between input
variables and a target variable. It aims to find a mathematical function or model that best fits the data to
make accurate predictions.
3. Clustering
Clustering is a technique used to group similar data instances together based on their intrinsic
characteristics or similarities. It aims to discover natural patterns or structures in the data without any
predefined classes or labels.
4. Association Rule
Association rule mining focuses on discovering interesting relationships or patterns among a set of items
in transactional or market basket data. It helps identify frequently co-occurring items and generates rules
such as "if X, then Y" to reveal associations between items. This simple Venn diagram shows the
associations between itemsets X and Y of a dataset.
5. Anomaly Detection
Anomaly detection, sometimes called outlier analysis, aims to identify rare or unusual data instances that
deviate significantly from the expected patterns. It is useful in detecting fraudulent transactions, network
intrusions, manufacturing defects, or any other abnormal behavior.
6. Time Series Analysis
Time series analysis focuses on analyzing and predicting data points collected over time. It involves
techniques such as forecasting, trend analysis, seasonality detection, and anomaly detection in time-
dependent datasets.
7. Neural Networks
Neural networks are a type of machine learning or AI model inspired by the human brain's structure
and function. They are composed of interconnected nodes (neurons) and layers that can learn from data to
recognize patterns, perform classification, regression, or other tasks.
8. Decision Trees
Decision trees are graphical models that use a tree-like structure to represent decisions and their possible
consequences. They recursively split the data based on different attribute values to form a hierarchical
decision-making process.
9. Ensemble Methods
Ensemble methods combine multiple models to improve prediction accuracy and generalization.
Techniques like Random Forests and Gradient Boosting utilize a combination of weak learners to create a
stronger, more accurate model.
10. Text Mining
Text mining techniques are applied to extract valuable insights and knowledge from unstructured text
data. Text mining includes tasks such as text categorization, sentiment analysis, topic modeling, and
information extraction, enabling your organization to derive meaningful insights from large volumes of
textual data, such as customer reviews, social media posts, emails, and articles.
Clustering in Data Mining
The process of making a group of abstract objects into classes of similar
objects is known as clustering.
Requirements of clustering in data mining:
The following are some points why clustering is important in data mining.
Scalability – we require highly scalable clustering algorithms to work with
large databases.
Ability to deal with different kinds of attributes – Algorithms should be
able to work with the type of data such as categorical, numerical, and
binary data.
Discovery of clusters with attribute shape – The algorithm should be
able to detect clusters in arbitrary shapes and it should not be bounded to
distance measures.
Interpretability – The results should be comprehensive, usable, and
interpretable.
High dimensionality – The algorithm should be able to handle high
dimensional space instead of only handling low dimensional data.