0% found this document useful (0 votes)

16 views13 pages

Unit-1 (Data Mining)

Uploaded by

simran singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views13 pages

Unit-1 (Data Mining)

Uploaded by

simran singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Unit -1

Data Mining
When the first computers were created and put to use for mathematical and scientific research in
the 1950s, data mining got its start. Researchers started looking into using computers to analyze
and draw conclusions from massive data sets as computing power and data storage methods
advanced.

Dr. Herbert Simon, an economics Nobel winner and often regarded as the inventor of artificial
intelligence, was one of the first and most significant pioneers of data mining. Simon and his
colleagues created a variety of algorithms and methods in the 1950s and 1960s for drawing
insightful conclusions and valuable information from data, such as decision trees, classification,
and clustering.

As data mining continued to advance in the 1980s and 1990s, new methods and algorithms were
created to handle the difficulties of handling big, complicated data sets. Applying data mining
techniques to an organization's data has been simpler with the introduction of data mining
platforms and software, such as SAS, SPSS, and RapidMiner.

Data mining is the process of discovering patterns, trends, and useful information from large
datasets. It involves using methods from various fields like statistics, machine learning, and
database systems to extract knowledge that can be used for decision-making and other purposes.

What Kind of Information are we collecting?

1. Customer Data:

 Demographics: Age, gender, location, income, education, etc.

 Transactional Data: Purchase history, website browsing activity, app usage, customer
service interactions.
 Behavioral Data: Online behavior, product preferences, responses to marketing
campaigns.
 Social Media Data: Publicly available information from social media profiles, including
interests, connections, and opinions.

2. Business Data:

 Sales Data: Revenue, sales volume, product performance, sales channels.

 Financial Data: Stock prices, market trends, economic indicators.
 Operational Data: Supply chain information, manufacturing processes, logistics.
 Human Resources Data: Employee demographics, performance reviews, training
records.
3. Sensor Data:

 Environmental Data: Temperature, humidity, air quality, weather conditions.

 Machine Data: Data from industrial equipment, vehicles, and other devices, including
performance metrics, maintenance records, and error logs.
 Medical Data: Patient vital signs, medical images, electronic health records.

4. Web Data:

 Website Traffic Data: Page views, click-through rates, bounce rates, user navigation
patterns.
 Search Engine Data: Search queries, search results, website rankings.
 Social Media Data: Posts, comments, shares, likes, and other interactions on social
media platforms.

5. Multimedia Data:

 Image Data: Photos, videos, medical images.

 Audio Data: Music, speech recordings, sound effects.
 Video Data: Movies, TV shows, surveillance footage

Motivation Behind Data Mining

The Data Explosion:

 Increased data generation: We live in a world of ever-increasing data. From social

media interactions and online transactions to sensor readings and scientific experiments,
data is being generated at an unprecedented rate. This sheer volume of data makes it
impossible for humans to analyze it manually, creating a need for automated data mining
techniques.
 Need for knowledge discovery: Hidden within this massive data are valuable insights,
patterns, and trends that can drive better decision-making and innovation. Data mining
provides the tools to extract this knowledge.

2. Business Needs:

 Competitive advantage: In today's competitive business environment, organizations

need to make informed decisions quickly. Data mining helps businesses understand their
customers, markets, and operations better, giving them a competitive edge.
 Improved customer relationship management (CRM): By analyzing customer data,
businesses can personalize marketing campaigns, improve customer service, and build
stronger customer relationships.
 Fraud detection: Data mining techniques can identify patterns indicative of fraudulent
activities, helping businesses prevent losses and protect their assets.
 Risk management: Data mining can help organizations assess and manage risks by
identifying potential problems and predicting future outcomes.
3. Scientific and Technological Advancements:

 Advancements in computing power: The development of powerful computers and

distributed computing systems has made it possible to process and analyze massive
datasets efficiently.
 Development of sophisticated algorithms: Researchers have developed increasingly
sophisticated data mining algorithms that can discover complex patterns and relationships
in data.
 Advances in database technology: The development of advanced database management
systems and data warehousing technologies has provided the infrastructure for storing
and managing large datasets.

4. Societal Needs:

 Improved healthcare: Data mining can help healthcare providers improve patient care,
develop new treatments, and predict disease outbreaks.
 Enhanced security: Data mining can be used to detect and prevent terrorist attacks,
cybercrime, and other security threats.
 Environmental protection: Data mining can help us understand and address
environmental challenges such as climate change and pollution.

“Data Mining” can be referred to as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging. Data Mining also known as
Knowledge Discovery in Databases, refers to the nontrivial extraction of implicit, previously
unknown and potentially useful information from data stored in databases.
The need of data mining is to extract useful information from large datasets and use it to make
predictions or better decision-making. Nowadays, data mining is used in almost all places
where a large amount of data is stored and processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion Detection.

KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful,
previously unknown, and potentially valuable information from large datasets. The KDD
process is an iterative process and it requires multiple iterations of the above steps to extract
accurate knowledge from the data.The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a
common source(DataWarehouse). Data integration using Data Migration tools, Data
Synchronization tools and ETL(Extract-Load-Transformation) process .

Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and
retrieved from the data collection. For this we can use Neural network, Decision Trees,
Naive bayes, Clustering, and Regression methods.

Data Transformation
Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure. Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to destination to capture
transformations.
2. Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns potentially useful. It
transforms task relevant data into patterns, and decides purpose of model
using classification or characterization.

Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge
based on given measures. It find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by user .

Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used to make
decisions.
Note: KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, new data can be integrated and transformed in order to get different and more
appropriate results.Preprocessing of databases consists of Data cleaning and Data
Integration.

Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and knowledge that can help
organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the
data ready for analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better understanding of their
customers’ needs and preferences, which can help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns
and anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can forecast future
trends and patterns.

Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing
large amounts of data, which can include sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills and knowledge
to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences, such as bias or
discrimination, if the data or models are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is not accurate
or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant investments in
hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem in machine
learning where a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new unseen data.
Difference between KDD and Data Mining
Parameter KDD Data Mining

KDD refers to a process of

identifying valid, novel, Data Mining refers to a
potentially useful, and process of extracting useful
Definition
ultimately understandable and valuable information or
patterns and relationships in patterns from large data sets.
data.

To find useful knowledge from To extract useful information

Objective
data. from data.

Data cleaning, data integration,

Association rules,
data selection, data
classification, clustering,
Techniques transformation, data mining,
regression, decision trees,
Used pattern evaluation, and
neural networks, and
knowledge representation and
dimensionality reduction.
visualization.

Structured information, such as Patterns, associations, or

rules and models, that can be insights that can be used to
Output
used to make decisions or improve decision-making or
predictions. understanding.

Focus is on the discovery of Data mining focus is on the

Focus useful knowledge, rather than discovery of patterns or
simply finding patterns in data. relationships in data.

Domain expertise is important Domain expertise is less

Role of in KDD, as it helps in defining critical in data mining, as the
domain the goals of the process, algorithms are designed to
expertise choosing appropriate data, and identify patterns without
interpreting the results. relying on prior knowledge.
 Data Mining Architecture
The general layout and composition of a data mining system are referred to as its data mining
architecture. In order to complete data mining activities and extract valuable insights and
information from data, a data mining architecture usually consists of a number of essential
components. A typical data mining architecture's essential elements include the following:

Data sources: The sources of data used in data mining are known as data sources. These may
consist of both structured and unstructured data from files, databases, sensors, and other sources.
To produce a useful data collection for analysis, data sources supply the raw data required in data
mining, which can then be cleaned, processed, and transformed.
Data Preprocessing: The process of getting data ready for analysis is known as data preparation.
Usually, this entails preparing the data for analysis by cleaning and converting it to get rid of
mistakes, inconsistencies, and unnecessary information. A crucial stage in data mining is data
preprocessing, which guarantees that the data is high-quality and prepared for analysis.

Data Mining Algorithms: These are the models and algorithms that are used to carry out data
mining. These algorithms can be both supervised and unsupervised learning algorithms, such
clustering, regression, and classification, as well as more task-specific algorithms like anomaly
detection and association rule mining. To extract valuable information and insights from the
data, data mining methods are deployed.
Data Visualization: Data visualization is the process of presenting data and insights in a clear and
effective manner, typically using charts, graphs, and other visualizations. Data visualization is an
important part of data mining, as it allows data miners to communicate their findings and insights
to others in a way that is easy to understand and interpret.

Data Mining Techniques

There are a wide array of data mining Techniques used in data science and data analytics.
Predictive Modeling is a fundamental component of mining data and is widely used to make
predictions or forecasts based on historical data patterns.
Top 10 data mining techniques are:-
1. Classification
Classification is a technique used to categorize data into predefined classes or categories based on the
features or attributes of the data instances. It involves training a model on labeled data and using it to
predict the class labels of new, unseen data instances.

2. Regression
Regression is employed to predict numeric or continuous values based on the relationship between input
variables and a target variable. It aims to find a mathematical function or model that best fits the data to
make accurate predictions.

3. Clustering
Clustering is a technique used to group similar data instances together based on their intrinsic
characteristics or similarities. It aims to discover natural patterns or structures in the data without any
predefined classes or labels.
4. Association Rule
Association rule mining focuses on discovering interesting relationships or patterns among a set of items
in transactional or market basket data. It helps identify frequently co-occurring items and generates rules
such as "if X, then Y" to reveal associations between items. This simple Venn diagram shows the
associations between itemsets X and Y of a dataset.

5. Anomaly Detection
Anomaly detection, sometimes called outlier analysis, aims to identify rare or unusual data instances that
deviate significantly from the expected patterns. It is useful in detecting fraudulent transactions, network
intrusions, manufacturing defects, or any other abnormal behavior.
6. Time Series Analysis
Time series analysis focuses on analyzing and predicting data points collected over time. It involves
techniques such as forecasting, trend analysis, seasonality detection, and anomaly detection in time-
dependent datasets.

7. Neural Networks
Neural networks are a type of machine learning or AI model inspired by the human brain's structure
and function. They are composed of interconnected nodes (neurons) and layers that can learn from data to
recognize patterns, perform classification, regression, or other tasks.
8. Decision Trees
Decision trees are graphical models that use a tree-like structure to represent decisions and their possible
consequences. They recursively split the data based on different attribute values to form a hierarchical
decision-making process.

9. Ensemble Methods
Ensemble methods combine multiple models to improve prediction accuracy and generalization.
Techniques like Random Forests and Gradient Boosting utilize a combination of weak learners to create a
stronger, more accurate model.
10. Text Mining

Text mining techniques are applied to extract valuable insights and knowledge from unstructured text
data. Text mining includes tasks such as text categorization, sentiment analysis, topic modeling, and
information extraction, enabling your organization to derive meaningful insights from large volumes of
textual data, such as customer reviews, social media posts, emails, and articles.

Clustering in Data Mining

The process of making a group of abstract objects into classes of similar
objects is known as clustering.
Requirements of clustering in data mining:
The following are some points why clustering is important in data mining.
 Scalability – we require highly scalable clustering algorithms to work with
large databases.
 Ability to deal with different kinds of attributes – Algorithms should be
able to work with the type of data such as categorical, numerical, and
binary data.
 Discovery of clusters with attribute shape – The algorithm should be
able to detect clusters in arbitrary shapes and it should not be bounded to
distance measures.
 Interpretability – The results should be comprehensive, usable, and
interpretable.
 High dimensionality – The algorithm should be able to handle high
dimensional space instead of only handling low dimensional data.

1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
Introduction
No ratings yet
Introduction
46 pages
Unit III DWDM
No ratings yet
Unit III DWDM
113 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
Data Mining
No ratings yet
Data Mining
7 pages
Data Mining
No ratings yet
Data Mining
395 pages
DWDM 2
No ratings yet
DWDM 2
15 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
43 pages
Data Mining Essentials for Students
No ratings yet
Data Mining Essentials for Students
15 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
BIDW Lecture 2
No ratings yet
BIDW Lecture 2
33 pages
Data Mining 1
No ratings yet
Data Mining 1
39 pages
Data Mining L1,2
No ratings yet
Data Mining L1,2
26 pages
INTRODUCTION Data Mining
No ratings yet
INTRODUCTION Data Mining
43 pages
Week-1-Introduction To Data Mining
No ratings yet
Week-1-Introduction To Data Mining
43 pages
Unit 3
100% (1)
Unit 3
22 pages
Data Mining Mids
No ratings yet
Data Mining Mids
24 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
52 pages
July 16, 2009 1 Data Mining
No ratings yet
July 16, 2009 1 Data Mining
26 pages
Data Mining & BI Course Guide
No ratings yet
Data Mining & BI Course Guide
25 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Module 3
No ratings yet
Module 3
187 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
13 pages
Data Mining Cognate
No ratings yet
Data Mining Cognate
23 pages
DWM Unit II
No ratings yet
DWM Unit II
76 pages
Adm Unit - 1
No ratings yet
Adm Unit - 1
62 pages
Data Minng
No ratings yet
Data Minng
20 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
Haramaya University College of Engineering and Technology Department of Information Technology
No ratings yet
Haramaya University College of Engineering and Technology Department of Information Technology
38 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Lecture 1-Data Mining (Introduction)
No ratings yet
Lecture 1-Data Mining (Introduction)
30 pages
2 DM Module 1 Introduction DVS
No ratings yet
2 DM Module 1 Introduction DVS
81 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Chap 1
No ratings yet
Chap 1
45 pages
Data Mining for Business Growth
No ratings yet
Data Mining for Business Growth
7 pages
Why Data Mining?: March 3, 2015
No ratings yet
Why Data Mining?: March 3, 2015
41 pages
Module - 1 - DM
No ratings yet
Module - 1 - DM
52 pages
DM Module1
No ratings yet
DM Module1
15 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Data Mining
No ratings yet
Data Mining
46 pages
Data Mining:: Dr. Hany Saleeb
No ratings yet
Data Mining:: Dr. Hany Saleeb
37 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
02-Introduction To Data Mining
No ratings yet
02-Introduction To Data Mining
40 pages
Presentation On Data Mining
100% (1)
Presentation On Data Mining
51 pages
Data Mining (Introduction)
No ratings yet
Data Mining (Introduction)
31 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
Data Mining: Concepts and Techniques: - Chapter 1
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1
37 pages
Data Mining OVERVIEW
No ratings yet
Data Mining OVERVIEW
8 pages
DM-Unit 1
No ratings yet
DM-Unit 1
110 pages
Inf 444e - Datamining N Advanced Databases Introduction 2019
No ratings yet
Inf 444e - Datamining N Advanced Databases Introduction 2019
32 pages
Unit 3
No ratings yet
Unit 3
22 pages
DB 14
No ratings yet
DB 14
97 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
60 pages
KDD and Data Mining Explained
No ratings yet
KDD and Data Mining Explained
46 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
Lecture 1 and 2 - Introduction and Background
No ratings yet
Lecture 1 and 2 - Introduction and Background
28 pages
Final Merit List of Eligible BVSC 2025-26-MS (State)
No ratings yet
Final Merit List of Eligible BVSC 2025-26-MS (State)
446 pages
MPH Program Regulations
No ratings yet
MPH Program Regulations
18 pages
Module 1 Comfort and Hygiene Measure
No ratings yet
Module 1 Comfort and Hygiene Measure
70 pages
List of Pro Bono Legal Service Providers: Non-Profit Organization Referral Service Private Attorney
No ratings yet
List of Pro Bono Legal Service Providers: Non-Profit Organization Referral Service Private Attorney
154 pages
Conversion Disorder: Anna Lethborg, Ashley Quinn & Alicia Grant
100% (1)
Conversion Disorder: Anna Lethborg, Ashley Quinn & Alicia Grant
16 pages
Psy 210.full Notes
100% (1)
Psy 210.full Notes
56 pages
Customer Service ESL Worksheet
50% (2)
Customer Service ESL Worksheet
4 pages
2015 CV Cromer 10 17 15 Website
No ratings yet
2015 CV Cromer 10 17 15 Website
5 pages
Notification 102-Ii
No ratings yet
Notification 102-Ii
22 pages
De Waal 1997 Are We in Anthropodenial
No ratings yet
De Waal 1997 Are We in Anthropodenial
5 pages
SAMPLE Research Title Proposal
No ratings yet
SAMPLE Research Title Proposal
8 pages
Gender Education Training Guide
No ratings yet
Gender Education Training Guide
2 pages
Simplifying Radicals Lesson Plan
No ratings yet
Simplifying Radicals Lesson Plan
3 pages
6 March SET 2 AGENTFORCE BRAND
No ratings yet
6 March SET 2 AGENTFORCE BRAND
61 pages
Understanding Verb Tenses and Aspects
No ratings yet
Understanding Verb Tenses and Aspects
19 pages
J2EE Interview Questions
No ratings yet
J2EE Interview Questions
8 pages
Youth Unemployment
100% (1)
Youth Unemployment
25 pages
Curriculum: Bachelor of Secondary Education Major in Mathematics
No ratings yet
Curriculum: Bachelor of Secondary Education Major in Mathematics
4 pages
Course Allocation Spring 2025 V1
No ratings yet
Course Allocation Spring 2025 V1
4 pages
Maptek Getting Started With Drillhole and Databases 2018
No ratings yet
Maptek Getting Started With Drillhole and Databases 2018
2 pages
Vignette
No ratings yet
Vignette
3 pages
Artigue and Blomhoj
No ratings yet
Artigue and Blomhoj
14 pages
1000 Most Common English Phrases
No ratings yet
1000 Most Common English Phrases
2 pages
American British or Canadian
No ratings yet
American British or Canadian
5 pages
BFC
No ratings yet
BFC
8 pages
Z. Int
No ratings yet
Z. Int
168 pages
EDUC-6 Outpot
No ratings yet
EDUC-6 Outpot
2 pages
Marking Guidelines Society and Culture Pip PDF
No ratings yet
Marking Guidelines Society and Culture Pip PDF
7 pages
PNDA Guidebook 2013
No ratings yet
PNDA Guidebook 2013
26 pages
On Becoming A Leadership Coach A Holistic Approach To Coaching Excellence Scribd PDF Download
100% (11)
On Becoming A Leadership Coach A Holistic Approach To Coaching Excellence Scribd PDF Download
14 pages

Unit-1 (Data Mining)

Uploaded by

Unit-1 (Data Mining)

Uploaded by

Unit -1

What Kind of Information are we collecting?

 Demographics: Age, gender, location, income, education, etc.

 Sales Data: Revenue, sales volume, product performance, sales channels.

 Environmental Data: Temperature, humidity, air quality, weather conditions.

 Image Data: Photos, videos, medical images.

Motivation Behind Data Mining

The Data Explosion:

 Increased data generation: We live in a world of ever-increasing data. From social

 Competitive advantage: In today's competitive business environment, organizations

 Advancements in computing power: The development of powerful computers and

KDD refers to a process of

To find useful knowledge from To extract useful information

Data cleaning, data integration,

Structured information, such as Patterns, associations, or

Focus is on the discovery of Data mining focus is on the

Domain expertise is important Domain expertise is less

Data Mining Techniques

Clustering in Data Mining

You might also like