Introduction To Data Mining Unit1
Introduction To Data Mining Unit1
DATA MINING
Dr.M.Hemalatha
Department of Computer Science
Sri Ramakrishna College of Arts & Science
Coimbatore
SYLLABUS
Introduction - Data mining: an essential step in
knowledge discovery - Diversity of data types for
data mining -Mining various kinds of knowledge
- Multidimensional data summarization - Mining
frequent patterns, associations, and correlations
Classification and regression for predictive
analysis - Cluster analysis Deep learning Outlier
analysis-Database technology and data mining -
Data mining and data science.
Data
• Datum means "an item given“
• Individual pieces of information
• Structure that is often tabular (represented by
rows and columns)
• A tree (a set of nodes with parent-children
relationship)
• A graph (a set of connected nodes).
• Raw data, i.e., unprocessed data
Data, Information and Knowledge
• Data, information and knowledge frequently
overlap, mainly differing in abstraction
• Data is least abstract, information next least,
and knowledge most.
• Example:
– the height of Mt. Everest – Data
– a book on Mt. Everest geological characteristics –
Information
– a report containing practical information on the
best way to reach Mt. Everest's peak – Knowledge
Data, Information and Knowledge
What is Data Mining?
• Extracting useful patterns from large datasets.
• Also known as Knowledge Discovery from Data (KDD).
• Example: Market basket analysis in retail.
• Data mining is often defined as finding hidden
information or extracting meaningful information
from large database.
• It is also called exploratory data analysis, data driven
discovery and deductive learning.
Why Data Mining is Important
• Helps in decision-making by revealing hidden trends.
• Used in business, healthcare, finance, and more.
Example: Fraud detection in banking transactions.
• Alternative names:
– Knowledge discovery (mining) in databases (KDD)
– Knowledge extraction
– Data/pattern analysis
– Data archeology
– Data dredging
– Information harvesting
– Business intelligence, etc.
Steps in Knowledge Discovery
•Key techniques:
🔹 Regression
🔹 Association Rule Discovery
🔹 Classification
🔹 Clustering
Regression Analysis
• Predictive technique used to estimate values
• Example: Predicting revenue based on
previous sales
• 📈 Use Case: Forecasting stock prices,
predicting housing prices
• Output: Continuous values
Association Rule Discovery
• Identifies relationships between items in
datasets
• 📦 Example: Customers who buy bread often buy
butter
• Common in:
🔹 E-commerce recommendations
🔹 Market Basket Analysis
• Format: IF {Item A} THEN {Item B}
Classification
• Assigns data items to predefined categories
• Example: Classifying emails as spam or not
spam
• Used in:
🔹 Fraud detection
🔹 Medical diagnosis
🔹 Document sorting
Clustering
• Groups similar data points together
• No predefined labels (unsupervised learning)
• 🎯 Used in:
🔹 Customer segmentation
🔹 Image compression
🔹 Social network analysis
Frequent Pattern Mining
• Sub-field of data mining for discovering
recurring patterns
• Finds frequent itemsets (e.g., milk + bread
bought together)
• Basis for association rules
• Methods:
🔹 Apriori
🔹 FP-Growth
Apriori Algorithm (Brief)
• Works by identifying frequent individual items
• Expands them to larger itemsets based on
minimum support
• Good for market basket data
FP-Growth Algorithm (Brief)
• Faster than Apriori
• Uses a special data structure: FP-tree
• Compresses data and mines frequent patterns
without candidate generation
Technique Goal Example Use Case
Product
Association Find relationships recommendation
Spam email
Classification Categorize data detection
Customer
Clustering Group similar data segmentation
Why Use Database Technology?
• Efficient storage, indexing, and retrieval of large datasets
• Built-in query processing (e.g., SQL)Seamless integration
with data mining algorithms
• High scalability, consistency, and data integrity
Types of Databases Used
• Relational Databases (RDBMS): Structured data with tables
and keys
• Data Warehouses: Integrated, historical data for analytics
• NoSQL Databases: For unstructured/semi-structured data
(e.g., MongoDB, Cassandra)
• Distributed Databases: Handle large-scale data across nodes
What is Data Mining?
• The process of discovering meaningful patterns in large
datasets
• Uses statistical techniques and machine learning
Common techniques:
• Clustering
• Classification
• Association rule mining
• Regression
What is Data Science?
• A broader discipline involving the entire data lifecycle
• Combines statistics, machine learning, data engineering,
and domain knowledge
Tasks include :
• Data collection & preprocessing
• Model building
Role of SQL in Data Mining
• SQL is used to select, filter, and aggregate data for
mining
• Useful operations: GROUP BY, JOIN, WHERE,
HAVING
• Extensions like DMX (Data Mining Extensions) in
Microsoft SQL Server
Data Warehousing
• A centralized repository of data from multiple
sources
• Supports multidimensional analysis
• Enables summarization and trend analysis over
time
• Ideal for preparing data before mining
What Are Frequent Patterns?
• Patterns that occur frequently in data
• Help in identifying valuable relationships
• Applications: Market basket analysis,
recommendation systems
Types of Frequent Patterns
• Frequent Itemsets: Items often bought together
– Example: Milk & Bread
• Sequential Patterns: Items bought in a sequence
– Example: Laptop → Camera → Memory Card
• Frequent Substructures: Repeated structures like
graphs, trees
– Example: Social network connections
Association Rule Mining
• 📊 Example Rule:
buys(X, "computer") ⇒ buys(X, "software")
• Support: 1% (appears in 1% of transactions)
• Confidence: 50% (if buys computer, 50% also
buy software)
Correlation Analysis
• Goes beyond simple association
• Finds statistical relationships between
attribute-value pairs
What is Classification?
• Classification = Predicting a category or class
• Builds a model from training data (with known
class labels)
• Model is used to predict class labels of
new/unseen data
How Classification Works
• Training Data – Data with known outcomes
• Model Building – Learn patterns from data
• Prediction – Use model to classify new inputs
📈 Example: Email → Spam or Not Spam
Decision Tree Overview
• 🌳 A Decision Tree is a flowchart-like structure:
• Node: Test on an attribute
• Branch: Outcome of the test
• Leaf: Final class label
• 📊 Example:
IF age < 30
→ IF student = yes → buy = yes
→ ELSE → buy = no
Regression vs Classification
Feature Classification Regression
Output Type Categorical (Labels) Continuous (Numbers)
Example Spam / Not Spam Predict house price
Linear Regression, etc.
Algorithm Used Decision Tree, SVM, etc.
• Application Areas
• Classification:
– Spam Detection
– Disease Diagnosis
– Credit Approval
• Regression:
– Stock Price Prediction
– Sales Forecasting
– Temperature Estimation
What is Cluster Analysis?
• A form of unsupervised learning
• Finds natural groupings in data
• No class labels are used or needed
• Groups similar objects into clusters
Common Clustering Algorithms
• K-Means Clustering – Partitions data into K
clusters
• Hierarchical Clustering – Builds a tree of clusters
• DBSCAN – Finds arbitrarily shaped clusters,
handles noise
Applications of Clustering
•Customer segmentation in marketing
•Social network analysis
•Image recognition & compression
•Medical data grouping
What is an Outlier?
• A data object that does not follow the general pattern
of the data
• Also called an anomaly
• Often treated as noise or exceptions, but can be
valuable
Outlier Analysis (Anomaly Mining)
• Process of detecting and analyzing these unusual data
points
• Can reveal rare but important events
• Common in:
– Fraud Detection
– Network Security
– Health Monitoring
Example – Credit Card Fraud
• 💳 Unusual behavior detection:
• Large purchases not typical for a customer
• Purchases in a different country or city
• Unusual purchase frequency
• → Might signal credit card misuse or theft
Outlier Detection Methods
• Statistical Methods:
– Based on probability distributions
– Identify points far from the expected values
• Distance-Based Methods:
– Outliers = Points far from any cluster
• Density-Based Methods:
– Look for local anomalies in dense regions
– Useful when global models fail
• Applications of Outlier Analysis
• 📊 Financial Sector: Fraud and risk detection
• 🌐 Cybersecurity: Detecting intrusions
• 🚑 Healthcare: Identifying abnormal patient
records
• 📦 Quality Control: Detecting faulty products