KEMBAR78
Introduction To Data Mining Unit1 | PDF | Data Mining | Data
0% found this document useful (0 votes)
98 views37 pages

Introduction To Data Mining Unit1

Introduction to datamining for beginners

Uploaded by

M Hemalatha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views37 pages

Introduction To Data Mining Unit1

Introduction to datamining for beginners

Uploaded by

M Hemalatha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT – I

DATA MINING

Dr.M.Hemalatha
Department of Computer Science
Sri Ramakrishna College of Arts & Science
Coimbatore
SYLLABUS
Introduction - Data mining: an essential step in
knowledge discovery - Diversity of data types for
data mining -Mining various kinds of knowledge
- Multidimensional data summarization - Mining
frequent patterns, associations, and correlations
Classification and regression for predictive
analysis - Cluster analysis Deep learning Outlier
analysis-Database technology and data mining -
Data mining and data science.
Data
• Datum means "an item given“
• Individual pieces of information
• Structure that is often tabular (represented by
rows and columns)
• A tree (a set of nodes with parent-children
relationship)
• A graph (a set of connected nodes).
• Raw data, i.e., unprocessed data
Data, Information and Knowledge
• Data, information and knowledge frequently
overlap, mainly differing in abstraction
• Data is least abstract, information next least,
and knowledge most.
• Example:
– the height of Mt. Everest – Data
– a book on Mt. Everest geological characteristics –
Information
– a report containing practical information on the
best way to reach Mt. Everest's peak – Knowledge
Data, Information and Knowledge
What is Data Mining?
• Extracting useful patterns from large datasets.
• Also known as Knowledge Discovery from Data (KDD).
• Example: Market basket analysis in retail.
• Data mining is often defined as finding hidden
information or extracting meaningful information
from large database.
• It is also called exploratory data analysis, data driven
discovery and deductive learning.
Why Data Mining is Important
• Helps in decision-making by revealing hidden trends.
• Used in business, healthcare, finance, and more.
Example: Fraud detection in banking transactions.
• Alternative names:
– Knowledge discovery (mining) in databases (KDD)
– Knowledge extraction
– Data/pattern analysis
– Data archeology
– Data dredging
– Information harvesting
– Business intelligence, etc.
Steps in Knowledge Discovery

• Selection: Obtain data from various sources.


• Preprocessing: Cleanse data.
• Transformation: Convert to common format.
Transform to new format.
• Data Mining: Obtain desired results.
• Interpretation/Evaluation: Present results to
user in meaningful manner.
Data Mining Application Areas

1. Business and E-Commerce Data


• Major source category of data for DM
applications
• Back office, front office and network
applications produce large amounts of
data about business challenge
Business Transactions
Electronic Commerce
2. Scientific, Engineering and Health Care
Data
• Genomic Data
• Sensor Data
• Simulation Data
• Healthcare Data
• Web Data
• Multimedia Documents
• Data Web
3. Other Application Areas
• Risk Analysis
• Targeted Marketing
• Customer Retention
• Portfolio Management
• Brand Loyalty
• Banking
Types of Data
Data Type Explanation Examples
Organized in tables with Customer databases, Excel
Structured
rows/columns files

Flexible structure using JSON from APIs, XML


Semi-Structured
tags (XML/JSON) product listings

No fixed structure; rich in Emails, documents,


Unstructured
content images, videos
Time-Series Data over time intervals Stock prices, weather data
Geographical or location-
Spatial GPS data, map coordinates
based data

Rich content like Medical scans, surveillance


Multimedia
audio/video/images videos
Multidimensional Data Models

• • Data cube structure with dimensions and


measures.
• • Enables OLAP (Online Analytical Processing).
• Example: Sales data by product, time, region.
Database technology and data mining
• Databases are used to store, manage, and
retrieve vast amounts of structured data.
• Data mining is the process of discovering
patterns and knowledge from this data.
• Database technology provides the foundation
for efficient and scalable data mining.
Introduction to Data Mining Techniques

•Two major goals: Descriptive Analysis and


Predictive Modeling

•Key techniques:
🔹 Regression
🔹 Association Rule Discovery
🔹 Classification
🔹 Clustering
Regression Analysis
• Predictive technique used to estimate values
• Example: Predicting revenue based on
previous sales
• 📈 Use Case: Forecasting stock prices,
predicting housing prices
• Output: Continuous values
Association Rule Discovery
• Identifies relationships between items in
datasets
• 📦 Example: Customers who buy bread often buy
butter
• Common in:
🔹 E-commerce recommendations
🔹 Market Basket Analysis
• Format: IF {Item A} THEN {Item B}
Classification
• Assigns data items to predefined categories
• Example: Classifying emails as spam or not
spam
• Used in:
🔹 Fraud detection
🔹 Medical diagnosis
🔹 Document sorting
Clustering
• Groups similar data points together
• No predefined labels (unsupervised learning)
• 🎯 Used in:
🔹 Customer segmentation
🔹 Image compression
🔹 Social network analysis
Frequent Pattern Mining
• Sub-field of data mining for discovering
recurring patterns
• Finds frequent itemsets (e.g., milk + bread
bought together)
• Basis for association rules
• Methods:
🔹 Apriori
🔹 FP-Growth
Apriori Algorithm (Brief)
• Works by identifying frequent individual items
• Expands them to larger itemsets based on
minimum support
• Good for market basket data
FP-Growth Algorithm (Brief)
• Faster than Apriori
• Uses a special data structure: FP-tree
• Compresses data and mines frequent patterns
without candidate generation
Technique Goal Example Use Case

Regression Predict values Predict housing price

Product
Association Find relationships recommendation

Spam email
Classification Categorize data detection

Customer
Clustering Group similar data segmentation
Why Use Database Technology?
• Efficient storage, indexing, and retrieval of large datasets
• Built-in query processing (e.g., SQL)Seamless integration
with data mining algorithms
• High scalability, consistency, and data integrity
Types of Databases Used
• Relational Databases (RDBMS): Structured data with tables
and keys
• Data Warehouses: Integrated, historical data for analytics
• NoSQL Databases: For unstructured/semi-structured data
(e.g., MongoDB, Cassandra)
• Distributed Databases: Handle large-scale data across nodes
What is Data Mining?
• The process of discovering meaningful patterns in large
datasets
• Uses statistical techniques and machine learning
Common techniques:
• Clustering
• Classification
• Association rule mining
• Regression
What is Data Science?
• A broader discipline involving the entire data lifecycle
• Combines statistics, machine learning, data engineering,
and domain knowledge
Tasks include :
• Data collection & preprocessing
• Model building
Role of SQL in Data Mining
• SQL is used to select, filter, and aggregate data for
mining
• Useful operations: GROUP BY, JOIN, WHERE,
HAVING
• Extensions like DMX (Data Mining Extensions) in
Microsoft SQL Server
Data Warehousing
• A centralized repository of data from multiple
sources
• Supports multidimensional analysis
• Enables summarization and trend analysis over
time
• Ideal for preparing data before mining
What Are Frequent Patterns?
• Patterns that occur frequently in data
• Help in identifying valuable relationships
• Applications: Market basket analysis,
recommendation systems
Types of Frequent Patterns
• Frequent Itemsets: Items often bought together
– Example: Milk & Bread
• Sequential Patterns: Items bought in a sequence
– Example: Laptop → Camera → Memory Card
• Frequent Substructures: Repeated structures like
graphs, trees
– Example: Social network connections
Association Rule Mining
• 📊 Example Rule:
buys(X, "computer") ⇒ buys(X, "software")
• Support: 1% (appears in 1% of transactions)
• Confidence: 50% (if buys computer, 50% also
buy software)
Correlation Analysis
• Goes beyond simple association
• Finds statistical relationships between
attribute-value pairs
What is Classification?
• Classification = Predicting a category or class
• Builds a model from training data (with known
class labels)
• Model is used to predict class labels of
new/unseen data
How Classification Works
• Training Data – Data with known outcomes
• Model Building – Learn patterns from data
• Prediction – Use model to classify new inputs
📈 Example: Email → Spam or Not Spam
Decision Tree Overview
• 🌳 A Decision Tree is a flowchart-like structure:
• Node: Test on an attribute
• Branch: Outcome of the test
• Leaf: Final class label
• 📊 Example:
IF age < 30
→ IF student = yes → buy = yes
→ ELSE → buy = no
Regression vs Classification
Feature Classification Regression
Output Type Categorical (Labels) Continuous (Numbers)
Example Spam / Not Spam Predict house price
Linear Regression, etc.
Algorithm Used Decision Tree, SVM, etc.

• Application Areas
• Classification:
– Spam Detection
– Disease Diagnosis
– Credit Approval
• Regression:
– Stock Price Prediction
– Sales Forecasting
– Temperature Estimation
What is Cluster Analysis?
• A form of unsupervised learning
• Finds natural groupings in data
• No class labels are used or needed
• Groups similar objects into clusters
Common Clustering Algorithms
• K-Means Clustering – Partitions data into K
clusters
• Hierarchical Clustering – Builds a tree of clusters
• DBSCAN – Finds arbitrarily shaped clusters,
handles noise
Applications of Clustering
•Customer segmentation in marketing
•Social network analysis
•Image recognition & compression
•Medical data grouping
What is an Outlier?
• A data object that does not follow the general pattern
of the data
• Also called an anomaly
• Often treated as noise or exceptions, but can be
valuable
Outlier Analysis (Anomaly Mining)
• Process of detecting and analyzing these unusual data
points
• Can reveal rare but important events
• Common in:
– Fraud Detection
– Network Security
– Health Monitoring
Example – Credit Card Fraud
• 💳 Unusual behavior detection:
• Large purchases not typical for a customer
• Purchases in a different country or city
• Unusual purchase frequency
• → Might signal credit card misuse or theft
Outlier Detection Methods
• Statistical Methods:
– Based on probability distributions
– Identify points far from the expected values
• Distance-Based Methods:
– Outliers = Points far from any cluster
• Density-Based Methods:
– Look for local anomalies in dense regions
– Useful when global models fail
• Applications of Outlier Analysis
• 📊 Financial Sector: Fraud and risk detection
• 🌐 Cybersecurity: Detecting intrusions
• 🚑 Healthcare: Identifying abnormal patient
records
• 📦 Quality Control: Detecting faulty products

You might also like