KEMBAR78
Unit IV Ensemble Unsupervised Learning | PDF | Cluster Analysis | Principal Component Analysis
0% found this document useful (0 votes)
62 views5 pages

Unit IV Ensemble Unsupervised Learning

The document covers Ensemble Learning and Unsupervised Learning, detailing techniques to improve model performance and analyze unlabeled data. Ensemble Learning combines multiple models through methods like Bagging, Boosting, and Stacking, while Unsupervised Learning includes clustering and dimensionality reduction techniques. Key concepts such as Random Forest, AdaBoost, K-Means, and PCA are discussed, highlighting their advantages and disadvantages.

Uploaded by

Boomika G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views5 pages

Unit IV Ensemble Unsupervised Learning

The document covers Ensemble Learning and Unsupervised Learning, detailing techniques to improve model performance and analyze unlabeled data. Ensemble Learning combines multiple models through methods like Bagging, Boosting, and Stacking, while Unsupervised Learning includes clustering and dimensionality reduction techniques. Key concepts such as Random Forest, AdaBoost, K-Means, and PCA are discussed, highlighting their advantages and disadvantages.

Uploaded by

Boomika G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Unit IV: Ensemble Learning & Unsupervised Learning – Study Material

Ensemble Learning

Ensemble Learning is a technique where multiple models are combined to improve


overall performance. It reduces errors, increases accuracy, and handles data variability
better than individual models.

### Key Features:


1. Combines multiple weak learners to create a strong learner.
2. Improves generalization and reduces overfitting.
3. Works well for both classification and regression tasks.

### Types of Ensemble Learning:


- **Bagging**: Reduces variance by training multiple models on random subsets (e.g.,
Random Forest).
- **Boosting**: Reduces bias by training models sequentially, giving more weight to
misclassified instances (e.g., AdaBoost, Gradient Boosting).
- **Stacking**: Combines multiple models using a meta-learner for final predictions.

Model Combination Schemes

Different strategies exist for combining multiple models in ensemble learning.

1. **Voting**: In classification, multiple models vote, and the majority class is selected.
2. **Error-Correcting Output Codes (ECOC)**: Decomposes multi-class problems into
multiple binary classifications.
3. **Bagging (Bootstrap Aggregating)**: Trains models independently on different
subsets of data and averages results.
4. **Boosting**: Models are trained sequentially, correcting errors from previous
models.
5. **Stacking**: Outputs from base learners are combined using another model (meta-
learner) for final predictions.

Bagging: Random Forest

Bagging is a technique that improves stability and accuracy by reducing overfitting.


### **Random Forest**:
- Uses multiple Decision Trees trained on different subsets of data.
- Predictions are averaged (regression) or majority-voted (classification).
- Handles missing values and large datasets well.

### **Advantages**:
- Reduces overfitting.
- Works well with high-dimensional data.
- Can be used for feature importance ranking.

### **Disadvantages**:
- Requires more computational power.
- Loses interpretability compared to individual Decision Trees.

Boosting: AdaBoost

Boosting combines weak models sequentially, giving more weight to misclassified


instances.

### **AdaBoost (Adaptive Boosting)**:


- Assigns weights to each sample and updates them iteratively.
- Focuses on misclassified samples to improve predictions.
- Uses weak classifiers like Decision Stumps.

### **Advantages**:
- Reduces bias, improving weak classifiers.
- More accurate than bagging for complex datasets.

### **Disadvantages**:
- Sensitive to noise in the dataset.
- Slower training due to sequential model building.

Unsupervised Learning

Unsupervised Learning finds patterns in **unlabeled data**. Unlike supervised learning,


it does not rely on predefined outputs.

### **Key Features**:


1. Works with **unlabeled** data.
2. Groups similar data points or reduces dimensionality.
3. Used in anomaly detection, recommendation systems, and exploratory data analysis.

### **Main Types**:


- **Clustering**: Groups similar data points.
- **Dimensionality Reduction**: Reduces dataset complexity while preserving essential
information (e.g., PCA, LLE, Factor Analysis).

Clustering: Introduction

Clustering is an unsupervised learning technique that **groups similar data points**


based on some similarity measure.

### **Types of Clustering**:


1. **Hierarchical Clustering**: Builds a hierarchy of clusters (e.g., AGNES, DIANA).
2. **Partitional Clustering**: Divides data into distinct clusters (e.g., K-Means, K-
Mode).
3. **Density-Based Clustering**: Identifies clusters based on dense regions (e.g.,
DBSCAN, Mean-Shift).

Hierarchical Clustering: AGNES & DIANA

Hierarchical Clustering builds a nested structure of clusters.

### **AGNES (Agglomerative Nesting)**:


- A **bottom-up** approach: Each data point starts as its own cluster and merges step by
step.
- Uses linkage methods (single, complete, average).

### **DIANA (Divisive Analysis)**:


- A **top-down** approach: All data points start in one cluster and are split iteratively.

### **Advantages**:
- No need to predefine the number of clusters.
- Dendrograms provide visual insights.

### **Disadvantages**:
- Computationally expensive for large datasets.
- Sensitive to noise and outliers.

Partitional Clustering: K-Means & K-Mode

Partitional Clustering divides data into **fixed K clusters**.

### **K-Means Clustering**:


- Assigns data points to **K clusters** based on distance (usually Euclidean).
- Iteratively updates centroids to minimize variance.

### **K-Mode Clustering**:


- Used for categorical data instead of numerical values.
- Replaces means with **modes** (most frequent values).

### **Advantages**:
- Fast and scalable for large datasets.
- Works well when clusters are well-separated.

### **Disadvantages**:
- Sensitive to initial cluster centers.
- Does not handle outliers well.

Dimensionality Reduction: PCA & LLE

Dimensionality reduction techniques help reduce the number of features while preserving
important information.

### **Principal Component Analysis (PCA)**:


- Finds new feature axes (principal components) that maximize variance.
- Used in image compression, face recognition.

### **Locally Linear Embedding (LLE)**:


- A nonlinear technique preserving local relationships in data.
- Suitable for highly nonlinear structures.

### **Advantages**:
- Reduces noise and redundancy.
- Speeds up model training.
### **Disadvantages**:
- Can lose interpretability.
- Assumes linearity (for PCA).

You might also like