PR - Unit 4
PR - Unit 4
Fisher Discriminant Analysis (FDA), also known as Linear Discriminant Analysis (LDA), is a
technique used in statistics and machine learning to classify or separate data points into
different groups. It's commonly applied when you have two or more classes (or categories) and
want to find a way to distinguish between them.
1. Class Separation: The main goal of FDA is to find a line or direction that best separates
the classes in the data. For instance, if you have two types of flowers, FDA will try to
find the line that best divides them based on their characteristics, like petal length and
width.
2. Maximizing Distance Between Classes: FDA works by finding a direction that
maximizes the distance between the means (centers) of each class while minimizing the
variation (spread) within each class. This way, the classes are as separate as possible in
this new direction.
3. Projection to a New Axis: Once the best direction is found, all data points are projected
onto this axis. This makes the data more distinguishable along a single dimension, even
if it initially had many dimensions.
4. Classification: After projecting the data, we can set a threshold along the new axis to
decide the boundary between classes. If a new data point falls on one side of the
threshold, it belongs to one class; if it falls on the other side, it belongs to the other
class.
1. Calculate Class Means: For each class, calculate the average (mean) value for each
feature.
2. Compute Scatter Matrices:
o Within-class scatter: Measures the spread of points within each class.
o Between-class scatter: Measures the distance between the means of each class.
3. Find the Optimal Direction: The next step is to find a direction that maximizes the
ratio of between-class scatter to within-class scatter. This is the "discriminant" direction
that best separates the classes.
4. Project Data: Project the original data points onto this direction to get a single value
for each point. This value represents where it falls along the new axis.
5. Classify New Points: Use the projection to assign new points to one of the classes by
checking where it falls relative to the threshold on the new axis.
Imagine you have two types of flowers, and you measured the petal length and petal width of
each. The two types of flowers form two clusters in this feature space. FDA will try to find a
line that best divides these two clusters by considering the centers and spreads of the clusters.
Once it finds this line, you can project new flowers onto it and determine which type they are
based on where they fall on the line.
Advantages:
• Simple and interpretable, as it finds a single direction that best separates classes.
• Effective for linearly separable data (where a line or plane can split the classes).
Limitations:
• Assumes that data for each class is normally distributed and has the same variance,
which may not always be true.
• May not work well if classes are not linearly separable.
In summary, Fisher Discriminant Analysis helps separate classes by finding a single direction
where the differences between them are most pronounced, making classification easier and
more accurate.
1. Data Simplification:
o Imagine trying to understand a dataset with hundreds of variables. PCA helps
condense these variables into a smaller number of "principal components" that
capture the most important patterns.
2. Remove Redundancy:
o Many datasets have correlated variables (e.g., height and weight are related).
PCA identifies and removes this redundancy.
3. Better Visualization:
o High-dimensional data is hard to visualize. PCA reduces the dataset to two or
three dimensions so you can plot and see patterns.
4. Speed Up Computation:
o Fewer variables mean faster computations, especially in machine learning
models.
Think of PCA as finding a new way to look at the data. Instead of using the original variables,
it creates principal components, which are new variables that capture most of the variation in
the data.
• Data is standardized so that all variables have the same scale (mean = 0, standard
deviation = 1). This is important because variables measured on different scales (e.g.,
height in meters vs. weight in kilograms) can dominate the results.
• The covariance matrix measures how variables relate to each other. If two variables
change together (e.g., height and weight increase together), they will have a high
covariance.
• Principal components are ranked by their eigenvalues. The first principal component
captures the most variation, the second captures the next most, and so on.
• You choose how many components to keep, typically enough to explain 90-95% of the
total variation in the data.
• The original data is projected onto the selected principal components, creating a new
dataset with fewer dimensions.
Key Terms
• Principal Components:
o These are new variables created by PCA. Each principal component is a linear
combination of the original variables.
• Variance:
o A measure of how spread out the data is. PCA aims to capture the directions
where the data varies the most.
• Dimensionality Reduction:
Imagine you’re analyzing the preferences of customers in a bakery. You’ve collected data on
how much they like cookies, cakes, and pastries. Here’s what PCA would do:
1. Standardize: Make sure that "likes cookies," "likes cakes," and "likes pastries" are
measured on the same scale.
2. Covariance Matrix: Calculate how much liking cookies is related to liking cakes and
pastries.
3. Eigenvalues and Eigenvectors: Identify the directions (principal components) that
capture the most customer preference patterns.
4. Select Components: If the first two components capture 95% of the variation, you’ll
use only these two.
5. Transform: Combine "likes cookies," "likes cakes," and "likes pastries" into just two
new variables (the first two principal components).
Now, instead of analyzing three variables, you work with two, making the analysis simpler
without losing much information.
Visual Explanation
Advantages of PCA
1. Simplifies Analysis:
o Reduces the number of variables while keeping the core patterns.
2. Improves Performance:
o Speeds up machine learning algorithms by reducing dimensions.
3. Removes Noise:
o Filters out less important information (lower-variance components).
4. Visualization:
o Helps visualize high-dimensional data in 2D or 3D plots.
Limitations of PCA
1. Loss of Interpretability:
o Principal components are combinations of original variables, so they may not
have a clear meaning.
• When you have too many variables, and they’re highly correlated.
• When you want to visualize high-dimensional data.
• When you want to reduce the complexity of your machine learning model.
In summary, PCA is a powerful tool for reducing the number of variables in a dataset while
keeping the most important information. It simplifies analysis, speeds up computation, and
makes data easier to visualize and interpret.
In pattern recognition, factor analysis and linear discriminant analysis (LDA) are techniques
used to identify and classify patterns in data. Gradient descent is an optimization method often
used to "teach" these models by finding the best parameters that minimize some form of error.
1. Factor Analysis
Factor analysis is a technique used to reduce the number of variables (features) in data by
finding a few "factors" that can explain the relationships among the observed variables. This
is useful in pattern recognition because it simplifies complex data, making it easier to identify
meaningful patterns.
For example, imagine we have a dataset with numerous observed features like age, height,
weight, income, and education level. These features could be combined into fewer "factors"
(e.g., lifestyle or socioeconomic status) that capture the main patterns without needing all the
original data.
LDA is a technique used for classifying data into different categories (or classes) by finding
a line (or a decision boundary) that best separates those classes. It’s commonly used in pattern
recognition to tell which category a new data point should belong to.
For example, suppose you have a dataset with two classes, say, "Apples" and "Oranges." Each
fruit has features like weight and color intensity. LDA finds a decision boundary (or line) that
separates the "Apples" from the "Oranges" by maximizing the distance between the two
classes.
Gradient descent is an optimization technique used in both factor analysis and LDA to "tune"
the model by adjusting parameters (weights) to minimize the classification error.
1. Initialize Parameters: Start with some random values for the parameters (like the
factors or weights in the model).
2. Calculate the Error: Check how far off the current model predictions are from the
actual class labels.
3. Update Parameters: Adjust the parameters a little in the direction that reduces the
error, based on the gradient.
4. Repeat: Keep doing this until the error stops decreasing or becomes very small,
meaning the model has learned the best parameters to recognize patterns in the data.
Example
Imagine you’re a farmer with a data set of fruits with features like size and color intensity.
You want to classify these fruits as either apples or oranges.
1. Applying LDA: You’ll use LDA to find the best boundary (line) that separates apples
from oranges.
2. Using Gradient Descent in LDA:
o You start by randomly setting the boundary line and calculate how many fruits
are misclassified.
o If 20% of apples are misclassified as oranges and 15% of oranges as apples,
that’s the "error."
o Gradient descent will adjust the boundary line slightly to reduce this error, step
by step.
o After enough iterations, the boundary line will have adjusted to minimize
misclassification.
Without gradient descent, it would be challenging to find the ideal parameters that best separate
your classes. Gradient descent systematically reduces the error, leading to a model that
effectively recognizes patterns by tuning itself based on the data.
So, in pattern recognition using factor analysis or LDA, gradient descent is crucial because it
enables the model to "learn" the parameters that best explain and classify the data.
The Perceptron is a basic unit in neural networks and an algorithm for classification in pattern
recognition. It’s a type of linear classifier that’s particularly useful for understanding how
models like linear discriminant analysis (LDA) work, as it shares a similar goal: separating
data points into different classes.
In the context of pattern recognition, factor analysis, and linear discriminant functions, a
Perceptron can be used to classify data into categories by finding a boundary (usually a line)
that separates those categories based on their features.
1. What is a Perceptron?
A Perceptron is a simple algorithm that takes multiple input values (features) and outputs a
single value (a decision about classifying the data into one of two classes). It's based on a
weighted sum of inputs and a threshold:
1. Weighted Sum of Inputs: Each input (feature) is multiplied by a weight that shows its
importance, and all the weighted inputs are added up.
2. Activation/Threshold: If this weighted sum exceeds a certain threshold, the Perceptron
"fires" (outputs 1), and if not, it does not fire (outputs 0). This is a very simple way of
deciding if an input belongs to one class or another.
In pattern recognition, you might have data with two classes that you want to separate, such as
images of cats and dogs. Each image has features like size, color, and shape. The Perceptron
will try to learn weights for each feature to create a decision boundary that separates cats from
dogs.
3. Training a Perceptron
The Perceptron learns how to classify data by adjusting its weights through a process called
training, using gradient descent or Perceptron learning rule. During training:
1. The Perceptron receives an input and makes an initial guess about the class.
2. If it’s wrong, it adjusts the weights a little bit to improve its next prediction.
3. This process is repeated until the Perceptron can separate the classes correctly.
Suppose you want to classify emails as either "spam" or "not spam" based on features like:
The Perceptron will learn weights for each of these features. For example:
• "free" might have a high weight because it’s often in spam emails.
• Email length might have a smaller weight because both spam and non-spam emails
can be short or long.
• Specific keywords might have weights based on their relevance.
During training, the Perceptron will adjust these weights to minimize misclassifications. After
training, if you input a new email, it will use these learned weights to classify the email as spam
or not spam.
In linear discriminant functions like LDA, the goal is similar: to find a decision boundary
that separates classes. The Perceptron, as a linear classifier, also finds a boundary, but it does
so with an iterative learning process that adjusts weights based on errors. The Perceptron only
works well when the classes are linearly separable (can be separated by a straight line or
plane).
Summary
• The Perceptron is a simple linear classifier that decides between two classes based on
a weighted sum of inputs.
• It’s trained to minimize classification errors by adjusting weights, much like how
gradient descent tunes parameters in other models.
• In pattern recognition, the Perceptron can classify objects (like spam vs. not spam) by
finding a linear boundary based on input features.
Support Vector Machines (SVMs) are typically used for classification tasks and work well with
numeric data. However, when dealing with non-numeric (categorical) data or when a non-
metric approach is needed, modifications are required. Let's break down the concepts in simple
language, focusing on how SVMs can be adapted for non-metric data (data that isn't just
numbers or doesn't follow traditional distance-based metrics).
Non-metric data includes things like categorical variables (e.g., color, type of product, or even
text data). This data doesn't have a natural ordering or numeric relationship, so measuring
"distances" between points isn't straightforward.
1. Kernel Trick
The kernel trick is a core part of SVMs that allows it to work in non-linear spaces, meaning
you don’t always need numeric inputs in the traditional sense. For non-metric data, the kernel
function is modified to measure the similarity between different categories rather than
measuring "distance."
Example:
Imagine you're classifying animals into "mammals" and "non-mammals." You might have
categories like "has fur," "lays eggs," "number of legs," etc. Here, instead of numeric distances,
you can use a kernel function that compares these categories:
One such kernel is the Hamming kernel, which measures how different two sets of categorical
features are (by counting mismatches between them).
One way to handle non-metric or categorical data is by converting it into a numeric format.
Some common methods for encoding are:
Once this data is transformed into numeric form, the traditional SVM approach can be
applied.
Example: You’re classifying clothing items based on "color" and "type" (shirt, pants,
jacket). After encoding, each clothing item is represented numerically (e.g., [0,1,0] for
blue, and [1,0,0] for a shirt), and SVM can be used as usual.
• Label Encoding: Each category is assigned a unique number (e.g., red = 1, blue = 2,
green = 3). This is simpler but assumes some implicit ordering or relationship between
categories, which might not always be appropriate.
For text data or any non-numeric sequences (like DNA sequences), you can use specialized
kernels like the String Kernel. This kernel measures the similarity between two strings or
sequences based on their common subsequences.
Example:
• ACGT
• ACTT
The String Kernel would look for shared subsequences like "AC" and "T" to calculate how
similar they are, without needing to turn the data into numbers. This similarity measure can
then be used in the SVM algorithm.
If your data has a structured, non-numeric form (e.g., graphs representing chemical compounds
or social networks), graph kernels can be used. These kernels compare graphs based on their
structure and can work with SVM to classify them.
Example:
Imagine you're classifying different chemical molecules. Each molecule can be represented as
a graph, where atoms are nodes, and bonds are edges. A graph kernel would compare the
structures of these molecules, rather than comparing numeric values, and feed that similarity
information to the SVM.
Problem: Classify animals based on features like "type of movement" (walks, flies, swims),
"skin type" (fur, feathers, scales), and "habitat" (land, water, air).
Now, each animal is represented as a vector of binary values like [1, 0, 0, 1, 0, 0, 1, 0, 0], which
corresponds to "walks", "fur", and "land."
After encoding, the dataset becomes numeric. The SVM can now look for the best hyperplane
that separates animals based on the encoded features.
If the data isn’t linearly separable (i.e., can’t be classified with a straight hyperplane), the kernel
trick can be applied. A specialized kernel (like Hamming or String kernel) can help measure
similarity in a way that makes classification more accurate.
Conclusion
In SVMs, handling non-metric or non-numeric data requires adapting how the data is processed
or how the similarity between data points is calculated. Key approaches include:
The idea is that even though SVM traditionally works with numeric data, there are ways to
extend it to handle more complex, non-metric types of data by transforming the problem into
something that SVM can still handle effectively.
A Decision Tree is a simple and intuitive model used for classification and regression tasks.
It makes decisions by splitting data into smaller subsets based on the value of the input features.
Decision trees are often visualized as trees where each node represents a decision, and each
branch represents an outcome of that decision.
The CART (Classification and Regression Trees) algorithm is one of the most widely used
methods for constructing decision trees. As its name suggests, it can handle both classification
(when predicting categories like "spam" or "not spam") and regression (when predicting
continuous values like temperature or price).
1. Root Node: This is the first node of the tree where the decision-making process starts.
It represents the entire dataset.
2. Internal Nodes: These are the decision points in the tree, where a specific feature
(input) is selected to split the data into two or more groups based on a condition.
3. Branches: These represent the possible outcomes of a decision. Each branch
corresponds to a specific subset of the data.
4. Leaf Nodes: These are the terminal nodes (endpoints) of the tree where no further
decisions are made. In a classification tree, each leaf node represents a predicted class;
in a regression tree, it represents a predicted value.
The CART algorithm splits the data into smaller and smaller subsets, step by step, in a greedy
fashion, which means it tries to make the best split at each step, aiming to improve the accuracy
of the tree. The splits are based on certain criteria:
• For classification (CART for classification): CART uses a measure called Gini
Impurity to choose the best splits. The Gini Impurity measures how “pure” a node is,
i.e., how much the data in the node belong to one class. A pure node contains data from
only one class.
• For regression (CART for regression): CART minimizes the Mean Squared Error
(MSE) at each step to make better predictions. This helps the model predict continuous
values like house prices or temperatures.
Let’s say you are trying to build a decision tree to classify whether a person will buy a computer
based on the following features:
The CART algorithm starts by analyzing the data and selecting the feature that best separates
the people who buy computers from those who don’t.
For example, let’s say that Age is the best feature to split on. The root node might ask:
Now, each branch becomes a new node, and the algorithm looks for another feature to split on.
• For the Yes (Young) branch, the best feature might be whether they are a student.
• For the No branch (Middle-aged and Senior), the algorithm may choose Income as the
best feature to split on.
This process continues until the algorithm can’t make any more useful splits, and the tree
reaches its leaf nodes.
• One leaf node might predict: If the person is young and a student, they will buy a
computer.
• Another leaf node might predict: If the person is middle-aged and has high income,
they will buy a computer.
At each split, the CART algorithm tries to minimize the Gini Impurity for classification tasks.
It keeps choosing the best feature and threshold for splitting the data.
Now, let’s use a regression example where we want to predict house prices based on:
1. Number of Bedrooms
2. Square Footage
3. Distance from City Center
The CART algorithm examines the data and decides that Square Footage is the most important
feature for predicting house prices. The root node might ask:
For the branch where Square Footage > 2000, the algorithm might decide that Number of
Bedrooms is the next best feature to split on. The tree might then ask:
The tree continues to split the data until it reaches leaf nodes, where each leaf represents a
predicted house price.
• One leaf node might predict that if Square Footage > 2000 and Number of Bedrooms
> 3, the price is $500,000.
• Another leaf node might predict that if Square Footage ≤ 2000 and Distance from
City Center > 10 miles, the price is $250,000.
At each step, the CART algorithm tries to minimize the Mean Squared Error (MSE) for
regression tasks, ensuring that the splits lead to more accurate predictions.
• Stopping Criteria: The tree-building process will stop when the data is perfectly
separated, or when a minimum number of data points are left in each node (to avoid
overfitting). Sometimes, growing a tree until every data point is classified perfectly can
lead to overly complex trees, which don’t generalize well to new data.
• Pruning: To prevent overfitting (where the tree is too complex and captures noise in
the training data), we can "prune" the tree. This means removing branches that add
complexity but don’t improve accuracy much. This leads to a simpler, more
generalizable model.
1. Easy to Interpret: Decision trees are very intuitive and easy to understand, even by
non-experts. You can visualize the entire decision-making process as a flowchart.
2. Handles Both Types of Data: CART can handle both classification and regression
tasks.
3. No Need for Feature Scaling: Decision trees don’t require features to be scaled or
normalized, as the splits are based on the actual values of the features.
4. Non-linear Relationships: Decision trees can capture non-linear relationships between
features and target variables, as they repeatedly split the data into smaller chunks.
1. Overfitting: Decision trees can easily overfit the training data, especially if they grow
too deep. This can make them sensitive to small changes in the data.
2. Instability: Small changes in the data can lead to very different trees being generated.
3. Less Accurate Compared to Other Models: On their own, decision trees may not be
as accurate as more complex models like random forests or boosting techniques.
Example Summary
• Task: Predict if someone will buy a house based on income and credit score.
• CART Decision Tree:
1. Root Node: Is income > $50,000?
2. Split: Yes (high-income) or No (low-income).
3. Further Split: For high-income people, check credit score > 700.
4. Predictions: Leaf nodes might say: Yes, they will buy a house (if income >
$50,000 and credit score > 700) or No, they won’t buy a house (if income ≤
$50,000).
The tree follows a series of decisions based on the features to arrive at a prediction.
Conclusion