Unit 2
Unit 2
The core idea behind KNN is that similar data points tend to exist in close proximity within a
feature space. When a new, unlabeled data point needs to be classified, the algorithm
follows these steps:
1. Choose the value of K: This is a crucial hyperparameter that determines how many
nearest neighbors will be considered. 'K' is typically a small, odd integer to avoid ties
in classification.
2. Calculate Distances: For the new data point (query point), the algorithm calculates
its distance to every other data point in the training dataset. Common distance
metrics include:
3. Identify K Nearest Neighbors: After calculating all distances, the algorithm selects
the 'K' data points from the training set that are closest to the query point.
4. Majority Vote (for Classification): For classification tasks, the algorithm looks at the
class labels of these 'K' nearest neighbors. The class that appears most frequently
among these neighbors is assigned as the predicted class for the new data point.
5. Predict Value (for Regression): If KNN is used for regression, it typically takes the
average or a weighted average of the values of the 'K' nearest neighbors to predict
the value for the new data point.
• Instance-based/Lazy Learning: Stores all training data and defers computation until
prediction time. No explicit model is built during training.
KNN is widely used in various domains due to its simplicity and effectiveness:
• Finance: Fraud detection (e.g., identifying suspicious patterns in credit card usage),
credit risk assessment, stock market forecasting.
• Image Recognition and Computer Vision: Classifying images (e.g., handwritten digit
recognition, object recognition).
Advantages of KNN:
• Handles Multi-Class Problems Naturally: Can easily be extended to classify data into
more than two classes.
• Effective for Small Datasets: Can perform well when the amount of training data is
limited.
Disadvantages of KNN:
• Requires Optimal K Value: The choice of 'K' is critical and can significantly impact
performance. A small 'K' can lead to overfitting (sensitive to noise/outliers), while a
large 'K' can lead to underfitting (oversimplifying decision boundaries). Cross-
validation is often used to find the best 'K'.
Despite its disadvantages, KNN remains a valuable algorithm, especially as a baseline model
or for problems with well-defined clusters and lower dimensionality. Preprocessing
techniques like feature scaling and dimensionality reduction are crucial for optimizing its
performance
Let's walk through a simple, illustrative example of K-Nearest Neighbors (KNN) classification.
Imagine we have a dataset of fruits, with two features: Sweetness and Crunchiness. We
want to classify a new, unknown fruit as either an Apple or an Orange.
Apple 7 8
Apple 6 7
Apple 8 7
Orange 2 3
Orange 3 2
Orange 1 4
Apple 7 9
Orange 3 1
• Sweetness: 5
• Crunchiness: 6
We want to classify this new fruit. Let's choose K = 3 (meaning we'll consider the 3 nearest
neighbors).
Steps:
1. Calculate Distances
We'll use Euclidean Distance as our metric. The formula for Euclidean distance between two
points (x1,y1) and (x2,y2) is:
d= √ (x2−x1)2+(y2−y1)2
Let's calculate the distance from our new fruit (Sweetness=5, Crunchiness=6) to each fruit in
our training data:
(6−5)2+(7−6)2=12+12=1+1=2≈1.41
(8−5)2+(7−6)2=32+12=9+1=10≈3.16
(2−5)2+(3−6)2=(−3)2+(−3)2=9+9=18≈4.24
(1−5)2+(4−6)2=(−4)2+(−2)2=16+4=20≈4.47
(7−5)2+(9−6)2=22+32=4+9=13≈3.61
(3−5)2+(1−6)2=(−2)2+(−5)2=4+25=29≈5.39
Imagine plotting these points on a 2D graph with Sweetness on the X-axis and Crunchiness
on the Y-axis.
• You'd see the 'Apple' points generally clustered in the upper-right (sweeter,
crunchier).
• You'd see the 'Orange' points generally clustered in the lower-left (less sweet, less
crunchy).
• Our new fruit (5, 6) would fall into the region dominated by the 'Apple' points.
Drawing a circle with a radius that encompasses 3 points around (5,6) would likely
include only 'Apple' data points.
This simple example illustrates the core mechanics of KNN classification: measuring
proximity and taking a majority vote among the closest neighbors.
Decision Trees
Decision Trees are a popular and intuitive supervised machine learning algorithm used for
both classification and regression tasks. They work by creating a model that predicts the
value of a target variable by learning simple decision rules inferred from the data features.
Imagine a flowchart-like structure. A decision tree classifies a data point by asking a series of
questions about its features, much like you'd follow a series of "if-then-else" statements.
1. Start at the Root Node: The entire dataset is represented at the top of the tree,
called the "root node."
o The algorithm examines all features and their possible values to find the
"best" way to divide the data into two or more subsets.
o "Best" typically means creating subsets that are as "pure" as possible in terms
of the target class. For classification, purity means that most data points in a
subset belong to the same class.
o The feature and split point that maximize information gain or minimize Gini
impurity are chosen to create the first "decision node."
3. Branching: From the decision node, branches extend to "child nodes," each
representing a subset of the data based on the chosen split.
4. Recursion: Steps 2 and 3 are repeated recursively for each new child node. This
means the algorithm continues to split the data into smaller, purer subsets until
certain stopping criteria are met.
o All data points in a node belong to the same class (a "pure" node).
6. Leaf Nodes: The final nodes in the tree, which are not split further, are called "leaf
nodes" (or "terminal nodes"). Each leaf node represents a class label (the predicted
outcome for classification). The class assigned to a leaf node is typically the majority
class of the data points that end up in that leaf.
• Easy to Understand and Interpret: Their visual, flowchart-like structure makes them
highly interpretable and explainable.
• Minimal Data Preparation: Less data cleaning and preprocessing (like feature scaling)
are typically required compared to some other algorithms.
• Handles Both Numerical and Categorical Data: Can directly process different data
types.
• Can Handle Missing Values: Some implementations can naturally handle missing
values.
• Useful for Feature Selection: The tree structure implicitly highlights the most
important features that contribute to the classification decisions.
• Prone to Overfitting: If the tree is allowed to grow too deep, it can capture noise in
the training data rather than the underlying patterns, leading to poor generalization
on unseen data. This is a significant drawback.
• Instability: Small changes in the training data can lead to a completely different tree
structure.
• Bias Towards Dominant Classes: In imbalanced datasets, the tree might be biased
towards the majority class.
• Greedy Approach: The algorithm makes locally optimal decisions at each split
(choosing the best split at that moment) which doesn't guarantee a globally optimal
tree structure.
• Can Create Complex Trees: For very complex datasets, the tree can become very
large and difficult to visualize or interpret.
• Computationally Expensive (for Optimal Tree): Finding the absolute optimal decision
tree is an NP-hard problem, so heuristic algorithms are used.
Decision trees are widely used in various fields due to their interpretability and
effectiveness:
• Finance: Credit risk assessment (predicting loan defaults), fraud detection, stock
market analysis.
To mitigate the overfitting issue, techniques like pruning (removing branches that provide
little predictive power) and ensemble methods like Random Forests (which build multiple
decision trees and combine their predictions) and Gradient Boosting are commonly
employed. These ensemble methods leverage the strengths of individual decision trees
while addressing their weaknesses.
Let's illustrate how a Decision Tree would classify an instance with a simple, common
example: predicting whether a person will play tennis based on weather conditions.
We have historical data about playing tennis, with features like Outlook, Temperature,
Humidity, and Wind, and the target variable "Play Tennis" (Yes/No).
Total Samples: 14
• Yes: 9
• No: 5
The algorithm (e.g., ID3, C4.5, CART) would analyze this data to find the best feature to split
on at each step, aiming to create the purest possible subsets (nodes where most outcomes
are either "Yes" or "No"). We'll use Information Gain (or Gini Impurity) to guide our choices.
The algorithm calculates the information gain for splitting on each feature: Outlook,
Temperature, Humidity, and Wind.
Let's say, after calculations (which are more involved but follow the principles of
entropy/Gini), Outlook is found to provide the highest information gain. This means splitting
by Outlook creates the most homogeneous subsets.
• Outlook = Overcast: (4 samples: 4 Yes, 0 No) - This is a pure node! All "Overcast"
days resulted in "Yes". This branch terminates here.
Now, the algorithm focuses on the 'Sunny' subset (2 Yes, 3 No). It recalculates information
gain for Temperature, Humidity, and Wind only for these 5 samples.
Let's assume Humidity provides the best split for the 'Sunny' branch.
• Sunny AND Humidity = High: (3 samples: 0 Yes, 3 No) - Pure Node! All "No".
• Sunny AND Humidity = Normal: (2 samples: 2 Yes, 0 No) - Pure Node! All "Yes".
Similarly, the algorithm focuses on the 'Rain' subset (3 Yes, 2 No). It recalculates information
gain for Temperature, Humidity, and Wind for these 5 samples.
Let's assume Wind provides the best split for the 'Rain' branch.
• Rain AND Wind = Weak: (3 samples: 3 Yes, 0 No) - Pure Node! All "Yes".
• Rain AND Wind = Strong: (2 samples: 0 Yes, 2 No) - Pure Node! All "No".
Outlook
/ | \
/ | \
/ \ / \
| | | |
No Yes Yes No
New Day: Outlook = Sunny, Temperature = Cool, Humidity = High, Wind = Strong
1. Start at Root (Outlook): The Outlook is 'Sunny'. Follow the 'Sunny' branch.
2. Next Node (Humidity): The Humidity is 'High'. Follow the 'High' branch under
'Sunny'.
Prediction: The Decision Tree predicts that on this new day, the person will NOT Play Tennis.
• Interpretability: You can easily see the rules that led to the prediction. "If Outlook is
Sunny AND Humidity is High, then Don't Play Tennis."
• Feature Importance: The tree highlights that 'Outlook' is the most important factor
overall, followed by 'Humidity' for Sunny days and 'Wind' for Rainy days.
• Overfitting Risk: If our training data was very small or noisy, and the tree was
allowed to grow extremely deep, it might create overly specific rules that don't
generalize well. For example, if we had only one 'Sunny, High Humidity' day and it
happened to be a 'Yes', the tree might mistakenly predict 'Yes' for all such days.
Techniques like pruning or setting a maximum depth would prevent this.
This example demonstrates the step-by-step process of how a decision tree makes a
classification decision by traversing its branches based on the input features.
Rule induction:
Rule induction is a machine learning technique that focuses on extracting explicit, human-
readable "if-then" rules from data for classification. Unlike some "black box" models, rule
induction aims to provide models that are highly interpretable, making their decision-
making process transparent.
The core idea is to find a set of rules that accurately describe the relationships between
features and class labels in a dataset. These rules typically take the form:
For example:
There are several approaches and algorithms for rule induction, but they generally follow a
"separate-and-conquer" or "covering" strategy:
o The algorithm starts by trying to find a rule that covers a significant number
of instances from one specific class (often the minority class first, to ensure
it's not overlooked).
o This process often involves heuristics or search strategies (like a greedy search
or beam search) to find the best conditions. Measures like Gini impurity,
information gain, or other specific rule evaluation metrics (e.g., accuracy,
coverage, confidence) are used to assess the quality of a potential rule.
o Once a good rule is found, all the training instances that are correctly
classified by this rule are "covered" and typically removed from the training
set (or their weight is reduced). This is the "separate" part of the "separate-
and-conquer" strategy.
3. Iteration:
o Steps 1 and 2 are repeated until all (or most) instances in the training set are
covered by at least one rule, or until no more good rules can be found.
o The result is a set of "if-then" rules. These rule sets can be either:
Decision trees can be seen as a special case of rule induction, or rather, rule sets can often
be extracted from decision trees. Each path from the root node to a leaf node in a decision
tree directly translates into an "if-then" rule.
• Structure: Decision trees are hierarchical and sequential (one path at a time). Rule
induction algorithms can generate a set of independent rules that might not strictly
follow a tree structure.
• Completeness: A decision tree always provides a classification for every input (unless
a path is incomplete, which is rare for a fully grown tree). Rule sets might not be
exhaustive, meaning some instances might not be covered by any rule, or they might
be covered by multiple conflicting rules. This necessitates specific handling (e.g., a
default rule, voting).
• Overlapping Rules: In rule induction, rules can often overlap, meaning an instance
might satisfy the conditions of multiple rules. Decision tree paths, by definition, are
mutually exclusive.
• Optimality: Decision tree algorithms typically use a greedy, top-down approach. Rule
induction algorithms, especially "separate-and-conquer" ones, might build rules
more incrementally and can sometimes find simpler, more robust rules by focusing
on covering specific subsets of data.
• High Interpretability and Explainability: The "if-then" format makes the model's
logic very transparent and easy for humans to understand, which is crucial in
domains like healthcare, finance, or law where explainability is paramount.
• Human-Readable: The rules are often directly actionable and can even be used to
build expert systems or guide human decision-making.
• Handles Mixed Data Types: Can naturally handle both categorical and numerical
features (by discretizing numerical features into ranges).
• Good for Knowledge Discovery: Beyond prediction, rule induction can reveal
interesting patterns and relationships hidden in the data.
• Complexity with Many Rules: For complex datasets, the number of rules can
become very large, making the rule set difficult to manage and interpret, defeating
the purpose of interpretability.
• Overfitting: Like decision trees, if not properly controlled (e.g., through pruning or
minimum coverage criteria), rules can become too specific to the training data and
perform poorly on new data.
• Conflict Resolution: For unordered rule sets, strategies for resolving conflicts when
multiple rules apply to an instance are needed, which can add complexity.
Common Rule Induction Algorithms:
• PART: A hybrid algorithm that generates a "partial" decision tree, extracts a rule,
removes covered instances, and repeats.
• Apriori / Eclat (for Association Rule Mining): While primarily used for finding
associations (e.g., "customers who buy bread also buy milk"), they can be adapted
for classification tasks by finding rules where the consequent is a class label.
Applications:
• Transparency and explainability are critical: Medical diagnosis, credit scoring, legal
compliance.
• Expert systems are being built: Automating decision-making based on learned rules.
In essence, rule induction provides a "white box" approach to classification, offering clear
insights into how predictions are made, which is a significant advantage in many real-world
scenarios.
Let's use the same "Play Tennis" dataset we used for Decision Trees to illustrate Rule
Induction.
Total Samples: 14
• Yes: 9
• No: 5
Let's assume our rule induction algorithm prioritizes finding rules for the "Yes" class first,
then the "No" class, and it aims for rules that are as accurate and general as possible.
o Let's check how many samples this covers: Days 3, 7, 12, 13. All 4 of these
days resulted in "Yes".
o Check samples: Day 9 (Sunny, Normal, Yes), Day 11 (Sunny, Normal, Yes).
o Rule 2: IF Outlook = Sunny AND Humidity = Normal THEN Play Tennis = Yes
o It might find: IF Outlook = Rain AND Wind = Weak THEN Play Tennis = Yes.
o Check samples: Day 4 (Rain, Weak, Yes), Day 5 (Rain, Weak, Yes), Day 10 (Rain,
Weak, Yes).
o Rule 3: IF Outlook = Rain AND Wind = Weak THEN Play Tennis = Yes
o All "Yes" instances have been covered by rules. We can stop generating "Yes"
rules.
Phase 2: Inducing Rules for "Play Tennis = No" (or a Default Rule)
Now, the algorithm focuses on the remaining instances, which are all "No" instances.
o The algorithm might find a rule that covers a significant portion of these. For
example: IF Humidity = High AND Outlook = Sunny THEN Play Tennis = No.
o Check samples: Day 1 (Sunny, High, No), Day 2 (Sunny, High, No), Day 8
(Sunny, High, No).
o It might find: IF Outlook = Rain AND Wind = Strong THEN Play Tennis = No.
o Check samples: Day 6 (Rain, Strong, No), Day 14 (Rain, Strong, No).
o Separate: Remove these 2 instances. All "No" instances are now covered.
Since we built them sequentially and covered instances as we went, this often forms an
ordered list. When classifying a new instance, we go through the rules in order and apply the
first one that matches.
2. ELSE IF Outlook = Sunny AND Humidity = Normal THEN Play Tennis = Yes
3. ELSE IF Outlook = Rain AND Wind = Weak THEN Play Tennis = Yes
6. ELSE Default Class = No (This acts as a catch-all if no other rule matches, though in
this small dataset, all cases are covered.)
Let's classify the new day: Outlook = Sunny, Temperature = Cool, Humidity = High, Wind =
Strong
3. Rule 3 (Outlook = Rain AND Wind = Weak): False (Outlook is Sunny, not Rain)
4. Rule 4 (Outlook = Sunny AND Humidity = High): TRUE! (Outlook is Sunny, Humidity
is High)
Since Rule 4 matches, we stop and assign the class from Rule 4.
Prediction: The Rule Induction model predicts that on this new day, the person will NOT Play
Tennis.
Key Observations:
• Interpretability: Each rule is a clear, logical statement. You can directly read and
understand why a prediction is made.
• Conciseness: For some datasets, rule sets can be more compact than a full decision
tree, especially if there are redundant paths in a tree.
• Flexibility: Rule induction algorithms can sometimes find rules that are not easily
represented as a single path in a decision tree, or they might identify shorter, more
direct rules for certain outcomes.
• Ordered vs. Unordered: This example produced an ordered list. If it were unordered,
an instance might match multiple rules, and a conflict resolution strategy (like voting
or rule confidence) would be needed.
This example highlights how rule induction systematically discovers explicit "if-then"
conditions that explain and predict class labels in a dataset.
Ensemble learners
The core idea behind ensemble learning is simple: combine the predictions of multiple
individual "base" models (often called "weak learners") to create a stronger, more
accurate "ensemble" model.
Single machine learning models often suffer from one or more of these issues:
1. High Bias (Underfitting): The model is too simplistic and cannot capture the
underlying patterns in the data, leading to consistent errors.
2. High Variance (Overfitting): The model is too complex and learns the noise in the
training data, performing poorly on unseen data.
3. Computational Problem: Finding the single best hypothesis in a large search space
can be computationally challenging.
For classification tasks, the predictions from individual base models are combined, typically
using a majority voting scheme. If a new data point is presented to the ensemble, each base
model makes a prediction, and the class that receives the most votes among the base
models is the final predicted class.
2. Boosting
• Process:
1. Bootstrap Sampling: From the original training dataset (N samples), 'M' new
datasets are created by randomly sampling with replacement (bootstrapping).
Each new dataset typically has the same size as the original, but some original
samples will appear multiple times, while others might not appear at all.
3. Aggregation (Voting): For a new instance, each of the 'M' models makes a
prediction. For classification, the final prediction is determined by majority
voting among the 'M' base models.
• Key Characteristic: The base models are trained in parallel, independently of each
other.
o Highly popular and effective due to its robustness, accuracy, and ability to
handle high-dimensional data.
2. Boosting
• Process:
2. Error Correction: Each new base model is trained to give more weight to the
instances that were misclassified (or had high errors) by the previous models.
3. Weighted Combination: The final prediction is a weighted sum (or weighted
majority vote) of the predictions from all base models. Models that
performed better on the training data might have higher weights.
• Key Characteristic: Base models are trained sequentially, with each learning from the
mistakes of its predecessors.
• Famous Examples:
• Principle: Stacking combines predictions from multiple diverse base models using a
"meta-learner" (or "level-1 model").
• Process:
1. Base Model Training (Level-0 Models): Multiple different base models (e.g.,
Decision Tree, KNN, SVM, Logistic Regression) are trained on the entire
training dataset.
• Key Characteristic: It's a multi-level learning process where the output of base
models becomes the input for another model. This often leads to highly accurate
models.
• Diversity: Stacking benefits greatly from using diverse base models (e.g., a tree-
based model, a linear model, a kernel-based model) as they capture different aspects
of the data.
• Principle: This is one of the simplest ensemble methods. It combines the predictions
of multiple diverse models by simply taking a majority vote (for classification) or
averaging (for regression).
• Process:
1. Train Diverse Models: Train several different models (e.g., Logistic Regression,
SVM, Decision Tree, Naive Bayes) on the same training data.
2. Combine Predictions: For a new instance, each model predicts a class label.
▪ Hard Voting: The class label that receives the most votes from the
individual models is chosen as the final prediction.
Ensemble learning is a cornerstone of modern machine learning and is widely used in both
research and industry for its significant ability to boost model performance across a wide
range of classification tasks.
Linear Regression
1. Encode Classes Numerically: For binary classification, you would typically encode
your two classes as numerical values, often 0 and 1.
o Example: Predict "Will Default" (1) or "Won't Default" (0) for a loan applicant.
2. Train a Linear Regression Model: You then train a standard linear regression model
using your features (X) to predict this numerical target (Y). The model will learn a
linear relationship:
Y=β0+β1X1+β2X2+⋯+βnXn
3. Apply a Threshold: Since linear regression outputs continuous values (e.g., 0.2, 0.75,
1.3, -0.1), you need to set an arbitrary threshold to convert these continuous
predictions into discrete class labels.
While technically possible, using linear regression directly for classification has several
significant drawbacks:
o Linear regression tries to fit a line (or hyperplane in higher dimensions) that
minimizes the squared error between predicted and actual values. This line is
optimized for predicting continuous magnitudes, not for separating distinct
classes.
o The decision boundary (the threshold point) derived from a linear regression
is often not as effective or robust as boundaries learned by true classification
algorithms.
3. Sensitivity to Outliers:
o Example: If you have mostly 0s and 1s, but a few 0s are at a very high feature
value, linear regression might try to fit a line that passes closer to those
outliers, potentially misclassifying other 1s as 0s.
o For classification tasks, having probability estimates (e.g., "this email has a
95% chance of being spam") is often very valuable for understanding the
model's confidence and for setting appropriate action thresholds.
For classification, especially binary classification, the go-to algorithm that builds upon linear
concepts is Logistic Regression.
2. Probability Output: The sigmoid function squashes any real-valued input (Z)
into a probability between 0 and 1 (P(Y=1∣X)=1+e−Z1). This naturally gives a
meaningful probability estimate for the positive class.
In summary, while you could technically rig linear regression for classification, its
fundamental design for continuous output makes it an ill-suited choice. Logistic Regression
is the proper linear model for classification tasks, providing probability estimates and a
statistically sound framework.
Logistic Regression
It's a linear model in the sense that it models the log-odds of the probability of the positive
class as a linear combination of the input features.
Logistic Regression bridges the gap between linear regression's continuous output and
classification's discrete output by using a sigmoid (or logistic) function.
Z=β0+β1X1+β2X2+⋯+βnXn
Where:
o β0 is the intercept.
2. Sigmoid Transformation: This Z value (which can range from −∞ to +∞) is then
passed through the sigmoid function (or logistic function). The sigmoid function
maps any real number into a value between 0 and 1, which can be interpreted as a
probability.
P(Y=1∣X)=1+e−Z1
Where:
o P(Y=1∣X) is the probability that the output Y belongs to the positive class
(usually denoted as 1), given the input features X.
Key Characteristics:
• Binary Classification (primarily): Most commonly used for problems with two
classes. Can be extended to multi-class classification using strategies like One-vs-Rest
(OvR) or Multinomial Logistic Regression.
• Probabilistic Output: Outputs a probability score, which can be very useful for
decision-making (e.g., "This customer has an 85% chance of churning").
• Linear Decision Boundary: In the feature space, Logistic Regression finds a linear
decision boundary (a line in 2D, a plane in 3D, or a hyperplane in higher dimensions)
that separates the classes.
• Interpretability: The coefficients (β values) indicate the strength and direction of the
relationship between each feature and the log-odds of the positive class. For
example, a positive βi means that an increase in Xi increases the probability of the
positive class.
• Cost Function (Loss Function): Logistic Regression typically uses log-loss (also known
as binary cross-entropy) as its cost function, which aims to maximize the likelihood of
the observed training data. This loss function heavily penalizes confident but
incorrect predictions.
Unlike Linear Regression which uses Ordinary Least Squares (OLS), Logistic Regression
cannot use OLS because its output is non-linear due to the sigmoid function. Instead, it uses
iterative optimization algorithms such as:
• Good Baseline Model: Often serves as an excellent baseline against which more
complex models can be compared.
• Not the Best for Highly Complex Data: For highly complex, non-linear datasets, more
advanced algorithms (like SVMs with non-linear kernels, neural networks, or
ensemble methods) often outperform Logistic Regression.
• Finance: Credit scoring (predicting the likelihood of loan default), fraud detection.
In essence, Logistic Regression is a powerful and versatile tool for classification, especially
when interpretability and probabilistic outputs are important, and when the underlying
relationship between features and the log-odds of the outcome is reasonably linear.