KEMBAR78
ML Unit 2 | PDF | Statistical Classification | Applied Mathematics
0% found this document useful (0 votes)
31 views11 pages

ML Unit 2

The document discusses various proximity measures and distance metrics used in machine learning and data analysis, including Euclidean, Manhattan, Minkowski, Cosine, and Jaccard distances. It also covers classification algorithms based on these distance measures, particularly focusing on K-Nearest Neighbor (KNN) methods for both classification and regression tasks. Additionally, it highlights the importance of measuring similarity and dissimilarity in binary patterns using methods like Hamming distance and Jaccard similarity.

Uploaded by

maneeshgopisetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views11 pages

ML Unit 2

The document discusses various proximity measures and distance metrics used in machine learning and data analysis, including Euclidean, Manhattan, Minkowski, Cosine, and Jaccard distances. It also covers classification algorithms based on these distance measures, particularly focusing on K-Nearest Neighbor (KNN) methods for both classification and regression tasks. Additionally, it highlights the importance of measuring similarity and dissimilarity in binary patterns using methods like Hamming distance and Jaccard similarity.

Uploaded by

maneeshgopisetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT-II: Nearest Neighbor-Based Models: Introduction to Proximity Measures, Distance Measures, Non-Metric

Similarity Functions, Proximity Between Binary Patterns, Different Classification Algorithms Based on the Distance
Measures , K-Nearest Neighbor Classifier, Radius Distance Nearest Neighbor Algorithm, KNN Regression, Performance
of Classifiers, Performance of Regression Algorithms.

Introduction to Proximity Measures:=


Proximity measures are mathematical tools used to quantify the similarity or dissimilarity between
objects, typically in the context of data analysis, machine learning, and pattern recognition. These measures
are essential for clustering, classification, recommendation systems, among other applications.
Types of Proximity Measures
Proximity measures are broadly categorized into similarity measures (which indicate how alike two objects
are) and dissimilarity measures (which indicate how different two objects are).
1. Distance Metrics (Dissimilarity Measures)
o Euclidean Distance: The most common distance metric, defined as the straight-line distance
between two points in a multidimensional space.
o Manhattan Distance (Taxicab Distance): Measures distance by summing the absolute
differences along each dimension.
o Minkowski Distance: A generalized form of both Euclidean and Manhattan distances.
o Cosine Dissimilarity: Measures the angle between two vectors, commonly used in text
analysis.
2. Similarity Measures
o Cosine Similarity: Measures the cosine of the angle between two vectors, often used in text
and recommendation systems.
o Jaccard Similarity: Used for comparing the similarity of two sets, defined as the size of the
intersection divided by the size of the union.
o Pearson Correlation: Measures the linear correlation between two variables.

Distance Measures
Each distance measure is suited for specific types of data patterns. Below, we discuss the types of data suitable for
each measure, along with calculations and applications.

1. Euclidean Distance
**Pattern Type:** Continuous numerical data, typically used in Cartesian space.
Formula: d(A, B) = √((x₂ - x₁)² + (y₂ - y₁)²)
**Example Calculation:**
For A(2,3) and B(6,7):
d(A, B) = √((6-2)² + (7-3)²) = √32 ≈ 5.66
✅ Clustering (e.g., K-Means)
✅ Image Processing
✅ Biometrics (e.g., face recognition)
2. Manhattan Distance
**Pattern Type:** Grid-based movement, where only horizontal/vertical moves are allowed.
Formula: d(A, B) = |x₂ - x₁| + |y₂ - y₁|
**Example Calculation:**
For A(2,3) and B(6,7):
d(A, B) = |6-2| + |7-3| = 8
✅ Pathfinding (e.g., A* Algorithm)
✅ Urban Planning
✅ Logistics
3. Minkowski Distance
**Pattern Type:** Generalized distance metric; parameterized by 'p'.
Formula: d(A, B) = (∑ |xᵢ₂ - xᵢ₁|^p)^(1/p)
**Example Calculation:**
For A(2,3) and B(6,7), with p=3:
d(A, B) = ((4³ + 4³)^(1/3)) ≈ 5.02
✅ Machine Learning (e.g., K-NN)
✅ Dimensionality Reduction
4. Cosine Distance
**Pattern Type:** High-dimensional text and vector data, measuring the angle between vectors.
Formula: d(A, B) = 1 - (∑ xᵢ yᵢ) / (√(∑ xᵢ²) * √(∑ yᵢ²))
**Example Calculation:**
For vectors A = (1,2,3) and B = (4,5,6):
cosθ ≈ 0.97
d(A, B) = 1 - 0.97 = 0.03
✅ NLP (e.g., text similarity)
✅ Recommendation Systems
✅ Document Clustering
5. Jaccard Distance
**Pattern Type:** Categorical and set data, measuring overlap between sets.
Formula: d(A, B) = 1 - |A ∩ B| / |A ∪ B|
**Example Calculation:**
For sets A={1,2,3,4} and B={3,4,5,6}:
d(A, B) = 1 - (2/6) = 0.67
✅ Plagiarism Detection
✅ Bioinformatics
✅ Recommendation Systems

Distance Measure Best For Example Use Case

Euclidean Distance Continuous numerical data K-Means Clustering, Face


Recognition
Manhattan Distance Grid-based movement Pathfinding, Supply Chain
Optimization
Minkowski Distance Mixed feature types ML Models (K-NN, SVM)

Cosine High-dimensional vector data Text Similarity, Document Clustering


Distance

Jaccard Distance Categorical & Set data Plagiarism Detection, Bioinformatics

Metric Similarity (Distance) Functions


A metric is a measure that satisfies three key properties:
1. Positive Reflexivity: d(x,x)=0d(x, x) = 0d(x,x)=0
2. Symmetry: d(x,y)=d(y,x)d(x, y) = d(y, x)d(x,y)=d(y,x)
3. Triangular Inequality: d(x,y)≤d(x,z)+d(z,y)d(x, y) \leq d(x, z) + d(z, y)d(x,y)≤d(x,z)+d(z,y)
These properties ensure that the metric behaves like a proper distance measure.
Metric Functions: Follow strict mathematical properties and can be used for Euclidean-based distance
computations.

Non-metric Similarity Functions


These similarity functions do not necessarily obey the triangular inequality or symmetry. They are
commonly used in scenarios like:
 Image processing
 String matching
 Noisy data analysis
Examples of Non-Metric Similarity Measures:
1. k-Median Distance:
o Defined as the k-th median of the absolute differences between corresponding elements of
two vectors.
2. Cosine Similarity:
o Given two vectors xxx and yyy, the similarity measure is: S(x,y)=x⋅y∥x∥∥y∥S(x, y) = \frac{x
\cdot y}{\|x\| \|y\|}S(x,y)=∥x∥∥y∥x⋅y
o The corresponding distance function: d(x,y)=1−S(x,y)d(x, y) = 1 - S(x, y)d(x,y)=1−S(x,y)
o While it is symmetric, it violates the triangular inequality.
Example Demonstration (Violating Triangular Inequality)
 Given three vectors x,y,zx, y, zx,y,z in a 2D space, where angles between vectors are:
o xxx and zzz → 45°
o zzz and yyy → 45°
 The calculation shows that d(x,z)+d(z,y)<d(x,y)d(x, z) + d(z, y) < d(x, y)d(x,z)+d(z,y)<d(x,y),
violating the triangle inequality.
Non-Metric Functions: Provide more flexibility, especially for similarity measurements in data science,
image processing, and text analysis, but do not always obey distance axioms.

Proximity Between Binary Patterns


Binary patterns consist of data represented using only two values (typically 0 and 1). Measuring proximity
(similarity or dissimilarity) between such patterns is crucial in machine learning, pattern recognition, and
bioinformatics.

(a) Hamming Distance


• Measures the number of differing bit positions.
• Formula: d_H(x, y) = Σ | x_i - y_i |
• Example: x = 10110, y = 10011 → Hamming distance = 2 (differences at positions 3 and 5).

b) Jaccard Similarity

 Measures the proportion of shared 1s between two binary vectors.


• Measures shared 1s between two binary vectors.
• Formula: S(x, y) = M_11 / (M_11 + M_10 + M_01)

Where:

M_11= Count of 1s in both x and y

M_10= Count of 1 in x, 0 in y

M_01= Count of 0 in x, 1 in y

x = 10110, y = 10011

M_11=2 M_10=1 M_o1=1 S(x,y)=0.5

c)Simple Matching Coefficient (SMC)


The Simple Matching Coefficient (SMC) is a similarity measure used to compare binary data. It is defined as:

Explanation:

 f11: Number of positions where both vectors have 1s.

 f00: Number of positions where both vectors have 0s.

 f10: Number of positions where the first vector has 1 and the second has 0.

 f01: Number of positions where the first vector has 0 and the second has 1.

Hamming and Euclidean distances measure dissimilarity.

Jaccard, Dice, and Cosine similarity measure closeness.


Different Classification Algorithms Based on the Distance Measures

The target value is (2.1,0.7)


KNN Regression
KNN Regression is a simple yet effective non-parametric technique used to predict continuous values. It
works by averaging the target values of the nearest neighbors of a given input.
 KNN can be used for regression, not just classification.
 Given a dataset of labeled examples X={(x1,y1),(x2,y2),...,(xn,yn)}
 Where xi are data vectors and yi are scalar values.
 The process involves:
1. Finding the k nearest neighbors of a new data point xxx.
2. Averaging the corresponding Y-values to predict the output.
Performance of Classifiers
Performance of Regression Algorithms

You might also like