UNIT-II: Nearest Neighbor-Based Models: Introduction to Proximity Measures, Distance Measures, Non-Metric
Similarity Functions, Proximity Between Binary Patterns, Different Classification Algorithms Based on the Distance
Measures , K-Nearest Neighbor Classifier, Radius Distance Nearest Neighbor Algorithm, KNN Regression, Performance
of Classifiers, Performance of Regression Algorithms.
Introduction to Proximity Measures:=
Proximity measures are mathematical tools used to quantify the similarity or dissimilarity between
objects, typically in the context of data analysis, machine learning, and pattern recognition. These measures
are essential for clustering, classification, recommendation systems, among other applications.
Types of Proximity Measures
Proximity measures are broadly categorized into similarity measures (which indicate how alike two objects
are) and dissimilarity measures (which indicate how different two objects are).
1. Distance Metrics (Dissimilarity Measures)
o Euclidean Distance: The most common distance metric, defined as the straight-line distance
between two points in a multidimensional space.
o Manhattan Distance (Taxicab Distance): Measures distance by summing the absolute
differences along each dimension.
o Minkowski Distance: A generalized form of both Euclidean and Manhattan distances.
o Cosine Dissimilarity: Measures the angle between two vectors, commonly used in text
analysis.
2. Similarity Measures
o Cosine Similarity: Measures the cosine of the angle between two vectors, often used in text
and recommendation systems.
o Jaccard Similarity: Used for comparing the similarity of two sets, defined as the size of the
intersection divided by the size of the union.
o Pearson Correlation: Measures the linear correlation between two variables.
Distance Measures
Each distance measure is suited for specific types of data patterns. Below, we discuss the types of data suitable for
each measure, along with calculations and applications.
1. Euclidean Distance
**Pattern Type:** Continuous numerical data, typically used in Cartesian space.
Formula: d(A, B) = √((x₂ - x₁)² + (y₂ - y₁)²)
**Example Calculation:**
For A(2,3) and B(6,7):
d(A, B) = √((6-2)² + (7-3)²) = √32 ≈ 5.66
✅ Clustering (e.g., K-Means)
✅ Image Processing
✅ Biometrics (e.g., face recognition)
2. Manhattan Distance
**Pattern Type:** Grid-based movement, where only horizontal/vertical moves are allowed.
Formula: d(A, B) = |x₂ - x₁| + |y₂ - y₁|
**Example Calculation:**
For A(2,3) and B(6,7):
d(A, B) = |6-2| + |7-3| = 8
✅ Pathfinding (e.g., A* Algorithm)
✅ Urban Planning
✅ Logistics
3. Minkowski Distance
**Pattern Type:** Generalized distance metric; parameterized by 'p'.
Formula: d(A, B) = (∑ |xᵢ₂ - xᵢ₁|^p)^(1/p)
**Example Calculation:**
For A(2,3) and B(6,7), with p=3:
d(A, B) = ((4³ + 4³)^(1/3)) ≈ 5.02
✅ Machine Learning (e.g., K-NN)
✅ Dimensionality Reduction
4. Cosine Distance
**Pattern Type:** High-dimensional text and vector data, measuring the angle between vectors.
Formula: d(A, B) = 1 - (∑ xᵢ yᵢ) / (√(∑ xᵢ²) * √(∑ yᵢ²))
**Example Calculation:**
For vectors A = (1,2,3) and B = (4,5,6):
cosθ ≈ 0.97
d(A, B) = 1 - 0.97 = 0.03
✅ NLP (e.g., text similarity)
✅ Recommendation Systems
✅ Document Clustering
5. Jaccard Distance
**Pattern Type:** Categorical and set data, measuring overlap between sets.
Formula: d(A, B) = 1 - |A ∩ B| / |A ∪ B|
**Example Calculation:**
For sets A={1,2,3,4} and B={3,4,5,6}:
d(A, B) = 1 - (2/6) = 0.67
✅ Plagiarism Detection
✅ Bioinformatics
✅ Recommendation Systems
Distance Measure Best For Example Use Case
Euclidean Distance Continuous numerical data K-Means Clustering, Face
Recognition
Manhattan Distance Grid-based movement Pathfinding, Supply Chain
Optimization
Minkowski Distance Mixed feature types ML Models (K-NN, SVM)
Cosine High-dimensional vector data Text Similarity, Document Clustering
Distance
Jaccard Distance Categorical & Set data Plagiarism Detection, Bioinformatics
Metric Similarity (Distance) Functions
A metric is a measure that satisfies three key properties:
1. Positive Reflexivity: d(x,x)=0d(x, x) = 0d(x,x)=0
2. Symmetry: d(x,y)=d(y,x)d(x, y) = d(y, x)d(x,y)=d(y,x)
3. Triangular Inequality: d(x,y)≤d(x,z)+d(z,y)d(x, y) \leq d(x, z) + d(z, y)d(x,y)≤d(x,z)+d(z,y)
These properties ensure that the metric behaves like a proper distance measure.
Metric Functions: Follow strict mathematical properties and can be used for Euclidean-based distance
computations.
Non-metric Similarity Functions
These similarity functions do not necessarily obey the triangular inequality or symmetry. They are
commonly used in scenarios like:
Image processing
String matching
Noisy data analysis
Examples of Non-Metric Similarity Measures:
1. k-Median Distance:
o Defined as the k-th median of the absolute differences between corresponding elements of
two vectors.
2. Cosine Similarity:
o Given two vectors xxx and yyy, the similarity measure is: S(x,y)=x⋅y∥x∥∥y∥S(x, y) = \frac{x
\cdot y}{\|x\| \|y\|}S(x,y)=∥x∥∥y∥x⋅y
o The corresponding distance function: d(x,y)=1−S(x,y)d(x, y) = 1 - S(x, y)d(x,y)=1−S(x,y)
o While it is symmetric, it violates the triangular inequality.
Example Demonstration (Violating Triangular Inequality)
Given three vectors x,y,zx, y, zx,y,z in a 2D space, where angles between vectors are:
o xxx and zzz → 45°
o zzz and yyy → 45°
The calculation shows that d(x,z)+d(z,y)<d(x,y)d(x, z) + d(z, y) < d(x, y)d(x,z)+d(z,y)<d(x,y),
violating the triangle inequality.
Non-Metric Functions: Provide more flexibility, especially for similarity measurements in data science,
image processing, and text analysis, but do not always obey distance axioms.
Proximity Between Binary Patterns
Binary patterns consist of data represented using only two values (typically 0 and 1). Measuring proximity
(similarity or dissimilarity) between such patterns is crucial in machine learning, pattern recognition, and
bioinformatics.
(a) Hamming Distance
• Measures the number of differing bit positions.
• Formula: d_H(x, y) = Σ | x_i - y_i |
• Example: x = 10110, y = 10011 → Hamming distance = 2 (differences at positions 3 and 5).
b) Jaccard Similarity
Measures the proportion of shared 1s between two binary vectors.
• Measures shared 1s between two binary vectors.
• Formula: S(x, y) = M_11 / (M_11 + M_10 + M_01)
Where:
M_11= Count of 1s in both x and y
M_10= Count of 1 in x, 0 in y
M_01= Count of 0 in x, 1 in y
x = 10110, y = 10011
M_11=2 M_10=1 M_o1=1 S(x,y)=0.5
c)Simple Matching Coefficient (SMC)
The Simple Matching Coefficient (SMC) is a similarity measure used to compare binary data. It is defined as:
Explanation:
f11: Number of positions where both vectors have 1s.
f00: Number of positions where both vectors have 0s.
f10: Number of positions where the first vector has 1 and the second has 0.
f01: Number of positions where the first vector has 0 and the second has 1.
Hamming and Euclidean distances measure dissimilarity.
Jaccard, Dice, and Cosine similarity measure closeness.
Different Classification Algorithms Based on the Distance Measures
The target value is (2.1,0.7)
KNN Regression
KNN Regression is a simple yet effective non-parametric technique used to predict continuous values. It
works by averaging the target values of the nearest neighbors of a given input.
KNN can be used for regression, not just classification.
Given a dataset of labeled examples X={(x1,y1),(x2,y2),...,(xn,yn)}
Where xi are data vectors and yi are scalar values.
The process involves:
1. Finding the k nearest neighbors of a new data point xxx.
2. Averaging the corresponding Y-values to predict the output.
Performance of Classifiers
Performance of Regression Algorithms