Distance and Similarity Measures for Machine Learning

Distance and Similarity Measures for Machine Learning

A natural concept of the distance that exists between two objects or events in the universe can be computed with the use of distance measurements.

Distance measures are used by a wide variety of algorithms, both supervised and unstructured machine learning algorithms. You may find metrics like Euclidean distance and cosine similarity in algorithms like the k-nearest neighbour algorithm, document similarity finding, clustering, anomaly detection etc.

It is much more important to have a solid understanding of the various distance measures. It is not possible to use the same distance measure for all types of data. Can we utilise Euclidean distance for hug dimensional data? No, of course not; convenience is yet another factor to take into account when selecting a distance measure.

In this article, I’ll go through several different types of distance and similarity measures, as well as their geometric representations, potential applications, and limitations.

Euclidean distance

In geometry, the most used distance measure is the Euclidean distance. The Euclidean distance between two points on the plane is measured by the length of the starlight line between them. The Pythagorean theorem can be used to determine the Euclidean distance between two points given only their Cartesian coordinates.

Geometrically, we can represent the Euclidean distance between points A and B as follow:

Geometric representation of Euclidean distance

Let A = {a1, a2, …, an} and B = {b1, b2, …, bn} are two vectors of dimension n. Euclidean distance between vectors A and B is computed as:

euclidean distance
Euclidean distance

Example:

Consider A = {1, 2, 3, 4, 5} and B = {0, 2, 4, 6, 8}, then the Euclidean distance between A and B is given as,

D(A, B) = sqrt ( (1 – 0)2 + (2 – 2)2 + (3 – 4)2 + (4 – 6)2 + (5 – 8)2)

= sqrt(1 + 0 + 1 + 4 + 9)

= sqrt(15)

= 3.87

In Machine Learning lingo, Euclidean distance is also known as L2 norm.In several domains, including machine learning, computer vision, and data mining, the Euclidean distance is the de facto standard. The benefits and drawbacks of relying on Euclidean distance are as follows:

Advantages of Euclidean distance:

  • Euclidean distance is simple to calculate and understand.
  • It is easy to understand since it directly reflects the actual physical separation of two places.
  • Euclidean distance is the default distance metric in many algorithms and sees extensive use in a wide range of disciplines.
  • In contrast to other distance measures, it performs well with high-dimensional data.

Limitations of Euclidean distance:

  • Scale sensitivity: the Euclidean distance takes scale differences into account. This allows larger features to take precedence over the distance calculation.
  • Euclidean distance has a bias towards more extreme features because of the variance assumptions it makes. If the traits with smaller variances are also the most informative, this could be problematic.
  • Since it considers the data to be continuous, Euclidean distance cannot be used with categorical data.
  • Inaccurate distance estimates can be caused by outliers, which Euclidean distance does not account for.

In conclusion, while Euclidean distance is a useful distance metric, it does have some restrictions that make it less than ideal for some uses.

Cosine Similarity Measure

The degree of resemblance between two vectors in an inner product space is quantified by the cosine similarity. The cosine of the angle between two vectors is used to detect if they point in nearly the same direction. In text analysis, it is frequently employed as a method of gauging the degree of similarity between two documents.

Cosine similarity

The cosine similarity between two vectors is the same as taking the cosine of the angle between them. If both vectors were scaled down to length 1, the inner product would be the same.

Cosine similarity formula

Values of the similarity range from -1 to 1. Cosine values are larger when the angle between the vectors is smaller, suggesting a higher degree of cosine similarity. Case in point:

  • When two vectors are aligned along the same axis, their angle is zero and their cosine similarity is one.
  • The cosine similarity between two perpendicular vectors is zero, and the angle between them is 90 degrees.
  • The cosine similarity of vectors that are 180 degrees apart is -1.

Cosine similarity is a popular measure of similarity used in many different contexts, including Information retrieval and text mining. The advantages and disadvantages of cosine similarity are as follows.

Advantages of cosine similarity:

  • Cosine similarity ignores the magnitude of the difference between characteristics. This means that it is not overpowered by larger features even when dealing with a wide range of feature sizes.
  • For datasets with noisy or unreliable data, Cosine similarity is a good option due to its relative robustness to outliers.
  • Cosine similarity succeeds where other similarity metrics fail when dealing with high-dimensional data.
  • Cosine similarity, which uses the cosine of the angle between two vectors as a measure of their similarity, is a straightforward concept.

Limitations of cosine similarity:

  • Cosine similarity is not appropriate for negative-valued datasets because negative values can result in negative similarity scores, which can be hard to interpret.
  • Cosine similarity is insensitive to changes in vector lengths, unlike the Euclidean distance. This implies that the similarity score between two vectors with different magnitudes but in the same direction will be the same.
  • Cosine similarity assumes that the data is continuous, making it inappropriate for use with categorical data.
  • Due to the violation of the triangle inequality, cosine similarity cannot be considered a valid distance measure. As a result, it can’t be used with algorithms that expect a precise distance measurement.

In conclusion, while cosine similarity is a valuable similarity metric, its applicability is context and data-dependent.

Leave a Reply

Your email address will not be published. Required fields are marked *