Distances and Distance Calculation

Introduction

Generally spoken a distance is a number between 0.0 (highest similarity) and 1.0 (lowest similarity) that tells you about the similarity of two items.

In our case, when speaking of a Distance, we sometimes mean munin.distance.Distance, which gathers single distances (in the general meaning) and combines them to one value by weighting them and calculating the average. But in most cases both terms can be used interchangeably.

See also

Distance in the Glossary.

A DistanceFunction is a something that calculates a Distance. Let’s explain this with some examples. Imagine you have to compare the following values and tell about their similarity:

See also

DistanceFunction in the Glossary.

Stupid and simple:

>>> 1.0 - float('Elch' == 'Pech')
1.0
>>> 1.0 - float('Elch' == 'Elch')
0.0

This will yield to nobodie’s surprise only two values 0.0 and 1.0, which is nothing but unsatisfying. We need a way to get values in between. And what about comparasions like elch == Elch?

I-googled-and-found-stuff-on-Stackoverflow:

>>> from difflib import SequenceMatcher
>>> 1 - SequenceMatcher(a='Elch', b='Pech').ratio()
0.5
>>> 1 - SequenceMatcher(a='Elch', b='elch').ratio()
0.25
>>> 1 - SequenceMatcher(a='Elch'.lower(), b='elch'.lower()).ratio()
0.0
>>> 1 - SequenceMatcher(a='PECH', b='Stuff').ratio()
1.0

Now, that’s a lot better already. We can see that we always have a normalization (str.lower()) and an actual comparasion (difflib.SequenceMatcher()).

The Normalization part is done by something called a Provider. We learn about that in the next chapter (What are Providers?).

Real life version:

A bit more sophisticated comparasion:

>>> # This needs PyStemmer and pyxDamerauLevenshtein
>>> from pyxdameraulevenshtein import *
>>> damerau_levenshtein_distance('elch', 'Pech')
2  # Number of adds., subs. and del. to get from 'elch' to 'pech'
>>> normalized_damerau_levenshtein_distance('elch', 'Pech')
0.5  # With string length taken into account - like difflib!

Now with addtional normalization:

>>> from Stemmer import Stemmer
>>> normalize = lambda word: Stemmer('english').stemWord(word.lower().strip())
>>> normalized_damerau_levenshtein_distance(normalize('elch'), normalize('Elchy'))
0.2

This makes use of the DamereauLevenshtein-Distance (levenshtein on wikipedia) and of the Porter Stemming (stemming on wikipedia) algorithm to normalize a single word to it’s Wordstem (kindness becomes kind).

Note

The input value needs to be stemmed only once, which can save a lot time.

Implementation in libmunin

We have a base-class for distance calculation there:

From this many subclasses in the munin.distance. submodules are inherited.

Note

DistanceFunction use two list of values as input. Even for single values.


List of concrete DistanceFunctions

Note

If no DistanceFunction is given, a default will be used which will sort both input list, zip them and compare each item with ==. The number of matching items is then divided through the maximum length of both lists.


Reference

class munin.distance.DistanceFunction(provider=None)[source]

A DistanceFunction calculates a Distance

This class is supposed to be overriden, but can also be used as fallback.

__call__ is implemented as shortcut to compute()

Parameters:name (String.) – Name of the DistanceFunction (used for display)
do_compute(list_a, list_b)[source]

Compare both lists with eq by default.

This goes through both lists and counts the matching elements at the same index.

Returns:Number of matches divivded through the max length of both lists.