Generally spoken a distance is a number between 0.0 (highest similarity) and 1.0 (lowest similarity) that tells you about the similarity of two items.
In our case, when speaking of a Distance, we sometimes mean munin.distance.Distance, which gathers single distances (in the general meaning) and combines them to one value by weighting them and calculating the average. But in most cases both terms can be used interchangeably.
See also
Distance in the Glossary.
A DistanceFunction is a something that calculates a Distance. Let’s explain this with some examples. Imagine you have to compare the following values and tell about their similarity:
See also
DistanceFunction in the Glossary.
>>> 1.0 - float('Elch' == 'Pech')
1.0
>>> 1.0 - float('Elch' == 'Elch')
0.0
This will yield to nobodie’s surprise only two values 0.0 and 1.0, which is nothing but unsatisfying. We need a way to get values in between. And what about comparasions like elch == Elch?
>>> from difflib import SequenceMatcher
>>> 1 - SequenceMatcher(a='Elch', b='Pech').ratio()
0.5
>>> 1 - SequenceMatcher(a='Elch', b='elch').ratio()
0.25
>>> 1 - SequenceMatcher(a='Elch'.lower(), b='elch'.lower()).ratio()
0.0
>>> 1 - SequenceMatcher(a='PECH', b='Stuff').ratio()
1.0
Now, that’s a lot better already. We can see that we always have a normalization (str.lower()) and an actual comparasion (difflib.SequenceMatcher()).
The Normalization part is done by something called a Provider. We learn about that in the next chapter (What are Providers?).
A bit more sophisticated comparasion:
>>> # This needs PyStemmer and pyxDamerauLevenshtein
>>> from pyxdameraulevenshtein import *
>>> damerau_levenshtein_distance('elch', 'Pech')
2 # Number of adds., subs. and del. to get from 'elch' to 'pech'
>>> normalized_damerau_levenshtein_distance('elch', 'Pech')
0.5 # With string length taken into account - like difflib!
Now with addtional normalization:
>>> from Stemmer import Stemmer
>>> normalize = lambda word: Stemmer('english').stemWord(word.lower().strip())
>>> normalized_damerau_levenshtein_distance(normalize('elch'), normalize('Elchy'))
0.2
This makes use of the DamereauLevenshtein-Distance (levenshtein on wikipedia) and of the Porter Stemming (stemming on wikipedia) algorithm to normalize a single word to it’s Wordstem (kindness becomes kind).
Note
The input value needs to be stemmed only once, which can save a lot time.
We have a base-class for distance calculation there:
From this many subclasses in the munin.distance. submodules are inherited.
Note
DistanceFunction use two list of values as input. Even for single values.
Note
If no DistanceFunction is given, a default will be used which will sort both input list, zip them and compare each item with ==. The number of matching items is then divided through the maximum length of both lists.
A DistanceFunction calculates a Distance
This class is supposed to be overriden, but can also be used as fallback.
__call__ is implemented as shortcut to compute()
Parameters: | name (String.) – Name of the DistanceFunction (used for display) |
---|