When using thefuzz in Python to calculate a simple ratio between two strings, a result of 0 means they are totally different while a result of 100 represents a 100% match. What do intermediate results mean? Does a result of 82, say, mean that the two files are 82% similar? Or is it just an abstract idea of 'bigger is better?'
The documentation is sadly lacking in any detail to answer this question, so far as I can tell.
There are bunch of string matching algorithms that have been developed over the last... hundred years or so. I believe the string matching algorithm under the hood of this library is InDel
.
InDel
is a variation of the much more common Levenshtein distance
algorithm. Levenshtein Distance
essentially counts the number of needed insertions, deletions, and subsitutions necessary to get from the first string to the second string.
With InDel
only insertions and deletions are counted. The ratio
is calcuated by dividing the number of insertions and deletions into the length of both strings, and then subtracting from 1. So the closer to 1, the closer the match as it took less insertions and deletions to get from one string to the other.
The real question you have to determined for yourself, is how far away from 1
(a perfect match) do you want to accept for two strings being the same. Likely no matter what you choose you will end up with false positives/negatives.