python

Is there a way to compare two strings for likeness?


I have an issue where I need to link certain sample names to each other, the problem however is that the sample names which I want to match are a little bit different from the keys in a dictionary I have from which I need to get the correct value.

Example:

sample = "foo_foo.bar.12"
matching_dict = {"foo_foo-bar-12": "foo.bar.12"}

I have about 5500 samples, each with a different type of arrangement, so not every sample looks like the example I gave.

Ideally I want a dynamic way of comparing the 2 strings with each other and then get the value from the dict if they are most alike.


Solution

  • You could use Levenshtein distance. This measures how similar two strings are to each other. There is a very easy python library for it called python-levenshtein. With this you could compare your sample to all the values in the dictionary, and calculate which value in the dict has the lowest Levenshtein distance.