[SOLVED] Efficient unordered substring matching

Efficient unordered substring matching

I want to match if one string is contained in another, regardless the order of characters. For example, if I have a string submarine I want to be able to detect marines as a match.

The way I'm currently handling this is through lists:

def match(x, y):
    x, y = list(x), list(y)
    for i in x:
        try:
            y.remove(i)
        except ValueError:
            return False
    return True

But this is inefficient when I try to match many combinations.

I thought then to use regex, but didn't make it.

Any ideas?

Solution

You can make use of a character class [SEARCH_WORD] where each character will be searched for independently. By setting the + quantifier after it, you will look for 1 or more characters, and by adding \b word boundaries, you will only match whole words:

r'\b[submarine]+\b'

See the regex demo and the IDEONE demo:

import re
s = "I have a string submarine I want to be able to detect marines as a match"
kw = "submarine"
r  = re.compile(r"\b[{0}]+\b".format(kw))
print(r.findall(s))

NOTE: If your input can contain non-word characters, especially characters like ^, ], \ or -, escape those with re.escape and use r"(?<!\w)[{0}]+(?!\w)".format(re.escape("submarine")).

import re
s = "I have a string ^submarine I want to be able to detect ^marines as a match"
kw = "^submarine"
r  = re.compile(r"(?<!\w)[{0}]+(?!\w)".format(re.escape(kw)))
print(r.findall(s))

See IDEONE demo