pythonregexpattern-matchingpython-2.5

Efficient unordered substring matching


I want to match if one string is contained in another, regardless the order of characters. For example, if I have a string submarine I want to be able to detect marines as a match.

The way I'm currently handling this is through lists:

def match(x, y):
    x, y = list(x), list(y)
    for i in x:
        try:
            y.remove(i)
        except ValueError:
            return False
    return True

But this is inefficient when I try to match many combinations.

I thought then to use regex, but didn't make it.

Any ideas?


Solution

  • You can make use of a character class [SEARCH_WORD] where each character will be searched for independently. By setting the + quantifier after it, you will look for 1 or more characters, and by adding \b word boundaries, you will only match whole words:

    r'\b[submarine]+\b'
    

    See the regex demo and the IDEONE demo:

    import re
    s = "I have a string submarine I want to be able to detect marines as a match"
    kw = "submarine"
    r  = re.compile(r"\b[{0}]+\b".format(kw))
    print(r.findall(s))
    

    NOTE: If your input can contain non-word characters, especially characters like ^, ], \ or -, escape those with re.escape and use r"(?<!\w)[{0}]+(?!\w)".format(re.escape("submarine")).

    import re
    s = "I have a string ^submarine I want to be able to detect ^marines as a match"
    kw = "^submarine"
    r  = re.compile(r"(?<!\w)[{0}]+(?!\w)".format(re.escape(kw)))
    print(r.findall(s))
    

    See IDEONE demo