pythonregexperformancesecuritynon-greedy

Adversarial input for this regex


Someone asked me an interview question: write a function match(s, t) to decide if a string s is a generalized substring of another string t. More concretely, match should return True if and only if removing some characters in t can equalize it to s. For example, match("abc", "abbbc") is True, because we can remove the two extra bs in t.

Surely the interviewer is expecting some kind of recursive solution, but I'm feeling adventurous and wrote

def match(s, t):
    return re.search('.*?'.join(s), t) is not None

This seems to satisfy all the test cases, but then I started wondering if there exists any adversarial input that can hog this operation (and potentially make our service vulnerable to DoS attacks). See this great article for an extended discussion, but as a quick example, try

re.match("a?"*29+"a"*29, "a"*29)

and it will take several seconds before finding a match. The reason is that re implements a backtracking regex engine:

Consider the regular expression a?nan. It matches the string an when the a? are chosen not to match any letters, leaving the entire string to be matched by the an. Backtracking regular expression implementations implement the zero-or-one ? by first trying one and then zero. There are n such choices to make, a total of 2n possibilities. Only the very last possibility—choosing zero for all the ?—will lead to a match. The backtracking approach thus requires O(2n) time, so it will not scale much beyond n=25.

Back to the interview question, technically '.*?'.join(s) gives us at least len(s) - 1 choices to make, which means there can be some adversarial inputs similar to the "a?"*29+"a"*29 and "a"*29 above, but I failed to find them after some trial and error.

Question: what s and t can make match ridiculously slow?


Solution

  • Lazy quantifiers are generally quite good for performance, but AFAIK they do not prevent the pathological emphasized behaviour.

    This is especially true when the beginning of the regexp match with the beginning of a text but the match is early and will fail at the end of the text requiring a lot of backtracks to "fix" the bad early lazy match of the beginning of the regexp.

    In your case, here is an example of pathological input requiring an exponential number of steps:

    # This should take at least few minutes to compute
    n = 12
    match('ab'*(n+1), 'abbbbb'*n+'a')
    

    Here is the number of steps and the Python match time required based on the value of n:

     n |  steps |    time
     1 |     44 |   2.4 µs
     2 |    374 |   6.4 µs
     3 |   2621 |  22.1 µs
     4 |  18353 | 131.3 µs
     5 | 126211 | 925.8 µs
     6 |    -   |   6.2 ms
     7 |    -   |  42.2 ms
     8 |    -   | 288.7 ms
     9 |    -   |  1.97 s
    

    You can understand and analyse the resulting regexp matching behaviour on this great website (more specifically backtracking in this case).