Someone asked me an interview question: write a function match(s, t)
to decide if a string s
is a generalized substring of another string t
. More concretely, match
should return True if and only if removing some characters in t
can equalize it to s
. For example, match("abc", "abbbc")
is True, because we can remove the two extra b
s in t
.
Surely the interviewer is expecting some kind of recursive solution, but I'm feeling adventurous and wrote
def match(s, t):
return re.search('.*?'.join(s), t) is not None
This seems to satisfy all the test cases, but then I started wondering if there exists any adversarial input that can hog this operation (and potentially make our service vulnerable to DoS attacks). See this great article for an extended discussion, but as a quick example, try
re.match("a?"*29+"a"*29, "a"*29)
and it will take several seconds before finding a match. The reason is that re
implements a backtracking regex engine:
Consider the regular expression a?nan. It matches the string an when the a? are chosen not to match any letters, leaving the entire string to be matched by the an. Backtracking regular expression implementations implement the zero-or-one ? by first trying one and then zero. There are n such choices to make, a total of 2n possibilities. Only the very last possibility—choosing zero for all the ?—will lead to a match. The backtracking approach thus requires O(2n) time, so it will not scale much beyond n=25.
Back to the interview question, technically '.*?'.join(s)
gives us at least len(s) - 1
choices to make, which means there can be some adversarial inputs similar to the "a?"*29+"a"*29
and "a"*29
above, but I failed to find them after some trial and error.
Question: what s
and t
can make match
ridiculously slow?
Lazy quantifiers are generally quite good for performance, but AFAIK they do not prevent the pathological emphasized behaviour.
This is especially true when the beginning of the regexp match with the beginning of a text but the match is early and will fail at the end of the text requiring a lot of backtracks to "fix" the bad early lazy match of the beginning of the regexp.
In your case, here is an example of pathological input requiring an exponential number of steps:
# This should take at least few minutes to compute
n = 12
match('ab'*(n+1), 'abbbbb'*n+'a')
Here is the number of steps and the Python match
time required based on the value of n
:
n | steps | time
1 | 44 | 2.4 µs
2 | 374 | 6.4 µs
3 | 2621 | 22.1 µs
4 | 18353 | 131.3 µs
5 | 126211 | 925.8 µs
6 | - | 6.2 ms
7 | - | 42.2 ms
8 | - | 288.7 ms
9 | - | 1.97 s
You can understand and analyse the resulting regexp matching behaviour on this great website (more specifically backtracking in this case).