If I have a collection of strings
is there a data structure or function that could improve the speed of checking if any of the elements of the collections are substrings
on my main string?
Right now I'm looping through my array of strings and using the in
operator. Is there a faster way?
import timing
## string match in first do_not_scan
## 0:00:00.029332
## string not in do_not_scan
## 0:00:00.035179
def check_if_substring():
for x in do_not_scan:
if x in string:
return True
return False
## string match in first do_not_scan
## 0:00:00.046530
## string not in do_not_scan
## 0:00:00.067439
def index_of():
for x in do_not_scan:
try:
string.index(x)
return True
except:
return False
## string match in first do_not_scan
## 0:00:00.047654
## string not in do_not_scan
## 0:00:00.070596
def find_def():
for x in do_not_scan:
if string.find(x) != -1:
return True
return False
string = '/usr/documents/apps/components/login'
do_not_scan = ['node_modules','bower_components']
for x in range(100000):
find_def()
index_of()
check_if_substring()
No, there is no faster built-in way.
If you have large amounts of strings to test for, then you may be better off using a third-party Aho-Corasick package, as J.F. Sebastian's answer shows.
Using the built-in methods, the worst-case scenario is: there is no match, which means you have tested every item in the list and nearly every offset in every item.
Fortunately, the in
operator is very quick (at least in CPython) and was faster by nearly a factor of three in my tests:
0.3364804992452264 # substring()
0.867534976452589 # any_substring()
0.8401796016842127 # find_def()
0.9342398950830102 # index_of()
2.7920695478096604 # re implementation
Here is the script I used for testing:
from timeit import timeit
import re
def substring():
for x in do_not_scan:
if x in string:
return True
return False
def any_substring():
return any(x in string for x in do_not_scan)
def find_def():
for x in do_not_scan:
if string.find(x) != -1:
return True
return False
def index_of():
for x in do_not_scan:
try:
string.index(x)
return True
except:
return False
def re_match():
for x in do_not_scan:
if re.search(string, x):
return True
return False
string = 'a'
do_not_scan = ['node_modules','bower_components']
print(timeit('substring()', setup='from __main__ import substring'))
print(timeit('any_substring()', setup='from __main__ import any_substring'))
print(timeit('find_def()', setup='from __main__ import find_def'))
print(timeit('index_of()', setup='from __main__ import index_of'))
print(timeit('re_match()', setup='from __main__ import re_match'))