I am using aho corasick to performing some string searches on documents. The original code uses numpy array to store in an efficient way the matches of each string of a string list:
import ahocorasick
import numpy as np
def ahocorasickFind(search_list, input):
A = ahocorasick.Automaton()
for idx, s in enumerate(search_list):
A.add_word(s, (idx, s))
A.make_automaton()
index_list = []
for item in A.iter(input):
print(item)
index_list.append(item[1][0])
output_list = np.array([0] * len(search_list))
output_list[index_list] = 1
return output_list.tolist()
search_list = ['joão','maria','thiago'] # thousands of words in the real code
result = ahocorasickFind(search_list,'asdasdasd joão 1231231 thiago') # huge text in the real code
for index, element in enumerate(result):
if(element == 1):
print(search_list[index])
Using the above approach took to much time and memory to iterate and test (if == 1). So, how to get "original" strings found in the input text in a perfomatic way?
If you are only interested in matching for words (i.e. separated by a white space), rather than using a full search text, it might be faster to use a set of words. Note, however, that this uses some additional memory. One straightforward solution to replicate your behaviour would be:
words = set(text.split())
for w in search_list:
if w in words:
print(w)
or even shorter (but changing the order of the result, and deleting duplicates from the search list):
for w in set(search_list).intersection(text.split()):
print(w)
I've quickly tested it on relatively large text
object (143M characters, 23M words) and a rather short search_list
object (606 words, of which 295 unique ones), and the times I got are:
corasick
: 14.5sHowever the first version uses a (relatively) negligible amount of additional memory, while the other versions use quite a lot of it (for the data I was using, could be almost 2GB of additional memory)