Imagine there is a big array of Strings S. From that array I need to get only those strings which contain a specific substring. For example if my array is
String s [] = {"hello world", "back to hell", "say hello world"};
and my keyword is "hello", then it should return me first and last element.
I tried using KMP and Boyer-Moor algorithms to check each string in array whether it contains a substring or not, but it takes too much time.
Then I learned about Aho-Corasick algorithm. I am still looking it up, but seemingly it needs an array of substrings and one big string to match,while what I want is exactly opposite.
So I was looking for a suggestions on how to modify Aho-Corasick algorithm for my purposes, or another means to achieve those. Would be thankful for any suggestions.
Build a suffix tree using Ukkonen algorithm or the one suggested in this source(PDF):
McCreight’s algorithm can be easily adapted to build the generalized suffix tree for a set S={s1, s2, . . . , s_k} of strings of total length N in O(N) time...
Then use the created suffix tree to search for a given pattern. The problem is to find all occurrences of pattern P (length m) in a suffix tree T. According to the above source:
The pattern matching problem can be solved in optimal O(m+k) time ..., where k is the number of occurrences of P in T
Note that the length of the text (or the number of strings in the array) does not affect search efficiency. Therefore, you could pay for constructing a suffix tree once and then use it many times to search efficiently for short pattern strings.
EDIT: if you are in a hurry and do not mind a little bit of extra time complexity, you can construct suffix arrays instead of suffix trees using this approach(PDF) in just O(n*log^2(n)) with a very small piece of code. Here is the core idea of this approach:
The algorithm is mainly based on maintaining the order of the string’s suffixes sorted by their 2^k long prefixes.
And here is the pseudo-code reproduced from the above source:
n ←length(T)
for i←0 : n – 1
P(0, i)← position of T(i) in the ordered array of T‘s characters
cnt ← 1
for k←1 : [log2n] (ceil)
for i←0 : n – 1
L(i)← (P(k – 1, i), P(k – 1, i + cnt), i)
sort L
compute P(k, i) , i = 0, n - 1
cnt←2 * cnt
After running this code, P
will contain the suffix array. Search using this approach is also straightforward:
Since the suffix array offers the order of T’s suffixes, searching a string P into T is easily done with a binary search. Since comparing is done in O(|P|)