arraysstringalgorithmaho-corasick

Is there an efficient algorithm to extract from an array of string only those which contain a specific substring?


Imagine there is a big array of Strings S. From that array I need to get only those strings which contain a specific substring. For example if my array is String s [] = {"hello world", "back to hell", "say hello world"}; and my keyword is "hello", then it should return me first and last element. I tried using KMP and Boyer-Moor algorithms to check each string in array whether it contains a substring or not, but it takes too much time. Then I learned about Aho-Corasick algorithm. I am still looking it up, but seemingly it needs an array of substrings and one big string to match,while what I want is exactly opposite. So I was looking for a suggestions on how to modify Aho-Corasick algorithm for my purposes, or another means to achieve those. Would be thankful for any suggestions.


Solution

  • Build a suffix tree using Ukkonen algorithm or the one suggested in this source(PDF):

    McCreight’s algorithm can be easily adapted to build the generalized suffix tree for a set S={s1, s2, . . . , s_k} of strings of total length N in O(N) time...

    Then use the created suffix tree to search for a given pattern. The problem is to find all occurrences of pattern P (length m) in a suffix tree T. According to the above source:

    The pattern matching problem can be solved in optimal O(m+k) time ..., where k is the number of occurrences of P in T

    Note that the length of the text (or the number of strings in the array) does not affect search efficiency. Therefore, you could pay for constructing a suffix tree once and then use it many times to search efficiently for short pattern strings.

    EDIT: if you are in a hurry and do not mind a little bit of extra time complexity, you can construct suffix arrays instead of suffix trees using this approach(PDF) in just O(n*log^2(n)) with a very small piece of code. Here is the core idea of this approach:

    The algorithm is mainly based on maintaining the order of the string’s suffixes sorted by their 2^k long prefixes.

    And here is the pseudo-code reproduced from the above source:

    n ←length(T) 
      for i←0 : n – 1
        P(0, i)← position of T(i) in the ordered array of T‘s characters 
    cnt ← 1 
    for k←1 : [log2n] (ceil)
      for i←0 : n – 1
        L(i)← (P(k – 1, i), P(k – 1, i + cnt), i)             
        sort L
        compute P(k, i) , i = 0, n - 1 
        cnt←2 * cnt
    

    After running this code, P will contain the suffix array. Search using this approach is also straightforward:

    Since the suffix array offers the order of T’s suffixes, searching a string P into T is easily done with a binary search. Since comparing is done in O(|P|)