python-3.xfind-occurrencesaho-corasick

number of occurrences of list of words in a string with O(n)


I have already seen this answer to a similar question: https://stackoverflow.com/a/44311921/5881884

Where the ahocorasick algorithm is used to show if each word in a list exists in a string or not with O(n). But I want to get the frequency of each word in a list in a string.

For example if

my_string = "some text yes text text some"
my_list = ["some", "text", "yes", "not"]

I would want the result:

[2, 3, 1, 0]

I did not find an exact example for this in the documentation, any idea how to accomplish this?

Other O(n) solutions than using ahocorasick would also be appreciated.


Solution

  • Implementation:

    Here's an Aho-Corasick frequency counter:

    import ahocorasick
    
    def ac_frequency(needles, haystack):
        frequencies = [0] * len(needles)
        # Make a searcher
        searcher = ahocorasick.Automaton()
        for i, needle in enumerate(needles):
            searcher.add_word(needle, i)
        searcher.make_automaton()
        # Add up all frequencies
        for _, i in searcher.iter(haystack):
            frequencies[i] += 1
        return frequencies
    

    (For your example, you'd call ac_frequency(my_list, my_string) to get the list of counts)

    For medium-to-large inputs this will be substantially faster than other methods.

    Notes:

    For real data, this method will potentially yield different results than the other solutions posted, because Aho-Corasick looks for all occurrences of the target words, including substrings.

    If you want to find full-words only, you can call searcher.add_word with space/punctuation-padded versions of the original string:

        ...
        padding_start = [" ", "\n", "\t"]
        padding_end = [" ", ".", ";", ",", "-", "–", "—", "?", "!", "\n"]
        for i, needle in enumerate(needles):
            for s, e in [(s,e) for s in padding_start for e in padding_end]:
                searcher.add_word(s + needle + e, i)
        searcher.make_automaton()
        # Add up all frequencies
        for _, i in searcher.iter(" " + haystack + " "):
        ...