algorithmstring-algorithm

Algorithm for finding bags of elements in a sequence


Say that I have a sequence of elements of interest A, B, C... interspersed with don't care symbols x. I want to identify bags of elements from a predefined set of interesting combinations that happen within a predefined distance. There can be overlaps between the spans of symbols. For instance in the string C x x A A x x C the algo would detect twice the pattern A A C if max distance is 5.

For instance says that my set of interesting combinations is:

A A C
A B C

that I have a sequence:

B A x x C x x x A A x x x C

and that the maximum span is 5.

My algorithm should output:

B A x x C -> A B C

and will fail to identify pattern A A C since the span between elements of interest is greater than 5.

My intuition says that it is some kind of dynamic programming but maybe it is just an instance of a well known algorithm I'm not able to spot.

Any hint on what would be the approach/solution?


Solution

  • Let's assign some names to describe the problem:

    m = length of the array sequence (14 in your example)
    n = total number of unique elements in the array sequence (3 in example)
    k = length of each search region (5 in example)
    g = number of groups that you're looking for (2 in example)

    One option would be to summarise your data in each search region of size k. In your example, like this:

    {B A x x C}
    {A x x C x}
    ...
    

    We make vectors of size n for each of these sections, first element represents the appearance of first type of element, say A

    B A x x C --> [1,1,1] (one appearance of each)
    A x x C x --> [1,0,1]
    

    and so on.

    We can do the same for our groups that we're searching for:

    {A A C} --> [2,0,1]  
    {A B C} --> [1,1,1]
    

    Now the problem becomes obvious. Say we consider the summary of a search region, [3,2,5], and the summary of a group we're searching for, [0,1,2], we can calculate the number of combinations by recognising that we had 2 options for the second element, and (5x4)/(1x2) options for the third, so that's 20 options overall.

    So for section summary, S, [s1, s2,..,sn], and a single group of interest, G, [g1, g2,...gn], we can calculate the total sum of ways we can extract G from S (c++ style code, except "!" means factorial):

    int total_options = 1; // total ways to select G from S
    for (int i = 0; i < n; ++i)
    {
        if(g[i] == 0)
            continue; // this is an element that doesn't appear in G, so it shouldn't effect our count
    
        if(s[i] < g[i])
            return 0; // not enough elements in S for G
    
        for (int d = 1, f = s[i]; f > max(g[i], s[i] - g[i]); --f, ++d)
            total_options = total_options / d * f; // f, d are effectively factorials
    
        // the previous loop is a more efficient version of:
        // total_options *= (s[i]!) /(g[i]! * (s[i] - g[i])!);
    }
    
    return  total_options;
    

    You would do this for each section, and each group that you're searching for.

    Time complexity: O(g*m*(k + n)) (we must include k here because of the worst-case factorial calculation)

    Space complexity: O(m + g*n) (each section we could calculate as we go, so there's no need to store multiple sections simultaneously)

    We can then improve on this by realising that each consecutive "section" only differs by considering the departing "tail" element, and the entering "head" element, so we should just calculate how these two change the "option count" as we iterate to the next section. We would do this by maintaining the previous "option count" calculation, and, NF (Number Fails), the number of elements in the region that were too low for the search group. The trick would be to maintain a positive "option count" which gets added to the grand total only if NF is 0. This would give you constant-time results for each G as you iterate over the main array of size m.

    Time complexity: O(g*m + g*n)
    Space complexity: O(g*n + m)

    This algorithm has worst-case performance when each element in the main array is unique, and each of these elements appears at least once in some of the searches (otherwise we can treat any elements that don't appear in any of the searches to all be the same, like "x" in your example). So worst-case complexities can be simplified to:

    Time complexity: O(g*m)
    Space complexity: O(g*m)

    I can't see how it'd be possible to get a better time complexity, but I'd be interested to know if some clever person can think of a method with lower space complexity.

    If you don't know what I'm talking about when it comes to constant-time iteration by considering only the head and the tail, let me know and I'll explain with an example.