c++algorithmtreesuffix-tree

Generalised suffix tree traversal to find longest common substring


I'm working with suffix trees. As far as I can tell, I have Ukkonen's algorithm running correctly to build a generalised suffix tree from an arbitrary number of strings. I'm now trying to implement a find_longest_common_substring() method to do exactly that. For this to work, I understand that I need to find the deepest shared edge (with depth in terms of characters, rather than edges) between all strings in the tree, and I've been struggling for a few days to get the traversal right.

Right now I have the following in C++. I'll spare you all my code, but for context, I'm keeping the edges of each node in an unordered_map called outgoing_edges, and each edge has a vector of ints recorded_strings containing integers identifying the added strings. The child field of an edge is the node it is going to, and l and r identify its left and rightmost indices, respectively. Finally, current_string_number is the current number of strings in the tree.

SuffixTree::Edge * SuffixTree::find_deepest_shared_edge(SuffixTree::Node * start, int current_length, int &longest) {
    Edge * deepest_shared_edge = new Edge;
    auto it = start->outgoing_edges.begin();
    while (it != start->outgoing_edges.end()) {
        if (it->second->recorded_strings.size() == current_string_number + 1) {
            int edge_length = it->second->r - it->second->l + 1;
            int path_length = current_length + edge_length;
            find_deepest_shared_edge(it->second->child, path_length, longest);
            if (path_length > longest) {
                longest = path_length;
                deepest_shared_edge = it->second;
            }
        }
        it++;
    }
    return deepest_shared_edge;
}

When trying to debug, as best I can tell, the traversal runs mostly fine, and correctly records the path length and sets longest. However, for reasons I don't quite understand, in the innermost conditional, deepest_shared_edge sometimes seems to get updated to a mistaken edge. I suspect I maybe don't quite understand how it->second is updated throughout the recursion. Yet I'm not quite sure how to go about fixing this.

I'm aware of this similar question, but the approach seems sufficiently different that I'm not quite sure how it applies here.

I'm mainly during this for fun and learning, so I don't necessarily need working code to replace the above - pseudocode or just any explanation of where I'm confused would be just as well.


Solution

  • Your handling of deepest_shared_edge is wrong. First, the allocation you do at the start of the function is a memory leak, since you never free the memory. Secondly, the result of the recursive call is ignored, so whatever deepest edge it finds is lost (although you update the depth, you don't keep track of the deepest edge).

    To fix this, you should either pass deepest_shared_edge as a reference parameter (like you do for longest), or you can initialize it to nullptr, then check the return from your recursive call for nullptr and update it appropriately.