algorithmstringsubstringprogramming-pearls

Longest Non-Overlapping Substring


I wonder if anyone knows the (optimal?) algorithm for longest recurring non-overlapping sub string.

For example, in the string

ABADZEDGBADEZ

the longest recurring would be "BAD". Incidentally if there is no such result, the algorithm should alert that such a thing has occurred. My guess is that this involves suffix trees.

I'm sure this must exist somewhere already. Thanks for the help!


Solution

  • Since you're applying this to music, you're probably not looking for 100% matches; there will be expected discrepencies from one instance of the theme to another. You might try looking up gene sequence analysis algorithms - there's lots of this variety of stuff going on there. Try BLAST or other alignment algorithms.

    You could also try the WinEPI family of algorithms, as well as various prefix tree implementations of this specific result set (these results allow gaps in the substring; for instance, ABCID and AFBCD both contain ABCD). I actually have a paper on an algorithm that might be worth looking at if you would be interested; I would need to attain distribution authorization, so let me know.

    Note that this is actually a VERY complicated subject (to do efficiently) for large datasets.

    Source: Research for a year or two on a comparable (theme-detection) topic.