Say I have the string "Torcellite" and another string "Tor" - the similarity length of these two strings is 3 since both of them begin with "Tor". Now another string "christmas" and "mas" would have a similarity of 0 since they do not begin with the same set of characters.
In both cases, the second string is a suffix of the first string.
A clearer example:
Length of string: 1 to 10^5
String: abaabc
Suffixes: abaabc
, baabc
, aabc
, abc
, bc
, c
Similarity: abaabc
, none, a
, ab
, none, none
Similarity Length: 6, 0, 1, 2, 0, 0
Answer: 6+0+1+2+0+0 = 9
I have an inefficient logic to find these partial suffixes matches using regex.
Algorithm:
Make a pattern from the substrings of the suffixes.
for(int i=1; i<substrings[i].length; i++) {
Pattern p = Pattern.compile("^"+substrings[i].substring(0, i));
Matcher m = p.find(string); //the given string for which similarities need to be calculated
if(m.find())
similaryLengths += i;
}
The complexity for this becomes roughly O(n^2) since I need to run through the string for suffixes and then the substrings for patterns.
I've thought of using grouping in the pattern to find the groups but I'm unsure what the regex would look like. What I have in mind is for the very first substring is: ((((((a)b)a)a)b)c)
and then find the longest group match.
Is there a more efficient algorithm that can achieve his?
The best approach, by far, would be to build a suffix tree on the input string. Building suffix trees takes only O(n) time where n is the length of the string. A suffix tree consists logically of a tree in which all the suffixes of the string can be found by walking from the root to each leaf. You can read Wikipedia for more details on how these trees work.
Essentially, a suffix tree will allow you to simply recast your current problem as one of "finding" the original string in the suffix tree. As you walk down the tree, you count the number of suffixes in each subtree, and multiply by your current match length to determine your score. This "search" takes O(n) time too.
So the final result is that you can solve the problem in guaranteed O(n) time and O(n) space, with O(n) preprocessing time. This is pretty efficient! And, there are no "worst cases" that produce quadratic behaviour. With that you can probably handle strings up to 10^7 in length easily.
The only difficulty in implementation will be building the suffix tree, but you can find freely available code for that.