stringsubstringtext-comparison

How to find all Common substrings which is 3 chars or longer


Are there an efficient algorithm to search and dump all common substrings (which length is 3 or longer) between 2 strings?

Example input:

Length   : 0    5    10   15   20   25   30

String 1 : ABC-DEF-GHI-JKL-ABC-ABC-STU-MWX-Y
String 2 : ABC-JKL-MNO-ABC-DEF-PQR-DEF-ZWX-Y

Example output:

In string 1           2
---------------------------
ABC-DEF   0           12
ABC-DE    0           12
BC-DEF    1           13
:
-ABC-     15,19       11
-JKL-     11          3
-DEF-     3           15
-JKL      11          3
JKL-      12          4
-DEF      3           15,23
DEF-      4           16
WX-Y      29          29
ABC-      0,16,20     0,12
-ABC      15,19       11
DEF-      4           16,24
DEF       4           16,24
ABC       0,16,20     0,12
JKL       12          4
WX-       29          29
X-Y       30          30
-AB       15,19       11
BC-       1,17,21     1,13
-DE       3           15,23
EF-       5           17,25
-JK       11          3
KL-       13          5
:

In the example, "-D", "-M" is also a common substring but is not required, because it's length is only 2. (There might be some missing outputs in example because there are so many of them...)


Solution

  • You can find all common substrings using a data structure called a Generalized suffix tree

    Libstree contains some example code for finding the longest common substring. That example code can be modified to obtain all common substrings.