javaalgorithmstring-algorithm

Find all concatenations of two string in a huge set


Given a set of 50k strings, I need to find all pairs (s, t), such that s, t and s + t are all contained in this set.

What I've tried

, there's an additional constraint: s.length() >= 4 && t.length() >= 4. This makes it possible to group the strings by length 4 prefixes and, separately, suffixes. Then for every string composed of length at least 8, I look up the set of candidates for s using the first four characters of composed and the set of candidates for t using its last four characters. This works, but it needs to look at 30M candidate pairs (s, t) for finding the 7k results.

This surprisingly high number of candidates comes from the fact, that the string are (mostly German) words from a limited vocabulary and the word starts and ends often the same. It's still much better than trying all 2.5G pairs, but much worse than I hoped.

What I need

As the additional constraint may get dropped and the set will grow, I'm looking for a better algorithm.

The "missing" question

There were complaints about me not asking a question. So the missing question mark is at the end of the next sentence. How can this be done more efficiently, ideally without using the constraint?


Solution

  • Algorithm 1: Test pairs, not singles

    One way could be, instead of working from all possible pairs to all possible composite strings containing those pairs, work from all possible composite strings and see if they contain pairs. This changes the problem from n^2 lookups (where n is the number of strings >= 4 characters) to m * n lookups (where m is the average length of all strings >= 8 characters, minus 7, and n is now the number of strings >= 8 characters). Here's one implementation of that:

    int minWordLength = 4;
    int minPairLength = 8;
    
    Set<String> strings = Stream
       .of(
          "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
          "bear", "hug", "bearhug", "cur", "curlique", "curl",
          "down", "downstream", "stream"
       )
       .filter(s -> s.length() >= minWordLength)
       .collect(ImmutableSet.toImmutableSet());
    
    strings
       .stream()
       .filter(s -> s.length() >= minPairLength)
       .flatMap(s -> IntStream
          .rangeClosed(minWordLength, s.length() - minWordLength)
          .mapToObj(splitIndex -> ImmutableList.of(
             s.substring(0, splitIndex),
             s.substring(splitIndex)
          ))
          .filter(pair ->
              strings.contains(pair.get(0))
              && strings.contains(pair.get(1))
          )
       )
       .map(pair ->
          pair.get(0) + pair.get(1) + " = " + pair.get(0) + " + " + pair.get(1)
       )
       .forEach(System.out::println);
    

    Gives the result:

    downstream = down + stream
    

    This has average algorithmic complexity of m * n as shown above. So in effect, O(n). In the worst case, O(n^2). See hash table for more on the algorithmic complexity.

    Explanation

    1. Put all strings four or more characters long into a hash set (which takes average O(1) complexity for search). I used Guava's ImmutableSet for convenience. Use whatever you like.
    2. filter: Restrict to only the items that are eight or more characters in length, representing our candidates for being a composition of two other words in the list.
    3. flatMap: For each candidate, compute all possible pairs of sub-words, ensuring each is at least 4 characters long. Since there can be more than one result, this is in effect a list of lists, so flatten it into a single-deep list.
      1. rangeClosed: Generate all integers representing the number of characters that will be in the first word of the pair we will check.
      2. mapToObj: Use each integer combined with our candidate string to output a list of two items (in production code you'd probably want something more clear like a two-property value class, or an appropriate existing class).
      3. filter: Restrict to only pairs where both are in the list.
    4. map: Pretty up the results a little.
    5. forEach: Output to the console.

    Algorithm Choice

    This algorithm is tuned to words that are way shorter than the number of items in the list. If the list were very short and the words were very long, then switching back to a composition task instead of a decomposition task would work better. Given that the list is 50,000 strings in size, and German words while long are very unlikely to exceed 50 characters, that is a 1:1000 factor in favor of this algorithm.

    If on the other hand, you had 50 strings that were on average 50,000 characters long, a different algorithm would be far more efficient.

    Algorithm 2: Sort and keep a candidate list

    One algorithm I thought about for a little while was to sort the list, with the knowledge that if a string represents the start of a pair, all candidate strings that could be one of its pairs will be immediately after it in order, among the set of items that start with that string. Sorting my tricky data above, and adding some confounders (downer, downs, downregulate) we get:

    a
    abc
    abcdef
    bear
    bearhug
    cur
    curl
    curlique
    def
    down ---------\
    downs         |
    downer        | not far away now!
    downregulate  |
    downstream ---/
    hug
    shine
    stream
    sun
    sunshine
    

    Thus if a running set of all items to check were kept, we could find candidate composites in essentially constant time per word, then probe directly into a hash table for the remainder word:

    int minWordLength = 4;
    
    Set<String> strings = Stream
       .of(
          "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
          "bear", "hug", "bearhug", "cur", "curlique", "curl",
          "down", "downs", "downer", "downregulate", "downstream", "stream")
       .filter(s -> s.length() >= minWordLength)
       .collect(ImmutableSet.toImmutableSet());
    
    ImmutableList<String> orderedList = strings
       .stream()
       .sorted()
       .collect(ImmutableList.toImmutableList());
    List<String> candidates = new ArrayList<>();
    List<Map.Entry<String, String>> pairs = new ArrayList<>();
    
    for (String currentString : orderedList) {
       List<String> nextCandidates = new ArrayList<>();
       nextCandidates.add(currentString);
       for (String candidate : candidates) {
          if (currentString.startsWith(candidate)) {
             nextCandidates.add(candidate);
             String remainder = currentString.substring(candidate.length());
             if (remainder.length() >= minWordLength && strings.contains(remainder)) {
                pairs.add(new AbstractMap.SimpleEntry<>(candidate, remainder));
             }
          }
       }
       candidates = nextCandidates;
    }
    pairs.forEach(System.out::println);
    

    Result:

    down=stream
    

    The algorithmic complexity on this one is a little more complicated. The searching part I think is O(n) average, with O(n^2) worst case. The most expensive part might be the sorting—which depends on the algorithm used and the characteristics of the unsorted data. So use this one with a grain of salt, but it has possibility. It seems to me that this is going to be way less expensive than building a Trie out of an enormous data set, because you only probe it once comprehensively and don’t get any amortization of the build cost.

    Also, this time I chose a Map.Entry to hold the pair. It's completely arbitrary how you do it. Making a custom Pair class or using some existing Java class would be fine.