javaalgorithmsequencesubsequence

Find the supersequence of given ordered subsequences in Java


I came across this problem in an unrelated program I'm writing and I've spent quite a few hours trying to solve it because I though it would be fun. It was but I was unable to do it all the way. My code only solves the sequence of some subsets. This problem also feels like a general math problem that has probably been solved in a wide variety of ways over the decades but I lack the mathematical skills and terminology to find a solution or indeed anything about this specific problem online.

I have a set of subsequences that I know to be a part of a larger, unknown (super?)sequence. I don't think these subsequences are sets in the mathematical sense because they are ordered but they are similar in that they do not contain duplicate elements. Same goes for the master/super/whateversequence. (For clarity I will be referring to this as supersequence.)

The subsequences all contain the same type of data, however the data is not ordered alphabetically, in an ascending order, or anything like that. In a sense the data is in an arbitrary order: in that of the supersequence. And that is what I am interested in. I want to find the unknown supersequence of these subsequences.

For the sake of simplicity I've tried to solve this problem using letters of the alphabet, but I can later refactor the code to suit my needs. Obviously because I'm still trying to solve this problem I started by coming up with a suitable word for supersequence that does not contain duplicate elements: FLOWCHARTS.

Then I came up with the following six subsequences:

F,W,C,R
L,H,A
L,O,H,A,R,S
C,S
R,T,S
F,O,W,H,A,S

Here's my sequence ordering method:

// LinkedHashMappedKeyValueList keeps the data in the order it was inserted and allows one key to have multiple values.
private static LinkedHashSet<Character> orderSequence(final Set<Character> unorderedSequence, final LinkedHashMappedKeyValueList ruleMap)
{
    List<Character> orderedSequence = new ArrayList<Character>(unorderedSequence);

    // Order the sequence according to the rules.
    System.out.println("---- ORDERING SEQUENCE ----");

    for (Map.Entry<Character, LinkedHashSet<Character>> rule : ruleMap.entrySet())
    {
        char currentChar = rule.getKey();
        LinkedHashSet<Character> ruleChars = rule.getValue();

        System.out.println("Processing rule " + currentChar + "<" + ruleChars.toString());

        if (orderedSequence.contains(currentChar))
        {
            int ruleCharIndex = -1;
            int smallestRuleCharIndex = Integer.MAX_VALUE;
            Iterator<Character> it = ruleChars.iterator();

            // Find the rule character with the smallest index.
            while (it.hasNext())
            {
                char ruleChar = it.next();
                ruleCharIndex = orderedSequence.indexOf(ruleChar);
                System.out.println("\tChecking for rule character: " + ruleChar + " (" + ruleCharIndex + ")");

                if (ruleCharIndex > -1 && smallestRuleCharIndex > ruleCharIndex)
                    smallestRuleCharIndex = ruleCharIndex;
            }

            if (smallestRuleCharIndex != Integer.MAX_VALUE)
                System.out.println("\tMoving '" + currentChar + "' before '"
                        + orderedSequence.get(smallestRuleCharIndex) + "'.");
            else
                System.out.println("\tMoving '" + currentChar + "' to the end of the sequence.");

            if (!moveBefore(orderedSequence.indexOf(currentChar), smallestRuleCharIndex, orderedSequence))
                System.out.println("\tAlready in correct position.");
            else
                System.out.println("\tCurrent sequence: " + listToString(orderedSequence));
        }
        else
            throw new ArithmeticException("Element of a subsequence not a part of the sequence.");
    }

    return new LinkedHashSet<Character>(orderedSequence);
}

In the end my code found the supersequence F,L,O,W,H,C,A,R,T,S for these subsequences which is pretty close but not perfect. I also need to run my ordering method multiple times so the "algorithm" I came up with isn't perfect either. The "rule map" thing is a hash map where the key is another hash map of Character objects that came after the key Character in the subsequences (and thus in the supersequence).

Is there a Java library of some sort I could use that does this kind of sequence finding? Can someone point me in the right direction in terms of telling me what this is called and/or helping me find the right algorithms for the job?

Additionally, a shortened console output of my program:

---- BUILDING RULE MAP ----
Subsequences:   F,W,C,R
        L,H,A
        L,O,H,A,R,S
        C,S
        R,T,S
        F,O,W,H,A,S

All subsequences processed. Number of ordering rules: 10
Rule map: (F<[W, O]),(W<[C, H]),(C<[R, S]),(R<[, S, T]),(L<[H, O]),(H<[A]),(A<[, R, S]),(O<[H, W]),(S<[]),(T<[S])

---- BUILDING UNORDERED SEQUENCE ----
Sequence size is 10.
Unordered sequence: F,W,C,R,L,H,A,O,S,T

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'W'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Already in correct position.
Processing rule R<[, S, T]
    Moving 'R' before 'S'.
    Current sequence: F,W,C,L,H,A,O,R,S,T
Processing rule L<[H, O]
    Moving 'L' before 'H'.
    Already in correct position.
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,W,C,L,H,O,A,R,S,T
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Current sequence: F,O,W,C,L,H,A,R,S,T
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Current sequence: F,O,W,C,L,H,A,R,T,S
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,W,C,R,L,H,A,O,S,T
Ordered sequence:   F,O,W,C,L,H,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: F,O,W,L,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,O,W,C,L,H,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Current sequence: L,F,O,W,H,C,A,R,T,S
Processing rule W<[C, H]
    Moving 'W' before 'H'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: L,F,O,W,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,L,O,W,H,C,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    true
Sequence ordered according to the limits of the rule map.
Sequence found after 2 tries.

Expected sequence:  F,L,O,W,C,H,A,R,T,S FLOWCHARTS
Found sequence:     F,L,O,W,H,C,A,R,T,S FLOWHCARTS
Sequences match:    false

Solution

  • The problem I am describing in my question is the shortest common supersequence problem. I did not search this online because I was too focused on mathematical sets while working on the code, and only once I had started writing out my question did I realize that I was working with sequences. In fact I thought I just made up the word "supersequence" on the spot. Turns out that is precisely the word I needed to search online to find several pages of material related to this problem.

    Oliver Charlesworth's comment is absolutely correct. The given subsequences lack information to make the correct supersequence so in my example, there is no way of finding the actual supersequence. Coffee's comment about bioinformatics most likely refers to shortest common superstring problem which has applications in reconstruction of DNA sequences, something discussed in this question. The shortest common supersequence and superstring problems are quite similar at first look, however, in the superstring problem the substrings must be composed of adjacent elements of the superstring unlike with the supersequence problem. (Illustrated below.)

    Difference between the shortest common supersequence and superstring problems.

    Both problems are also NP-complete or "NP-hard". The way I understand it, is that these problems do not have an optimal solution or some magical algorithm that you can just copy-paste into your code. There are only approximations and "good enough" solutions.

    Surprisingly enough I was not able to find full Java code that approximates this problem, but I did come across a few sites with code in other languages. I've listed below some useful resources I found while researching this problem. I have also included resources related to the shortest common superstring problem as it is related.

    Additional resources for further study