algorithmcombinationsoeis

Finding certain arrangements of all 2-combinatons for a given list


Given a list L of an even number (2k) of elements, I'm looking for an algorithm to produce a list of 2k-1 sublists with the following properties:

  1. each sublist includes exactly k 2-combinations (pairs where the order does not matter) of elements from L,
  2. each sublist includes every elements from L exactly once, and
  3. the union of all elements from all sublists is exactly the set of all possible 2-combinations of the elements from L.

For example, if the input list is L = [a, b, c, d], we have k = 2 with 3 sublists, each including 2 pairs. A possible solution would look like [[ab, cd], [ac, bd], [ad, bc]]. If we ignore the ordering for all elements in the lists (think of all lists as sets), it turns out that this is also the only solution for k = 2.

My aim now is not only to find a single solution but all possible solutions. As the number of involved combinations grows pretty quickly, it would be nice to have all results be constructed in a clever way instead of generating a huge list of candidates and removing the elements from it that don't satisfy the given properties. Such a naïve algorithm could look like the following:

  1. Find the set C of all 2-combinations for L.
  2. Find the set D of all k-combinations for C.
  3. Choose all sets from D that union equals L, call the new set D'.
  4. Find the set E of all (2k-1)-combinations for D'.
  5. Choose all sets from E that union is the set C, and let the new set be the final output.

This algorithm is easy to implement but it's incredibly slow for bigger input lists. So is there a way to construct the result list more efficently?


Edit: Here is the result for L = [a,b,c,d,e,f] with k = 3, calculated by the above algorithm:

[[[ab,cd,ef],[ac,be,df],[ad,bf,ce],[ae,bd,cf],[af,bc,de]],
 [[ab,cd,ef],[ac,bf,de],[ad,be,cf],[ae,bc,df],[af,bd,ce]],
 [[ab,ce,df],[ac,bd,ef],[ad,be,cf],[ae,bf,cd],[af,bc,de]],
 [[ab,ce,df],[ac,bf,de],[ad,bc,ef],[ae,bd,cf],[af,be,cd]],
 [[ab,cf,de],[ac,bd,ef],[ad,bf,ce],[ae,bc,df],[af,be,cd]],
 [[ab,cf,de],[ac,be,df],[ad,bc,ef],[ae,bf,cd],[af,bd,ce]]]

All properties are satisfied:

  1. each sublist has k = 3 2-combinations,
  2. each sublist only includes each element once, and
  3. the union of all 2k-1 = 5 sublists for one solution is exactly the set of all possible 2-combinations for L.

Edit 2: Based on user58697's answer, I improved the calculation algorithm by using the round-robin tournament scheduling:

  1. Let S be the result set, starting with an empty set, and P be the set of all permutations of L.
  2. Repeat the following until P is empty:
    • Select an arbitrary permutation from P
    • Perform full RRT scheduling for this permutation. In each round, the arrangement of elements from L forms a permutation of L. Remove all these 2k permutations from P.
    • Add the resulting schedule to S.
  3. Remove all lists from S if the union of their sublists has duplicate elements (i.e. doesn't add up to all 2-combinations of L).

This algorithm is much more performant than the first one. I was able to calculate the number of results for k = 4 as 960 and k = 5 as 67200. The fact that there doesn't seem to be an OEIS result for this sequence makes me wonder if the numbers are actually correct, though, i.e. if the algorithm is producing the complete solution set.


Solution

  • This was an interesting question. In the process of answering it (basically after writing the program included below, and looking up the sequence on OEIS), I learned that the problem has a name and rich theory: what you want is to generate all 1-factorizations of the complete graph K2k.


    Let's first restate the problem in that language:

    You are given a number k, and a list (set) L of size 2k. We can view L as the vertex set of a complete graph K2k.

    A 1-factor (aka perfect matching) is a partition of L into unordered pairs (sets of size 2). That is, it is a set of k pairs, whose disjoint union is L.

    Let S (called C in the question) denote the set of all pairs of elements of L. (In terms of the complete graph, if L is its vertex set, S is its edge set.) Note that S contains (2k choose 2) = k(2k-1) pairs. So for k = 0, 1, 2, 3, 4, 5, 6…, S has size 0, 1, 6, 15, 28, 45, 66….

    A 1-factorization is a partition of S into sets, each of which is itself a 1-factor (perfect matching). Note that as each of these matchings has k pairs, and S has size k(2k-1), the partition has size 2k-1 (i.e., is made of 2k-1 matchings).

    In other words, every element of S (every pair) occurs in exactly one element of the 1-factorization, and every element of L occurs exactly once in each element of the 1-factorization.

    The problem asks to generate all 1-factorizations.


    Let M denote the set of all 1-factors (all perfect matchings) of L. It is easy to prove that M contains (2k)!/(k!2^k) = 1×3×5×…×(2k-1) matchings. For k = 0, 1, 2, 3, 4, 5, 6…, the size of M is 1, 1, 3, 15, 105, 945, 10395….

    M is easy to generate:

    def perfect_matchings(l):
        if len(l) == 0:
            yield []
        for i in range(1, len(l)):
            first_pair = l[0] + l[i]
            for matching in perfect_matchings(l[1:i] + l[i+1:]):
                yield [first_pair] + matching
    

    For example, calling perfect_matchings('abcdef') yields the 15 elements ['ab', 'cd', 'ef'], ['ab', 'ce', 'df'], ['ab', 'cf', 'de'], ['ac', 'bd', 'ef'], ['ac', 'be', 'df'], ['ac', 'bf', 'de'], ['ad', 'bc', 'ef'], ['ad', 'be', 'cf'], ['ad', 'bf', 'ce'], ['ae', 'bc', 'df'], ['ae', 'bd', 'cf'], ['ae', 'bf', 'cd'], ['af', 'bc', 'de'], ['af', 'bd', 'ce'], ['af', 'be', 'cd'] as expected.

    By definition, a 1-factorization is a partition of S into elements from M. Or equivalently, any (2k-1) disjoint elements of M form a 1-factorization. This lends itself to a straightforward backtracking algorithm:

    In code:

    matching_list = []
    pair_used = defaultdict(lambda: False)
    known_matchings = []  # Populate this list using perfect_matchings()
    def extend_matching_list(r, need):
        """Finds ways of extending the matching list by `need`, using matchings r onwards."""
        if need == 0:
            use_result(matching_list)
            return
        for i in range(r, len(known_matchings)):
            matching = known_matchings[i]
            conflict = any(pair_used[pair] for pair in matching)
            if conflict:
                continue  # Can't use this matching. Some of its pairs have already appeared.
            # Else, use this matching in the current matching list.
            for pair in matching:
                pair_used[pair] = True
            matching_list.append(matching)
            extend_matching_list(i + 1, need - 1)
            matching_list.pop()
            for pair in matching:
                pair_used[pair] = False
    

    If you call it with extend_matching_list(0, len(l) - 1) (after populating known_matchings), it generates all 1-factorizations. I've put the full program that does this here. With k=4 (specifically, the list 'abcdefgh'), it outputs 6240 1-factorizations; the full output is here.


    It was at this point that I fed the sequence 1, 6, 6240 into OEIS, and discovered OEIS A000438, sequence 1, 1, 6, 6240, 1225566720, 252282619805368320,…. It shows that for k=6, the number of solutions ≈2.5×1017 means that we can give up hope of generating all solutions. Even for k=5, the ≈1 billion solutions (recall that we're trying to find 2k-1=9 disjoint sets out of the |M|=945 matchings) will require some carefully optimized programs.

    The first optimization (which, embarrassingly, I only realized later by looking closely at trace output for k=4) is that (under natural lexicographic numbering) the index of the first matching chosen in the partition cannot be greater than the number of matchings for k-1. This is because the lexicographically first element of S (like "ab") occurs only in those matchings, and if we start later than this one we'll never find it again in any other matching.

    The second optimization comes from the fact that the bottleneck of a backtracking program is usually the testing for whether a current candidate is admissible. We need to test disjointness efficiently: whether a given matching (in our partial factorization) is disjoint with the union of all previous matchings. (Whether any of its k pairs is one of the pairs already covered by earlier matchings.) For k=5, it turns out that the size of S, which is (2k choose 2) = 45, is less than 64, so we can compactly represent a matching (which is after all a subset of S) in a 64-bit integer type: if we number the pairs as 0 to 44, then any matching can be represented by an integer having 1s in the positions corresponding to elements it contains. Then testing for disjointness is a simple bitwise operation on integers: we just check whether the bitwise-AND of the current candidate matching and the cumulative union (bitwise-OR) of previous matchings in our partial factorization is zero.

    A C++ program that does this is here, and just the backtracking part (specialized for k=5) does not need any C++ features so it's extracted out as a C program here. It runs in about 4–5 hours on my laptop, and finds all 1225566720 1-factorizations.

    Another way to look at this problem is to say that two elements of M have an edge between them if they intersect (have a pair (element of S) in common), and that we're looking for all maximum independent set in M. Again, the simplest way to solve that problem would still probably be backtracking (we'd write the same program).

    Our programs can be made quite a lot more efficient by exploiting the symmetry in our problem: for example we could pick any matching as our first 1-factor in the 1-factorization (and then generate the rest by relabelling, being careful not to avoid duplicates). This is how the number of 1-factorizations for K12 (the current record) was calculated.


    A note on the wisdom of generating all solutions

    In The Art of Computer Programming Volume 4A, at the end of section 7.2.1.2 Generating All Permutations, Knuth has this important piece of advice:

    Think twice before you permute. We have seen several attractive algorithms for permutation generation in this section, but many algorithms are known by which permutations that are optimum for particular purposes can be found without running through all possibilities. For example, […] the best way to arrange records on a sequential storage […] takes only O(n log n) steps. […] the assignment problem, which asks how to permute the columns of a square matrix so that the sum of the diagonal elements is maximized […] can be solved in at most O(n3) operations, so it would be foolish to use a method of order n! unless n is extremely small. Even in cases like the traveling salesrep problem, when no efficient algorithm is known, we can usually find a much better approach than to examine every possible solution. Permutation generation is best used when there is good reason to look at each permutation individually.

    This is what seems to have happened here (from the comments below the question):

    I wanted to calculate all solutions to run different attribute metrics on these and find an optional match […]. As the number of results seems to grow quicker than expected, this is impractical.

    Generally, if you're trying to "generate all solutions" and you don't have a very good reason for looking at each one (and one almost never does), there are many other approaches that are preferable, ranging from directly trying to solve an optimization problem, to generating random solutions and looking at them, or generating solutions from some subset (which is what you seem to have done).


    Further reading

    Following up references from OEIS led to a rich history and theory.

    Some other stuff: