algorithmsortingarray-merge

Efficient algorithm to create ordered union of lists with unknown true ordering


I have a bunch of lists I need to combine. Each of these lists has an ordering, and each ordering is consistent with the ordering of an original list. (By consistent, I mean that each list can be reproduced by dropping items from the original list, without rearranging any items.)

My problem is that I don't have the original list, and I have to reproduce it as closely as possible using the partial lists I do have.

For example, consider the following ordered lists:

a = ["first", "fourth", "fifth", "sixth", "eighth", "sophomore", "junior"]
b = ["second", "third", "fourth", "sixth", "seventh", "eighth", "freshman", "sophomore", "senior"]
c = ["first", "second", "freshman", "sophomore", "junior", "senior"]
...
partial_lists = [a, b, c, ...]

From these three lists, it's possible to recover the original list. In some cases, however, that may not be possible. Either way, I want to create a list merged_list such that merged_list preserves the order of each of the partial lists. (That is, any specified partial list could theoretically be reconstructed from merged_list using only merged_list.remove() operations.) It's safe to assume that each partial list contains no duplicates, and merged_list must not contain any duplicates either.

For this example, merged_list would be ["first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth", "freshman", "sophomore", "junior", "senior"]

Is there an efficient algorithm that can handle this for an arbitrary number of partial lists?


Solution

  • This is about topological sorting, given some dependencies.

    There are several ways to find a topological sorted original list. It will help to build a graph data structure that has all the dependencies keyed by nodes. Then yield the nodes that don't have any predecessor, and remove them from the graph. This will give a new set of nodes that don't have any predecessor, ...etc.

    Here is an implementation in Python:

    from collections import defaultdict
    
    def topological_sort(lists):
        # create graph (pred + succ)
        succ = defaultdict(set)
        pred = defaultdict(set)
        nodes = set()
        for lst in lists:
            nodes.update(lst)
            for a, b in zip(lst, lst[1:]):
                succ[a].add(b)
                pred[b].add(a)
            
        # get nodes that don't have a predecessor
        first = { node for node in nodes if not pred[node] }
    
        while first:
            collect = list(first)
            first.clear()
            yield from collect  # yield the nodes that don't have a predecessor (anymore)
            for a in collect:
                for b in succ[a]:
                    pred[b].remove(a) # those can't be a predecessor anymore
                    if not pred[b]: # this may make another node to be without predecessor
                        first.add(b)
        
            
    # demo run:
    a = ["first", "fourth", "fifth", "sixth", "eighth", "sophomore", "junior"]
    b = ["second", "third", "fourth", "sixth", "seventh", "eighth", "freshman", "sophomore", "senior"]
    c = ["first", "second", "freshman", "sophomore", "junior", "senior"]
    partial_lists = [a, b, c]
    result = list(topological_sort(partial_lists))
    

    For this example run, the result will come out as:

    ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'freshman', 'sophomore', 'junior', 'senior']