algorithm sorting language-agnostic array-merge

Efficient algorithm to create ordered union of lists with unknown true ordering

I have a bunch of lists I need to combine. Each of these lists has an ordering, and each ordering is consistent with the ordering of an original list. (By consistent, I mean that each list can be reproduced by dropping items from the original list, without rearranging any items.)

My problem is that I don't have the original list, and I have to reproduce it as closely as possible using the partial lists I do have.

For example, consider the following ordered lists:

a = ["first", "fourth", "fifth", "sixth", "eighth", "sophomore", "junior"]
b = ["second", "third", "fourth", "sixth", "seventh", "eighth", "freshman", "sophomore", "senior"]
c = ["first", "second", "freshman", "sophomore", "junior", "senior"]
...
partial_lists = [a, b, c, ...]

From these three lists, it's possible to recover the original list. In some cases, however, that may not be possible. Either way, I want to create a list merged_list such that merged_list preserves the order of each of the partial lists. (That is, any specified partial list could theoretically be reconstructed from merged_list using only merged_list.remove() operations.) It's safe to assume that each partial list contains no duplicates, and merged_list must not contain any duplicates either.

For this example, merged_list would be ["first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth", "freshman", "sophomore", "junior", "senior"]

Is there an efficient algorithm that can handle this for an arbitrary number of partial lists?

Solution

This is about topological sorting, given some dependencies.

There are several ways to find a topological sorted original list. It will help to build a graph data structure that has all the dependencies keyed by nodes. Then yield the nodes that don't have any predecessor, and remove them from the graph. This will give a new set of nodes that don't have any predecessor, ...etc.

Here is an implementation in Python:

from collections import defaultdict

def topological_sort(lists):
    # create graph (pred + succ)
    succ = defaultdict(set)
    pred = defaultdict(set)
    nodes = set()
    for lst in lists:
        nodes.update(lst)
        for a, b in zip(lst, lst[1:]):
            succ[a].add(b)
            pred[b].add(a)
        
    # get nodes that don't have a predecessor
    first = { node for node in nodes if not pred[node] }

    while first:
        collect = list(first)
        first.clear()
        yield from collect  # yield the nodes that don't have a predecessor (anymore)
        for a in collect:
            for b in succ[a]:
                pred[b].remove(a) # those can't be a predecessor anymore
                if not pred[b]: # this may make another node to be without predecessor
                    first.add(b)
    
        
# demo run:
a = ["first", "fourth", "fifth", "sixth", "eighth", "sophomore", "junior"]
b = ["second", "third", "fourth", "sixth", "seventh", "eighth", "freshman", "sophomore", "senior"]
c = ["first", "second", "freshman", "sophomore", "junior", "senior"]
partial_lists = [a, b, c]
result = list(topological_sort(partial_lists))

For this example run, the result will come out as:

['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'freshman', 'sophomore', 'junior', 'senior']