I have a (potentially large) list data
of 3-tuples of small non-negative integers, like
data = [
(1, 0, 5),
(2, 4, 2),
(3, 2, 1),
(4, 3, 4),
(3, 3, 1),
(1, 2, 2),
(4, 0, 3),
(0, 3, 5),
(1, 5, 1),
(1, 5, 2),
]
I want to order the tuples within data
so that neighboring tuples (data[i]
and data[i+1]
) are "as similar as possible".
Define the dissimilarity of two 3-tuples as the number of elements which are unequal between them. E.g.
(0, 1, 2)
vs. (0, 1, 2)
: Dissimilarity 0
.(0, 1, 2)
vs. (0, 1, 3)
: Dissimilarity 1
.(0, 1, 2)
vs. (0, 2, 1)
: Dissimilarity 2
.(0, 1, 2)
vs. (3, 4, 5)
: Dissimilarity 3
.(0, 1, 2)
vs. (2, 0, 1)
: Dissimilarity 3
.Question: What is a good algorithm for finding the ordering of data
which minimizes the sum of dissimilarities between all neighboring 3-tuples?
Here's a function which computes the dissimilarity between two 3-tuples:
def dissimilar(t1, t2):
return sum(int(a != b) for a, b in zip(t1, t2))
Here's a function which computes the summed total dissimilarity of data
, i.e. the number which I seek to minimize:
def score(data):
return sum(dissimilar(t1, t2) for t1, t2 in zip(data, data[1:]))
The problem can be solved by simply running score()
over every permutation of data
:
import itertools
n_min = 3*len(data) # some large number
for perm in itertools.permutations(data):
n = score(perm)
if n < n_min:
n_min = n
data_sorted = list(perm)
print(data_sorted, n_min)
Though the above works, it's very slow as we explicitly check each and every permutation (resulting in O(N!) complexity). On my machine the above takes about 20 seconds when data
has 10 elements.
For completeness, here's the result of running the above given the example data
:
data_sorted = [
(1, 0, 5),
(4, 0, 3),
(4, 3, 4),
(0, 3, 5),
(3, 3, 1),
(3, 2, 1),
(1, 5, 1),
(1, 5, 2),
(1, 2, 2),
(2, 4, 2),
]
with n_min = 15
. Note that several other orderings (10
in total) with a score of 15
exist. For my purposes these are all equivalent and I just want one of them.
In practice the size of data
may be as large as say 10000
.
The sought-after algorithm should beat O(N!), i.e. probably be polynomial in time (and space).
If no such algorithm exists, I would be interested in "near-solutions", i.e. a fast algorithm which gives an ordering of data
with a small but not necessarily minimal total score. One such algorithm would be lexicographic sorting, i.e.
sorted(data) # score 18
though I hope to be able to do better than this.
I have tried all of the below heuristic solutions given as code (I have not tried e.g. Google OR-tools). For large len(data)
, I find that the solution of Andrej Kesely is both quick and gives the best results.
The idea behind this method is quite simple. The sorted list of data elements (3-tuples) is built up one by one. Given some data element, the next element is chosen to be the most similar one out of the remaining (not yet part of the sorted) data.
Essentially this solves a localized version of the problem where we only "look one ahead", rather than optimizing globally over the entire data set. We can imagine a hierarchy of algorithms looking n
ahead, each successively delivering better (or at least as good) results but at the cost of being much more expensive. The solution of Andrej Kesely then sits lowest in this hierarchy. The algorithm at the highest spot, looking len(data)
ahead, solves the problem exactly.
Let's settle for "looking 1 ahead", i.e. the answer by Andrej Kesely. This leaves room for a) the choice of initial element, b) what to do when several elements are equally good candidates (same dissimilarity) for use as the next one. Choosing the first element in data
as the initial element and the first occurrence of an element with minimal dissimilarity, both a) and b) are determined from the original order of elements within data
. As Andrej Kesely points out, it then helps to (lex)sort data
in advance.
In the end I went with this solution, but refined in a few ways:
data
; lex sort for columns (0, 1, 2)
, (2, 0, 1)
, (1, 2, 0)
, all in ascending as well as descending order.len(data)
, the algorithm becomes too slow for me. I suspect it scales like O(n²)
. I thus process chunks of the data of size n_max
independently, with the final result being the different sorted chunks concatenated. Transitioning from one chunk to the next we expect a dissimilarity of 3, but this is unimportant if we keep n_max
large. I go with n_max = 1000
.As an implementation note, the performance can be improved by not using data.pop(idx)
as this itself is O(n)
. Instead, either leave the original data
as is and use another data structure for keeping track of which elements/indices have been used, or replace data[idx]
with some marker value upon use.
This isn't exact algorithm, just heuristic, but should be better that naive sorting:
# you can sort first the data for lower total average score:
# data = sorted(data)
out = [data.pop(0)]
while data:
idx, t = min(enumerate(data), key=lambda k: dissimilar(out[-1], k[1]))
out.append(data.pop(idx))
print(score(out))
Testing (100 repeats with data len(data)=1000
):
import random
from functools import lru_cache
def get_data(n=1000):
f = lambda n: random.randint(0, n)
return [(f(n // 30), f(n // 20), f(n // 10)) for _ in range(n)]
@lru_cache(maxsize=None)
def dissimilar(t1, t2):
a, b, c = t1
x, y, z = t2
return (a != x) + (b != y) + (c != z)
def score(data):
return sum(dissimilar(t1, t2) for t1, t2 in zip(data, data[1:]))
def lexsort(data):
return sorted(data)
def heuristic(data, sort_data=False):
data = sorted(data) if sort_data else data[:]
out = [data.pop(0)]
while data:
idx, t = min(enumerate(data), key=lambda k: dissimilar(out[-1], k[1]))
out.append(data.pop(idx))
return out
N, total, total_lexsort, total_heuristic, total_heuristic2 = 100, 0, 0, 0, 0
for i in range(N):
data = get_data()
r0 = score(data)
r1 = score(lexsort(data))
r2 = score(heuristic(data))
r3 = score(heuristic(data, True))
print("original data", r0)
print("lexsort", r1)
print("heuristic", r2)
print("heuristic with sorted", r3)
total += r0
total_lexsort += r1
total_heuristic += r2
total_heuristic2 += r3
print("total original data score", total)
print("total score lexsort", total_lexsort)
print("total score heuristic", total_heuristic)
print("total score heuristic(with sorted)", total_heuristic2)
Prints:
...
total original data score 293682
total score lexsort 178240
total score heuristic 162722
total score heuristic(with sorted) 160384