I build a function, that finds some alignment by some metric.
It gets a matrix with already computed similarity values:
weighted_res
may be:
[[0.2, 0.5, 0.3],
[0.1, 0.2, 0.4],
[0.8, 0.2, 0.4],
[0.1, 0.2, 0.7],
[0.1, 0.2, 0.4],
My function maximizes the sum of the values for all combinations the indices of exs1 and exs2, but no index can be taken twice. The results are these optimal indices. The sum for (0,1), (2,0), (3,2), accordingly 0.5+0.8+0.7 produces the maximal score.
There are many cases, where finding for each column/row the maximum isn't enough. Let the matrix be:
[[0.1, 0.0, 0.1]
[0.5, 0.6, 0.4],
[0.5, 0.8, 0.3],
[0.0, 0.0, 0.2]]
Here, it chooses (1,1), (2,1), (3,2), because 0.5+0.8+0.2 is the maximal reachable score.
My code is like the following and I fear, it is maximally ineffective. I would be happy about some hint to find a more efficient algorithm, than to compute all the possibilities and sum up and maximize. Here is that code:
def one_to_one(weighted_res, exs1, exs2, mask):
inner_cube_len = min(len(list(exs1)), len(list(exs2)))
turned = False
if (len(exs1) < len(exs2)):
exs1, exs2 = exs2, exs1
weighted_res = weighted_res.T
mask = mask.T
turned = True
x_to_choose = np.array(list(itertools.permutations(range(len(exs1)), inner_cube_len)))
y_to_choose = np.array(list(range (len(exs2))))
weighted_res_overall = \
weighted_res[x_to_choose,y_to_choose].sum(axis=1)
best_overall_row = np.argmax(weighted_res_overall)
best_x_values = np.array (x_to_choose[best_overall_row] )
valid_mask = mask[best_x_values,y_to_choose]
best_res1 = best_x_values[valid_mask]
best_res2 = y_to_choose[valid_mask]
if not valid_mask.any():
return [],[]
if turned:
left_value = best_res2.tolist()
right_values = [[x] for x in best_res1.tolist()]
exs1, exs2 = exs2, exs1
weighted_res = weighted_res.T
mask = mask.T
else:
right_values = [[x] for x in best_res2.tolist()]
left_value = best_res1.tolist()
return left_value, right_values
With input values with lengths of 8 and 6 of the input results, the weighted_res_overall
has a size of 20160 and that grows extremly fast.
I found it, it's named Hungarian Algorithm, but with maximizing instead of minimizing the score. https://en.wikipedia.org/wiki/Hungarian_algorithm
There is a scipy implementation of it: https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html
Or https://github.com/src-d/lapjv
Thanks for thinking about it!