pythonalgorithmsequence-alignment

choosing one to one result by a similarity matrix


I build a function, that finds some alignment by some metric.

It gets a matrix with already computed similarity values: weighted_res may be:

[[0.2, 0.5, 0.3],
 [0.1, 0.2, 0.4],
 [0.8, 0.2, 0.4],
 [0.1, 0.2, 0.7],
 [0.1, 0.2, 0.4],

My function maximizes the sum of the values for all combinations the indices of exs1 and exs2, but no index can be taken twice. The results are these optimal indices. The sum for (0,1), (2,0), (3,2), accordingly 0.5+0.8+0.7 produces the maximal score.

There are many cases, where finding for each column/row the maximum isn't enough. Let the matrix be:

[[0.1, 0.0, 0.1]
 [0.5, 0.6, 0.4],
 [0.5, 0.8, 0.3],
 [0.0, 0.0, 0.2]]

Here, it chooses (1,1), (2,1), (3,2), because 0.5+0.8+0.2 is the maximal reachable score.

My code is like the following and I fear, it is maximally ineffective. I would be happy about some hint to find a more efficient algorithm, than to compute all the possibilities and sum up and maximize. Here is that code:

def one_to_one(weighted_res, exs1, exs2, mask):

    inner_cube_len = min(len(list(exs1)), len(list(exs2)))
    turned = False

    if (len(exs1) < len(exs2)):
        exs1, exs2 = exs2, exs1
        weighted_res = weighted_res.T
        mask = mask.T
        turned = True

    x_to_choose = np.array(list(itertools.permutations(range(len(exs1)), inner_cube_len)))
    y_to_choose  = np.array(list(range (len(exs2))))

    weighted_res_overall = \
        weighted_res[x_to_choose,y_to_choose].sum(axis=1)

    best_overall_row  = np.argmax(weighted_res_overall)
    best_x_values     = np.array (x_to_choose[best_overall_row] )

    valid_mask        = mask[best_x_values,y_to_choose]
    best_res1         = best_x_values[valid_mask]
    best_res2         = y_to_choose[valid_mask]

    if not valid_mask.any():
        return [],[]
    if turned:
        left_value   = best_res2.tolist()
        right_values = [[x] for x in best_res1.tolist()]
        exs1, exs2 = exs2, exs1
        weighted_res = weighted_res.T
        mask = mask.T
    else:
        right_values =  [[x] for x in best_res2.tolist()]
        left_value   =  best_res1.tolist()
    return left_value, right_values

With input values with lengths of 8 and 6 of the input results, the weighted_res_overall has a size of 20160 and that grows extremly fast.


Solution

  • I found it, it's named Hungarian Algorithm, but with maximizing instead of minimizing the score. https://en.wikipedia.org/wiki/Hungarian_algorithm

    There is a scipy implementation of it: https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html

    Or https://github.com/src-d/lapjv

    Thanks for thinking about it!