I have a data frame that contains all the possible combinations between the elements of two vectors and for each combination I have a corresponding score. I was trying to find an efficient way to find the subset of unique pairs with unique elements (i.e., an element from one vector can be found only once across all pairs) that maximizes the sum of scores corresponding to each combination.
As example data, consider this df
:
df = data.frame(Var1 = c("A", "B", "C"), Var2 = c("A", "C", "D"))
df = expand.grid(df$Var1, df$Var2)
df$score = c(1, 0.5, 2, 1, 0.5, 0.5, 1, 2, 1)
> df
Var1 Var2 score
1 A A 1.0
2 B A 0.5
3 C A 2.0
4 A C 1.0
5 B C 0.5
6 C C 0.5
7 A D 1.0
8 B D 2.0
9 C D 1.0
>
The expected result would be:
A C 1
B D 2
C A 2
Note that you can have overlap between the elements of the two vectors, but again each element from each vector should appear only once. Also, the pair A A 1
is allowed and would've been possible, but that would make it impossible then to generate the pair C A 2
which would increase the overall sum of the score
.
As an attempt I have used this one liner with the functionality from dplyr
df <- df %>% group_by(Var1) %>% slice(which.max(score)) %>% as.data.frame()
which produces:
> df
Var1 Var2 score
1 A A 1
2 B D 2
3 C A 2
which is close enough.. but the A
from the second vector is repeated. Do you have any suggestions? Thank you in advance!
Well, I eventually found the solution based on the Hungarian algorithm implemented in the solve_LSAP
function of the clue
R package. To have it working, transform your df
in a matrix like so:
df = matrix(sapply(df$score, function(x) x), nrow=length(unique(df$Var1)), ncol=length(unique(df$Var2)), dimnames = list(unique(df$Var1), unique(df$Var2)))
and apply the function
df.res = solve_LSAP(df, maximum = T)
> df.res
Optimal assignment:
1 => 2, 2 => 3, 3 => 1
and then get back the actual nodes or names
df.res = cbind(rownames(df), colnames(df)[df.res])
> df.res
[,1] [,2]
[1,] "A" "C"
[2,] "B" "D"
[3,] "C" "A"
>
Tadaaaaam!