I have a data frame in R where each row contains multiple columns with categorical values. My goal is to rearrange the values within each row so that no value is repeated across columns in the same row. The original data frame may contain missing values represented as empty strings or NA, and I want to keep the same number of values per column after the rearrangement.
df <- data.frame(
t1 = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B"),
t2 = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B"),
t3 = c("A", "B", "C", "D", "", "", "", "", "", ""),
t4 = c("A", "B", "C", "D", "", "", "", "", "", "")
)
I want to rearrange each row so there are no duplicate values across columns, it doesn't matter the sequence, for instance, a row could be A, B, C, NA, or A, B, C, D, as far it doesn't have a repeated value, I'm ok. I also need to preserve the number of non-missing values from the original data frame. Here is an example of the desired output:
# Example of expected rearrangement (order may vary):
df_rearranged <- data.frame(
t1 = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B"),
t2 = c("D", "A", "B", "C", "D", "A", "B", "C", "D", "A"),
t3 = c("B", "", "A", "", "C", "", "D", "", "", ""),
t4 = c("", "", "", "A", "", "C", "", "", "", "D")
)
For context, each column indicates one time many coders (values in time columns) will rate an item (RID). Each new time, I need the coder to rate a different item. The coders will code all items for the first two times (time1 and time2); however, for time3 and time4, coders will code only 25% of the items (in the example above, I used 40% for simplicity, but this % will vary so I need syntax to adjust that automatically). Any help is appreciated; I'm stuck here.
structure(list(RID = c(2L, 9L, 14L, 24L, 44L, 64L, 95L, 116L,
165L, 169L, 170L, 171L, 172L, 177L, 192L, 215L, 217L, 226L, 246L,
247L, 288L, 292L, 300L, 306L, 313L, 316L, 339L, 344L, 352L, 355L,
375L, 378L, 384L, 421L, 476L, 488L, 493L, 495L, 498L, 503L, 532L,
553L, 581L, 588L, 604L, 605L, 608L, 639L, 640L, 642L, 664L, 669L,
702L, 742L, 744L, 746L, 749L, 756L, 767L, 820L, 822L, 824L, 825L,
826L, 842L, 843L, 856L, 865L, 895L, 901L, 916L, 920L, 921L, 929L,
930L, 934L, 936L, 939L, 952L, 958L), time1 = c("MV", "MV", "AF",
"RP", "MV", "AC", "RP", "MV", "FL", "MV", "AF", "AF", "AF", "RP",
"MV", "AF", "RP", "RP", "MV", "AC", "AC", "FL", "MV", "AF", "FL",
"AC", "AF", "RP", "FL", "AF", "AC", "AL", "FL", "AL", "FL", "AF",
"RP", "AC", "RP", "RP", "FL", "AL", "FL", "FL", "RP", "MV", "MV",
"AC", "MV", "AL", "AL", "RP", "AC", "AF", "AC", "MV", "AL", "AL",
"RP", "AL", "FL", "MV", "RP", "AL", "AL", "AC", "RP", "FL", "AC",
"AL", "MV", "AC", "AF", "AF", "AL", "AL", "FL", "AC", "FL", "AF"
), time2 = c("RP", NA, NA, NA, NA, NA, "AL", "RP", NA, NA, NA,
"FL", NA, NA, NA, NA, NA, NA, NA, "MV", NA, NA, NA, "FL", "AL",
"MV", NA, "AL", NA, NA, NA, NA, NA, NA, "RP", NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, "AF", NA, NA, NA, "AF", "AC", NA, NA,
"AC", NA, NA, NA, NA, NA, NA, NA, NA, "AC", NA, NA, NA, NA, "MV",
NA, "MV", NA, "AF", NA, NA, NA, NA, NA, "RP", NA, "FL"), time3 = c("MV",
"MV", "AF", "RP", "MV", "AC", "RP", "MV", "FL", "MV", "AF", "AF",
"AF", "RP", "MV", "AF", "RP", "RP", "MV", "AC", "AC", "FL", "MV",
"AF", "FL", "AC", "AF", "RP", "FL", "AF", "AC", "AL", "FL", "AL",
"FL", "AF", "RP", "AC", "RP", "RP", "FL", "AL", "FL", "FL", "RP",
"MV", "MV", "AC", "MV", "AL", "AL", "RP", "AC", "AF", "AC", "MV",
"AL", "AL", "RP", "AL", "FL", "MV", "RP", "AL", "AL", "AC", "RP",
"FL", "AC", "AL", "MV", "AC", "AF", "AF", "AL", "AL", "FL", "AC",
"FL", "AF"), time4 = c("RP", NA, NA, NA, NA, NA, "AL", "RP",
NA, NA, NA, "FL", NA, NA, NA, NA, NA, NA, NA, "MV", NA, NA, NA,
"FL", "AL", "MV", NA, "AL", NA, NA, NA, NA, NA, NA, "RP", NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, "AF", NA, NA, NA, "AF", "AC",
NA, NA, "AC", NA, NA, NA, NA, NA, NA, NA, NA, "AC", NA, NA, NA,
NA, "MV", NA, "MV", NA, "AF", NA, NA, NA, NA, NA, "RP", NA, "FL"
)), row.names = c(NA, -80L), class = c("tbl_df", "tbl", "data.frame"
))
structure(list(RID = c(2, 9, 14, 24, 44, 64, 95, 116, 165, 169,
170, 171, 172, 177, 192, 215, 217, 226, 246, 247, 288, 292, 300,
306, 313, 316, 339, 344, 352, 355, 375, 378, 384, 421, 476, 488,
493, 495, 498, 503, 532, 553, 581, 588, 604, 605, 608, 639, 640,
642, 664, 669, 702, 742, 744, 746, 749, 756, 767, 820, 822, 824,
825, 826, 842, 843, 856, 865, 895, 901, 916, 920, 921, 929, 930,
934, 936, 939, 952, 958), time1 = c("MV", "MV", "AF", "RP", "MV",
"AC", "RP", "MV", "FL", "MV", "AF", "AF", "AF", "RP", "MV", "AF",
"RP", "RP", "MV", "AC", "AC", "FL", "MV", "AF", "FL", "AC", "AF",
"RP", "FL", "AF", "AC", "AL", "FL", "AL", "FL", "AF", "RP", "AC",
"RP", "RP", "FL", "AL", "FL", "FL", "RP", "MV", "MV", "AC", "MV",
"AL", "AL", "RP", "AC", "AF", "AC", "MV", "AL", "AL", "RP", "AL",
"FL", "MV", "RP", "AL", "AL", "AC", "RP", "FL", "AC", "AL", "MV",
"AC", "AF", "AF", "AL", "AL", "FL", "AC", "FL", "AF"), time2 = c("RP",
NA, NA, NA, NA, NA, "AL", "RP", NA, NA, NA, "FL", NA, NA, NA,
NA, NA, NA, NA, "MV", NA, NA, NA, NA, "AL", "MV", NA, "AL", NA,
"FL", NA, NA, NA, NA, "RP", NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, "AF", NA, NA, NA, "AF", "AC", NA, NA, "AC", NA, NA, NA, NA,
NA, NA, NA, NA, "AC", NA, NA, NA, NA, "MV", NA, "MV", NA, "AF",
NA, NA, NA, NA, NA, "RP", NA, "FL"), time3 = c("FL", "AF", "MV",
"MV", "AF", "RP", "MV", "AC", "RP", "AL", "FL", "MV", "FL", "AF",
"AF", "RP", "MV", "AF", "RP", "RP", "MV", "AC", "AC", "FL", "MV",
"AF", "FL", "AC", "AF", "RP", "FL", "AF", "AC", "MV", "AF", "AL",
"FL", "AF", "FL", "AC", "RP", "RP", "AL", "AL", "FL", "FL", "RP",
"MV", "RP", "AC", "MV", "AL", "AL", "RP", "FL", "AF", "AC", "MV",
"AL", "MV", "RP", "AL", "FL", "MV", "RP", "AL", "AL", "AC", "RP",
"FL", "AC", "AL", "MV", "AC", "AF", "AF", "AL", "AL", "AC", "AC"
), time4 = c(NA, NA, NA, NA, "AL", NA, NA, NA, NA, NA, "RP",
NA, NA, NA, "FL", NA, NA, "MV", NA, NA, NA, NA, "FL", NA, NA,
"AL", "MV", NA, "AL", NA, NA, NA, NA, "RP", NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, "AF", NA, NA, "AF", NA, NA, NA, "AC", NA,
NA, NA, "AC", NA, NA, NA, NA, NA, "AC", NA, NA, NA, NA, "MV",
NA, "MV", NA, "AF", NA, NA, NA, NA, NA, "RP", "FL", "RP", NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -80L))
This seems like a Multidimensional assignment problem.
In your case, you would like to arrange the values in columns t2, t3, and t4 such that there are no duplicates across rows. The Hungarian Algorithm Solver (HungarianSolver
) from the RcppHungarian
package seems to be a good choice here. The function "solves weighted bipartite matching problems (e.g., optimal matching of people to cars or optimal matching of students to colleges, etc...)"
Use of the function is straightforward. It needs just one argument, a cost matrix. I think for this situation, we simply need 0s and 1s where a 0 indicates no cost, i.e. any pairing is allowed, and a 1 indicates some non-zero cost where that particular pairing is not desired. For this, I use outer
with ==
as the FUN (excluding NA). The result is a cost matrix which can be used as input to the solver, giving a vector of pairings, the second column being the desired indices that minimizes the cost.
library(RcppHungarian)
Since you provided a toy dataset and a real one, I'll wrap my code in a function so that I can call it twice. The only argument is the data.
fn <- function(data) {
# Helper function for the `outer` function.
equal <- function(x, y) (x==y) & !is.na(x) & !is.na(y)
# Extract the four columns
t1 <- data[,1, drop=TRUE]
t2 <- data[,2, drop=TRUE]
t3 <- data[,3, drop=TRUE]
t4 <- data[,4, drop=TRUE]
# Create the cost matrix for t1 and t2
cost2 <- outer(t1, t2, FUN=equal)
# Solve the problem for t2 and assign the result
res2 <- HungarianSolver(cost2)
t2a <- t2[res2$pairs[,2]]
# Repeat for t3 and t4 (aggregating the costs)
cost3 <- outer(t1, t3, equal) + outer(t2a, t3, equal)
res3 <- HungarianSolver(cost3)
t3a <- t3[res3$pairs[,2]]
cost4 <- outer(t1, t4, equal) + outer(t2a, t4, equal) + outer(t3a, t4, equal)
res4 <- HungarianSolver(cost4)
t4a <- t4[res4$pairs[,2]]
return(list(data=data.frame(t1, t2=t2a, t3=t3a, t4=t4a),
cost=c(res2$cost, res3$cost, res4$cost)))
}
Call the above function for the toy data set
fn(df)
$data
t1 t2 t3 t4
1 A B C D
2 B A D C
3 C D A B
4 D C B A
5 A B <NA> <NA>
6 B A <NA> <NA>
7 C D <NA> <NA>
8 D C <NA> <NA>
9 A B <NA> <NA>
10 B A <NA> <NA>
$cost
[1] 0 0 0
We see that the rows are not duplicated and the cost is 0. Now we try on the real data.
DF_arranged <- fn(DF[,-1])
head(DF_arranged$data, 10)
t1 t2 t3 t4
1 MV AF RP FL
2 MV RP <NA> <NA>
3 AF MV <NA> RP
4 RP MV FL <NA>
5 MV AC <NA> <NA>
6 AC MV <NA> <NA>
7 RP MV AL <NA>
8 MV RP <NA> <NA>
9 FL MV RP <NA>
10 MV FL <NA> RP
...
The first ten rows look good (no duplicates across rows). Further inspection verifies the remaining rows.
sum(apply(DF_arranged, 1, FUN=\(x) sum(duplicated(x, incomparables=NA))))
# [1] 0
DF_arranged$cost
# [1] 0 0 0
There are no duplicates and the cost is also 0.
Data:
df
is the toy data (provided by OP).
DF
is the real data (also provided by OP except that time2
and time3
were interchanged since it was stated that all rows in time1
and time2
should have a value)