rdataframeduplicatesr-factor

Remove rows with duplicate combinations of factor levels in two columns


After bind_rows() a number of large data.frames, i end up with a data.frame like this:

tmp <- data.frame(Query=c("A", "B", "C", "D", "A"), target=c("D", "A", "A", "A", "B"), values=runif(5))
tmp
  Query target     values
1     A      D 0.06075322
2     B      A 0.43179750
3     C      A 0.32325309
4     D      A 0.26714620
5     A      B 0.96854999

I need to remove all rows which contain combinations of Query and target, that have appeared before, in either direction (AxD is a duplicate of DxA). In the example, the desired output would be (since row 4 is a duplicate of row 1, and row 5 a duplicate of row 2)

tmp
      Query target     values
    1     A      D 0.06075322
    2     B      A 0.43179750
    3     C      A 0.32325309

thank you very much!


Solution

  • sort the selected columns and discard duplicated rows:

    cols = c("Query", "target")
    tmp[!duplicated(t(apply(tmp[cols], 1, sort))), ]
    
    #  Query target    values
    #1     A      D 0.7205899
    #2     B      A 0.5484203
    #3     C      A 0.4503456