rdeduplication

Deduping Column pairs in R


I have a dataframe containing 7 columns and would like to records that have same info in the first two columns even they are in reverse order.

Here is a snippet of my df

 zip1  zip2       APP       PCR       SCR       APJ       PJR
1 01701 01701 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
2 01701 01702 0.9887567 0.9898379 0.9811615 0.9993856 0.9842659
3 01701 01703 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
4 01701 01704 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
5 01704 01701 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

I know how to use unique, but the twist here is that I'd like to treat instances where zip1 = a and zip2 = b the same as zip1 = b and zip2 = a. So I'd essentially want only one records for those two instances. So for example I'd only want column 4 and not column 5 Any advice?

Thanks, Ben


Solution

  • First create a new vector which identifies rows with a particular zip pair but doesn't distinguish based upon the ordering:

    zipUp<-paste(pmin(df$zip1,df$zip2),pmax(df$zip1,df$zip2))
    

    Now find duplicates in that vector, and discard them from the original data frame.

    dups<-duplicated(zipUp)
    
    newdf<-df[!dups,]
    

    I am assuming that the first two columns will not contain NA. If they do you will need to adjust the pmin, pmax calls to keep any non NA value for each pair