rconcatenationmatchanti-join

how to find what is in data frame and not another in R


I have two data frames: codes and supply. Codes (shown below) is comprised of four fields: state,codetype,code,codetitle supply has 12 columns but three of them are state,codetype, and code

An example of this is below

    state     codetype    code    codetitle
      32          15     123456     Something
      32          15     123455     Something Else
      32          10     123455     Something Else

From there, I use the following code to concatenate the item

    supply1<- supply%>%mutate(supply1= paste0(state,codetype,code))
    codes1<- codes%>%mutate(codes1= paste0(state,codetype,code))

My question is how do I find out what combinations of state,codetype,code are in supply1 but not codes1. I would use excel and the match function to do this but there are 1.9 million rows and that exceeds the capacity of Excel.

Have looked at documentation about antijoin. However, being that there is no common field such as ID, getting a bit confused.


Solution

  • tidyverse

    library(dplyr)
    anti_join(supply, codes, by = c("state", "codetype", "code"))
    #   state codetype   code   codetitle
    # 1    34       15 123459 Something_4
    

    base R

    codes$code_rn <- seq_len(nrow(codes))
    supply$supply_rn <- seq_len(nrow(supply))
    temp <- merge(codes, supply, by = c("state", "codetype", "code"))
    temp
    #   state codetype   code    codetitle.x code_rn codetitle.y supply_rn
    # 1    32       15 123455 Something_Else       2 Something_3         2
    # 2    32       15 123456      Something       1 Something_2         1
    supply[ !supply$supply_rn %in% temp$supply_rn, ]
    #   state codetype   code   codetitle supply_rn
    # 3    34       15 123459 Something_4         3
    

    (and some column clean-up)

    alternative base R

    This is effectively what you were starting with:

    supply_id <- with(supply, paste(state, codetype, code, sep = "|"))
    supply_id
    # [1] "32 15 123456" "32 15 123455" "34 15 123459"
    codes_id <- with(codes, paste(state, codetype, code, sep = "|"))
    codes_in
    # [1] "32|15|123456" "32|15|123455" "32|10|123455"
    supply[!supply_id %in% codes_id,]
    #   state codetype   code   codetitle supply_rn
    # 3    34       15 123459 Something_4         3
    

    data

    codes <- read.table(header = TRUE, text="
        state     codetype    code    codetitle
          32          15     123456     Something
          32          15     123455     Something_Else
          32          10     123455     Something_Else")
    supply <- read.table(header = TRUE, text="
        state     codetype    code    codetitle
          32          15     123456     Something_2
          32          15     123455     Something_3
          34          15     123459     Something_4")