rdplyrsemi-join

Semi_join to filter columns of X based on multiple Y columns


Starting with these two data frames:

data <- data.frame("Run_ID" = c(1,2,3), "Sample" = c("A", "B", "C"), "Value" = c(1,2,3))
metadata <- data.frame("Run_ID" = c(1,3), "Sample" = c("A","C"))

I would like to subset data so that it only includes the values from the pairs of Run_ID + Sample which also are present in metadata. The output should contain the same columns as data.

Expected Output:

Run_ID Sample Value
1      A     1
3      C     3

Based on the documentation, it seems like semi_join() should be the solution, but I cannot figure out join based on the two variables.

>semi_join(data, metadata, by = c("Run_ID", "Sample"))
[1] Run_ID Sample Value 
<0 rows> (or 0-length row.names)

Any suggestions are greatly appreciated!


Solution

  • Your code is ok but the input metadata is not in a friendly format but I guess this is what you're after:

    semi_join(
      data,
      metadata %>% separate_rows(Sample, sep = ','), 
      by = c('Run_ID', 'Sample')
    )
    #   Run_ID Sample Value
    # 1      1      A     1
    # 2      3      C     3