rdplyr

How to remove rows based on values in two columns


I have a data frame with 35 columns and about 250,000 rows. Depending on the values in the year column and the network_id column I want to remove some rows. The specification of which to remove is given in this list:

remove.nets <- list(r19=c(14, 31),
                    r21=c(31),
                    r23=c(32),
                    r24=c(1, 4, 8, 24, 30, 59))

So if the year is 2019 and the network ID is either 14 or 31, remove the row, and similarly for other rows. I tried something like this:

test.data2 <- test.data %>%
     {if (year==2019) filter(., !network_id %in% remove.nets$r19)}

This seemed to me to be an obvious way to do this but it didn't work. (It threw errors that I don't understand).

Error in year == 2019 : 
  comparison (==) is possible only for atomic and list types

I had to make a data frame out of the remove.nets list and do an anti_join like this:

remove.nets <- data.frame(year=c(2019, 2019, 2021, 2023, rep(2024, 6)),
                          network_id=c(14, 31, 31, 32, 1, 4, 8, 24, 30, 59))
anti_join(., remove.nets, by=c("year", "network_id"))

This works but it's aesthetically un-pleasing. Can anyone help me make it easier and prettier?


Solution

  • There's nothing aesthetically unpleasing about anti_join. To get the data frame from the list, just do:

    remove.nets.df <- data.frame(year=rep(sub('r', 20, names(remove.nets)), 
                                          sapply(remove.nets, length)),
                                 network_id=unlist(remove.nets))
    

    And then:

    library(dplyr)
    
    anti_join(test.data, remove.nets.df, by=c("year", "network_id"))