rdata-sciencedata-scrubbing

R editing dataframe based on column value


Suppose I have a table of around 1M rows with the following format:

id  paid_2000  paid_2001  paid_2002  paid_2003  censor_yr
1   10         20         10         20         2001
2   15         25         15         15         2003

What would be an efficient way to set each observation to NA where the paid year is greater than or equal to the censor year? In particular, I would like the table to look like this:

id  paid_2000  paid_2001  paid_2002  paid_2003  censor_yr
1   10         NA         NA         NA         2001
2   15         25         15         NA         2003

Solution

  • Using dplyr:

    library(dplyr)
    df %>%
      gather(paid_yr, value, grep("paid", names(.))) %>%
      mutate(value = ifelse(as.numeric(gsub(".*_", "", paid_yr)) >= censor_yr, 
                            NA, value)) %>%
      spread(paid_yr, value)
    

    Throw a %>% select chain at the bottom to move censor_yr back to the end.

    Can explain how it works if you'd like. May or may not be easier to read than akrun's answers.