rdplyrrowsum

rowSums over different columns for each row


I would like to perform a rowSums() over specified columns. My trouble is that the desired columns to sum over are different for each row and given by another variable. The motivation is to check whether the three values before a given column are NA.

An example data frame can be built by

toy <- data.frame(id = c(rep(1, 10), rep(2, 10), rep(3, 10)),
                  day = c(1:10, 1:10, 1:10),
                  response = c(1, NA, NA, NA, 1, 1, 1, NA, NA, 1,
                          1, 1, NA, 1 ,1 ,1 ,1 ,1 ,1, 1,
                          NA, 1, NA, 1, 1, 1, NA, NA, NA, NA),
                  colnum = c(rep(5, 10), rep(7, 10), rep(10, 10)))
toy <- toy %>% pivot_wider(names_from = 2, values_from = 3, names_prefix = "day")
toy

In this example, the colnum variable gives the day for which to check the previous three days. Looking at the data,

toy
# A tibble: 3 x 12
     id colnum  day1  day2  day3  day4  day5  day6  day7  day8  day9 day10
  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1      5     1    NA    NA    NA     1     1     1    NA    NA     1
2     2      7     1     1    NA     1     1     1     1     1     1     1
3     3     10    NA     1    NA     1     1     1    NA    NA    NA    NA

I would want to create a new column in the data frame which would have values of FALSE, TRUE, FALSE for ids 1, 2, and 3, respectively. Let's call it three_miss. E.g., for id=1, we want to check if days 2 through 4 are all NA since colnum=5. I tried a simple implementation of rowSums()

 toy <- toy %>% mutate(three_miss = rowSums(select(., (colnum-1):(colnum+1)), na.rm = TRUE) == 3)

Using colnum-1:colnum+1 was deduced by which column numbers we would need. The code will run but gives FALSE for all values of three_miss, which is not the correct or desired output. I receive warning messages:

There were 2 warnings in `mutate()`.
The first warning was:
i In argument: `three_miss = ==...`.
Caused by warning in `x:y`:
! numerical expression has 3 elements: only the first used

I get the sense from this that either the select() is causing the issue, or rowSums() is producing more than one sum per row. I have tried adding group_by(id) which does not help. I want to stay within the context of piping, but I am open to other solutions if this is not possible.

p.s. I understand there may be issues if colnum is less than 4 or greater than 10. I was planning to amend this by using case_when(). E.g.,

toy <- toy %>% mutate(three_miss =
  case_when(
    colnum <=3 ~ NA,
    colnum > 10 ~ NA,
    TRUE ~ code_that_works
  )
)

Solution

  • If toy0 is the data before pivoting, you can do

    toy0 %>%
      filter(between(day, colnum-3, colnum-1)) %>%
      summarize(three_wise = all(!is.na(response)), .by = id) %>%
      left_join(toy, ., by = "id")
    # # A tibble: 3 × 13
    #      id colnum  day1  day2  day3  day4  day5  day6  day7  day8  day9 day10 three_wise
    #   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>     
    # 1     1      5     1    NA    NA    NA     1     1     1    NA    NA     1 FALSE     
    # 2     2      7     1     1    NA     1     1     1     1     1     1     1 TRUE      
    # 3     3     10    NA     1    NA     1     1     1    NA    NA    NA    NA FALSE