rdataframedplyrrowsum

rowSums() to count number of both non-missing and unique values


Let's say I have this dataframe

> df
   mr_daterd mr_daterd_fu1 mr_daterd_fu2
1 2018-03-05    2018-03-05          <NA>
2 2019-05-04          <NA>    2020-03-05
3 2020-01-03    2020-06-06    2021-04-02

Each row represent a patient and the dates represent MRI scans. I want to count the number of MRI scans per row, i.e. rowSums() of non-missing values. However, some patient had several scans at the same date. Therefore, the rowSums() should only count unique non-missing values.

Eg, using

df_new <- df %>%
  mutate(
n_mri = rowSums(!is.na(select(., contains('mr_daterd'))))
)

Gives

> df_new
   mr_daterd mr_daterd_fu1 mr_daterd_fu2 n_mri
1 2018-03-05    2018-03-05          <NA>     2
2 2019-05-04          <NA>    2020-03-05     2
3 2020-01-03    2020-06-06    2021-04-02     3

The n_mri for row 1 should be 1, and not 2, because 2018-03-05 is duplicated in mr_daterd and mr_daterd_fu1.

Expected output:

> df_new
   mr_daterd mr_daterd_fu1 mr_daterd_fu2 n_mri
1 2018-03-05    2018-03-05          <NA>     1
2 2019-05-04          <NA>    2020-03-05     2
3 2020-01-03    2020-06-06    2021-04-02     3

Data

df <- structure(list(mr_daterd = structure(c(17595, 18020, 18264), class = "Date"), 
    mr_daterd_fu1 = structure(c(17595, NA, 18419), class = "Date"), 
    mr_daterd_fu2 = structure(c(NA, 18326, 18719), class = "Date")), class = "data.frame", row.names = c(NA, 
-3L))

Solution

  • dplyr solution using n_distinct and c_across.

    df %>% 
      rowwise %>% 
      mutate(n_mri = n_distinct(
        c_across(contains('mr_daterd')), 
        na.rm=TRUE)) %>%
      ungroup()
    
    
    # A tibble: 3 × 4
    # Rowwise: 
      mr_daterd  mr_daterd_fu1 mr_daterd_fu2 n_mri
      <date>     <date>        <date>        <int>
    1 2018-03-05 2018-03-05    NA                1
    2 2019-05-04 NA            2020-03-05        2
    3 2020-01-03 2020-06-06    2021-04-02        3