rdplyr

Count occurrences of multiple values across multiple columns


I am having issues with making a special count variable in r. I have seen that I can use something like this to just do a general count:

 D <- D %>% 
    group_by(X1) %>% 
    mutate(S1 = n())

I would like to have four different variables (S1, S2, S3 and S4) using dplyr. Each variable must count occurrences by scanning the columns (X1-X5) and count an occurrence as ONLY 0 or 1. For example, I would like S1 to be a variable that tells me that R has scanned X1 to X5 and counts if the value of '1' occurred at all, and computes 1 for S1 if an occurrence happened in the X1 to X5 columns, and computes 0 if the occurrence did NOT happen.

Here is an example of my data:

  structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), X1 = c(1, 
NA, NA, NA, NA, NA, 1, 1, 1, 1), X2 = c(2, 1, 1, 1, 1, NA, 2, 
NA, NA, 2), X3 = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA), X4 = c(4, 
2, NA, NA, NA, NA, 3, NA, 2, 3), X5 = c(5, NA, 2, 2, 2, NA, NA, 
NA, NA, NA)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-10L))

Solution

  • Adapting from the approach here: https://stackoverflow.com/a/74732921/6851825

    df |>
      rowwise() |>
      mutate(S1 = 1*any(na.omit(c_across(X1:X5)) == 1),
             S2 = 1*any(na.omit(c_across(X1:X5)) == 2),
             S3 = 1*any(na.omit(c_across(X1:X5)) == 3),
             S4 = 1*any(na.omit(c_across(X1:X5)) == 4)) |>
      ungroup()
    

    Result

    
    # A tibble: 10 × 10
          ID    X1    X2    X3    X4    X5    S1    S2    S3    S4
       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
     1     1     1     2    NA     4     5     1     1     0     1
     2     2    NA     1    NA     2    NA     1     1     0     0
     3     3    NA     1    NA    NA     2     1     1     0     0
     4     4    NA     1    NA    NA     2     1     1     0     0
     5     5    NA     1    NA    NA     2     1     1     0     0
     6     6    NA    NA     1    NA    NA     1     0     0     0
     7     7     1     2    NA     3    NA     1     1     1     0
     8     8     1    NA    NA    NA    NA     1     0     0     0
     9     9     1    NA    NA     2    NA     1     1     0     0
    10    10     1     2    NA     3    NA     1     1     1     0
    

    Alternatively, we could reshape with tidyr::pivot_longer/wider to avoid needing to make any formulas for specific values. This version creates an S5 column since 5 exists in the data.

    
    left_join(df, df |>
      tidyr::pivot_longer(-ID, values_drop_na = TRUE) |>
      count(ID, value = paste0("S", value)) |>
      arrange(value) |>
      tidyr::pivot_wider(names_from = value, values_fn = ~1, values_fill = 0) |>
      select(-n))
    

    Result

    
    Joining with `by = join_by(ID)`
    # A tibble: 10 × 11
          ID    X1    X2    X3    X4    X5    S1    S2    S3    S4    S5
       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
     1     1     1     2    NA     4     5     1     1     0     1     1
     2     2    NA     1    NA     2    NA     1     1     0     0     0
     3     3    NA     1    NA    NA     2     1     1     0     0     0
     4     4    NA     1    NA    NA     2     1     1     0     0     0
     5     5    NA     1    NA    NA     2     1     1     0     0     0
     6     6    NA    NA     1    NA    NA     1     0     0     0     0
     7     7     1     2    NA     3    NA     1     1     1     0     0
     8     8     1    NA    NA    NA    NA     1     0     0     0     0
     9     9     1    NA    NA     2    NA     1     1     0     0     0
    10    10     1     2    NA     3    NA     1     1     1     0     0