rdataframetidyversedata-wrangling

How to compute which values and how many values of a given variable satisfy a condition for another variable?


I have a dataframe of the form

# Minimum example
> data.frame(variable = c("A", "B", "C", "A", "B", "C"),
+            quantity1 = c(2,4,5,4,6,7),
+            quantity2 = c(3,5,6,7,8,9),
+            group = c("G_A", "G_A", "G_A", "G_B", "G_B", "G_B"))
  variable quantity1 quantity2 group
1        A         2         3   G_A
2        B         4         5   G_A
3        C         5         6   G_A
4        A         4         7   G_B
5        B         6         8   G_B
6        C         7         9   G_B

I want to create a dataset with newly computed summaries. Concretely, I want: 1) to compute how many values in quantity1 and quantity2 are higher than a given number (new variable: threshold) 2) Then, get two columns which indicates 3) which values in 'variable' are higher than that given number for quantity1 and quantity2, 4) how many values in quantity1 and quantity2 are higher than that given number.

  threshold quantity1values quantity1_nvalues quantity2values quantity2_nvalues group
1        2         B, C         2                    A, B, C         3           G_A
2        2         A, B, C      3                    A, B, C         3           G_B
3        4         C           1                     B, C            2           G_A
4        4         B, C        2                     A, B, C         3           G_B

I could get here with a combination of list-columns, strings, the length() column, and grouped summaries. But I feel like there should be an efficient solution, probably writing a function, to generalize this, so that if I add a new threshold, or have more than 2 quantities (i.e., quantit1, quantity2, and quantity3), the new values would be computed. Actually, getting the previous dataframe in long format would also work. Base R or tidivyerse solutions appreciated!

UPDATE

I managed to get most of what I meant with the following code:

# Minimum example
data <- data.frame(variable = c("A", "B", "C", "A", "B", "C"),
           quantity1 = c(2,4,5,4,6,7),
           quantity2 = c(3,5,6,7,8,9),
           group = c("G_A", "G_A", "G_A", "G_B", "G_B", "G_B")) 
data |> 
  select(variable, quantity1, quantity2, group) |> 
  mutate(threshold1 = 2,
         threshold2 = 4) |> 
  pivot_longer(cols = starts_with("thres"),
               names_to = "threshold", values_to = "threshold_value") |> 
  pivot_longer(cols = starts_with("quant"),
               names_to = "quantity", values_to = "quantity_value") |> 
  group_by(group, threshold, quantity) |>
  mutate(above_threshold = sum(quantity_value > threshold_value))

However, two questions remain: 1) how can I get the information of which particular values in 'variable' are above the threshold? E.g., as a character string 2) My original question refered to efficiency (for what I meant shortness of code, and generalizibility). I guess generalizibity is no longer a concern, because extending the previous code for more thresholds is pretty straightforward.


Solution

  • Here's a tidyverse::purrr take.

    library(tidyverse)
    
    # Toy data --------------------
    my_df <- tibble::tribble(
      ~var, ~q1, ~q2, ~group, ~threshold,
      "A",   2,   3,  "G_A",         3,
      "B",   4,   5,  "G_A",        10,
      "C",   5,   6,  "G_A",         6,
      "A",   4,   7,  "G_B",         1,
      "B",   6,   8,  "G_B",         0,
      "C",   7,   9,  "G_B",         7)
    
    # Select all `q` columns and prefix them with "pos_"
    new_df <- rename_with(select(my_df, starts_with("q")), \(col) str_glue("pos_{col}"))
    
    # Find where `q` is greater than `threshold` (`pos_` columns are lists of logical)
    new_df <- map_dfr(my_df$threshold, \(t) map(new_df, \(q_col) list(which(q_col > t)))) 
    new_df <- bind_cols(my_df, new_df) 
    
    # Calculate how many `q`s are greater than each `threshold` and with which values
    new_df <- mutate(
      new_df,
      across(
        starts_with("pos"),
        .fns = list(
            n = \(pos_q)  map_int(pos_q, length),
          val = \(pos_q) map2_chr(pos_q, list(var), \(pos, var) str_flatten_comma(var[pos]))), 
        .names = "{.fn}_{str_extract(.col, 'q.$')}"))
    
    # Just sorting columns
    new_df <- select(new_df, var, group, threshold, matches("\\d$"), -starts_with("pos"))
    

    Here's the output:

    # Output
    new_df
    #> # A tibble: 6 × 9
    > new_df
    # A tibble: 6 × 9
      var   group threshold    q1    q2  n_q1 val_q1              n_q2 val_q2          
      <chr> <chr>     <int> <dbl> <dbl> <int> <chr>              <int> <chr>           
    1 A     G_A           2     2     3     5 "B, C, A, B, C"        6 A, B, C, A, B, C
    2 B     G_A           3     4     5     5 "B, C, A, B, C"        5 B, C, A, B, C   
    3 C     G_A           0     5     6     6 "A, B, C, A, B, C"     6 A, B, C, A, B, C
    4 A     G_B           5     4     7     2 "B, C"                 4 C, A, B, C      
    5 B     G_B           7     6     8     0 ""                     2 B, C            
    6 C     G_B           4     7     9     3 "C, B, C"              5 B, C, A, B, C 
    

    If "" on val_q is undesirable, just..
    mutate(across(starts_with("val_q"), \(x) if_else(x == "", NA_character_, x)))

    Hope it helps!

    Created on 2024-04-30 with reprex v2.1.0