I have a dataframe of the form
# Minimum example
> data.frame(variable = c("A", "B", "C", "A", "B", "C"),
+ quantity1 = c(2,4,5,4,6,7),
+ quantity2 = c(3,5,6,7,8,9),
+ group = c("G_A", "G_A", "G_A", "G_B", "G_B", "G_B"))
variable quantity1 quantity2 group
1 A 2 3 G_A
2 B 4 5 G_A
3 C 5 6 G_A
4 A 4 7 G_B
5 B 6 8 G_B
6 C 7 9 G_B
I want to create a dataset with newly computed summaries. Concretely, I want: 1) to compute how many values in quantity1 and quantity2 are higher than a given number (new variable: threshold) 2) Then, get two columns which indicates 3) which values in 'variable' are higher than that given number for quantity1 and quantity2, 4) how many values in quantity1 and quantity2 are higher than that given number.
threshold quantity1values quantity1_nvalues quantity2values quantity2_nvalues group
1 2 B, C 2 A, B, C 3 G_A
2 2 A, B, C 3 A, B, C 3 G_B
3 4 C 1 B, C 2 G_A
4 4 B, C 2 A, B, C 3 G_B
I could get here with a combination of list-columns, strings, the length() column, and grouped summaries. But I feel like there should be an efficient solution, probably writing a function, to generalize this, so that if I add a new threshold, or have more than 2 quantities (i.e., quantit1, quantity2, and quantity3), the new values would be computed. Actually, getting the previous dataframe in long format would also work. Base R or tidivyerse solutions appreciated!
UPDATE
I managed to get most of what I meant with the following code:
# Minimum example
data <- data.frame(variable = c("A", "B", "C", "A", "B", "C"),
quantity1 = c(2,4,5,4,6,7),
quantity2 = c(3,5,6,7,8,9),
group = c("G_A", "G_A", "G_A", "G_B", "G_B", "G_B"))
data |>
select(variable, quantity1, quantity2, group) |>
mutate(threshold1 = 2,
threshold2 = 4) |>
pivot_longer(cols = starts_with("thres"),
names_to = "threshold", values_to = "threshold_value") |>
pivot_longer(cols = starts_with("quant"),
names_to = "quantity", values_to = "quantity_value") |>
group_by(group, threshold, quantity) |>
mutate(above_threshold = sum(quantity_value > threshold_value))
However, two questions remain: 1) how can I get the information of which particular values in 'variable' are above the threshold? E.g., as a character string 2) My original question refered to efficiency (for what I meant shortness of code, and generalizibility). I guess generalizibity is no longer a concern, because extending the previous code for more thresholds is pretty straightforward.
Here's a tidyverse::purrr
take.
library(tidyverse)
# Toy data --------------------
my_df <- tibble::tribble(
~var, ~q1, ~q2, ~group, ~threshold,
"A", 2, 3, "G_A", 3,
"B", 4, 5, "G_A", 10,
"C", 5, 6, "G_A", 6,
"A", 4, 7, "G_B", 1,
"B", 6, 8, "G_B", 0,
"C", 7, 9, "G_B", 7)
# Select all `q` columns and prefix them with "pos_"
new_df <- rename_with(select(my_df, starts_with("q")), \(col) str_glue("pos_{col}"))
# Find where `q` is greater than `threshold` (`pos_` columns are lists of logical)
new_df <- map_dfr(my_df$threshold, \(t) map(new_df, \(q_col) list(which(q_col > t))))
new_df <- bind_cols(my_df, new_df)
# Calculate how many `q`s are greater than each `threshold` and with which values
new_df <- mutate(
new_df,
across(
starts_with("pos"),
.fns = list(
n = \(pos_q) map_int(pos_q, length),
val = \(pos_q) map2_chr(pos_q, list(var), \(pos, var) str_flatten_comma(var[pos]))),
.names = "{.fn}_{str_extract(.col, 'q.$')}"))
# Just sorting columns
new_df <- select(new_df, var, group, threshold, matches("\\d$"), -starts_with("pos"))
Here's the output:
# Output
new_df
#> # A tibble: 6 × 9
> new_df
# A tibble: 6 × 9
var group threshold q1 q2 n_q1 val_q1 n_q2 val_q2
<chr> <chr> <int> <dbl> <dbl> <int> <chr> <int> <chr>
1 A G_A 2 2 3 5 "B, C, A, B, C" 6 A, B, C, A, B, C
2 B G_A 3 4 5 5 "B, C, A, B, C" 5 B, C, A, B, C
3 C G_A 0 5 6 6 "A, B, C, A, B, C" 6 A, B, C, A, B, C
4 A G_B 5 4 7 2 "B, C" 4 C, A, B, C
5 B G_B 7 6 8 0 "" 2 B, C
6 C G_B 4 7 9 3 "C, B, C" 5 B, C, A, B, C
If ""
on val_q
is undesirable, just..
mutate(across(starts_with("val_q"), \(x) if_else(x == "", NA_character_, x)))
Hope it helps!
Created on 2024-04-30 with reprex v2.1.0