I have a data frame like this:
v_1 <- c("1a", "1b","1c", "2a", "2b", "2c", "3a", "3b","3c", "4a", "4b", "4c")
v_2 <- c(1,1,1,2,2,2,3,3,3,4,4,4)
v_3 <- c("dog", "dog", "dog", "dog", "dog", "dog", "cat", "cat", "cat", "cat", "cat", "cat")
v_4 <- c(1:12)
df <- data.frame(v_1, v_2, v_3, v_4)
df
v_1 v_2 v_3 v_4
1 1a 1 dog 1
2 1b 1 dog 2
3 1c 1 dog 3
4 2a 2 dog 4
5 2b 2 dog 5
6 2c 2 dog 6
7 3a 3 cat 7
8 3b 3 cat 8
9 3c 3 cat 9
10 4a 4 cat 10
11 4b 4 cat 11
12 4c 4 cat 12
I want to group this data frame and count the distinct values for v_1 and v_2. If I am just interested in the count in v_1 it is quite easy. I just do:
library(dplyr)
df_grouped <- df %>%
group_by(v_3) %>%
summarise(v_4_sum = sum(v_4),
v_1_count = n())
v_3 v_4_sum v_1_count
<chr> <int> <int>
1 cat 57 6
2 dog 21 6
If I want to also see te count of v_2 it seems like I have to use group_by two times like this:
df_grouped_v2 <- df %>%
group_by(v_2, v_3) %>%
summarise(v_4_sum = sum(v_4),
v_1_count = n())
df_grouped_v22 <- df_grouped_v2 %>%
group_by(v_3) %>%
summarise(v_4_sum = sum(v_4_sum),
v_1_count = sum(v_1_count),
v_2_count = n())
df_grouped_v22
v_3 v_4_sum v_1_count v_2_count
<chr> <int> <int> <int>
1 cat 57 6 2
2 dog 21 6 2
This is the result I want but it seems not straight forward. Especially if I have a huge data frame the group_by operation is time intensive and I would prefer to use it just one time.
For distinct values, you can use n_distinct()
rather than n()
:
library(dplyr)
df |>
summarise(v_4_sum = sum(v_4),
across(c(v_1, v_2), n_distinct, .names = "{.col}_count"),
.by = v_3)
v_3 v_4_sum v_1_count v_2_count
1 dog 21 6 2
2 cat 57 6 2