I have data that I've downloaded from a server. It has product sales per month and the products have various categories. I would like to summarise the data by category and by quarter.
I can't figure out how to keep the group_by groups when I use the timetk summarise_by_time.
I have tried reading the documentation for summarise_by_time, but I can't see anything about other groups that are not time groupings. This is what I have so far:
plot_dat<-df_m %>%
group_by(date,Area,Main_Category) %>%
summarise(sum_products=sum(Products_sold)) %>%
summarise_by_time(
.by = "quarter",
value = last(sum_products),
.type = "ceiling"
)
This gives the correct output
plot_dat<-df_m %>%
group_by(date,Area,Main_Category) %>%
summarise(sum_products=sum(Products_sold))
output
`summarise()` has grouped output by 'date', 'Area'. You can override using the `.groups` argument.
# A tibble: 900 × 4
# Groups: date, Area [240]
date Area Main_Category sum_products
<date> <chr> <chr> <int>
1 2018-11-27 Analysis "NGS" 4
2 2018-11-27 Analysis "PCR" 0
3 2018-11-27 Collection/Preservation "" 0
4 2018-11-27 Collection/Preservation "Preservation Device" 172
5 2018-11-27 Collection/Preservation "Preservation Reagent" 10
6 2018-11-27 Isolation_and_Purification "DNA kits" 96
But when I add the summarise_by_time they are only grouped by Area and I lose most of the data (ie the sum of the products sold is now only ~800 as opposed to 74,000)
output
# A tibble: 84 × 3
# Groups: date [21]
date Area value
<date> <chr> <int>
1 2019-01-01 Analysis 0
2 2019-01-01 Collection/Preservation 9
3 2019-01-01 Isolation_and_Purification 4
4 2019-01-01 Lab_Supplies 2
5 2019-04-01 Analysis 0
6 2019-04-01 Collection/Preservation 0
7 2019-04-01 Isolation_and_Purification 2
8 2019-04-01 Lab_Supplies 1
9 2019-07-01 Analysis 0
10 2019-07-01 Collection/Preservation 3
My educated guess based on your question is that you're wanting to get one row for the sum of Products, for each combination of (Main_Category, Area, Quarter) that exists in your dataset?
The issue with your code can be seen by looking at the code in your question:
This gives the correct output
plot_dat <- df_m %>% group_by(date,Area,Main_Category) %>% summarise(sum_products=sum(Products_sold))
output
`summarise()` has grouped output by 'date', 'Area'. You can override using the `.groups` argument. # A tibble: 900 × 4 # Groups: date, Area [240] date Area Main_Category sum_products <date> <chr> <chr> <int> 1 2018-11-27 Analysis "NGS" 4 2 2018-11-27 Analysis "PCR" 0 3 2018-11-27 Collection/Preservation "" 0 4 2018-11-27 Collection/Preservation "Preservation Device" 172 5 2018-11-27 Collection/Preservation "Preservation Reagent" 10 6 2018-11-27 Isolation_and_Purification "DNA kits" 96
You say it's giving the correct answer, but in fact, you can see that the dataframe is now only grouped by date and area, not date, area, and category, so it's really not the correct answer. Why is it doing this? To quote the documentation for summarise()
:
.groups
[Experimental] Grouping structure of the result.
"drop_last": dropping the last level of grouping. This was the only supported option before version 1.0.0.
When .groups is not specified, it is chosen based on the number of rows of the results:
If all the results have 1 row, you get "drop_last".
So in this case, it is dropping the last level of grouping, which in your code is Main_Category, hence your result.
You could fix it in a few different ways, but maybe the simplest is to simply make the date column the last one, and then use that as the .by
argument in summarise_by_time()
:
df |>
group_by(Area,Main_Category, date) |>
summarise(sum_products=sum(Products_sold)) |>
summarise_by_time(.date_var = date, .by = "quarter", sum_products = sum(sum_products))
Output:
# A tibble: 72 × 4
# Groups: Area, Main_Category [18]
Area Main_Category date sum_products
<chr> <chr> <date> <int>
1 Analysis "" 2023-01-01 735
2 Analysis "" 2023-04-01 374
3 Analysis "" 2023-07-01 477
4 Analysis "" 2023-10-01 830
5 Analysis "DNA kits" 2023-01-01 1101
6 Analysis "DNA kits" 2023-04-01 513
7 Analysis "DNA kits" 2023-07-01 527
8 Analysis "DNA kits" 2023-10-01 1458
9 Analysis "NGS" 2023-01-01 622
10 Analysis "NGS" 2023-04-01 775
# ℹ 62 more rows
Another option would be to give summarise()
the argument .groups = "keep"
, which would achieve the same result.
Note, also, if you would like there to be a row for every combination that could possibly exist, not just those that do exist in your dataset, you can use complete()
to add rows for every possible combination of Area, Main_Category, and Quarter. E.g. like this:
# add onto the end of the above code:
ungroup() |>
complete(Area, Main_Category, date = seq.Date(from = min(date), to = max(date), by = "quarter"), fill = list(sum_products = 0))
Data:
set.seed(0)
df <- data.frame(
date = sample(seq.Date(from = as.Date("2023-01-01"), to = as.Date("2023-12-31"), by = "day"), 1095, replace = TRUE),
Area = sample(c("Analysis", "Collection/Preservation", "Isolation_and_Purification"), 1095, replace = TRUE),
Main_Category = sample(c("NGS", "PCR", "", "Preservation Device", "Preservation Reagent", "DNA kits"), 1095, replace = TRUE),
Products_sold = sample(0:100, 1095, replace = TRUE))