I have two dataframes with values for two different years, categorized by group, and I'd like to combine them together to calculate running amounts per group while filling in missing dates. Here are the example dataframes I'm working with.
df1 <- data.frame(
group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
values = rnorm(365, 10, 1)
)
df2 <- data.frame(
group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
values = rnorm(365, 10, 1)
)
This create two random dataframes with a random value for each date, but for only one group, so there are missing dates. Below, I use data.table functions to combine the two dataframes together, fill in the missing dates, and calculate the mean average running amounts per group using the runner
package. I noticed, though, that dates with values of 0 are not counted against the mean. For example, the resulting data frame from this code:
df <- rbindlist(list(df1, df2))
idx <- df[,.(date = seq(min(date), max(date), "day")), by = group]
setkey(df, group, date)
setkey(idx, group, date)
df <- df[idx] %>%
setorder(date) %>%
.[,.(
date = lubridate::ymd(date),
values = ifelse(is.na(values), 0, as.numeric(values)),
running_values = runner::mean_run(x = values, k = 90, idx = date)
), by = group]
results in a dataframe that looks like this:
group | date | values | running_values |
---|---|---|---|
g1 | 2023-01-04 | 9.412 | 9.412 |
g1 | 2023-01-05 | 0 | 9.412 |
g1 | 2023-01-06 | 0 | 9.412 |
g1 | 2023-01-07 | 0 | 9.412 |
g1 | 2023-01-08 | 10.788 | 10.100 |
g1 | 2023-01-09 | 0 | 10.100 |
However, the output that I expect (using tidyverse functions instead) is:
df <- bind_rows(df1, df2) %>%
complete(
nesting(group),
date = seq.Date(min(date), max(date), by = "day")
) %>%
group_by(group) %>%
arrange(date) %>%
mutate(
values = ifelse(is.na(values), 0, as.numeric(values)),
date = lubridate::ymd(date),
running_values = runner::mean_run(x = values, k = 90, idx = date)
)
This gives the dataframe I want.
group | date | values | running_values |
---|---|---|---|
g1 | 2023-01-01 | 0 | 0 |
g1 | 2023-01-02 | 0 | 0 |
g1 | 2023-01-03 | 0 | 0 |
g1 | 2023-01-04 | 9.412 | 2.353 |
g1 | 2023-01-05 | 0 | 1.882 |
g1 | 2023-01-06 | 0 | 1.568 |
Using the tidyverse method will also make the groups start at the earliest date across all groups, while the data.table will only use the earliest date per each group, though it's not my main concern. How do I reproduce the tidyverse table but using data.table functions so I don't have to load those dependencies? A simple solution would seem to be to convert the NAs to arbitrarily small numbers (0.000000001) which seems to work well enough, but I'd like to understand why this difference occurs at all.
df1 <- data.frame(
group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
values = rnorm(365, 10, 1)
)
df2 <- data.frame(
group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
values = rnorm(365, 10, 1)
)
# Bad output (data.table method)
df <- rbindlist(list(df1, df2))
idx <- df[,.(date = seq(min(date), max(date), "day")), by = group]
setkey(df, group, date)
setkey(idx, group, date)
df <- df[idx] %>%
setorder(date) %>%
.[,.(
date = lubridate::ymd(date),
values = ifelse(is.na(values), 0, as.numeric(values)),
running_values = runner::mean_run(x = values, k = 90, idx = date)
), by = group]
# Desired output (tidyverse method)
df <- bind_rows(df1, df2) %>%
complete(
nesting(group),
date = seq.Date(min(date), max(date), by = "day")
) %>%
group_by(group) %>%
arrange(date) %>%
mutate(
values = ifelse(is.na(values), 0, as.numeric(values)),
date = lubridate::ymd(date),
running_values = runner::mean_run(x = values, k = 90, idx = date)
)
You can use the cross-product function CJ()
to create a filled panel (every day for each group) and right-join it to the data:
df <- rbind(df1,df2)[
CJ(group=unique(df$group), date=seq(min(df$date), max(df$date), "day")),
on=.(group, date)
][is.na(values), values:=0]
Then do the calculation:
df[, running_values := runner::mean_run(x = values, k = 90, idx = date), by=.(group)]
Edit - adding an explanation of why the original attempt didn't work:
First, as an aside, your example data has multiple values per group per day in some cases, so we should probably be aggregating to max one row per date/group:
rbind(df1,df2)[, .(value=sum(values)), keyby=.(group, date)]
(NB, rbindlist()
is intended for large lists of unknown length - rbind
will despatch to rbind.data.table
, so is all you need)
This will give 4 * 365 = 1460 rows after joining with the filled panel frame.
As for the other issues, setkey
just orders the data (and tags it as ordered) - it isn't the issue. (And you don't actually need to key df
here - see below.)
The issues are:
idx <- df[,.(date = seq(min(date), max(date), "day")), by = group]
This explicitly does what you don't want - it creates a series by group from the group min to the group max. This is why you need CJ()
(or else replace with seq(min(df$date), max(df$date), "day")
).
There are a lot of odd things next, which must be the result of you trying different things to see what happens. You set keys on both the df
and idx
tables (even though they are both already in the right order thanks to the join with idx
, and you don't even use the idx
table again). But then you immediately undo this by putting df
in the wrong order:
df |> setorder(date) # now cycles g1,g2,...,g4 by date
Next:
date
is already a date so you don't need to convert it to one with lubridate().
You convert NA to 0 in values
and give the output the same name values
- this bit is okay.
But you create running_values
in the same step in the pipe, which means that it will be based on the original values
(with NAs). You need to do this in a second step instead.
In short the only steps you need are: