I have data in a database that look like:
ID month_year value
1 01/06/2014 10
1 01/07/2014 100
1 01/10/2014 25
I would like to fill in the missing months:
ID month_year value
1 01/06/2014 10
1 01/07/2014 100
1 01/08/2014 NA
1 01/09/2014 NA
1 01/10/2014 25
I am using the BigQuery package to use dbplyr. I know this is possible in BigQuery with UNNEST(GENERATE_DATE_ARRAY(... but I can't make that work with dbplyr. Possibly related to this github issue
You can do this with an outer join.
list_of_dates = data_with_missing_dates %>%
select(month_year) %>%
distinct()
data_with_filled_dates = data_with_missing_dates %>%
right_join(list_of_dates, by = "month_year")
These are all standard dplyr
commands, hence dbplyr
can translate them into bigquery.
The above assumes that your existing data includes all the dates you want in the final output (but spread over different ID values), hence list_of_dates
can be constructed from your initial dataset.
If there are dates that do not appear for any ID in your initial data that you want to appear in your final data, then you will need to construct list_of_dates
some other way. In this case even complete()
would not be sufficient by itself.
Edit so each ID has its own start and end
list_of_dates = data_with_missing_dates %>%
select(month_year) %>%
distinct() %>%
mutate(placeholder = 1)
date_limits = data_with_missing_dates %>%
group_by(ID) %>%
summarise(min_date = min(month_year),
max_date = max(month_year)) %>%
mutate(placeholder = 1)
data_with_filled_dates = date_limits %>%
outer_join(list_of_dates, by = "placeholder") %>%
filter(min_date <= month_year,
max_date >= month_year) %>%
select(ID, month_year) %>%
left_join(data_with_missing_dates, by = c("ID", "month_year"))