rdbplyrbigrquery

Fill in missing values in dbplyr


I have data in a database that look like:

ID    month_year   value
1     01/06/2014   10
1     01/07/2014   100
1     01/10/2014   25

I would like to fill in the missing months:

ID    month_year   value
1     01/06/2014   10
1     01/07/2014   100
1     01/08/2014   NA
1     01/09/2014   NA
1     01/10/2014   25

I am using the BigQuery package to use dbplyr. I know this is possible in BigQuery with UNNEST(GENERATE_DATE_ARRAY(... but I can't make that work with dbplyr. Possibly related to this github issue


Solution

  • You can do this with an outer join.

    list_of_dates = data_with_missing_dates %>%
      select(month_year) %>%
      distinct()
    
    data_with_filled_dates = data_with_missing_dates %>%
      right_join(list_of_dates, by = "month_year")
    

    These are all standard dplyr commands, hence dbplyr can translate them into bigquery.

    The above assumes that your existing data includes all the dates you want in the final output (but spread over different ID values), hence list_of_dates can be constructed from your initial dataset.

    If there are dates that do not appear for any ID in your initial data that you want to appear in your final data, then you will need to construct list_of_dates some other way. In this case even complete() would not be sufficient by itself.

    Edit so each ID has its own start and end

    list_of_dates = data_with_missing_dates %>%
      select(month_year) %>%
      distinct() %>%
      mutate(placeholder = 1)
    
    date_limits = data_with_missing_dates %>%
      group_by(ID) %>%
      summarise(min_date = min(month_year),
                max_date = max(month_year)) %>%
      mutate(placeholder = 1)
    
    data_with_filled_dates = date_limits %>%
      outer_join(list_of_dates, by = "placeholder") %>%
      filter(min_date <= month_year,
             max_date >= month_year) %>%
      select(ID, month_year) %>%
      left_join(data_with_missing_dates, by = c("ID", "month_year"))