time-seriesforecastinggoogle-cloud-automl

Google's AutoML time series forecasting: understanding data exported to BigQuery during training


When training a time series forecasting model, I checked the option to "Export test dataset to BigQuery." I'm having a hard time understanding the meaning of the "predicted_on" timestamps that appear in the BigQuery table.

Some info about my model: the granularity is weekly. The context window is 26 weeks, and the forecast horizon is 26 weeks. The 10% test data split also contains exactly 26 weeks of data. In our training data, we have a submission_week column which is designated as the "timestamp" column.

When I sort the BigQuery table by submission_week and then predicted_on_submission_week, it looks like this:†

predicted_on_submission_week / submission_week
06/05/2022                     06/05/2022
---
06/05/2022                     06/12/2022
06/12/2022                     06/12/2022
---
06/05/2022                     06/19/2022
06/12/2022                     06/19/2022
06/19/2022                     06/19/2022

† (Note that for each row above, there actually are multiple rows in the BigQuery table - one for each time series.)

The pattern seen above proceeds until there are at most 6 predicted_on_submission_week timestamps for every submission_week timestamp.

My questions: What is the meaning of the predicted_on_submission_week timestamps? Why are there multiple (at most 6) such timestamps for each submission_week timestamp?

(I suspect this may be related to how the context window and forecast horizon are used during training and forecasts as described here in Google's documentation, but I'm not sure...)


Solution

  • Regarding my first question (What is the meaning of the predicted_on_submission_week timestamps?):

    I've since learned that a predicted_on timestamp marks the first date of the forecast horizon of a sliding forecast window.

    I found that it's easier to understand and interpret the data when I sort it first by predicted_on_submission_week and then by submission_week. This way I can view the data in terms of the sliding forecast windows.

    Regarding my second question (Why are there multiple - at most 6 - such timestamps for each submission_week timestamp?):

    I'm not sure, but I did discover the following.

    The timestamp format I was using (mm/dd/yyyy) is not among the timestamp formats supported by Google according to this documentation. I changed my timestamps to yyyy-mm-dd. I also ensured that every number in my target column has a decimal (it was previously a mix of integers and decimal numbers). After making these changes, I trained a new model, and examined the data exported to BigQuery.

    Now I see that there are 26 weeks of submission_week timestamps associated with the first predicted_on_submission_week timestamp (2022-06-05). This would suggest that the forecast horizon of the sliding forecast windows is 26 weeks long. This makes more sense to me, given that I set the forecast horizon to 26 weeks when I trained the model. (Also note that the 26th week is the last week of training data - this is relevant to the next point.)

    The next predicted_on_submission_week timestamp after 2022-06-05 is 2022-06-12. There are 25 weeks of submission_week timestamps for this date. This makes sense because now the forecast horizon is extending one week beyond the end of the training data.