pythontime-seriesmissing-dataimputationmissing-features

Impute time series data in python using given set of features


So my data looks like:

year, y, x1, x2, x3, x4
2009, 0.5, 0.4, 0.4, 0.9
2013, nan, 0.4, 0.5, 0.8
2020, 0.8, 0.39, 0.51, 0.7

The data is year-wise but the interval between each year is not consistent. Value of y depends both on time and the features. But in some cases y is missing which I need the most. Other features can be missing too but mostly they are all there. I have tried imputing data through df.interpolate() function but values does not fit well in the interval for most of the functions. I have tried ARIMA, LSTM and others but they do not consider input features. I have considered using regression techniques too but they do not incorporate time series nature of the data.

So what is the best approach for this case. i.e.

How to impute Time Series values based on input features?


Solution

  • Interesting question, there is no rule or a good answer to your problem...

    Seems you would like to predict t+n points starting from t+1, where t is your last known point.

    If so, you need to:

    It is important to remove unknown target values (y with nans). But doing this you will loose some important information, Therefore one way is to create two models. One for data imputation to fill the unknown values y. The second for forecasting future values of y.

    The first model may be represented as an AutoEncoder, where the features represents the current time. In another words, given n features predict y. Where n and y were obtained from the same time t (same row).

    The second model may predict the future (forecasting), therefore after inputing the missing y values, predict the future t+n, where n exists {1 -> +inf}.

    Another good approach to deal with missing values is to create three models instead of two.

    The first is the above mentioned to data imputation.

    After filling missing target values, use the new matrix to input a second autoencoder.

    Use the hidden state of the second AE as input to the third model, this way you may have missing values, and the AE could get a compressed representation of those values using the best to predict the future.

    The best architecture varies from problem to problem. For example, in your case you can just drop missing target values and get a good final model.

    One adjustment that should be necessary is to input missing feature values, but I would try with missing values before adding some noise. If needed you can add the mean, median, min or max of a rolling window (use rolling method pandas).