pythonmachine-learningtime-seriesmissing-data

Filling huge/large chunks of time-series data


What would be the best way to fill up missing values in time series data. Data varies a lot over working hours. The data is missing in huge chunks.

I have tried back back, forward filling and mean techniques to fill up the data. I have also tried interpolation( linear, nearest and polynomial) with pandas package. But results achieved are not very useful.enter image description here

First graph shows the missing data around 6-9 April. Second graph is plotted after filling missing values using linear interpolation.

What would be the best method to fill such a data? I am afraid linear interpolation will end up polluting the data.

I have read a bit about Kalman filter. Not sure how to use that.


Solution

  • It really depends on the size of the chunks of missing data, but training a model in order to predict your missing values could work in some cases.
    Apart from using linear regression, you could also try using other models, for example k-nn regression. In addition, the datawig module (Github) uses Neural Networks to learn Machine Learning models in order to impute missing values in tables.

    Kalman filter in python can be found in the FilterPy module. For more information you can read the documentation here.

    Moreover, as you have time-series data to work with, you could see if an ARIMA model can do the job predicting your missing values.