pythonpandasscikit-learndata-mining

Sklearn or Pandas, impute missing values with simple linear regression


I have time series data, and I want to impute the missing data. I cant use mean of the column because I think it's not good for time series data.

So I want simple linear regression to impute it

Day, Price
 1, NaN
 2, NaN
 3, 1800
 4, 1900
 5, NaN
 6, NaN
 7, 2000
 8, 2200
 

I prefer to do this using Pandas, but if there is no other way I'm ok to do it using sklearn :)


Solution

  • You can do this using interpolate:

    df['Price'].interpolate(method='linear', inplace=True)
    

    Result:

        Price   Date
    0   NaN     1
    1   NaN     2
    2   1800.000000     3
    3   1900.000000     4
    4   1933.333333     5
    5   1966.666667     6
    6   2000.000000     7
    7   2200.000000     8
    

    As you can see, this only fills the missing values in a forward direction. If you want to fill the first two values as well, use the parameter limit_direction="both":

    df['Price'].interpolate(method='linear', inplace=True, limit_direction="both")
    

    There are different interpolation methods, e.g. quadratic or spline, for more info see the docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html