I have time series data, and I want to impute the missing data. I cant use mean of the column because I think it's not good for time series data.
So I want simple linear regression to impute it
Day, Price
1, NaN
2, NaN
3, 1800
4, 1900
5, NaN
6, NaN
7, 2000
8, 2200
I prefer to do this using Pandas, but if there is no other way I'm ok to do it using sklearn
:)
You can do this using interpolate
:
df['Price'].interpolate(method='linear', inplace=True)
Result:
Price Date
0 NaN 1
1 NaN 2
2 1800.000000 3
3 1900.000000 4
4 1933.333333 5
5 1966.666667 6
6 2000.000000 7
7 2200.000000 8
As you can see, this only fills the missing values in a forward direction. If you want to fill the first two values as well, use the parameter limit_direction="both"
:
df['Price'].interpolate(method='linear', inplace=True, limit_direction="both")
There are different interpolation methods, e.g. quadratic or spline, for more info see the docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html