I have a Dataframe with a time series index with several thousand rows. Unfortunately some datapoints are missing it looks like this:
A
2023-03-18 05:00:00 3.0
2023-03-18 06:00:00 4.0
2023-03-18 07:00:00 24.4
2023-03-18 12:00:00 5.6
2023-03-18 13:00:00 3.4
2023-03-18 15:00:00 4.5
2023-03-18 20:00:00 8.8
2023-03-18 21:00:00 3.2
I want my data to be hourly and thus interpolate the missing data points. This can be done with .resample()
(or .reindex()
) and .interpolate()
like this:
df = df.resample("1h").first()
df = df.interpolate(method="time")
df
which gives:
A
2023-03-18 05:00:00 3.00
2023-03-18 06:00:00 4.00
2023-03-18 07:00:00 24.40
2023-03-18 08:00:00 20.64
2023-03-18 09:00:00 16.88
2023-03-18 10:00:00 13.12
2023-03-18 11:00:00 9.36
2023-03-18 12:00:00 5.60
2023-03-18 13:00:00 3.40
2023-03-18 14:00:00 3.95
2023-03-18 15:00:00 4.50
2023-03-18 16:00:00 5.36
2023-03-18 17:00:00 6.22
2023-03-18 18:00:00 7.08
2023-03-18 19:00:00 7.94
2023-03-18 20:00:00 8.80
2023-03-18 21:00:00 3.20
QUESTION
However, I don't want to interpolate if the gaps are too large e.g. if they are longer than 3 hours. At these large gaps I want to split the DataFrame by rows i.e. so that I have multiple DataFrames in the end (, as a list or whatever). How can I do this?
The result should look something like this - 3 DataFrames:
A
2023-03-18 05:00:00 3.00
2023-03-18 06:00:00 4.00
2023-03-18 07:00:00 24.40
A
2023-03-18 12:00:00 5.60
2023-03-18 13:00:00 3.40
2023-03-18 14:00:00 3.95
2023-03-18 15:00:00 4.50
A
2023-03-18 20:00:00 8.80
2023-03-18 21:00:00 3.20
Edit:
My idea was to look for a convenience function which does some consecutive logic on a boolean list.
Imagine there is an index / a series with boolean values
[True, False, True, False, False, False, False, True, True, False, False, True]
and we want to generate another Series like this
[True, True, True, False, False, False, False, True, True, True, True, True]
Which means we only want those entries to be False
which are part of a sequence of more than 3 consecutive False
's ...
The first boolean list can be obtained like this:
~df.resample('1h').first().isnull().any(axis=1)
SOLUTION:
Thanks to @triky for his very nice solution. I also found a very similar question on stackoverlfow
I ended up with the following snippet (which is essentially trikys method):
# specify threshold
th = pd.Timedelta(hours=3)
# calculate the groups
groups = (df.index.diff() > th).cumsum()
# split the df according to groups
dfs = [g for _, g in df.groupby(groups)]
# several further computations ...
dfs = [ df.resample('1h').first().interpolate(method='time') for df in dfs]
Create the groups based on your condition using the index and diff to check the difference between the time periods and after that groupby using the groups. Use resample to create the missing dates for each group and after use interploate.
groups = df.index.diff() > pd.to_timedelta(3, unit='h')
dfs = {f'df{idx}': g.resample('1h').first().interpolate(method="time")
for idx, g in df.groupby(groups.cumsum())}
In case you want a list to concat them back.
dfs2 = [g.resample('1h').first().interpolate(method="time")
for idx, g in df.groupby(groups.cumsum())]
End result:
{'df0': A
2023-03-18 05:00:00 3.0
2023-03-18 06:00:00 4.0
2023-03-18 07:00:00 24.4,
'df1': A
2023-03-18 12:00:00 5.60
2023-03-18 13:00:00 3.40
2023-03-18 14:00:00 3.95
2023-03-18 15:00:00 4.50,
'df2': A
2023-03-18 20:00:00 8.8
2023-03-18 21:00:00 3.2}