I have data that looks like this:
Timestamp | Variable A | Variable B | Variable C |
---|---|---|---|
2023-01-01 00:00:00.000 | Value | Nan | Nan |
2023-01-01 00:00:00.050 | Value | Nan | Nan |
2023-01-01 00:00:00.150 | Nan | Value | Value |
2023-01-01 05:00:00.450 | Nan | Value | Value |
2023-01-01 05:00:00.500 | Value | Nan | Nan |
Variable A comes from one source and Variable B and C from a second, so the timestamps are different which gives me a lot of missing data (the data comes from a dozen sources, this is just a simplified example).
I tried to use the resample
method of pandas.
But this method creates too many lines for me, in the example above there would be new lines between 2023-01-01 00:00:00.150 and 2023-01-01 05:00:00.450, which I do not want.
Can you help me? Thank you
One option could be to use the successive time difference and a threshold to group your data with groupby.first
(or groupby.mean
):
out = df.groupby(df['Timestamp'].diff().ge('1s').cumsum(), as_index=False).first()
Output:
Timestamp Variable A Variable B Variable C
0 2023-01-01 00:00:00.000 1.0 3.0 4.0
1 2023-01-01 05:00:00.450 7.0 5.0 6.0
With resample
:
out = df.resample('5H', on='Timestamp').mean().reset_index()
Output:
Timestamp Variable A Variable B Variable C
0 2023-01-01 00:00:00 1.5 3.0 4.0
1 2023-01-01 05:00:00 7.0 5.0 6.0
Used input:
Timestamp Variable A Variable B Variable C
0 2023-01-01 00:00:00.000 1.0 NaN NaN
1 2023-01-01 00:00:00.050 2.0 NaN NaN
2 2023-01-01 00:00:00.150 NaN 3.0 4.0
3 2023-01-01 05:00:00.450 NaN 5.0 6.0
4 2023-01-01 05:00:00.500 7.0 NaN NaN