I am trying to efficiently cleanup a Pandas time series that contains daily cumulative rain precipitation sum. Sometimes, multiple consecutive values are lower than the previous value. I am trying to replace those values with the last correct value.
Note that time series contains multiple days and therefore cumulative sum is reset for the first timestamp of each day.
Example data with erroneous values on first day at timestamps 2024-08-12 14:14:48+00:00
and 2024-08-12 14:19:55+00:00
:
RR_TODAY day day_change
timestamp
2024-08-12 14:04:57+00:00 0.00 2024-08-13 00:00:00+10:00 True
2024-08-12 14:09:58+00:00 1.50 2024-08-13 00:00:00+10:00 False
2024-08-12 14:14:48+00:00 1.40 2024-08-13 00:00:00+10:00 False
2024-08-12 14:19:55+00:00 1.40 2024-08-13 00:00:00+10:00 False
2024-08-12 14:24:58+00:00 1.50 2024-08-13 00:00:00+10:00 False
2024-08-12 14:24:58+00:00 1.60 2024-08-13 00:00:00+10:00 False
2024-08-14 13:39:59+00:00 0.91 2024-08-14 00:00:00+10:00 True
2024-08-14 13:44:52+00:00 1.01 2024-08-14 00:00:00+10:00 False
2024-08-14 13:49:56+00:00 1.40 2024-08-14 00:00:00+10:00 False
2024-08-14 13:54:51+00:00 9.91 2024-08-14 00:00:00+10:00 False
2024-08-14 13:59:53+00:00 9.91 2024-08-14 00:00:00+10:00 False
What I would like to end up with:
RR_TODAY day day_change
timestamp
2024-08-12 14:04:57+00:00 0.00 2024-08-13 00:00:00+10:00 True
2024-08-12 14:09:58+00:00 1.50 2024-08-13 00:00:00+10:00 False
2024-08-12 14:14:48+00:00 1.50 2024-08-13 00:00:00+10:00 False
2024-08-12 14:19:55+00:00 1.50 2024-08-13 00:00:00+10:00 False
2024-08-12 14:24:58+00:00 1.50 2024-08-13 00:00:00+10:00 False
2024-08-12 14:24:58+00:00 1.60 2024-08-13 00:00:00+10:00 False
2024-08-14 13:39:59+00:00 0.91 2024-08-14 00:00:00+10:00 True
2024-08-14 13:44:52+00:00 1.01 2024-08-14 00:00:00+10:00 False
2024-08-14 13:49:56+00:00 1.40 2024-08-14 00:00:00+10:00 False
2024-08-14 13:54:51+00:00 9.91 2024-08-14 00:00:00+10:00 False
2024-08-14 13:59:53+00:00 9.91 2024-08-14 00:00:00+10:00 False
I came up with the following solution that works but it is highly inefficient:
def cleanup_RR_TODAY(df: pd.DataFrame) -> pd.Series:
"""
Cleanup the RR_TODAY measure by setting the value to max day value if the value is less than that
:param df: The DataFrame to cleanup
:return: The cleaned up DataFrame
"""
max_value = np.nan
clean_series = pd.Series(dtype="float64")
for index, value in df.iterrows():
if (
np.isnan(max_value)
or value["day_change"]
or (not np.isnan(value["RR_TODAY"]) and value["RR_TODAY"] > max_value)
):
max_value = value["RR_TODAY"]
clean_series[index] = max_value
elif value["RR_TODAY"] < max_value:
clean_series[index] = max_value
else:
clean_series[index] = value["RR_TODAY"]
return clean_series
df["day"] = df.index.tz_convert(station.timezone).floor("D")
df["day_change"] = df["day"] != df["day"].shift(periods=1)
df["RR_TODAY"] = cleanup_RR_TODAY(df)
I am looking at a vectorised solution that would operate much faster
It seems like the following should work. It is vectorized and I suspect it should run faster at scale.
df["day"] = df.index.tz_convert(station.timezone).floor("D")
df["RR_MAX"] = (df.groupby("day")["RR_TODAY"]
.transform("cummax"))
df["RR_TODAY"] = df[["RR_TODAY", "RR_MAX"]].max(axis = 1)
the RR_MAX
column keeps track of the maximum RR_TODAY
number seen so far in the given day, with nan's zeroed out in order to not change the corresponding RR_TODAY
entry.
EDIT: Apparently, the pandas max function ignores nan's (unlike the numpy max function), so there's no need to zero them out.