I have this DataFrame:
df = pd.DataFrame({"val": [1, 2, 3, 5], "signal": [0, 1, 0, 0]})
df
val signal
0 1 0
1 2 1
2 3 0
3 5 0
Then I do:
df["target"] = np.where(df.signal, df.val + 3, np.nan)
df["target"] = df.target.ffill()
df["hit"] = df.val >= df.target
df
val signal target hit
0 1 0 NaN False
1 2 1 5.0 False
2 3 0 5.0 False
3 5 0 5.0 True
To see if my target has been hit.
Here's my issue: let's say that the starting DataFrame was this:
val signal
0 1 0
1 2 1
2 3 0
3 5 1 # <-- new signal
4 6 0 # <-- new row
If I do the same operations as before I get:
df["target"] = np.where(df.signal, df.val + 3, np.nan)
df["target"] = df.target.ffill()
df["hit"] = df.val >= df.target
df
val signal target hit
0 1 0 NaN False
1 2 1 5.0 False
2 3 0 5.0 False
3 5 1 7.0 False
4 6 0 7.0 False
Now I lost the hit on index 3, as target
has been replaced by the second signal.
What I would like is for signal
to not create a new target
if the previous target has not been hit yet.
Desired output (example 1):
val signal target hit
0 1 0 NaN False
1 2 1 5.0 False
2 3 0 5.0 False
3 5 1 5.0 True
4 6 0 NaN False
Desired output (example 2):
val signal target hit
0 1 0 NaN False
1 2 1 5.0 False
2 3 1 5.0 False
3 5 0 5.0 True
4 6 0 NaN False
Desired output (example 3):
val signal target hit
0 1 1 4.0 False
1 4 0 4.0 True
2 3 0 NaN False
3 4 1 7.0 False
4 7 0 7.0 True
Desired output (example 4):
val signal target hit
0 5 0 NaN False
1 3 1 6.0 False
2 6 1 6.0 True
3 2 1 5.0 False
4 7 0 5.0 True
P.S. Ideally, this needs to be done with vectorization as I'm going to perform this operation for millions of rows.
EDIT: Just so the logic is clearer, here's the "loopy" version of the algorithm:
def loopy_way(vals: list, signals: list) -> list:
active_trgt = None
hits = []
for val, signal in zip(vals, signals):
if active_trgt:
if val >= active_trgt: # Arbitrary logic
hits.append(True)
active_trgt = None
continue
# There's an active target, so ignore signal
hits.append(False)
continue
if signal:
active_trgt = val + 3 # Arbitrary condition
hits.append(False) # Couldn't be otherwise
continue
# No signal and no active target
hits.append(False)
return hits
You can look at both the new target and the previous target at each signal point using the .shift
method in pandas.
Tracking both will allow you to signal if either we are over the current or the previous target.
Additionally, you want to track what the largest historical value you have seen in the previous signal window. You can enumerate signal windows with df.signal.cumsum()
and then group by that window enumeration to get the cummax just per signal window with df.groupby(df.signal_window).val.cummax().shift(1)
.
As an additional condition for non-monotonic data, you can accept the candidate target in signal rows if it is less than the previous target.
Combining these, you can get your desired output.
I calculate and store these as intermediate columns below to show how the logic works, but you don't have to store and then drop them in your code.
Note: All of this said, it may not be worth vectorizing this calculation. Using numba or similar you could get a very fast implementation in a loop with more readable/maintainable code and substantial runtime memory savings since you don't have to do all your intermediate calculations for every row at once.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
"val": [1, 2, 3, 5, 6], "signal": [0, 1, 0, 1, 0],
})
df2 = pd.DataFrame({
"val": [1, 2, 3, 5, 6], "signal": [0, 1, 1, 0, 0],
})
df3 = pd.DataFrame({
"val": [1, 4, 3, 4, 7], "signal": [1, 0, 0, 1, 0],
})
df4 = pd.DataFrame({
"val": [5, 3, 6, 2, 7], "signal": [0, 1, 1, 1, 0],
})
for df in [df1, df2, df3, df4]:
# add candidate target at signal times
df["candidate_target"] = np.where(df.signal, df.val + 3, np.nan)
# track previous target at signal times
df["prev_target"] = np.where(
df.signal,
df.candidate_target.ffill().shift(1),
np.nan
)
# enumerate the signal windows with cumsum
df["signal_window"] = df.signal.cumsum()
# track max value we have seen in previous signal window
df["max_to_date"] = df.groupby(df.signal_window).val.cummax().shift(1)
# for signal rows, actual target is candidate if previous has been exceeded, else previous
df["signal_target"] = np.where(
(df.max_to_date >= df.prev_target) | df.prev_target.isnull() | (df.prev_target > df.candidate_target),
df.candidate_target,
df.prev_target
)
# for non-signal rows, add target only if it has not been hit
df["non_signal_target"] = np.where(
(df.signal == 0) & (df.max_to_date < df.signal_target.ffill()),
df.signal_target.ffill(),
np.nan,
)
# combine signal target and non-signal target rows
df["target"] = df.signal_target.fillna(df.non_signal_target)
# hit is where value exceeds or equal to target
df["hit"] = df.val >= df.target
# drop intermediate calculations
df.drop(["max_to_date", "signal_target", "signal_window", "non_signal_target", "candidate_target", "prev_target"], axis=1, inplace=True)
print(df)
#> val signal target hit
#> 0 1 0 NaN False
#> 1 2 1 5.0 False
#> 2 3 0 5.0 False
#> 3 5 1 5.0 True
#> 4 6 0 NaN False
#> val signal target hit
#> 0 1 0 NaN False
#> 1 2 1 5.0 False
#> 2 3 1 5.0 False
#> 3 5 0 5.0 True
#> 4 6 0 NaN False
#> val signal target hit
#> 0 1 1 4.0 False
#> 1 4 0 4.0 True
#> 2 3 0 NaN False
#> 3 4 1 7.0 False
#> 4 7 0 7.0 True
#> val signal target hit
#> 0 5 0 NaN False
#> 1 3 1 6.0 False
#> 2 6 1 6.0 True
#> 3 2 1 5.0 False
#> 4 7 0 5.0 True