pythonpandastradingback-testing

Backtesting problem with Pandas and vectorization


I have this DataFrame:

df = pd.DataFrame({"val": [1, 2, 3, 5], "signal": [0, 1, 0, 0]})
df
    val signal
0   1   0
1   2   1
2   3   0
3   5   0

Then I do:

df["target"] = np.where(df.signal, df.val + 3, np.nan)
df["target"] = df.target.ffill()
df["hit"] = df.val >= df.target
df
    val signal  target  hit
0   1   0       NaN     False
1   2   1       5.0     False
2   3   0       5.0     False
3   5   0       5.0     True

To see if my target has been hit.

Here's my issue: let's say that the starting DataFrame was this:

    val signal
0   1   0
1   2   1
2   3   0
3   5   1 # <-- new signal
4   6   0 # <-- new row

If I do the same operations as before I get:

df["target"] = np.where(df.signal, df.val + 3, np.nan)
df["target"] = df.target.ffill()
df["hit"] = df.val >= df.target
df
    val signal  target  hit
0   1   0       NaN     False
1   2   1       5.0     False
2   3   0       5.0     False
3   5   1       7.0     False
4   6   0       7.0     False

Now I lost the hit on index 3, as target has been replaced by the second signal.

What I would like is for signal to not create a new target if the previous target has not been hit yet.

Desired output (example 1):

    val signal  target  hit
0   1   0       NaN     False
1   2   1       5.0     False
2   3   0       5.0     False
3   5   1       5.0     True
4   6   0       NaN     False

Desired output (example 2):

    val signal  target  hit
0   1   0       NaN     False
1   2   1       5.0     False
2   3   1       5.0     False
3   5   0       5.0     True
4   6   0       NaN     False

Desired output (example 3):

    val signal  target  hit
0   1   1       4.0     False
1   4   0       4.0     True
2   3   0       NaN     False
3   4   1       7.0     False
4   7   0       7.0     True

Desired output (example 4):

    val signal  target  hit
0   5   0       NaN     False
1   3   1       6.0     False
2   6   1       6.0     True
3   2   1       5.0     False
4   7   0       5.0     True

P.S. Ideally, this needs to be done with vectorization as I'm going to perform this operation for millions of rows.

EDIT: Just so the logic is clearer, here's the "loopy" version of the algorithm:

def loopy_way(vals: list, signals: list) -> list:
    active_trgt = None
    hits = []
    for val, signal in zip(vals, signals):

        if active_trgt:
            if val >= active_trgt: # Arbitrary logic
                hits.append(True)
                active_trgt = None
                continue

            # There's an active target, so ignore signal
            hits.append(False)
            continue

        if signal:
            active_trgt = val + 3 # Arbitrary condition
            hits.append(False) # Couldn't be otherwise
            continue

        # No signal and no active target
        hits.append(False)
    return hits

Solution

  • You can look at both the new target and the previous target at each signal point using the .shift method in pandas.

    Tracking both will allow you to signal if either we are over the current or the previous target.

    Additionally, you want to track what the largest historical value you have seen in the previous signal window. You can enumerate signal windows with df.signal.cumsum() and then group by that window enumeration to get the cummax just per signal window with df.groupby(df.signal_window).val.cummax().shift(1).

    As an additional condition for non-monotonic data, you can accept the candidate target in signal rows if it is less than the previous target.

    Combining these, you can get your desired output.

    I calculate and store these as intermediate columns below to show how the logic works, but you don't have to store and then drop them in your code.

    Note: All of this said, it may not be worth vectorizing this calculation. Using numba or similar you could get a very fast implementation in a loop with more readable/maintainable code and substantial runtime memory savings since you don't have to do all your intermediate calculations for every row at once.

    import numpy as np
    import pandas as pd
    
    df1 = pd.DataFrame({
        "val": [1, 2, 3, 5, 6], "signal": [0, 1, 0, 1, 0],
    })
    
    df2 = pd.DataFrame({
        "val": [1, 2, 3, 5, 6], "signal": [0, 1, 1, 0, 0],
    })
    
    df3 = pd.DataFrame({
        "val": [1, 4, 3, 4, 7], "signal": [1, 0, 0, 1, 0],
    })
    
    df4 = pd.DataFrame({
        "val": [5, 3, 6, 2, 7], "signal": [0, 1, 1, 1, 0],
    })
    
    for df in [df1, df2, df3, df4]:
        # add candidate target at signal times
        df["candidate_target"] = np.where(df.signal, df.val + 3, np.nan)
        
        # track previous target at signal times
        df["prev_target"] = np.where(
            df.signal,
            df.candidate_target.ffill().shift(1),
            np.nan
        )
        
        # enumerate the signal windows with cumsum
        df["signal_window"] = df.signal.cumsum()
        
        # track max value we have seen in previous signal window
        df["max_to_date"] = df.groupby(df.signal_window).val.cummax().shift(1)
        
        # for signal rows, actual target is candidate if previous has been exceeded, else previous
        df["signal_target"] = np.where(
            (df.max_to_date >= df.prev_target) | df.prev_target.isnull() | (df.prev_target > df.candidate_target),
            df.candidate_target,
            df.prev_target
        )
        
        # for non-signal rows, add target only if it has not been hit
        df["non_signal_target"] = np.where(
            (df.signal == 0) & (df.max_to_date < df.signal_target.ffill()),
            df.signal_target.ffill(),
            np.nan,
        )
        
        # combine signal target and non-signal target rows
        df["target"] = df.signal_target.fillna(df.non_signal_target)
        
        # hit is where value exceeds or equal to target
        df["hit"] = df.val >= df.target
        
        # drop intermediate calculations
        df.drop(["max_to_date", "signal_target", "signal_window", "non_signal_target", "candidate_target", "prev_target"], axis=1, inplace=True)
        
        print(df)
    #> val  signal  target    hit
    #> 0    1       0     NaN  False
    #> 1    2       1     5.0  False
    #> 2    3       0     5.0  False
    #> 3    5       1     5.0   True
    #> 4    6       0     NaN  False
    #>    val  signal  target    hit
    #> 0    1       0     NaN  False
    #> 1    2       1     5.0  False
    #> 2    3       1     5.0  False
    #> 3    5       0     5.0   True
    #> 4    6       0     NaN  False
    #>    val  signal  target    hit
    #> 0    1       1     4.0  False
    #> 1    4       0     4.0   True
    #> 2    3       0     NaN  False
    #> 3    4       1     7.0  False
    #> 4    7       0     7.0   True
    #>    val  signal  target    hit
    #> 0    5       0     NaN  False
    #> 1    3       1     6.0  False
    #> 2    6       1     6.0   True
    #> 3    2       1     5.0  False
    #> 4    7       0     5.0   True