pythonpandasvectorization

Vectorisation of loop accessing preceding values on a row by row basis


I need to roll through a dataframe and update the estimate in column 2 based on the previous value in column 1 and column 2. I am looking for ideas on how to vectorise this approach/speed it up as it is currently quite costly in the loop format below. The first row is set to 0 as there is no preceding value.

df = pd.DataFrame({'column1': [1, 2, 3, 4, 5]})

for idx, row in df.iterrows():
    if idx == 0:
        df.loc[idx, 'column2'] = 0
    else:
        df.loc[idx, 'column2'] = (df.loc[idx - 1, 'column1'] + df.loc[idx - 1, 'column2']) * DECAY

# Expected output -

   column1  column2
0        1      0.0
1        2      0.9
2        3      2.61
3        4      5.049
4        5     8.1441

Solution

  • You can use numpy to get the dot product of a matrix that takes DECAY into account:

    df = pd.DataFrame({"column1": [1, 2, 3, 4, 5]})
    
    DECAY = 0.9
    
    n = len(df)
    dm = np.tril(np.power(DECAY, np.subtract.outer(np.arange(n), np.arange(n))), -1)
    df["column2"] = dm.dot(df["column1"])
    
       column1  column2
    0        1   0.0000
    1        2   0.9000
    2        3   2.6100
    3        4   5.0490
    4        5   8.1441
    

    Intermediate steps for clarity:

    Create a matrix m with the subtraction of all pairs from an array with range [0, n]:

    m = np.subtract.outer(np.arange(n), np.arange(n))
    
    [[ 0 -1 -2 -3 -4]
     [ 1  0 -1 -2 -3]
     [ 2  1  0 -1 -2]
     [ 3  2  1  0 -1]
     [ 4  3  2  1  0]]
    

    Elevate DECAY to the power of each element in m:

    p = np.power(DECAY, m)
    
    [[1.         1.11111111 1.2345679  1.37174211 1.5241579 ]
     [0.9        1.         1.11111111 1.2345679  1.37174211]
     [0.81       0.9        1.         1.11111111 1.2345679 ]
     [0.729      0.81       0.9        1.         1.11111111]
     [0.6561     0.729      0.81       0.9        1.        ]]
    

    Keep the lower triangle of the matrix:

    dm = np.tril(p, -1)
    
    [[0.     0.     0.     0.     0.    ]
     [0.9    0.     0.     0.     0.    ]
     [0.81   0.9    0.     0.     0.    ]
     [0.729  0.81   0.9    0.     0.    ]
     [0.6561 0.729  0.81   0.9    0.    ]]
    

    Alternative approach using less memory, taking on average 2.7 seconds for the example dataframe on my machine:

    df = pd.DataFrame({"column1": np.arange(1, 60001)})
    
    DECAY = 0.9
    
    n = len(df)
    c = np.zeros(n)
    factors = np.power(DECAY, np.arange(n))
    
    for i in range(1, n):
        c[i] = np.dot(factors[1 : i + 1][::-1], df["column1"][:i])
    
    df["column2"] = c
    
           column1      column2
    0            1       0.0000
    1            2       0.9000
    2            3       2.6100
    3            4       5.0490
    4            5       8.1441
    ...        ...          ...
    59995    59996  539874.0000
    59996    59997  539883.0000
    59997    59998  539892.0000
    59998    59999  539901.0000
    59999    60000  539910.0000
    
    [60000 rows x 2 columns]