[SOLVED] Vectorisation of loop accessing preceding values on a row by row basis

Vectorisation of loop accessing preceding values on a row by row basis

I need to roll through a dataframe and update the estimate in column 2 based on the previous value in column 1 and column 2. I am looking for ideas on how to vectorise this approach/speed it up as it is currently quite costly in the loop format below. The first row is set to 0 as there is no preceding value.

df = pd.DataFrame({'column1': [1, 2, 3, 4, 5]})

for idx, row in df.iterrows():
    if idx == 0:
        df.loc[idx, 'column2'] = 0
    else:
        df.loc[idx, 'column2'] = (df.loc[idx - 1, 'column1'] + df.loc[idx - 1, 'column2']) * DECAY

# Expected output -

   column1  column2
0        1      0.0
1        2      0.9
2        3      2.61
3        4      5.049
4        5     8.1441

Solution

You can use numpy to get the dot product of a matrix that takes DECAY into account:

df = pd.DataFrame({"column1": [1, 2, 3, 4, 5]})

DECAY = 0.9

n = len(df)
dm = np.tril(np.power(DECAY, np.subtract.outer(np.arange(n), np.arange(n))), -1)
df["column2"] = dm.dot(df["column1"])

   column1  column2
0        1   0.0000
1        2   0.9000
2        3   2.6100
3        4   5.0490
4        5   8.1441

Intermediate steps for clarity:

Create a matrix m with the subtraction of all pairs from an array with range [0, n]:

m = np.subtract.outer(np.arange(n), np.arange(n))

[[ 0 -1 -2 -3 -4]
 [ 1  0 -1 -2 -3]
 [ 2  1  0 -1 -2]
 [ 3  2  1  0 -1]
 [ 4  3  2  1  0]]

Elevate DECAY to the power of each element in m:

p = np.power(DECAY, m)

[[1.         1.11111111 1.2345679  1.37174211 1.5241579 ]
 [0.9        1.         1.11111111 1.2345679  1.37174211]
 [0.81       0.9        1.         1.11111111 1.2345679 ]
 [0.729      0.81       0.9        1.         1.11111111]
 [0.6561     0.729      0.81       0.9        1.        ]]

Keep the lower triangle of the matrix:

dm = np.tril(p, -1)

[[0.     0.     0.     0.     0.    ]
 [0.9    0.     0.     0.     0.    ]
 [0.81   0.9    0.     0.     0.    ]
 [0.729  0.81   0.9    0.     0.    ]
 [0.6561 0.729  0.81   0.9    0.    ]]

Alternative approach using less memory, taking on average 2.7 seconds for the example dataframe on my machine:

df = pd.DataFrame({"column1": np.arange(1, 60001)})

DECAY = 0.9

n = len(df)
c = np.zeros(n)
factors = np.power(DECAY, np.arange(n))

for i in range(1, n):
    c[i] = np.dot(factors[1 : i + 1][::-1], df["column1"][:i])

df["column2"] = c

       column1      column2
0            1       0.0000
1            2       0.9000
2            3       2.6100
3            4       5.0490
4            5       8.1441
...        ...          ...
59995    59996  539874.0000
59996    59997  539883.0000
59997    59998  539892.0000
59998    59999  539901.0000
59999    60000  539910.0000

[60000 rows x 2 columns]