pythonpandasnumpynumbavaex

How to iterate with result of previous rows of same column?


Starting from a Data Frame with the columns A B D P:

import numba
import numpy as np
import pandas as pd
import vaex

d = {'A':[0,1,2,3,4,5,6],'B':[30,35,32,35,31,39,37],'D':[12,10,13,19,12,21,13],'P':[3,3,3,3,3,3,3]}
df = pd.DataFrame(data=d)

df['C'] = 0

df

Exporting to hdf5:

df.to_csv("v_df", index=False)
    
df = vaex.from_csv("v_df", convert=True, chunk_size=5_000_000)

I need to calculate the column 'C' with each row with a desired result as follow:

'C': [0, 1.666666667, 0.552380952, 2.333630952, 0.962202381, 6.38155722, 5.714890553]

To reach this result I need to iterate each row starting on row (1) and calculate something like:

%%time
@numba.njit
def func(B, D, C, b_0=0):
    n = len(B)
    b = np.full(n, b_0, dtype=np.float64)
    for i in range(1, n):
        b[i] = ((((B[i] - B[i - 1]) / B[i - 1])) * D[i]) + C[i-1]
    return b
df['C'] = func(df['B'].to_numpy(),df['D'].to_numpy(),df['C'].to_numpy())
df

But it doesn't work.

It results ok for the first and second row:

'C': [0, 1.666666667, -1.114286, 1.781250, -1.371429, 5.419355, -0.666667]

I've tried the 'shift' (suggestion by sammywemmy). It works ok for this example data frame before 'vaex.from_csv', but when implemented the concept to the big data frame via vaex, shift is not working.

So, the question is: Is there a 'good practice' way to perform this calculus (iterate the result of a row in column 'C' considering the previous rows in 'C') with a large data source (I´m using vaex to import from a 1 million rows csv)?

Thanks in advance.


Solution

  • I am not sure it is the optimal solution but at least it is working: you can use the apply method with vectorize set to True.

    Here is the full snippet:

    import numba
    import numpy as np
    import pandas as pd
    import vaex
    
    d = {'A':[0,1,2,3,4,5,6],'B':[30,35,32,35,31,39,37],'D':[12,10,13,19,12,21,13],'P':[3,3,3,3,3,3,3]}
    df = pd.DataFrame(data=d)
    
    df['C'] = 0
    
    df
    
    # I removed the b_0 for simplicity
    def my_func(B, D, C):
        n = len(B)
        b = np.full(n, 0, dtype=np.float64)
        for i in range(1, n):
            b[i] = ((((B[i] - B[i - 1]) / B[i - 1])) * D[i]) + C[i-1]
    
        return b
    
    df_vaex = vaex.from_pandas(df)
    
    df_vaex.apply(my_func, arguments=[df_vaex["B"], df_vaex["D"], df_vaex["C"]], vectorize=True)
    

    It gives the expected output:

    0          0
    1    1.66667
    2   -1.11429
    3    1.78125
    4   -1.37143
    5    5.41935
    6  -0.666667
    

    Basically, apply method allows to apply functions row by row and if you set vectorize to True the full arrays will be passed instead of the row values. In your case it is mandatory since you need the values from the previous rows to compute the C value in a given row.