Starting from a Data Frame with the columns A B D P:
import numba
import numpy as np
import pandas as pd
import vaex
d = {'A':[0,1,2,3,4,5,6],'B':[30,35,32,35,31,39,37],'D':[12,10,13,19,12,21,13],'P':[3,3,3,3,3,3,3]}
df = pd.DataFrame(data=d)
df['C'] = 0
df
Exporting to hdf5:
df.to_csv("v_df", index=False)
df = vaex.from_csv("v_df", convert=True, chunk_size=5_000_000)
I need to calculate the column 'C' with each row with a desired result as follow:
'C': [0, 1.666666667, 0.552380952, 2.333630952, 0.962202381, 6.38155722, 5.714890553]
To reach this result I need to iterate each row starting on row (1) and calculate something like:
%%time
@numba.njit
def func(B, D, C, b_0=0):
n = len(B)
b = np.full(n, b_0, dtype=np.float64)
for i in range(1, n):
b[i] = ((((B[i] - B[i - 1]) / B[i - 1])) * D[i]) + C[i-1]
return b
df['C'] = func(df['B'].to_numpy(),df['D'].to_numpy(),df['C'].to_numpy())
df
But it doesn't work.
It results ok for the first and second row:
'C': [0, 1.666666667, -1.114286, 1.781250, -1.371429, 5.419355, -0.666667]
I've tried the 'shift' (suggestion by sammywemmy). It works ok for this example data frame before 'vaex.from_csv', but when implemented the concept to the big data frame via vaex, shift is not working.
So, the question is: Is there a 'good practice' way to perform this calculus (iterate the result of a row in column 'C' considering the previous rows in 'C') with a large data source (I´m using vaex to import from a 1 million rows csv)?
Thanks in advance.
I am not sure it is the optimal solution but at least it is working: you can use the apply method with vectorize
set to True
.
Here is the full snippet:
import numba
import numpy as np
import pandas as pd
import vaex
d = {'A':[0,1,2,3,4,5,6],'B':[30,35,32,35,31,39,37],'D':[12,10,13,19,12,21,13],'P':[3,3,3,3,3,3,3]}
df = pd.DataFrame(data=d)
df['C'] = 0
df
# I removed the b_0 for simplicity
def my_func(B, D, C):
n = len(B)
b = np.full(n, 0, dtype=np.float64)
for i in range(1, n):
b[i] = ((((B[i] - B[i - 1]) / B[i - 1])) * D[i]) + C[i-1]
return b
df_vaex = vaex.from_pandas(df)
df_vaex.apply(my_func, arguments=[df_vaex["B"], df_vaex["D"], df_vaex["C"]], vectorize=True)
It gives the expected output:
0 0
1 1.66667
2 -1.11429
3 1.78125
4 -1.37143
5 5.41935
6 -0.666667
Basically, apply
method allows to apply functions row by row and if you set vectorize
to True
the full arrays will be passed instead of the row values. In your case it is mandatory since you need the values from the previous rows to compute the C value in a given row.