In a dataframe with 12 columns and 4 million rows, I need to add a column that gets the maximum number of consecutive columns with values above zero for each row.
Here's a sample
df = pd.DataFrame(np.array([[284.77, 234.37, 243.8, 84.36, 0., 0., 0., 55.04, 228.2, 181.97, 0., 0.],
[13.78, 0., 38.58, 33.16, 0., 38.04, 74.02, 45.74, 27.2, 9.19, 0., 0.],
[88.66, 255.72, 323.19, 7.24, 0., 73.38, 45.73, 0., 0., 77.39, 26.57, 279.34],
[0., 0., 34.42, 9.16, 0., 43.4, 42.17, 123.69, 60.5, 25.47, 72.32, 7.29],
[320.6, 1445.56, 856.23, 371.21, 0., 244.22, 134.58, 631.59, 561.82, 1172.44, 895.68, 186.28],
[0., 0., 32.29, 1000.91, 0., 680., 585.46, 466.6, 0., 493.48, 157.1, 125.31]]),
columns=[1,2,3,4,5,6,7,8,9,10,11,12])
And here's an example of my goal:
df['MAX_CONSECUTIVE_COL'] = pd.Series([4,5,4,7,7,3])
Due to the size of dataframe, performance is a must have for the solution.
I've tried to mask the data with boolean values and do a cumulative sum to identify each group of consecutive columns with values == 0 or != 0
((df\>0) != (df\>0).shift(axis=1)).cumsum(axis=1)
Then, I've got the results of one row:
((df>0) != (df>0).shift(axis=1)).cumsum(axis=1).iloc[0]
Applied a value_counts and transformed the result in a dataframe:
pd.DataFrame(((df>0) != (df>0).shift(axis=1)).cumsum(axis=1).iloc[0].value_counts())
applied a sort_values:
pd.DataFrame(((df>0) != (df>0).shift(axis=1)).cumsum(axis=1).iloc[0].value_counts()).sort_values('count', ascending=False)
and, finally, got the first value (the max number of consecutive columns with values !=0 or == 0):
pd.DataFrame(((df>0) != (df>0).shift(axis=1)).cumsum(axis=1).iloc[0].value_counts()).sort_values('count', ascending=False).iloc[0,0]
Now, I've got a problem: I don't know how to filter only the consecutive columns with values != 0.
But let's consider that this method worked and we have now the number of consecutive columns with values !=0 for the first row. The only solution I was capable to develop to get the results for the other rows is iterating each one.
Something like this:
df['MAX_CONSECUTIVE_COL'] = 0
for n in range(0,df.shape[0]-1):
df.loc[df.index[n], 'MAX_CONSECUTIVE_COL'] = pd.DataFrame(((df>0) != df>0).shift(axis=1)).cumsum(axis=1).iloc[n].value_counts()).sort_values('count',ascending=False).iloc[0,0]
But remember we have 4 million rows, so this iteration would take a looooong time to be completed, and that's the second problem I have.
If performance is concern I'd consider to use numba:
from numba import njit, prange
@njit(parallel=True)
def get_max(matrix, out):
n, m = matrix.shape
for row in prange(n):
mx, cnt = 0, 0
for col in range(m - 1): # -1 because last column is OUT
if matrix[row, col] > 0:
cnt += 1
mx = max(mx, cnt)
else:
cnt = 0
out[row] = mx
df["OUT"] = 0
get_max(df.values, df["OUT"].values)
print(df)
Prints:
1 2 3 4 5 6 7 8 9 10 11 12 OUT
0 284.77 234.37 243.80 84.36 0.0 0.00 0.00 55.04 228.20 181.97 0.00 0.00 4
1 13.78 0.00 38.58 33.16 0.0 38.04 74.02 45.74 27.20 9.19 0.00 0.00 5
2 88.66 255.72 323.19 7.24 0.0 73.38 45.73 0.00 0.00 77.39 26.57 279.34 4
3 0.00 0.00 34.42 9.16 0.0 43.40 42.17 123.69 60.50 25.47 72.32 7.29 7
4 320.60 1445.56 856.23 371.21 0.0 244.22 134.58 631.59 561.82 1172.44 895.68 186.28 7
5 0.00 0.00 32.29 1000.91 0.0 680.00 585.46 466.60 0.00 493.48 157.10 125.31 3
Quick benchmark:
from time import monotonic
# >4 million rows
df = pd.concat([df] * 800_000, ignore_index=True)
start_time = monotonic()
df["OUT"] = 0
get_max(df.values, df["OUT"].values)
print(monotonic() - start_time)
Prints on my computer (AMD 5700x):
0.21921994583681226