I have panda series as the following :
1 1
2 2
3 3
4 4
5 0
6 0
7 1
8 2
9 3
10 0
11 0
12 0
13 0
14 1
15 2
I have to arrange this in following format :
1 1
2 2
3 3
4 4
5 0
6 0
7 3 ---> 4-2+1 (previous non zero value - amount of previous zeroes + current value)
8 4 ---> 4-2+2 (previous non zero value - amount of previous zeroes + current value)
9 5 ---> 4-2+3 (previous non zero value - amount of previous zeroes + current value)
10 0
11 0
12 0
13 0
14 2 ---> 5-4+1 (previous non zero value - amount of previous zeroes + current value)
15 3 ---> 5-4+2 (previous non zero value - amount of previous zeroes + current value)
I am stuck at this. Till now I am able to produce a data frame with consecutive zeroes.
zero = ser.eq(0).groupby(ser.ne(0).cumsum()).cumsum()
which gave me:
1 0
2 0
3 0
4 0
5 1
6 2
7 0
8 0
9 0
10 1
11 2
12 3
13 4
14 0
15 0
if someone willing to assist on this. i am dropping cookie cutter for this problem which will create the above series.
d = {'1': 1, '2': 2, '3': 3, '4':4, '5':0, '6':0, '7':1, '8':2, '9':3, '10':0, '11':0, '12':0, '13':0, '14':1, '15':2}
ser = pd.Series(data=d)
Although this can only be done with Pandas in a rather convoluted way IMHO, here is a straightforward implementation using Numba (which should also be faster than all Pandas solutions):
import numba as nb
import numpy as np
@nb.njit(['(int32[:],)', '(int64[:],)'])
def compute(arr):
res = np.empty(arr.size, dtype=arr.dtype)
z_count = 0
last_nnz_val = 0
nnz_count = 0
for i in range(arr.size):
if arr[i] == 0:
if i > 0 and arr[i-1] != 0: # If there is a switch from nnz to zero
last_nnz_val += nnz_count - z_count # Save the last nnz result
z_count = 0
z_count += 1
res[i] = 0
else:
if i > 0 and arr[i-1] == 0: # If there is a switch from zero to nnz
nnz_count = 0
nnz_count += 1
res[i] = last_nnz_val - z_count + nnz_count
return res
# [...]
compute(ser.to_numpy())
Note the result is a basic Numpy array, but you can easily create a dataframe from it.
Here are performance results on my machine (i5-9600KF CPU) on the tiny example dataset:
MichaelCao's answer: 886 µs
This answer: 2 µs <-----
On a 1000x larger dataset (repeated), I get:
MichaelCao's answer: 1240 µs
This answer: 20 µs <-----
It is much faster than the other answer. I also get different output results so one of the answer implementation is certainly wrong.