pythonpandasnumpy

Arrange consecutive zeros in panda by specific rule


I have panda series as the following :

    1   1
    2   2
    3   3 
    4   4
    5   0
    6   0
    7   1
    8   2
    9   3
   10   0
   11   0
   12   0
   13   0
   14   1
   15   2

I have to arrange this in following format :

    1   1
    2   2
    3   3 
    4   4
    5   0
    6   0
    7   3  ---> 4-2+1 (previous non zero value - amount of previous zeroes + current value)
    8   4  ---> 4-2+2 (previous non zero value - amount of previous zeroes + current value)
    9   5  ---> 4-2+3 (previous non zero value - amount of previous zeroes + current value)
   10   0
   11   0
   12   0
   13   0
   14   2 ---> 5-4+1 (previous non zero value - amount of previous zeroes + current value)
   15   3 ---> 5-4+2 (previous non zero value - amount of previous zeroes + current value)

I am stuck at this. Till now I am able to produce a data frame with consecutive zeroes.

zero = ser.eq(0).groupby(ser.ne(0).cumsum()).cumsum()

which gave me:

    1   0
    2   0
    3   0 
    4   0
    5   1
    6   2
    7   0
    8   0
    9   0
   10   1
   11   2
   12   3
   13   4
   14   0
   15   0

if someone willing to assist on this. i am dropping cookie cutter for this problem which will create the above series.

d = {'1': 1, '2': 2, '3': 3, '4':4, '5':0, '6':0, '7':1, '8':2, '9':3, '10':0, '11':0, '12':0, '13':0, '14':1, '15':2}
ser = pd.Series(data=d)

Solution

  • Although this can only be done with Pandas in a rather convoluted way IMHO, here is a straightforward implementation using Numba (which should also be faster than all Pandas solutions):

    import numba as nb
    import numpy as np
    
    @nb.njit(['(int32[:],)', '(int64[:],)'])
    def compute(arr):
        res = np.empty(arr.size, dtype=arr.dtype)
        z_count = 0
        last_nnz_val = 0
        nnz_count = 0
        for i in range(arr.size):
            if arr[i] == 0:
                if i > 0 and arr[i-1] != 0:   # If there is a switch from nnz to zero
                    last_nnz_val += nnz_count - z_count   # Save the last nnz result
                    z_count = 0
                z_count += 1
                res[i] = 0
            else:
                if i > 0 and arr[i-1] == 0:   # If there is a switch from zero to nnz
                    nnz_count = 0
                nnz_count += 1
                res[i] = last_nnz_val - z_count + nnz_count
        return res
    
    # [...]
    compute(ser.to_numpy())
    

    Note the result is a basic Numpy array, but you can easily create a dataframe from it.


    Benchmark

    Here are performance results on my machine (i5-9600KF CPU) on the tiny example dataset:

    MichaelCao's answer:    886 µs
    This answer:              2 µs   <-----
    

    On a 1000x larger dataset (repeated), I get:

    MichaelCao's answer:   1240 µs
    This answer:             20 µs   <-----
    

    It is much faster than the other answer. I also get different output results so one of the answer implementation is certainly wrong.