pandasdataframelistcountelement

Count consecutive elements in Pandas list


I have created the following pandas dataframe:

import pandas as pd
import numpy as np

ds = {'col1':[1,"S",3,4,"S"], 'col2' : [6,"S",8,9,"S"],'col3' : [67,None,87,79,"S"]}

df = pd.DataFrame(data=ds)

df['col4']= df[['col1','col2','col3']].values.tolist()

The dataframe looks like this:

print(df)

  col1 col2  col3          col4
0    1    6    67    [1, 6, 67]
1    S    S  None  [S, S, None]
2    3    8    87    [3, 8, 87]
3    4    9    79    [4, 9, 79]
4    S    S     S     [S, S, S]

For each record, I need to calculate the number of consecutive "S" inside col4. The resulting dataframe would look like this:

  col1 col2  col3          col4   iCount
0    1    6    67    [1, 6, 67]      0
1    S    S  None  [S, S, None]      2
2    3    8    87    [3, 8, 87]      0
3    4    9    79    [4, 9, 79]      0
4    S    S     S     [S, S, S]      3

I have tried this code:

col4 = np.array(df['col4'])
iCount = 0
for i in range(len(df)):
    for j in range(len(col4[i])):

        if(col4[i][j] == "S"):
            iCount += 1
    
        else:
            iCount = 0

df['iCount'] = iCount

But I get the following dataframe:

  col1 col2  col3          col4  iCount
0    1    6    67    [1, 6, 67]       3
1    S    S  None  [S, S, None]       3
2    3    8    87    [3, 8, 87]       3
3    4    9    79    [4, 9, 79]       3
4    S    S     S     [S, S, S]       3

Please can someone help me find the error?


Solution

  • I would use itertools.groupby:

    from itertools import groupby
    
    def consec(lst):
        return max((len(list(g)) for k,g in
                    groupby(lst, lambda x: x=='S') if k), default=0)
    
    df['iCount'] = df['col4'].map(consec)
    

    NB. using max here to get the longest sequence since there could be more than one stretch of S's, but you could use min/sum or any other logic.

    If you are sure there is a maximum of one series of S per list, you could simplify to:

    df['iCount'] = [sum(x=='S' for x in lst) for lst in df['col4']]
    

    Output:

      col1 col2  col3          col4  iCount
    0    1    6    67    [1, 6, 67]       0
    1    S    S  None  [S, S, None]       2
    2    3    8    87    [3, 8, 87]       0
    3    4    9    79    [4, 9, 79]       0
    4    S    S     S     [S, S, S]       3