I have created the following pandas dataframe:
import pandas as pd
import numpy as np
ds = {'col1':[1,"S",3,4,"S"], 'col2' : [6,"S",8,9,"S"],'col3' : [67,None,87,79,"S"]}
df = pd.DataFrame(data=ds)
df['col4']= df[['col1','col2','col3']].values.tolist()
The dataframe looks like this:
print(df)
col1 col2 col3 col4
0 1 6 67 [1, 6, 67]
1 S S None [S, S, None]
2 3 8 87 [3, 8, 87]
3 4 9 79 [4, 9, 79]
4 S S S [S, S, S]
For each record, I need to calculate the number of consecutive "S" inside col4
. The resulting dataframe would look like this:
col1 col2 col3 col4 iCount
0 1 6 67 [1, 6, 67] 0
1 S S None [S, S, None] 2
2 3 8 87 [3, 8, 87] 0
3 4 9 79 [4, 9, 79] 0
4 S S S [S, S, S] 3
I have tried this code:
col4 = np.array(df['col4'])
iCount = 0
for i in range(len(df)):
for j in range(len(col4[i])):
if(col4[i][j] == "S"):
iCount += 1
else:
iCount = 0
df['iCount'] = iCount
But I get the following dataframe:
col1 col2 col3 col4 iCount
0 1 6 67 [1, 6, 67] 3
1 S S None [S, S, None] 3
2 3 8 87 [3, 8, 87] 3
3 4 9 79 [4, 9, 79] 3
4 S S S [S, S, S] 3
Please can someone help me find the error?
I would use itertools.groupby
:
from itertools import groupby
def consec(lst):
return max((len(list(g)) for k,g in
groupby(lst, lambda x: x=='S') if k), default=0)
df['iCount'] = df['col4'].map(consec)
NB. using max
here to get the longest sequence since there could be more than one stretch of S's, but you could use min
/sum
or any other logic.
If you are sure there is a maximum of one series of S per list, you could simplify to:
df['iCount'] = [sum(x=='S' for x in lst) for lst in df['col4']]
Output:
col1 col2 col3 col4 iCount
0 1 6 67 [1, 6, 67] 0
1 S S None [S, S, None] 2
2 3 8 87 [3, 8, 87] 0
3 4 9 79 [4, 9, 79] 0
4 S S S [S, S, S] 3