I have a dataframe that looks like this:
test = pd.DataFrame(
{'onset': [1,3,18,33,35,50],
'duration': [2,15,15,2,15,15],
'type': ['Instr', 'Remember', 'SocTestString', 'Rating', 'SelfTestString', 'XXX']
}
)
I want to create a new dataframe such that when type
contains "TestString",
The final dataframe should look like this:
test_final = pd.DataFrame(
{'onset': [1,3,18,23,28,33,35,40,45,50],
'duration': [2,15,5,5,5,2,5,5,5,15],
'type': ['Instr', 'Remember', 'SocTestString_1', 'SocTestString_2', 'SocTestString_3', 'Rating', 'SelfTestString_1', 'SelfTestString_2', 'SelfTestString_3', 'XXX']
})
How may I accomplish this?
You could use str.contains
to identify the target rows, then Index.repeat
to duplicate them, finally boolean indexing and groupby.cumcount
to update the new rows:
N = 3 # number of rows to create
# identify target rows
m = test['type'].str.contains('TestString')
# repeat them
out = test.loc[test.index.repeat(m.mul(N-1).add(1))]
# divide duration
out.loc[m, 'duration'] /= N
# compute the cumcount
cc = out.loc[m].groupby(level=0).cumcount()
# increment the onset
out.loc[m, 'onset'] += cc*5
# add the suffix
out.loc[m, 'type'] += '_'+cc.add(1).astype(str)
# optionally, reset the index
out.reset_index(drop=True, inplace=True)
NB. this assumes that the original index does not have duplicated indices.
Output:
onset duration type
0 1 2 Instr
1 3 15 Remember
2 18 5 SocTestString_1
3 23 5 SocTestString_2
4 28 5 SocTestString_3
5 33 2 Rating
6 35 5 SelfTestString_1
7 40 5 SelfTestString_2
8 45 5 SelfTestString_3
9 50 15 XXX
Intermediates (without updating the original columns and resetting the index):
onset duration type m cc cc*5 _{cc+1}
0 1 2 Instr False <NA> <NA> NaN
1 3 15 Remember False <NA> <NA> NaN
2 18 15 SocTestString True 0 0 _1
2 18 15 SocTestString True 1 5 _2
2 18 15 SocTestString True 2 10 _3
3 33 2 Rating False <NA> <NA> NaN
4 35 15 SelfTestString True 0 0 _1
4 35 15 SelfTestString True 1 5 _2
4 35 15 SelfTestString True 2 10 _3
5 50 15 XXX False <NA> <NA> NaN