pythonpandas

Creating new rows in a dataframe based on previous values


I have a dataframe that looks like this:

test = pd.DataFrame(
    {'onset': [1,3,18,33,35,50],
     'duration': [2,15,15,2,15,15],
     'type': ['Instr', 'Remember', 'SocTestString', 'Rating', 'SelfTestString', 'XXX']
    }
)

I want to create a new dataframe such that when type contains "TestString",

The final dataframe should look like this:

test_final = pd.DataFrame(
    {'onset': [1,3,18,23,28,33,35,40,45,50],
     'duration': [2,15,5,5,5,2,5,5,5,15],
     'type': ['Instr', 'Remember', 'SocTestString_1', 'SocTestString_2', 'SocTestString_3', 'Rating', 'SelfTestString_1', 'SelfTestString_2', 'SelfTestString_3', 'XXX']
    })

How may I accomplish this?


Solution

  • You could use str.contains to identify the target rows, then Index.repeat to duplicate them, finally boolean indexing and groupby.cumcount to update the new rows:

    N = 3 # number of rows to create
    # identify target rows
    m = test['type'].str.contains('TestString')
    # repeat them
    out = test.loc[test.index.repeat(m.mul(N-1).add(1))]
    # divide duration
    out.loc[m, 'duration'] /= N
    # compute the cumcount
    cc = out.loc[m].groupby(level=0).cumcount()
    # increment the onset
    out.loc[m, 'onset'] += cc*5
    # add the suffix
    out.loc[m, 'type'] += '_'+cc.add(1).astype(str)
    
    # optionally, reset the index
    out.reset_index(drop=True, inplace=True)
    

    NB. this assumes that the original index does not have duplicated indices.

    Output:

       onset  duration              type
    0      1         2             Instr
    1      3        15          Remember
    2     18         5   SocTestString_1
    3     23         5   SocTestString_2
    4     28         5   SocTestString_3
    5     33         2            Rating
    6     35         5  SelfTestString_1
    7     40         5  SelfTestString_2
    8     45         5  SelfTestString_3
    9     50        15               XXX
    

    Intermediates (without updating the original columns and resetting the index):

       onset  duration            type      m    cc  cc*5 _{cc+1}
    0      1         2           Instr  False  <NA>  <NA>     NaN
    1      3        15        Remember  False  <NA>  <NA>     NaN
    2     18        15   SocTestString   True     0     0      _1
    2     18        15   SocTestString   True     1     5      _2
    2     18        15   SocTestString   True     2    10      _3
    3     33         2          Rating  False  <NA>  <NA>     NaN
    4     35        15  SelfTestString   True     0     0      _1
    4     35        15  SelfTestString   True     1     5      _2
    4     35        15  SelfTestString   True     2    10      _3
    5     50        15             XXX  False  <NA>  <NA>     NaN