pandasalgorithmperformancerows

best way to generate rows based on other rows in pandas at a big file


I have a csv with around 8 million of rows, something like that:

a b c
0 2 3

and I wanted to generate from it new rows based on the second and the third value so I will get:

a b c
0 2 3
0 3 3
0 4 3
0 5 3

which is basically just itereating through every row(in this example one row), and then creating a new row with a value of b+i, where i is between 0 to the value of c including c itself. c column is irelevant after the rows have been generated, problem is that it has million of rows, and doing that might generate many rows, so how can I do it efficenly? (loops are too slow for that amount of data). thanks


Solution

  • You can reindex on the repeated index:

    out = df.loc[df.index.repeat(df['c']+1)]
    out['b'] += out.groupby(level=0).cumcount()
    print(out)
    

    Output (reset index if you want):

       a  b  c
    0  0  2  3
    0  0  3  3
    0  0  4  3
    0  0  5  3
    

    Note since you blow your data up by the c column and you already have 8 million rows, your new dataframe can be too big on its own.