I have a csv with around 8 million of rows, something like that:
a b c
0 2 3
and I wanted to generate from it new rows based on the second and the third value so I will get:
a b c
0 2 3
0 3 3
0 4 3
0 5 3
which is basically just itereating through every row(in this example one row), and then creating a new row with a value of b+i, where i is between 0 to the value of c including c itself. c column is irelevant after the rows have been generated, problem is that it has million of rows, and doing that might generate many rows, so how can I do it efficenly? (loops are too slow for that amount of data). thanks
You can reindex on the repeated index:
out = df.loc[df.index.repeat(df['c']+1)]
out['b'] += out.groupby(level=0).cumcount()
print(out)
Output (reset index if you want):
a b c
0 0 2 3
0 0 3 3
0 0 4 3
0 0 5 3
Note since you blow your data up by the c
column and you already have 8 million rows, your new dataframe can be too big on its own.