[SOLVED] Pandas read_csv with column behind value

Pandas read_csv with column behind value

I want to read a csv file with following structure:

1 file, 13 insertions, 23 deletions
1 file, 43 deletions
2 file, 7 insertions
4 file, 13 insertions, 45 deletions

I would like to get a Dataframe, which should look following:

file   insertions  deletions
1      13          23
1      0           43
1      7           0
4      13          45

Is there a good way to do this?

Solution

You can read your file, split the values with extractall (or split), then pivot:

df = pd.read_csv('file.csv', sep=', *')
out = (df.stack().str.extract(r'(\d+) *(.*)')
         .droplevel(-1).pivot(columns=1, values=0)
         .fillna(0).rename_axis(index=None, columns=None)
        #.astype(int) # optional, to convert strings to integers
      )

NB. to use split, replace .str.extract(r'(\d+) *(.*)') with .str.split(' +', n=1, expand=True).

Output:

  deletions file insertions
0        23    1         13
1        43    1          0
2         0    2          7
3        45    4         13

Variant with unstack to keep the original order of columns:

out = (df.stack().str.extract(r'(\d+) *(.*)')
         .droplevel(-1)
         .set_index(1, append=True)[0]
         .unstack(fill_value=0, sort=False)
         .rename_axis(index=None, columns=None)
      )

Output:

  file insertions deletions
0    1         13        23
1    1         43         0
2    0          2         7
3    4         13        45

NB. I just realized that the output with unstack is incorrect when using sort=False which seems to be a bug.