pandasdataframeread-csv

Pandas read_csv with column behind value


I want to read a csv file with following structure:

1 file, 13 insertions, 23 deletions
1 file, 43 deletions
2 file, 7 insertions
4 file, 13 insertions, 45 deletions

I would like to get a Dataframe, which should look following:

file   insertions  deletions
1      13          23
1      0           43
1      7           0
4      13          45

Is there a good way to do this?


Solution

  • You can read your file, split the values with extractall (or split), then pivot:

    df = pd.read_csv('file.csv', sep=', *')
    out = (df.stack().str.extract(r'(\d+) *(.*)')
             .droplevel(-1).pivot(columns=1, values=0)
             .fillna(0).rename_axis(index=None, columns=None)
            #.astype(int) # optional, to convert strings to integers
          )
    

    NB. to use split, replace .str.extract(r'(\d+) *(.*)') with .str.split(' +', n=1, expand=True).

    Output:

      deletions file insertions
    0        23    1         13
    1        43    1          0
    2         0    2          7
    3        45    4         13
    

    Variant with unstack to keep the original order of columns:

    out = (df.stack().str.extract(r'(\d+) *(.*)')
             .droplevel(-1)
             .set_index(1, append=True)[0]
             .unstack(fill_value=0, sort=False)
             .rename_axis(index=None, columns=None)
          )
    

    Output:

      file insertions deletions
    0    1         13        23
    1    1         43         0
    2    0          2         7
    3    4         13        45
    

    NB. I just realized that the output with unstack is incorrect when using sort=False which seems to be a bug.