I want to read a csv file with following structure:
1 file, 13 insertions, 23 deletions
1 file, 43 deletions
2 file, 7 insertions
4 file, 13 insertions, 45 deletions
I would like to get a Dataframe, which should look following:
file insertions deletions
1 13 23
1 0 43
1 7 0
4 13 45
Is there a good way to do this?
You can read your file, split the values with extractall
(or split
), then pivot
:
df = pd.read_csv('file.csv', sep=', *')
out = (df.stack().str.extract(r'(\d+) *(.*)')
.droplevel(-1).pivot(columns=1, values=0)
.fillna(0).rename_axis(index=None, columns=None)
#.astype(int) # optional, to convert strings to integers
)
NB. to use split
, replace .str.extract(r'(\d+) *(.*)')
with .str.split(' +', n=1, expand=True)
.
Output:
deletions file insertions
0 23 1 13
1 43 1 0
2 0 2 7
3 45 4 13
Variant with unstack
to keep the original order of columns:
out = (df.stack().str.extract(r'(\d+) *(.*)')
.droplevel(-1)
.set_index(1, append=True)[0]
.unstack(fill_value=0, sort=False)
.rename_axis(index=None, columns=None)
)
Output:
file insertions deletions
0 1 13 23
1 1 43 0
2 0 2 7
3 4 13 45
NB. I just realized that the output with unstack
is incorrect when using sort=False
which seems to be a bug.