I have a DataFrame like this:
ID | MS | DS |
---|---|---|
654 | 1500,10000,20000,30000 | 60,365,730 |
131 | 1500,10000,20000 | 60,365,730 |
598 | 1500,10000,20000,30000 | 60,365,730 |
The desired output looks like this:
ID | MS | DS |
---|---|---|
654 | 1500 | 60 |
654 | 10000 | 365 |
654 | 20000 | 730 |
654 | 30000 | Nan |
131 | 1500 | 60 |
131 | 10000 | 365 |
131 | 20000 | 730 |
598 | 1500 | 60 |
I have tried the following code:
from itertools import zip_longest
df.apply(lambda row: list(zip_longest(row['MS'],
row['DS'],
fillvalue = 'Nan')),axis = 1)
This did not work as expected, as it creates a Series with list values like this: [('1500,10000,20000,30000', '60,365,730')], [('1500,10000,20000', '60,365,730')]
.
Here's one approach:
cols = ['MS', 'DS']
dfs = [df[c].explode().str.split(',', expand=True)
.astype('Int64')
.stack()
.rename(c)
for c in cols]
out = (pd.concat(dfs, axis=1)
.droplevel(1)
.join(df['ID'])
.loc[:, df.columns]
)
out
ID MS DS
0 654 1500 60
0 654 10000 365
0 654 20000 730
0 654 30000 <NA>
1 131 1500 60
1 131 10000 365
1 131 20000 730
2 598 1500 60
2 598 10000 365
2 598 20000 730
2 598 30000 <NA>
Explanation
['MS', 'DS']
:
Series.explode
to get list elements into multiple rows;Series.str.split
to split on ','
and set expand
to True
to get the result in multiple columns;Series.astype
with Int64
to turn what will be string values into proper integers, while allowing for NaN
values (the shorter sequences will have these for columns not "reached");df.stack
to get a pd.Series
result (with column values now as a second level index);Series.rename
to restore the original column name.Series
, add non-exploded column, and re-order:
pd.concat
along axis=1
(columns)df.droplevel
to drop the added index level;df.join
to re-add the non-exploded column ID
;df.columns
to df.loc
to reset the column order.Data used
import pandas as pd
data = {
'ID': [654, 131, 598],
'MS': [['1500,10000,20000,30000'], ['1500,10000,20000'],
['1500,10000,20000,30000']],
'DS': [['60,365,730'], ['60,365,730'],
['60,365,730']]
}
df = pd.DataFrame(data)
df
ID MS DS
0 654 [1500,10000,20000,30000] [60,365,730]
1 131 [1500,10000,20000] [60,365,730]
2 598 [1500,10000,20000,30000] [60,365,730]