So I have many dataframes coming in that need to be exploded. they look something like this:
df = pd.DataFrame({'A': [1, [11,22], [111,222]],
'B': [2, [33,44], float('nan')],
'C': [3, [55,66], [333,444]],
'D': [4, [77,88], float('nan')]
})
+-----------+---------+-----------+---------+
| A | B | C | D |
+-----------+---------+-----------+---------+
| 1 | 2 | 3 | 4 |
+-----------+---------+-----------+---------+
| [11,22] | [33,44] | [55,66] | [77,88] |
+-----------+---------+-----------+---------+
| [111,222] | NaN | [333,444] | NaN |
+-----------+---------+-----------+---------+
Typically if a column couldn't be exploded I'd just remove it from the column list like so:
colList = df.columns.values.tolist()
colList.remove("B")
colList.remove("D")
df = df.explode(colList)
But that would leave me with a dataframe that looks like:
+-----+---------+-----+---------+
| A | B | C | D |
+-----+---------+-----+---------+
| 1 | 2 | 3 | 4 |
+-----+---------+-----+---------+
| 11 | [33,44] | 55 | [77,88] |
+-----+---------+-----+---------+
| 22 | [33,44] | 66 | [77,88] |
+-----+---------+-----+---------+
| 111 | NaN | 333 | NaN |
+-----+---------+-----+---------+
| 222 | NaN | 444 | NaN |
+-----+---------+-----+---------+
I still need to explode those columns (B and D in example), but if I do, it'll throw an error due to the nulls. How can I successfully explode dataframes with this sort of problem?
One option could be to explode
each column separately and deduplicate the index before concat
:
def explode_dedup(s):
s = s.explode()
return s.set_axis(
pd.MultiIndex.from_arrays([s.index, s.groupby(level=0).cumcount()])
)
out = pd.concat({c: explode_dedup(df[c]) for c in df}, axis=1)
Output:
A B C D
0 0 1 2 3 4
1 0 11 33 55 77
1 22 44 66 88
2 0 111 NaN 333 NaN
1 222 NaN 444 NaN
To explode a subset of the columns:
cols = ['C', 'D']
others = df.columns.difference(cols)
out = (df[others]
.join(pd.concat({c: explode_dedup(df[c])
for c in cols}, axis=1)
.droplevel(-1)
)[df.columns]
)
Output:
A B C D
0 1 2 3 4
1 [11, 22] [33, 44] 55 77
1 [11, 22] [33, 44] 66 88
2 [111, 222] NaN 333 NaN
2 [111, 222] NaN 444 NaN
Reproducible input:
df = pd.DataFrame({'A': [1, [11,22], [111,222]],
'B': [2, [33,44], float('nan')],
'C': [3, [55,66], [333,444]],
'D': [4, [77,88], float('nan')]
})