pythonpandasdataframe

How to explode a pandas dataframe that has nulls in some rows, but populated in others


So I have many dataframes coming in that need to be exploded. they look something like this:

df = pd.DataFrame({'A': [1, [11,22], [111,222]],
                   'B': [2, [33,44], float('nan')],
                   'C': [3, [55,66], [333,444]],
                   'D': [4, [77,88], float('nan')]
                  })
+-----------+---------+-----------+---------+
| A         | B       | C         | D       |
+-----------+---------+-----------+---------+
| 1         | 2       | 3         | 4       |
+-----------+---------+-----------+---------+
| [11,22]   | [33,44] | [55,66]   | [77,88] |
+-----------+---------+-----------+---------+
| [111,222] | NaN     | [333,444] | NaN     |
+-----------+---------+-----------+---------+

Typically if a column couldn't be exploded I'd just remove it from the column list like so:

colList = df.columns.values.tolist()
colList.remove("B")
colList.remove("D")
df = df.explode(colList)

But that would leave me with a dataframe that looks like:

+-----+---------+-----+---------+
| A   | B       | C   | D       |
+-----+---------+-----+---------+
| 1   | 2       | 3   | 4       |
+-----+---------+-----+---------+
| 11  | [33,44] | 55  | [77,88] |
+-----+---------+-----+---------+
| 22  | [33,44] | 66  | [77,88] |
+-----+---------+-----+---------+
| 111 | NaN     | 333 | NaN     |
+-----+---------+-----+---------+
| 222 | NaN     | 444 | NaN     |
+-----+---------+-----+---------+

I still need to explode those columns (B and D in example), but if I do, it'll throw an error due to the nulls. How can I successfully explode dataframes with this sort of problem?


Solution

  • One option could be to explode each column separately and deduplicate the index before concat:

    def explode_dedup(s):
        s = s.explode()
        return s.set_axis(
            pd.MultiIndex.from_arrays([s.index, s.groupby(level=0).cumcount()])
        )
    
    
    out = pd.concat({c: explode_dedup(df[c]) for c in df}, axis=1)
    

    Output:

           A    B    C    D
    0 0    1    2    3    4
    1 0   11   33   55   77
      1   22   44   66   88
    2 0  111  NaN  333  NaN
      1  222  NaN  444  NaN
    

    To explode a subset of the columns:

    cols = ['C', 'D']
    others = df.columns.difference(cols)
    
    out = (df[others]
           .join(pd.concat({c: explode_dedup(df[c])
                            for c in cols}, axis=1)
                   .droplevel(-1)
                )[df.columns]
          )
    

    Output:

                A         B    C    D
    0           1         2    3    4
    1    [11, 22]  [33, 44]   55   77
    1    [11, 22]  [33, 44]   66   88
    2  [111, 222]       NaN  333  NaN
    2  [111, 222]       NaN  444  NaN
    

    Reproducible input:

    df = pd.DataFrame({'A': [1, [11,22], [111,222]],
                       'B': [2, [33,44], float('nan')],
                       'C': [3, [55,66], [333,444]],
                       'D': [4, [77,88], float('nan')]
                      })