python pandas dataframe logistic-regression

Explode / unpivot DataFrame containing count data to one row per item

Context: Data transformation for a logistic regression problem. I have the following data structure:

df = pd.DataFrame({"group": ["A", "B"], "total": [3, 5], "occurrence": [2, 1]})

I want to do sth. like pd.explode, but creating one row for item of total, i.e. 5+6 rows where occurrence number of rows hold 1 and the rest 0 (either in the occurrence columns or a new target column).

Currently I'm doing it iteratively which is quite slow on large data:

expanded = []
for ix, row in df.iterrows():
    for i in range(row["total"]):
        row["y"] = 1 if i < row["occurrence"] else 0
        expanded.append(row.copy())
df_out = pd.DataFrame(expanded).reset_index(drop=True)
df_out.drop(["total", "occurrence"], axis=1, inplace=True)
df_out


  group  y
0     A  1
1     A  1
2     A  0
3     B  1
4     B  0
5     B  0
6     B  0
7     B  0

Solution

You could repeat the rows, then assign a new column based on the output from groupby.cumcount:

out = (df.loc[df.index.repeat(df['total']), ['group', 'occurrence']]
         .assign(y=lambda x: x.groupby(level=0).cumcount()
                              .lt(x.pop('occurrence'))
                              .astype(int))
      )

Or using numpy's tile/repeat:

import numpy as np

out = pd.DataFrame({'group': np.repeat(df['group'], df['total']),
                    'y': np.repeat(np.tile([1, 0], len(df)),
                                   np.r_[df['occurrence'],
                                         df['total']-df['occurrence']])
                    })

Output:

If you want to use explode:

out = (df.assign(y= lambda x: [[1 if i<o else 0 for i in range(t)] for
                               t, o in zip(x['total'], x['occurrence'])])
         .explode('y')
      )

Output:

  group  total  occurrence  y
0     A      3           2  1
0     A      3           2  1
0     A      3           2  0
1     B      5           1  1
1     B      5           1  0
1     B      5           1  0
1     B      5           1  0
1     B      5           1  0

timings

On 20k rows. The pure numpy approach is the fastest.

# original approach
9.96 s ± 248 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# repetition
7.17 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# numpy tile+repeat
1.31 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# explode
18.2 ms ± 959 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# numpy + list comprehension (PaulS)
125 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# numba (PaulS)
3.88 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)