I need a memory-efficient way of exploding a dataframe and then grouping by a column to get either normalized value_counts or the mean depending on dtype.
I have a dataframe similar to this:
key cnt X1 X2
0 1 8 a 1
1 1 3 b 0
2 1 4 a 0
3 2 2 b 1
4 2 6 a 0
5 3 3 a 1
The cnt
variable indicates the count of the values in other columns. For example, in the first row you can interpret X1
has having 8
instances of 'a'
(i.e., like ['a']*8
or ['a','a','a','a','a','a','a','a']
).
I am able to do what I need using .apply()
then .explode()
then .groupby()
, something like this:
df = pd.DataFrame([[1, 8, 'a', 1], [1, 3, 'b', 0], [1, 4, 'a', 0],
[2, 2, 'b', 1], [2, 6, 'a', 0], [3, 3, 'a', 1]],
columns=['key', 'cnt', 'X1', 'X2'])
df['X1'] = df.apply(lambda row: [row['X1']] * row['cnt'], axis=1)
df['X2'] = df.apply(lambda row: [row['X2']] * row['cnt'], axis=1)
df = df.explode(['X1', 'X2']).drop(columns=['cnt']).reset_index(drop=True)
vc = pd.DataFrame(df.groupby(['key'])['X1'].value_counts(normalize=True, dropna=True)).unstack()
vc.columns = [f'X1_{col}' for col in vc.columns.get_level_values(1).values]
df = pd.merge(left=vc.reset_index(),
right=df.drop(columns=['X1']).groupby(['key']).agg('mean').astype(float).reset_index(),
how='left')
print(df)
key X1_a X1_b X2
0 1 0.80 0.20 0.533333
1 2 0.75 0.25 0.250000
2 3 1.00 NaN 1.000000
But the data I'm working with is enormous, with many different variables that need to be aggregated like this, and most of the cnt
values are >15000, which results in using too much memory and freezing my machine.
I feel like there must be a more memory-efficient way to do this. Anyone have any ideas?
Calculate the normalized sum of counts per key
and X1
cnts = pd.crosstab(df['key'], df['X1'], df['cnt'],
aggfunc='sum', normalize='index').add_prefix('X1_')
# X1 X1_a X1_b
# key
# 1 0.80 0.20
# 2 0.75 0.25
# 3 1.00 0.00
Calculated the weighted average of X2
per key
weighted_sum = df['cnt'].mul(df['X2']).groupby(df['key']).sum()
total_weight = df.groupby('key')['cnt'].sum()
average = weighted_sum / total_weight
# key
# 1 0.533333
# 2 0.250000
# 3 1.000000
# dtype: float64
Join the dataframes
result = cnts.join(average.rename('X2'))
# X1_a X1_b X2
# key
# 1 0.80 0.20 0.533333
# 2 0.75 0.25 0.250000
# 3 1.00 0.00 1.000000