pythonpandasmemory

pandas memory issue when apply list to groupby


I am doing the below but getting memory issues.

make frame

data = {'link': [1,2,3,4,5,6,7],
        'code': ['xx', 'xx', 'xy', '', 'aa', 'ab', 'aa'],
        'Name': ['Tom', 'Tom', 'Tom', 'Tom', 'nick', 'nick', 'nick'],
        'Age': [20,20,20,20, 21, 21, 21]}

# Create DataFrame
df = pd.DataFrame(data)

print(df)

output

   link code  Name  Age
0     1   xx   Tom   20
1     2   xx   Tom   20
2     3   xy   Tom   20
3     4        Tom   20
4     5   aa  nick   21
5     6   ab  nick   21
6     7   aa  nick   21

minimal code example that works on subset of data but not on full dataset.

temp = df.groupby(['Name', 'Age'])['code'].apply(list).reset_index()
pd.merge(df, temp, on=['Name', 'Age']).explode('code_y').replace(r'^\s*$', np.nan, regex=True).dropna(subset='code_y').drop_duplicates()

output enter image description here

error when used on full dataset.

### MemoryError: Unable to allocate 5.34 TiB for an array with shape (733324768776,) and data type object

The apply list makes a big long list with duplicates. Is there a way to drop dups from the lists or is there maybe a better way to do this?

update

Going with this code as it seem to work best.

# Select relevant columns and drop duplicates directly
d = df[['code', 'Name', 'Age']].replace('', pd.NA).drop_duplicates()

# Perform the merge and drop rows with missing 'code_y' in one step
df.merge(d, how='outer', on=['Name', 'Age']).dropna(subset=['code_y'])

Solution

  • Your method is inefficient as it explodes then drops the duplicates. Ensure to drop the duplicates first then merge:

    d = df.mask(df.eq(''))[['code', 'Name', 'Age']].drop_duplicates()
    df.merge(d, how = 'outer', on = ['Name', 'Age']).dropna(subset='code_y') 
    
        link code_x  Name  Age code_y
    0      1     xx   Tom   20     xx
    1      1     xx   Tom   20     xy
    3      2     xx   Tom   20     xx
    4      2     xx   Tom   20     xy
    6      3     xy   Tom   20     xx
    7      3     xy   Tom   20     xy
    9      4          Tom   20     xx
    10     4          Tom   20     xy
    12     5     aa  nick   21     aa
    13     5     aa  nick   21     ab
    14     6     ab  nick   21     aa
    15     6     ab  nick   21     ab
    16     7     aa  nick   21     aa
    17     7     aa  nick   21     ab