I am doing the below but getting memory issues.
make frame
data = {'link': [1,2,3,4,5,6,7],
'code': ['xx', 'xx', 'xy', '', 'aa', 'ab', 'aa'],
'Name': ['Tom', 'Tom', 'Tom', 'Tom', 'nick', 'nick', 'nick'],
'Age': [20,20,20,20, 21, 21, 21]}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
output
link code Name Age
0 1 xx Tom 20
1 2 xx Tom 20
2 3 xy Tom 20
3 4 Tom 20
4 5 aa nick 21
5 6 ab nick 21
6 7 aa nick 21
minimal code example that works on subset of data but not on full dataset.
temp = df.groupby(['Name', 'Age'])['code'].apply(list).reset_index()
pd.merge(df, temp, on=['Name', 'Age']).explode('code_y').replace(r'^\s*$', np.nan, regex=True).dropna(subset='code_y').drop_duplicates()
error when used on full dataset.
### MemoryError: Unable to allocate 5.34 TiB for an array with shape (733324768776,) and data type object
The apply list makes a big long list with duplicates. Is there a way to drop dups from the lists or is there maybe a better way to do this?
update
Going with this code as it seem to work best.
# Select relevant columns and drop duplicates directly
d = df[['code', 'Name', 'Age']].replace('', pd.NA).drop_duplicates()
# Perform the merge and drop rows with missing 'code_y' in one step
df.merge(d, how='outer', on=['Name', 'Age']).dropna(subset=['code_y'])
Your method is inefficient as it explodes then drops the duplicates. Ensure to drop the duplicates first then merge:
d = df.mask(df.eq(''))[['code', 'Name', 'Age']].drop_duplicates()
df.merge(d, how = 'outer', on = ['Name', 'Age']).dropna(subset='code_y')
link code_x Name Age code_y
0 1 xx Tom 20 xx
1 1 xx Tom 20 xy
3 2 xx Tom 20 xx
4 2 xx Tom 20 xy
6 3 xy Tom 20 xx
7 3 xy Tom 20 xy
9 4 Tom 20 xx
10 4 Tom 20 xy
12 5 aa nick 21 aa
13 5 aa nick 21 ab
14 6 ab nick 21 aa
15 6 ab nick 21 ab
16 7 aa nick 21 aa
17 7 aa nick 21 ab