I want to get the information per Name.
df=pd.DataFrame({'Name':['A','A','A','A','B','B','B','B'],
'Pair':['b','c','d','e','b','f','d','g'],
'List 1':['1,2','2,3,4',np.nan,'5',np.nan,'1,2,3','3,6','5,4'],
'List 2':['1,2,3',',3,4','3','4,5',np.nan,'1,3','1,3,6','5,4']})
For output, I want to split 'Pair' and 'List 1' and 'List 2' into digits and join them.
'Unique Pair': group by each name and get all unique letters. Append them together in a sorted manner.
'Unique Item': group by each name and split it and drop NA and join all unique letters and sort letters.
'All Item': the only difference is that it does not join unique letters but all letters appeared and sort letters. For example, there are two '2' for name A, so it appears twice.
out=pd.DataFrame({'Name':['A','B'],
'Unique Pair':['b,c,d,e','b,d,f,g'],
'Unique List 1 Item':['1,2,3,4,5','1,2,3,4,5,6'],
'All List 1 Item':['1,2,2,3,4,5','1,2,3,3,4,5,6'],
'Unique List 2 Item':['1,2,3,4,5','1,3,4,5,6'],
'All List 2 Item':['1,2,3,3,3,4,4,5','1,1,3,3,4,5,6']
})
I have tried this but I don't know how to append all items together.
df.groupby('Name')['List'].fillna('NA').apply(lambda x: x.split(','))
My raw dataset has more than 500,000,000 rows, so I am looking for an efficient solution.
Here's one approach with df.groupby
:
out = df.groupby('Name', as_index=False).agg(
**{
'Unique Pair': ('Pair', lambda x:
','.join(np.unique(x))
),
'Unique Item': ('List', lambda x: ','.join(
np.unique(x.dropna().str.split(',').explode()))
),
'All Item': ('List', lambda x: ','.join(
x.dropna().str.split(',').explode().sort_values())
)
})
Output:
Name Unique Pair Unique Item All Item
0 A b,c,d,e 1,2,3,4,5 1,2,2,3,4,5
1 B b,d,f,g 1,2,3,4,5,6 1,2,3,3,4,5,6
Explanation
Use groupby.agg
:
'Unique Pair'
use str.join
after applying np.unique
, which returns sorted unique elements.'Unique Item'
drop NaN
values with Series.dropna
, apply Series.str.split
and Series.explode
. Pass result to np.unique
and use str.join
.'All Item'
, do the same, but after explode
, chain Series.sort_values
and then use str.join
.On the use of named aggregations, see here.
To avoid repetition of x.dropna().str.split(',').explode()
(which is expensive) for both 'Unique Item'
and 'All Item'
, you can also adjust your df
before starting the groupby
aggregations. E.g., do something like this:
df = (df.assign(List=df['List'].dropna().str.split(','))
.explode('List', ignore_index=True)
)
out2 = df.groupby('Name', as_index=False).agg(
**{
'Unique Pair': ('Pair', lambda x: ','.join(np.unique(x))),
'Unique Item': ('List', lambda x: ','.join(np.unique(x.dropna()))),
'All Item': ('List', lambda x: ','.join(x.dropna().sort_values()))
})
Output:
out2.equals(out)
# True
If performance is important, and you are working with a large dataset, this will probably be faster.