I have a dataset with 'tag-like' groupings:
Id tags
0 item1 ['friends','family']
1 item2 ['friends']
2 item3 []
3 item4 ['family','holiday']
So a row can belong to several groups. I want to create an object similar to groupby, so that I can use agg etc.
df.groupby('tags').count()
expected result
tags count
0 'friends' 2
1 'family' 2
2 'holiday' 1
But of course it won't work because it treats the whole list as the key, rather than the individual tags. Here's an attempt
tagset = set(df.tags.explode())
grpby = { t: df.loc[df.tags.str.contains(t, regex=False)]
for t in tagset }
From what I understand, groupby objects are structured a bit like this. But how to make it a groupby
object? So that I can do things like grpby.year.mean()
etc?
You can't have a row belong to multiple groups like your grpby
object. Thus what you want to do is impossible in pure pandas, unless you duplicate the rows with explode
, then you will be able to groupby.agg
:
out = (df.explode('tags')
.groupby('tags', as_index=False)
.agg(**{'count': ('tags', 'size')})
)
Output:
tags count
0 family 2
1 friends 2
2 holiday 1
With a more meaningful aggregation:
out = (df.explode('tags')
.groupby('tags', as_index=False)
.agg({'Id': frozenset})
)
Output:
tags Id
0 family (item4, item1)
1 friends (item2, item1)
2 holiday (item4)
Note however that explode
is quite expensive, so if you just want to count the tags, better use pure python:
from collections import Counter
from itertools import chain
out = Counter(chain.from_iterable(df['tags']))
Output:
Counter({'friends': 2, 'family': 2, 'holiday': 1})
And if you want to split the DataFrame like your grpby
object:
tmp = df.assign(group=df['tags']).explode('group')
group = tmp.pop('group')
out = dict(list(tmp.groupby(group)))
Output:
{'family': Id tags
0 item1 [friends, family]
3 item4 [family, holiday],
'friends': Id tags
0 item1 [friends, family]
1 item2 [friends],
'holiday': Id tags
3 item4 [family, holiday]}