I have the following code snippet using Python 3.10 and Pandas within a class method (not __init__
since I noticed this could lead to problems):
self.features = self.features.groupby(["token", "feature"], as_index=False).size() \
.rename(columns={"size": "freq"})
My self.features
DataFrame is very large, since I am processing a lot of textual data/documents. It also consists of elements from custom classes, which are not easily pickleable (I try to use dill whenever I can, e.g. for other parallelized tasks, I used pathos instead of standard multiprocessing).
Are there any ways of parallelizing the processing of .groupby(...).size()
? I know there a few parallelization methods for Pandas, but they often use .apply()
which I know is very slow.
groupby.size
can be replaced by value_counts
that is quite faster.
features[['token', 'feature']].value_counts(sort=False).reset_index(name='freq')
Parallelizing won't be of much help as the limiting step (building the groups) cannot be parallelized.