python-3.xpandasgroup-bymultiprocessingdill

Parallelize Pandas's .size()


I have the following code snippet using Python 3.10 and Pandas within a class method (not __init__ since I noticed this could lead to problems):

self.features = self.features.groupby(["token", "feature"], as_index=False).size() \
            .rename(columns={"size": "freq"})

My self.features DataFrame is very large, since I am processing a lot of textual data/documents. It also consists of elements from custom classes, which are not easily pickleable (I try to use dill whenever I can, e.g. for other parallelized tasks, I used pathos instead of standard multiprocessing).

Are there any ways of parallelizing the processing of .groupby(...).size()? I know there a few parallelization methods for Pandas, but they often use .apply() which I know is very slow.


Solution

  • groupby.size can be replaced by value_counts that is quite faster.

    features[['token', 'feature']].value_counts(sort=False).reset_index(name='freq')
    

    Parallelizing won't be of much help as the limiting step (building the groups) cannot be parallelized.