pythonpandasdataframespell-checkingautocorrect

Python: Is there a faster way than using autocorrect for spell correction?


I am doing sentiment analysis and have train and test csv files with a train dataframe (created after reading the csv files) which has columns text and sentiment.

Tried in google-colab:

!pip install autocorrect
from autocorrect import spell 
train['text'] = [' '.join([spell(i) for i in x.split()]) for x in train['text']]

But it's taking forever to come to a halt. Is there a better way to auto-correct the pandas column? How to do it?

P.S.: the dataset is large enough, having around 5000 rows and each train['text'] value has around 300 words and is of type str. I have not broken the train['text'] into sentences.


Solution

  • First, some sample data:

    from typing import List
    from autocorrect import spell
    import pandas as pd
    from sklearn.datasets import fetch_20newsgroups
    
    data_train: List[str] = fetch_20newsgroups(
        subset='train',
        categories=['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'],
        shuffle=True,
        random_state=444
    ).data
    
    df = pd.DataFrame({"train": data_train})
    

    Corpus size:

    >>> df.shape
    (2034, 1)
    

    Mean length of document in characters:

    >>> df["train"].str.len().mean()
    1956.4896755162242
    

    First observation: spell() (I've never used autocorrect) is reaallly slow. It takes 7.77s just on one document!

    >>> first_doc = df.iat[0, 0]                                                                                                                                                                                                                                 
    >>> len(first_doc.split())                                                                                                                                                                                                                                   
    547
    >>> first_doc[:100]                                                                                                                                                                                                                                          
    'From: dbm0000@tm0006.lerc.nasa.gov (David B. Mckissock)\nSubject: Gibbons Outlines SSF Redesign Guida'
    >>> %time " ".join((spell(i) for i in first_doc.split()))                                                                                                                                                                                                    
    CPU times: user 7.77 s, sys: 159 ms, total: 7.93 s
    Wall time: 7.93 s
    

    So that function, rather than choosing between a vectorized Pandas method or .apply(), is probably your bottleneck. A back of the envelope calculation, given that this document is roughly 1/3 as long as the average, has your total, non-parallelized calculation time at 7.93 * 3 * 2034 == 48,388 seconds. Not pretty.

    To that end, consider parallelization. This is a highly parallelization task: apply a CPU-bound, simple callable across a collection of documents. concurrent.futures has an easy API for this. At this point, you can take the data structure out of Pandas and into something lightweight, such as a list or tuple.

    Example:

    >>> corpus = df["train"].tolist()  # or just data_train from above...                                                                                                                                                                                        
    >>> import concurrent.futures                                                                                                                                                                                                                                
    >>> import os                                                                                                                                                                                                                                                
    >>> os.cpu_count()                                                                                                                                                                                                                                           
    24
    >>> with concurrent.futures.ProcessPoolExecutor() as executor: 
    ...     corrected = executor.map(lambda doc: " ".join((spell(i) for i in doc)), corpus)