I am doing sentiment analysis and have train
and test
csv files with a train
dataframe (created after reading the csv files) which has columns text
and sentiment
.
Tried in google-colab:
!pip install autocorrect
from autocorrect import spell
train['text'] = [' '.join([spell(i) for i in x.split()]) for x in train['text']]
But it's taking forever to come to a halt. Is there a better way to auto-correct the pandas column? How to do it?
P.S.: the dataset is large enough, having around 5000 rows and each train['text']
value has around 300 words and is of type str
. I have not broken the train['text']
into sentences.
First, some sample data:
from typing import List
from autocorrect import spell
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
data_train: List[str] = fetch_20newsgroups(
subset='train',
categories=['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'],
shuffle=True,
random_state=444
).data
df = pd.DataFrame({"train": data_train})
Corpus size:
>>> df.shape
(2034, 1)
Mean length of document in characters:
>>> df["train"].str.len().mean()
1956.4896755162242
First observation: spell()
(I've never used autocorrect
) is reaallly slow. It takes 7.77s just on one document!
>>> first_doc = df.iat[0, 0]
>>> len(first_doc.split())
547
>>> first_doc[:100]
'From: dbm0000@tm0006.lerc.nasa.gov (David B. Mckissock)\nSubject: Gibbons Outlines SSF Redesign Guida'
>>> %time " ".join((spell(i) for i in first_doc.split()))
CPU times: user 7.77 s, sys: 159 ms, total: 7.93 s
Wall time: 7.93 s
So that function, rather than choosing between a vectorized Pandas method or .apply()
, is probably your bottleneck. A back of the envelope calculation, given that this document is roughly 1/3 as long as the average, has your total, non-parallelized calculation time at 7.93 * 3 * 2034 == 48,388 seconds. Not pretty.
To that end, consider parallelization. This is a highly parallelization task: apply a CPU-bound, simple callable across a collection of documents. concurrent.futures
has an easy API for this. At this point, you can take the data structure out of Pandas and into something lightweight, such as a list or tuple.
Example:
>>> corpus = df["train"].tolist() # or just data_train from above...
>>> import concurrent.futures
>>> import os
>>> os.cpu_count()
24
>>> with concurrent.futures.ProcessPoolExecutor() as executor:
... corrected = executor.map(lambda doc: " ".join((spell(i) for i in doc)), corpus)