I have a pandas dataframe with 3 million rows of social media comments. I'm using the language-tool-python library to find the number of grammatical errors in a comment. Afaik the language-tool library by default sets up a local language-tool server on your machine and queries responses from that.
Getting the number of grammatical errors is just consists of creating an instance of the language tool object and calling the .check()
method with the string you want to check as a parameter.
>>> tool = language_tool_python.LanguageTool('en-US')
>>> text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
>>> matches = tool.check(text)
>>> len(matches)
2
So the method I used is df['body_num_errors'] = df['body'].apply(lambda row: len(tool.check(row)))
. Now I am pretty sure this works. Its quite straight forward. This single line of code has been running for the past hour.
Because running the above example took 10-20 second, so with 3 million instances, it might as well take virtually forever.
Is there any way I can cut my losses and speed this process up? Would iterating over every row and putting the whole thing inside of a threadpoolexecutor help? Intuitively it makes sense to me as its a I/O bound task.
I am open to any suggestions as to how to speed up this process and if the above method works would appreciate if someone can show me some sample code.
edit - Correction.
It takes 10-20 seconds along with the instantiation, calling the method is almost instantaneous.
I'm the creator of language_tool_python
. First, none of the comments here make sense. The bottleneck is in tool.check()
; there is nothing slow about using pd.DataFrame.map()
.
LanguageTool is running on a local server on your machine. There are at least two major ways to speed this up:
servers = []
for i in range(100):
servers.append(language_tool_python.LanguageTool('en-US'))
Then call to each server from a different thread. Or alternatively initialize each server within its own thread.
LanguageTool takes a maxCheckThreads
option – see the LT HTTPServerConfig documentation – so you could also try playing around with that? From a glance at LanguageTool's source code, it looks like the default number of threads in a single LanguageTool server is 10.