I am trying to run language detection on a Series object in a pandas dataframe. However, I am dealing with millions of rows of string data, and the standard Python language detection librarieslangdetect
and langid
are too slow, and after hours of running it still hasn't completed.
I set up my code as follows:
#function to detect language
def detect_language (cell):
if len(cell) > 0:
lan = langid.classify(cell)
else:
lan = "NaN"
return lan
#language detection using langid module
df['language'] = df.apply(lambda row: detect_language(row.Series), axis = 1)
Does anybody have suggestions on how to speed up my code or if there is another library out there?
You could use swifter to make your df.apply()
more efficient. In addition to that, you might want to try whatthelang library which should be more efficient than langdetect
.