pythonnlpbigdatalanguage-detection

Language detection in Python for big data


I am trying to run language detection on a Series object in a pandas dataframe. However, I am dealing with millions of rows of string data, and the standard Python language detection librarieslangdetect and langid are too slow, and after hours of running it still hasn't completed.

I set up my code as follows:

#function to detect language
def detect_language (cell):
    if len(cell) > 0:
        lan = langid.classify(cell)
    else:
        lan = "NaN"
    return lan
#language detection using langid module

df['language'] = df.apply(lambda row: detect_language(row.Series), axis = 1)

Does anybody have suggestions on how to speed up my code or if there is another library out there?


Solution

  • You could use swifter to make your df.apply() more efficient. In addition to that, you might want to try whatthelang library which should be more efficient than langdetect.