pythonlanguage-detection

Detect languages in a column, but ignore ambiguous values. Why am i getting an error?


Here is a sample dataset:

ID Details
1 Here Are the Details on Facebook's Global Part...
2 Aktien New York Schluss: Moderate Verluste nac...
3 Clôture de Wall Street : Trump plombe la tend...
4 ''
5 NaN

I need to add 'Language' column, which represents what language is used in 'Details' column, so that in the end it will look like this:

ID Details Language
1 Here Are the Details on Facebook's Global Part... en
2 Aktien New York Schluss: Moderate Verluste nac... de
3 Clôture de Wall Street : Trump plombe la tend... fr
4 '' NaN
5 NaN NaN

I tried this code:

!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(detect)

It failed, I guess it is because of rows that have values like 'ID'=4. Therefore, I tried this:

!pip install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
df2=df.dropna(subset=['Details'])
df2['Language']=df2['Details'].apply(lambda x: detect(x) if len(x)>1 else np.NaN)

However, I still got an error:

LangDetectException: No features in text.


Solution

  • You can catch the error and return NaN from the function you apply. Note that you can give any callable that takes one input and returns one output as the argument to .apply(), it doesn't have to be a lambda

    def detect_lang(x):
        if len(x) <= 1: return np.nan 
        try:
            lang = detect(x)
            if lang: return lang # Return lang if lang is not empty
        except langdetect.LangDetectException:
            pass # Don't do anything when you get an error, so you can fall through to the next line, which returns a Nan
        return np.nan  # If lang was empty or there was an error, we reach this line
    
    df2['Language']=df2['Details].apply(detect_lang)
    

    I'm not sure why you had if len(x)>1 in there: that would only return NaN when the original string has zero or one characters, but I included it in my detect_lang function to keep the functionality consistent with your lambda.