pythonpython-3.xchardet

chardet.detect return empty language


I'm using chardet.detect in order to detect the language of a string like in one of the solutions suggested here

my code looks like this:

import chardet

print(chardet.detect('test'.encode()))
print(chardet.detect('בדיקה'.encode()))
print(chardet.detect('тест'.encode()))
print(chardet.detect('テスト'.encode()))

the result I got looks like this:

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.9690625, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}

my expected result should look like this:

{'encoding': 'ascii', 'confidence': 1.0, 'language': 'English'}
{'encoding': 'utf-8', 'confidence': 0.9690625, 'language': 'Hebrew'}
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': 'Russian'}
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': 'Japanese'}

I prefer using chardet as my solution because I already importing it in my application, and I want to keep it as slim as possible


Solution

  • The chardet module is not very good at detecting either charsets or languages. Based on the options listed at Python: How to determine the language? I've found pyCLD3 to be easy to install and to provide good detection even with fairly short snippets of text, even though not perfect with single words like your test:

    >>> cld3.get_language("test")                                              
    LanguagePrediction(language='ko', probability=0.3396911025047302, is_reliable=False, proportion=1.0)
    
    >>> cld3.get_language("בדיקה")                                             
    LanguagePrediction(language='iw', probability=0.9995728731155396, is_reliable=True, proportion=1.0)
    
    >>> cld3.get_language("тест")                                              
    LanguagePrediction(language='bg', probability=0.9895398616790771, is_reliable=True, proportion=1.0)
    
    >>> cld3.get_language("テスト")                                            
    LanguagePrediction(language='ja', probability=1.0, is_reliable=True, proportion=1.0)
    

    Looks like three out of four because тест is also Bulgarian. The langid module gets all of these right, so that might be a good option also.