pythonlanguage-detection

How to fix langdetect's instable results


I'd like to detect languages in texts using langdetect. According to the documentation , I have to set a seed to get stable results.

Language detection algorithm is non-deterministic, which means that if you try to run it on a text which is either too short or too ambiguous, you might get different results everytime you run it. To enforce consistent results, call following code before the first language detection:

As shown below, the results seems not to work. What did I miss?

from langdetect import detect, detector_factory, detect_langs

my_string = "Hi, my friend lives next to me. Can you call her? Thibault François. Envoyé depuis mon mobile"

detector_factory.seed = 42

for i in range(5):
    print(detect_langs(my_string), detect(my_string))

result example:

[fr:0.7142820855500301, en:0.28571744799229243] en
[fr:0.7142837342663328, en:0.2857140098811736] en
[en:0.571427940246422, fr:0.4285710874902514] fr
[en:0.5714284102904427, fr:0.42857076299207464] fr
[en:0.5714277269187811, fr:0.4285715961184375] fr

Solution

  • If you use DetectorFactory (as suggested in the documentation) instead of detector_factory, it works.

    from langdetect import detect, DetectorFactory, detect_langs
    
    my_string = "Hi, my friend lives next to me. Can you call her? Thibault François. Envoyé depuis mon mobile"
    
    DetectorFactory.seed = 42
    
    for i in range(5):
        print(detect_langs(my_string), detect(my_string))
    

    result:

    [en:0.5714271973455635, fr:0.42857096898887964] en
    [en:0.5714271973455635, fr:0.42857096898887964] en
    [en:0.5714271973455635, fr:0.42857096898887964] en
    [en:0.5714271973455635, fr:0.42857096898887964] en
    [en:0.5714271973455635, fr:0.42857096898887964] en