spacyspacy-3

Load Spacy language module according to detected language


All around I see this example related to the package LanguageDetector

import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector

def get_lang_detector(nlp, name):
    return LanguageDetector()

nlp = spacy.load("en_core_web_sm")
Language.factory("language_detector", func=get_lang_detector)
nlp.add_pipe('language_detector', last=True)
text = 'This is an english text.'
doc = nlp(text)
print(doc._.language)

But how do I load the correct language module according to the detected language, if the previous code always only loads the English module?

I want something like

languageCode = LanguageDetector.detect('This is a text example')
nlp = spacy.load(languageCode.lower() + "_core_web_sm")

Solution

  • If you are not constrained by using only spacy, you can use the lingua-language-detector library in order to first retrieve the language itself.

    Here, you can find the comprehensive list of all the available languages on SpaCy. So you can build a dictionary as the following (including as many languages as you want):

    spacy_model_mapping = {
        "english": "en_core_web_sm",
        "french": "fr_core_web_sm",
        "german": "de_core_web_sm",
        "spanish": "es_core_web_sm",
        "portuguese": "pt_core_news_sm",
        "italian": "it_core_news_sm",
        "dutch": "nl_core_news_sm",
    }
    

    Proceding as follows:

    import spacy
    from lingua import Language, LanguageDetectorBuilder
        
    # Languages 
    supported_languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH, Language.PORTUGUESE, Language.ITALIAN, Language.DUTCH]
    detector = LanguageDetectorBuilder.from_languages(*supported_languages).build()
    
    text = "Ceci est un texte en français."
    
    result = detector.detect_language_of(text)
    detected_language_name = result.name.lower()  
    
    spacy_model_name = spacy_model_mapping.get(detected_language_name)
    print("SpaCy model name:", spacy_model_name)
    

    Obtaining:

    >>> SpaCy model name: fr_core_web_sm
    

    And eventually:

    nlp = spacy.load(spacy_model_name)