All around I see this example related to the package LanguageDetector
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector
def get_lang_detector(nlp, name):
return LanguageDetector()
nlp = spacy.load("en_core_web_sm")
Language.factory("language_detector", func=get_lang_detector)
nlp.add_pipe('language_detector', last=True)
text = 'This is an english text.'
doc = nlp(text)
print(doc._.language)
But how do I load the correct language module according to the detected language, if the previous code always only loads the English module?
I want something like
languageCode = LanguageDetector.detect('This is a text example')
nlp = spacy.load(languageCode.lower() + "_core_web_sm")
If you are not constrained by using only spacy
, you can use the lingua-language-detector
library in order to first retrieve the language itself.
Here, you can find the comprehensive list of all the available languages on SpaCy. So you can build a dictionary as the following (including as many languages as you want):
spacy_model_mapping = {
"english": "en_core_web_sm",
"french": "fr_core_web_sm",
"german": "de_core_web_sm",
"spanish": "es_core_web_sm",
"portuguese": "pt_core_news_sm",
"italian": "it_core_news_sm",
"dutch": "nl_core_news_sm",
}
Proceding as follows:
import spacy
from lingua import Language, LanguageDetectorBuilder
# Languages
supported_languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH, Language.PORTUGUESE, Language.ITALIAN, Language.DUTCH]
detector = LanguageDetectorBuilder.from_languages(*supported_languages).build()
text = "Ceci est un texte en français."
result = detector.detect_language_of(text)
detected_language_name = result.name.lower()
spacy_model_name = spacy_model_mapping.get(detected_language_name)
print("SpaCy model name:", spacy_model_name)
Obtaining:
>>> SpaCy model name: fr_core_web_sm
And eventually:
nlp = spacy.load(spacy_model_name)