[SOLVED] Language detection for short user-generated string

Language detection for short user-generated string

I need to detect the language of text sent in chat, and I am faced with 2 problems:

the length of the message
the errors that may be in it and the noise (emoji etc...)

For the noise, I clean the message and that works fine, but the length of the message is a problem.

For example, if a user writes "hi", Fasttext detects the language as Dutch text, but Google Translate detects it as English. And most likely it is a message in English.

I try to train my own Fasttext model, but how can I adjust the model to have better results with short strings? Do I need to train the model with the dictionary of a lot of languages to get a better result?

I use Fasttext because it's the most accurate language detector.

Here is an exemple of the problem with Fasttext:

# wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

import fasttext

text = "Hi"

pretrained_lang_model = "lid.176.bin"
model = fasttext.load_model(pretrained_lang_model)

predictions = model.predict(text, k=2)
print(predictions)
# (('__label__de', '__label__en'), array([0.51606238, 0.31865335]))

Solution

I have found a way to have better results. If you sum all probabilities of all languages on different detectors like fastText and lingua, and add a dictionary-based detection for short texts, you can have very good results (for my task, I also made a fastText model trained on my data).