nlpmultilingualword2vec

How can I train a multilingual Word2Vec model with aligned embeddings?


I'm working on a cross-lingual project involving semantic search in both Persian and English.

I want to create a single Word2Vec model where semantically equivalent words (e.g., "خانه" ↔︎ "house") have close vector representations.


Solution

  • Good day!

    There are two solutions

    Solution number 1:

    You can train the model on data in which there is text in English and in Persian (mixed data). It will be good if you will possess a balanced large set of data. (Data you also can find on the internet and change the existing text to the language you need through translators !Artifacts possible!)

    Explanation: Under (mixed data) I mean, that each of the sentences is mixed with English and Persian words "en en fa en en en" /"en en en en en en ''/"fa fa fa fa fa fa"/"fa fa en fa fa fa". Such an approach will lead to that words (the word we need) will have similar vectors.

    Solution number 2:

    The difference from the first solution lies in the fact that here you train 2 different models. Download the dictionary eng-fa. In the next stages you will need to use VecMap, which will perform the alignment of vectors(GitHub). Combine the vectors into a single model.

    If at you there is another solution, then with joy I will listen to you.