I'm working on a cross-lingual project involving semantic search in both Persian and English.
I want to create a single Word2Vec model where semantically equivalent words (e.g., "خانه" ↔︎ "house") have close vector representations.
Good day!
You can train the model on data in which there is text in English and in Persian (mixed data). It will be good if you will possess a balanced large set of data. (Data you also can find on the internet and change the existing text to the language you need through translators !Artifacts possible!)
Explanation: Under (mixed data) I mean, that each of the sentences is mixed with English and Persian words "en en fa en en en" /"en en en en en en ''/"fa fa fa fa fa fa"/"fa fa en fa fa fa". Such an approach will lead to that words (the word we need) will have similar vectors.
The difference from the first solution lies in the fact that here you train 2 different models. Download the dictionary eng-fa. In the next stages you will need to use VecMap, which will perform the alignment of vectors(GitHub). Combine the vectors into a single model.
If at you there is another solution, then with joy I will listen to you.