nlptokenizeword-embeddingtext2vec

text2vec word embeddings : compound some tokens but not all


I am using {text2vec} word embeddings to build a dictionary of similar terms pertaining to a certain semantic category.

Is it OK to compound some tokens in the corpus, but not all? For example, I want to calculate terms similar to “future generation” or “rising generation”, but these collocations occur as separate terms in the original corpus of course. I am wondering if it is bad practice to gsub "rising generation" --> "rising_generation", without compounding all other terms that occur frequently together such as “climate change.”

Thanks!


Solution

  • Yes, it's fine. It may or may not work exactly the way you want but it's worth trying.

    You might want to look at the code for collocations in text2vec, which can automatically detect and join phrases for you. You can certainly join phrases on top of that if you want. In Gensim in Python I would use the Phrases code for the same thing.

    Given that training word vectors usually doesn't take too long, it's best to try different techniques and see which one works better for your goal.