javascriptnode.jsnlpalgorithmia

Avoid synonyms in an array generated via AutoTag (Text Tagging Algorithm)


I have been working on a text analysis task. Where I am supposed to identify the words used most in a paragraph.

I am using algorithmia - npm package, for the purpose. This provides me the words repeated most in my text.

The package works quite fine, but still I have 2 issues:

(1) I am getting an array of tags like shown below:

['integrate', 'integration', 'policy', 'conversation', 'demo', 'test']

Here, 'integrate' & 'integration' both are having same meaning. I want to avoid 'integrate' over here.

(2) The process identifies tags using the words repeated the most. I have words like 'pricing', 'cost', 'payment' etc. in my input paragraph, but since it is not the exact match, I am not getting the tag 'cost' or something similar.

Improving either one of the logic will help me with the task.


I have already tried many libraries for synonyms, nouns, verbs, etc. But it does not seem to work out. Let mention the packages I have already tried:

thesaurus-com

sentence-similarity

string-similarity

compomise

wordnet

node-snowball

datamuse


I have also tried setting a threshold and match the words 'integrate' & 'integration', it does remove the 'integrate' tag, but also affect some of my other tags which needs to be there.


Thanks in advance


Solution

  • Your problem lies deep into Natural Language Understanding. You're not only dealing with "finding" words that are similar, you're dealing with the concepts that goes under the words.

    In your case, "Integrate" and "Integration" are not similar at all. They are not even synonyms. One is a verb, other is a noun, one is an action, other is a situation.

    What they do is that they share a common semantic root -> the idea of having things together as one, integral.

    There are no tools available (as of now) to do it. You can use a mix of many tools.

    You mentioned Wordnet and said that it did not work. However, this is probably the best bet for your problem. Wordnet's own explanation shows how it is useful in your situation:

    "[In WordNet,] Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations." and also "WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated." - WordNet Official Website

    With wordnet, you can find real synononyms and group them together ('pricing' and 'cost', for example - 'payment' is another whole story...).

    Now, regarding your original 'integrate' and 'integration' if you really want to group them together, add another heuristic that uses a stemmer to pack together words based on word stem (not guaranteed to work 100% of time since it depends on stemmer rules).