nlpgoogle-natural-language

Find sentences with similar relative meaning from a list of sentences against an example one


I want to be able to find sentences with the same meaning. I have a query sentence, and a long list of millions of other sentences. Sentences are words, or a special type of word called a symbol which is just a type of word symbolizing some object being talked about.

For example, my query sentence is:

Example: add (x) to (y) giving (z)

There may be a list of sentences already existing in my database such as: 1. the sum of (x) and (y) is (z) 2. (x) plus (y) equals (z) 3. (x) multiplied by (y) does not equal (z) 4. (z) is the sum of (x) and (y)

The example should match the sentences in my database 1, 2, 4 but not 3. Also there should be some weight for the sentence matching.

Its not just math sentences, its any sentence which can be compared to any other sentence based upon the meaning of the words. I need some way to have a comparison between a sentence and many other sentences to find the ones with the closes relative meaning. I.e. mapping between sentences based upon their meaning.

Thanks! (the tag is language-design as I couldn't create any new tag)


Solution

  • First off: what you're trying to solve is a very hard problem. Depending on what's in your dataset, it may be AI-complete.

    You'll need your program to know or learn that add, plus and sum refer to the same concept, while multiplies is a different concept. You may be able to do this by measuring distance between the words' synsets in WordNet/FrameNet, though your distance calculation will have to be quite refined if you don't want to find multiplies. Otherwise, you may want to manually establish some word-concept mappings (such as {'add' : 'addition', 'plus' : 'addition', 'sum' : 'addition', 'times' : 'multiplication'}).

    If you want full sentence semantics, you will in addition have to parse the sentences and derive the meaning from the parse trees/dependency graphs. The Stanford parser is a popular choice for parsing.

    You can also find inspiration for this problem in Question Answering research. There, a common approach is to parse sentences, then store fragments of the parse tree in an index and search for them by common search engines techniques (e.g. tf-idf, as implemented in Lucene). That will also give you a score for each sentence.