I'm working on an app which is creating a data bank of questions from old question papers. I wanted to maintain a table linking similar questions together as they were inserted. (The table I had in mind was a Modified Preordered Traversal Tree).
The requirements I have are:
Any idea on how to proceed on the algorithm side of things would be very much appreciated.
Also I'll be dealing with images containing Math notation. Should I make sure all my images have LaTeX in the 'ALT' attribute to make sure they are too processable by this algorithm or is there a better way of doing it?
It sounds like you want to consider two questions to be similar when they have the same sentence structure, after stripping out a laundry list of syntactic patterns you expect to vary. As such this problem looks similar to the problem of detecting near-duplicate documents in a corpus.
One way to do that is a technique called "simhashing"; one takes a (preprocessed) document and calculates a simhash fingerprint. Like a typical hash, the fingerprint has a fixed size and looks like binary gibberish. Unlike a typical hash, documents that are textually similar will also have similar fingerprints. By choosing a maximum (Hamming) distance that fingerprints can differ by, you can define clusters of documents (questions) that you consider "similar".
The process for indexing a new question would then look like this:
This book is an excellent primer on information retrieval in general. This is the simhash paper. Here's the manpage of a simple program to compute simhashes, it may be a good starting point if you don't want to implement the algorithm yourself.