In information retrieval, the inverted index has entries which are the words of corpus, and each word has a posting list which is the list of documents it appears in.
If stemming is applied, index entry would be a stem, so multiple words may finally map to the same entry if they share the same stem. For example:
Without stemming:
(slowing) --> [D1, D5, D9,...]
(slower) --> [D9, D10, D20,...]
(slow) --> [D2,...]
With stemming:
(slow) --> [D1, D2, D5, D9, , D10, D20...]
I want to avoid stemming, and instead would like to make each entry in my inverted index as a bag of words (inflections) such as (slow, slows, slowing, slowed, slower, slowest). For example:
(slow, slows, slowing, slowed, slower, slowest) --> [D1, D2, D5, D9, , D10, D20...]
Would that be possible and feasible or not?
Short Answer:
Simply avoid stemming to suit your need of not considering slow
and slows
to be a match.
Long Answer:
Question:
I want to avoid stemming, and instead would like to make each entry in my inverted index as a bag of words (inflections) such as (slow, slows, slowing, slowed, slower, slowest).
Let me try to clear some confusion that you have about inverted lists. It is the documents that are stored in the postings for each term (not the terms themselves).
The words are typically stored in a in-memory dictionary (implemented with a hash-table or a trie) containing pointers to the postings (list of documents which contain that particular term) stored and loaded on the fly from secondary storage.
A simple example (without showing document weights):
(information) --> [D1, D5, D9,...]
(informative) --> [D9, D10, D20,...]
(retrieval) --> [D1, D9, D17,...]
..
So, if you don't want to apply stemming, that's fine... In fact, the above example shows an unstemmed index, where the words information
and informative
appear in their non-conflated forms. In a conflated term index (with a stemmer or a lemmatizer), you would substitute the different forms with an equivalent representation (say inform
). In that case, the index will be:
(inform) --> [D1, D5, D9, D10, D20...]. --- union of the different forms
(retrieval) --> [D1, D9, D17,...]
..
So, this conflated representation matches all possible forms of the word information
, e.g. informative
, informational
etc.
Longer Answer
Now let's say you want to achieve the best of both worlds, i.e. a representation which allows this conflation to be done in a user controlled way, e.g. wrapping a word around quotes to denote requiring an exact match (
"slow"vs.
slowin the query), or some indicator to include synonyms for a query term for semantic search (e.g.
syn(slow)` to include synonyms of the word slow).
For this, you need to maintain separate postings for the non-conflated words and maintain additional equivalence indicating pointers
between a set of equivalent (stem relation/synonym relation/ semantic relation etc.)
terms.
Coming back to our example, you would have something like:
(E1)-->(information) --> [D1, D5, D9,...]
|---->(informative) --> [D9, D10, D20,...]
|---->(data) --> [D20, D23, D25,...]
(E2)-->(retrieval) --> [D1, D9, D17,...]
|---->(search) --> [D20, D30, D31,...]
..
Here, I have shown two examples of equivalence classes (concept representations) of two sets of terms information, data...
and retrieval, search...
.
Depending on the query syntax, it would then be possible at the retrieval time to facilitate exact search or relaxed search (based on inflections/synonyms etc.)