pythoninformation-retrievalwhoosh

Language Model through Whoosh in Information Retrieval


I am working in IR.

Can any one guide me, how can I implement the language model in Whoosh. I already Applied TD-IDF and BM25. I am new to IR.

For an example, the simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model:

P_{uni}(t_1t_2t_3t_4) = P(t_1)P(t_2)P(t_3)P(t_4)

There are many more complex kinds of language models, such as bigram language models, which condition on the previous term,

P_{bi}(t_1t_2t_3t_4) = P(t_1)P(t_2\vert t_1)P(t_3\vert t_2)P(t_4\vert t_3)

Solution

  • Take a look at Whoosh's scoring module and use BM25F (lines 276 to 332) as a reference for building your own weighting and scoring models. You need to create a Weighting Model and a Scorer. Assuming you want to call your model Unigram, the main steps would be:

    1. Implement your own Unigram weighting model class and inherit from scoring.WeightingModel:

      class Unigram(WeightingModel)

      Implement the methods required by the base class, the main one being scorer(), which returns a reference to your Scorer class (next). This class is called when you create your searcher and defines the Weighting Model the searcher will use.

    2. Implement a UnigramScorer class and inherit from scoring.WeightLengthScorer:

      class UnigramScorer(WeightLengthScorer)

      Implement the __init__ and _score methods. __init__ takes the field name and value and is called once for each term in your query when you call searcher.search(). _score is called for each matching document in your results. It takes a weight and length and returns a score for a given field.

    3. When you create your searcher at search time, specify your custom language model using the weighting parameter:

      ix.searcher(weighting = Unigram)