crf

Wapiti CRF : Understanding the model file and forcing hard-constraints


I'm currently using Wapiti to detect specific product names in web pages. I've trained a model, and I'd like to list the top 10 more important rules of this model (those rules that have a biggest weight (positive or negative)).

Here is an example of a trained model taken from the Wapiti documentation:

[...]
12:*:Pre-3 X='s,
13:*:Pre-3 X=Wel,
13:*:Suf-3 X=rid,
[...]
10=-0x1.32892bf985df3p-1
11=0x1.73883325ee8edp-4
15=0x1.034d12a224d71p-2
16=-0x1.1fa154002a2f9p+0

So, from the above 3 rules, how do I know which one has the biggest weight? The rule *:Pre-3 X='s, is associated with the number "12". is this number the weight? or is it a reference to the lines below? however, the number "12" does not appear in those lines.

Another question: Is is possible to force a "hard-constraint"? that is, to write a rule that whenever an observation is seen, it produces always a given tag.


Solution

  • For your first question, look at the dump mode of wapiti, it turn the model file in a more readable format where it will be easy to extract the feature with highest or lowest weights.

    wapiti dump model > model.txt

    This will give you a text file with one feature per line described with 4 columns. First the pattern with the substitutions expanded, next the label at previous position (or # for unigrams pattern), next the label at current position, and finally the feature weight.

    For your second question, Wapiti have a forced decoding mode made for this. If your data have N column of observations, just give wapiti a file with N+1 column and put the constrains in the last column. With the --force switch of the label mode, if a valid label is present in this last column, wapiti will force the decoder to predict this label at this position and take account of this in the neighbors positions.