nlpcrfcrf++

CRF++: anybody understand what does the float number mean in CRF model file


When you build you model file with -t option by crf_learn: crf_learn template train_data -t model

It will then generate two model file, one of them is model.txt.

Can anybody tell what does the float numbers mean?

See the following example:

version: 100 cost-factor: 1 maxid: 40 xsize: 1

B I

U00:%x[0,0] B

36 B 20 U00:、 26 U00:か 18 U00:が 22 U00:こ 8 U00:た 10 U00:ち 2 U00:っ 4 U00:て 34 U00:に 12 U00:の 0 U00:よ 28 U00:ら 24 U00:れ 32 U00:上 14 U00:世 16 U00:代 30 U00:地 6 U00:私

-0.3022268562246992 0.3022268562246989 -0.3629407244093161 0.3629407244093156 -0.3327259487028221 0.3327259487028215 0.3462799099537973 -0.3462799099537980 0.3452020097664334 -0.3452020097664336 -0.3218750203631590 0.3218750203631575 0.0376944272290242 -0.0376944272290280 0.3329631783491211 -0.3329631783491230 -0.3092967308014029 0.3092967308014015 0.3413769126433928 -0.3413769126433950 0.3786782765859961 -0.3786782765859980 0.5208645073272351 -0.5208645073272384 -0.3261580548802839 0.3261580548802814 -0.3615756495615902 0.3615756495615884 -0.3248593224319323 0.3248593224319312 0.3281895709166696 -0.3281895709166719 -0.3040331359589971 0.3040331359589951 0.2836939567332580 -0.2836939567332600 -0.1530917919770705 -0.1613508585854637 0.4245699543724943 -0.1101273038099901

My understanding is: each float number should correspond to each template, for instance: first float number "-0.3022268562246992" should correspond to "36 B". But why the number of float number double the number of template? what does those float number mean?

Many thanks,

Shuai Hua


Solution

  • After reading parts of the CRF++058 source code, I know how to understand the crf_learn output. I will use some examples to explain the output.

    ==== Basic ====

    Let's assume we have the following training data:

    毎  k   B
    日  k   I
    新  k   I  
    聞  k   I
    社  k   I
    特  k   B
    別  k   I 
    顧  k   B
    問  k   I
    

    And our template is very simple, only has one line: U00:%x[0,0]

    1. so the number of feature in this case is 9, there are: 毎, 日, 新, 聞, 社, 特, 別, 顧, 問.
    2. now let keep the training data unchanged, add another feature in template:

    U00:%x[0,0]

    U00:%x[-1,0]/%x[0,0]/%x[1,0]

    Now we have two "features" in template. So the total number of feature changes to 18, there are:

    毎, 日, 新, 聞, 社, 特, 別, 顧, 問    
     ../毎/日
    毎/日/新
    日/新/聞
    新/聞/社   
    聞/社/特
    社/特/別
    特/別/顧    
    別/顧/問    
    顧/問/..
    

    (This feature template with two rules will apply to each single word)

    1. now let's add a duplicated word in training data, as following:
    毎  k   B
    毎  k   B
    日  k   I
    新  k   I  
    聞  k   I
    社  k   I
    特  k   B
    別  k   I 
    顧  k   B
    問  k   I
    

    For the word "毎", it appears twice, but only be regarded as one feature. So the number of feature still 18.

    ==== Advance ====

    Now let's see how to understand the content in "model.txt".

    1) a SPACE LINE is used to delimit different block:

    1. First block:

        version: 100
        cost-factor: 1
        maxid: 670
        xsize: 1
    

    The maxid depends on numbers of features, and numbers of tags.

    Using the first training data as example:(9 different words, and two tags => B and I)

    the id should start from 0, 0+2=2, 2+2=4, ... 16. maxid is 16.

    Here, why the step is 2?

    Because we have two types of tag. actually each word corresponds to two different tags, like:

    0 毎 ==> B
    1 毎 ==> I
    
    2 日 ==> B
    3 日 ==> I 
    ...
    14 問 ==> B
    15 問 ==> I
    

    2. second block:

    list all the tags in the training data:

    B
    
    I
    

    3. third block:

    list all the template used:

    U00:%x[0,0]

    B

    4. fourth block:

    the feature id, the template and the correspond word:

    0 U00:毎
    2 U00:日
    ...
    

    5. the fifth block:

    For each feature, the possibility for each tag:

    There are two possibility correspond to each word.

    Possibility < 0 will be ignored.

    - Shuai Hua