CRF++: anybody understand what does the float number mean in CRF model file

When you build you model file with -t option by crf_learn: crf_learn template train_data -t model

It will then generate two model file, one of them is model.txt.

Can anybody tell what does the float numbers mean?

See the following example:

version: 100 cost-factor: 1 maxid: 40 xsize: 1

B I

U00:%x[0,0] B

36 B 20 U00:、 26 U00:か 18 U00:が 22 U00:こ 8 U00:た 10 U00:ち 2 U00:っ 4 U00:て 34 U00:に 12 U00:の 0 U00:よ 28 U00:ら 24 U00:れ 32 U00:上 14 U00:世 16 U00:代 30 U00:地 6 U00:私

-0.3022268562246992 0.3022268562246989 -0.3629407244093161 0.3629407244093156 -0.3327259487028221 0.3327259487028215 0.3462799099537973 -0.3462799099537980 0.3452020097664334 -0.3452020097664336 -0.3218750203631590 0.3218750203631575 0.0376944272290242 -0.0376944272290280 0.3329631783491211 -0.3329631783491230 -0.3092967308014029 0.3092967308014015 0.3413769126433928 -0.3413769126433950 0.3786782765859961 -0.3786782765859980 0.5208645073272351 -0.5208645073272384 -0.3261580548802839 0.3261580548802814 -0.3615756495615902 0.3615756495615884 -0.3248593224319323 0.3248593224319312 0.3281895709166696 -0.3281895709166719 -0.3040331359589971 0.3040331359589951 0.2836939567332580 -0.2836939567332600 -0.1530917919770705 -0.1613508585854637 0.4245699543724943 -0.1101273038099901

My understanding is: each float number should correspond to each template， for instance: first float number "-0.3022268562246992" should correspond to "36 B". But why the number of float number double the number of template? what does those float number mean?

Many thanks,

Shuai Hua

Solution

After reading parts of the CRF++058 source code, I know how to understand the crf_learn output. I will use some examples to explain the output.

==== Basic ====

Let's assume we have the following training data:

毎  k   B
日  k   I
新  k   I  
聞  k   I
社  k   I
特  k   B
別  k   I 
顧  k   B
問  k   I

And our template is very simple, only has one line: U00:%x[0,0]

so the number of feature in this case is 9, there are: 毎, 日, 新, 聞, 社, 特, 別, 顧, 問.
now let keep the training data unchanged, add another feature in template:

U00:%x[0,0]

U00:%x[-1,0]/%x[0,0]/%x[1,0]

Now we have two "features" in template. So the total number of feature changes to 18, there are:

毎, 日, 新, 聞, 社, 特, 別, 顧, 問    
 ../毎/日
毎／日／新
日／新／聞
新／聞／社   
聞／社／特
社／特／別
特／別／顧    
別／顧／問    
顧／問／..

(This feature template with two rules will apply to each single word)

now let's add a duplicated word in training data， as following:

毎  k   B
毎  k   B
日  k   I
新  k   I  
聞  k   I
社  k   I
特  k   B
別  k   I 
顧  k   B
問  k   I

For the word "毎", it appears twice, but only be regarded as one feature. So the number of feature still 18.

==== Advance ====

Now let's see how to understand the content in "model.txt".

1) a SPACE LINE is used to delimit different block:

1. First block:

    version: 100
    cost-factor: 1
    maxid: 670
    xsize: 1

The maxid depends on numbers of features, and numbers of tags.

Using the first training data as example:(9 different words, and two tags => B and I)

the id should start from 0, 0+2=2， 2+2=4, ... 16. maxid is 16.

Here, why the step is 2?

Because we have two types of tag. actually each word corresponds to two different tags, like:

0 毎 ==> B
1 毎 ==> I

2 日 ==> B
3 日 ==> I 
...
14 問 ==> B
15 問 ==> I

2. second block:

list all the tags in the training data:

B

I

3. third block:

list all the template used:

U00:%x[0,0]

B

4. fourth block:

the feature id, the template and the correspond word:

0 U00:毎
2 U00:日
...

5. the fifth block:

For each feature, the possibility for each tag:

There are two possibility correspond to each word.

Possibility < 0 will be ignored.

- Shuai Hua