I am trying to train a CRF sequence model using the Mallet library but I am missing some important information. I found a an example in the library itself at https://github.com/mimno/Mallet/blob/master/src/cc/mallet/examples/TrainCRF.java however the example does not state the format of the input training data so I do not know how to recreate it.
Mallet does have a data import example at http://mallet.cs.umass.edu/import-devel.php but the particular example seems to be for document classification and not CRF sequence models which is my use case.
I tried putting the input training data in the form used at http://mallet.cs.umass.edu/sequences.php i.e.
Bill CAPITALIZED noun
slept non-noun
here LOWERCASE STOPWORD non-noun
and test data in the form
CAPITAL Al
slept
here
however based on the output logs it does not seem to be the correct format. For example one line in the log is INFO: testing label slept P � R 0 F1 �
but slept
is not a label - the labels should be noun
or non-noun
.
So if someone could tell me what format the training data should be in that would be great.
The code sample you link to has the line that refers to the training file commented out. Is it possible your code is trying to train on the test file? That would cause slept
to look like a label since it's at the end of the line, and would explain the error.
For the record, I tried the example using the test data you gave above (using the command line, not the code sample) and it worked, so the test/train format seems to be OK.