crfcrf++

How to set up training and feature template files for NER? - CRF++


For the problem of named entity recognition,

After tokenizing the sentences, how do you set up the columns? it looks like one column in the documentation is POS tag, but where do these come from? Am I supposed to tag the POS myself or is there a tool to generate these?

What is the next column represent? A class like PERSON, LOCATION, etc? and does it have to be in any particular format?

Is there any example of a completed training file and template for NER?


Solution

  • You can find example training and test data in the crf++ repo here. The training data for noun phrase chunking looks like this:

    Confidence NN B
    in IN O
    the DT B
    pound NN I
    is VBZ O
    widely RB O
    expected VBN O
    ... etc ...
    

    The columns are arbitrary in that they can be anything. CRF++ requires that every line have the same number of columns (or be blank, to separate sentences), not all CRF packages require that. You will have to provide the data values yourself; they are the data the classifier learns from.

    While anything can go in the various columns, one convention you should know is IOB Format. To deal with potentially multi-token entities, you mark them as Inside/Outside/Beginning. It may be useful to give an example. Pretend we are training a classifier to detect names - for compactness I'll write this on one line:

    John/B Smith/I ate/O an/O apple/O ./O
    

    In columnar format it would look like this:

    John B
    Smith I
    ate O
    an O
    apple O
    . O
    

    With these tags, B (beginning) means the word is the first in an entity, I means a word is inside an entity (it comes after a B tag), and O means the word is not an entity. If you have more than one type of entity it's typical to use labels like B-PERSON or I-PLACE.

    The reason for using IOB tags is so that the classifier can learn different transition probabilities for starting, continuing, and ending entities. So if you're learning company names It'll learn that Inc./I-COMPANY usually transitions to an O label because Inc. is usually the last part of a company name.

    Templates are another problem and CRF++ uses its own special format, but again, there are examples in the source distribution you can look at. Also see this question.


    To answer the comment on my answer, you can generate POS tags using any POS tagger. You don't even have to provide POS tags at all, though they're usually helpful. The other labels can be added by hand or automatically; for example, you can use a list of known nouns as a starting point. Here's an example using spaCy for a simple name detector:

    import spacy
    nlp = spacy.load('en')
    names = ['John', 'Jane', etc...]
    text = nlp("John ate an apple.")
    for word in text:
        person = 'O' # default not a person
        if str(word) in names:
            person = 'B-PERSON'
        print(str(word), word.pos_, person)