nltknltk-trainer

NLTK - Convert a chunked tree into a list (IOB tagging)


I need to perform Named Entity Recognition / Classification, and generate output in IOB tagged format.

I'm using a NLTK chunker, as delivered by NLTK-train library, but that produces a Tree, not a list of IOB tags.

def chunk_iob(list_of_words):
    nltk_tagger = nltk.data.load("taggers/conll2002_aubt.pickle")
    nltk_chunker = nltk.data.load("chunkers/conll2002_NaiveBayes.pickle")

    t = nltk_tagger.tag(list_of_words)
    print(t)
    c = nltk_chunker.parse(t)
    print(c)

and we get c as a Tree, like:

(S
  (LOC Barcelona/NC)
  (PER Juan/NC :/Fd)

...

But I am looking for something like:

Barcelona - LOC
Juan - PER
...

which is the IOB tagged list of the list_of_words parameter, in the same order as list_of_words.

How can I get that tagged list from the tree?


Solution

  • What you are looking for is tree2conlltags and its reverse conlltags2tree. Here's how it works:

    from nltk import word_tokenize, pos_tag, ne_chunk
    from nltk import conlltags2tree, tree2conlltags
    
    
    tree = ne_chunk(pos_tag(word_tokenize("New York is my favorite city")))
    print tree
    # (S (GPE New/NNP York/NNP) is/VBZ my/PRP$ favorite/JJ city/NN)
    
    iob_tags = tree2conlltags(tree)
    print iob_tags
    # [('New', 'NNP', u'B-GPE'), ('York', 'NNP', u'I-GPE'), ('is', 'VBZ', u'O'), ('my', 'PRP$', u'O'), ('favorite', 'JJ', u'O'), ('city', 'NN', u'O')]
    
    tree = conlltags2tree(iob_tags)
    print tree
    # (S (GPE New/NNP York/NNP) is/VBZ my/PRP$ favorite/JJ city/NN)
    

    Note that the IOB tags are in this format B-{tag} for beginning, I-{tag} for inside and O for outside.