pythonnltkpos-taggertext-chunking

Training IOB Chunker using nltk.tag.brill_trainer (Transformation-Based Learning)


I'm trying to train a specific chunker (let's say a noun chunker for simplicity) by using NLTK's brill module. I'd like to use three features, ie. word, POS-tag, IOB-tag.

I want to incorporate them into nltk.tbl.feature, but there are only two kinds of feature objects, ie. brill.Word and brill.Pos. Limited by the design, I could only put word and POS feature together like (word, pos), and thus used ( (word, pos), iob) as features for training. For example,

from nltk.tbl import Template
from nltk.tag import brill, brill_trainer, untag
from nltk.corpus import treebank_chunk
from nltk.chunk.util import tree2conlltags, conlltags2tree

# Codes from (Perkins, 2013)
def train_brill_tagger(initial_tagger, train_sents, **kwargs):
    templates = [
        brill.Template(brill.Word([0])),
        brill.Template(brill.Pos([-1])),
        brill.Template(brill.Word([-1])),
        brill.Template(brill.Word([0]),brill.Pos([-1])),]
    trainer = brill_trainer.BrillTaggerTrainer(initial_tagger, templates, trace=3,)
    return trainer.train(train_sents, **kwargs)

# generating ((word, pos),iob) pairs as feature.
def chunk_trees2train_chunks(chunk_sents):
    tag_sents = [tree2conlltags(sent) for sent in chunk_sents]
    return [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]

>>> from nltk.tag import DefaultTagger
>>> tagger = DefaultTagger('NN')
>>> train = treebank_chunk.chunked_sents()[:2]
>>> t = chunk_trees2train_chunks(train)
>>> bt = train_brill_tagger(tagger, t)
TBL train (fast) (seqs: 2; tokens: 31; tpls: 4; min score: 2; min acc: None)
Finding initial useful rules...
    Found 79 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  12  12   0  17  | NN->I-NP if Pos:NN@[-1]
   3   3   0   0  | I-NP->O if Word:(',', ',')@[0]
   2   2   0   0  | I-NP->B-NP if Word:('the', 'DT')@[0]
   2   2   0   0  | I-NP->O if Word:('.', '.')@[0]

As shown above, (word, pos) are treated one feature as a whole. This is not a perfect capture of three features (word, pos-tag, iob-tag).


Solution

  • The nltk3 brill trainer api (I wrote it) does handle training on sequences of tokens described with multidimensional features, as your data is an example of. However, the practical limits may be severe. The number of possible templates in multidimensional learning increases drastically, and the current nltk implementation of the brill trainer trades memory for speed, similar to Ramshaw and Marcus 1994, "Exploring the statistical derivation of transformation-rule sequences...". Memory consumption may be HUGE and it is very easy to give the system more data and/or templates than it can handle. A useful strategy is to rank templates according to how often they produce good rules (see print_template_statistics() in the example below). Usually, you can discard the lowest-scoring fraction (say 50-90%) with little or no loss in performance and a major decrease in training time.

    Another or additional possibility is to use the nltk implementation of Brill's original algorithm, which has very different memory-speed tradeoffs; it does no indexing and so will use much less memory. It uses some optimizations and is actually rather quick in finding the very best rules, but is generally extremely slow towards end of training when there are many competing, low-scoring candidates. Sometimes you don't need those, anyway. For some reason this implementation seems to have been omitted from newer nltks, but here is the source (I just tested it) http://www.nltk.org/_modules/nltk/tag/brill_trainer_orig.html.

    There are other algorithms with other tradeoffs, and in particular the fast memory-efficient indexing algorithms of Florian and Ngai 2000 (http://www.aclweb.org/anthology/N/N01/N01-1006.pdf) and probabilistic rule sampling of Samuel 1998 (https://www.aaai.org/Papers/FLAIRS/1998/FLAIRS98-045.pdf) would be a useful additions. Also, as you noticed, the documentation is not complete and too much focused on part-of-speech tagging, and it is not clear how to generalize from it. Fixing the docs is (also) on the todo list.

    However, the interest for generalized (non-POS-tagging) tbl in nltk has been rather limited (the totally unsuited api of nltk2 was untouched for 10 years), so don't hold your breath. If you get impatient, you may wish to check out more dedicated alternatives, in particular mutbl and fntbl (google them, I only have reputation for two links).

    Anyway, here is a quick sketch for nltk:

    First, a hardcoded convention in nltk is that tagged sequences ('tags' meaning any label you would like to assign to your data, not necessarily part-of-speech) are represented as sequences of pairs, [(token1, tag1), (token2, tag2), ...]. The tags are strings; in many basic applications, so are the tokens. For instance, the tokens may be words and the strings their POS, as in

    [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
    

    (As an aside, this sequence-of-token-tag-pairs convention is pervasive in nltk and its documentation, but it should arguably be better expressed as named tuples rather than pairs, so that instead of saying

    [token for (token, _tag) in tagged_sequence]
    

    you could say for instance

    [x.token for x in tagged_sequence]
    

    The first case fails on non-pairs, but the second exploits duck typing so that tagged_sequence could be any sequence of user-defined instances, as long as they have an attribute "token".)

    Now, you could well have a richer representation of what a token is at your disposal. An existing tagger interface (nltk.tag.api.FeaturesetTaggerI) expects each token as a featureset rather than a string, which is a dictionary that maps feature names to feature values for each item in the sequence.

    A tagged sequence may then look like

    [({'word': 'Pierre', 'tag': 'NNP', 'iob': 'B-NP'}, 'NNP'),
     ({'word': 'Vinken', 'tag': 'NNP', 'iob': 'I-NP'}, 'NNP'),
     ({'word': ',',      'tag': ',',   'iob': 'O'   }, ','),
     ...
    ]
    

    There are other possibilities (though with less support in the rest of nltk). For instance, you could have a named tuple for each token, or a user-defined class which allows you to add any amount of dynamic calculation to attribute access (perhaps using @property to offer a consistent interface).

    The brill tagger doesn't need to know what view you currently provide on your tokens. However, it does require you to provide an initial tagger which can take sequences of tokens-in-your-representation to sequences of tags. You cannot use the existing taggers in nltk.tag.sequential directly, since they expect [(word, tag), ...]. But you may still be able to exploit them. The example below uses this strategy (in MyInitialTagger), and the token-as-featureset-dictionary view.

    from __future__ import division, print_function, unicode_literals
    
    import sys
    
    from nltk import tbl, untag
    from nltk.tag.brill_trainer import BrillTaggerTrainer
    # or: 
    # from nltk.tag.brill_trainer_orig import BrillTaggerTrainer
    # 100 templates and a tiny 500 sentences (11700 
    # tokens) produce 420000 rules and uses a 
    # whopping 1.3GB of memory on my system;
    # brill_trainer_orig is much slower, but uses 0.43GB
    
    from nltk.corpus import treebank_chunk
    from nltk.chunk.util import tree2conlltags
    from nltk.tag import DefaultTagger
    
    
    def get_templates():
        wds10 = [[Word([0])],
                 [Word([-1])],
                 [Word([1])],
                 [Word([-1]), Word([0])],
                 [Word([0]), Word([1])],
                 [Word([-1]), Word([1])],
                 [Word([-2]), Word([-1])],
                 [Word([1]), Word([2])],
                 [Word([-1,-2,-3])],
                 [Word([1,2,3])]]
    
        pos10 = [[POS([0])],
                 [POS([-1])],
                 [POS([1])],
                 [POS([-1]), POS([0])],
                 [POS([0]), POS([1])],
                 [POS([-1]), POS([1])],
                 [POS([-2]), POS([-1])],
                 [POS([1]), POS([2])],
                 [POS([-1, -2, -3])],
                 [POS([1, 2, 3])]]
    
        iobs5 = [[IOB([0])],
                 [IOB([-1]), IOB([0])],
                 [IOB([0]), IOB([1])],
                 [IOB([-2]), IOB([-1])],
                 [IOB([1]), IOB([2])]]
    
    
        # the 5 * (10+10) = 100 3-feature templates 
        # of Ramshaw and Marcus
        templates = [tbl.Template(*wdspos+iob) 
            for wdspos in wds10+pos10 for iob in iobs5]
        # Footnote:
        # any template-generating functions in new code 
        # (as opposed to recreating templates from earlier
        # experiments like Ramshaw and Marcus) might 
        # also consider the mass generating Feature.expand()
        # and Template.expand(). See the docs, or for 
        # some examples the original pull request at
        # https://github.com/nltk/nltk/pull/549 
        # ("Feature- and Template-generating factory functions")
    
        return templates
    
    def build_multifeature_corpus():
        # The true value of the target fields is unknown in testing, 
        # and, of course, templates must not refer to it in training.
        # But we may wish to keep it for reference (here, truepos).
    
        def tuple2dict_featureset(sent, tagnames=("word", "truepos", "iob")):
            return (dict(zip(tagnames, t)) for t in sent)
    
        def tag_tokens(tokens):
            return [(t, t["truepos"]) for t in tokens]
        # connlltagged_sents :: [[(word,tag,iob)]]
        connlltagged_sents = (tree2conlltags(sent) 
            for sent in treebank_chunk.chunked_sents())
        conlltagged_tokenses = (tuple2dict_featureset(sent) 
            for sent in connlltagged_sents)
        conlltagged_sequences = (tag_tokens(sent) 
            for sent in conlltagged_tokenses)
        return conlltagged_sequences
    
    class Word(tbl.Feature):
        @staticmethod
        def extract_property(tokens, index):
            return tokens[index][0]["word"]
    
    class IOB(tbl.Feature):
        @staticmethod
        def extract_property(tokens, index):
            return tokens[index][0]["iob"]
    
    class POS(tbl.Feature):
        @staticmethod
        def extract_property(tokens, index):
            return tokens[index][1]
    
    
    class MyInitialTagger(DefaultTagger):
        def choose_tag(self, tokens, index, history):
            tokens_ = [t["word"] for t in tokens]
            return super().choose_tag(tokens_, index, history)
    
    
    def main(argv):
        templates = get_templates()
        trainon = 100
    
        corpus = list(build_multifeature_corpus())
        train, test = corpus[:trainon], corpus[trainon:]
    
        print(train[0], "\n")
    
        initial_tagger = MyInitialTagger('NN')
        print(initial_tagger.tag(untag(train[0])), "\n")
    
        trainer = BrillTaggerTrainer(initial_tagger, templates, trace=3)
        tagger = trainer.train(train)
    
        taggedtest = tagger.tag_sents([untag(t) for t in test])
        print(test[0])
        print(initial_tagger.tag(untag(test[0])))
        print(taggedtest[0])
        print()
    
        tagger.print_template_statistics()
    
    if __name__ == '__main__':
        sys.exit(main(sys.argv))
    

    The setup above builds a POS tagger. If you instead wish to target another attribute, say to build an IOB tagger, you need a couple of small changes so that the target attribute (which you can think of as read-write) is accessed from the 'tag' position in your corpus [(token, tag), ...] and any other attributes (which you can think of as read-only) are accessed from the 'token' position. For instance:

    1) construct your corpus [(token,tag), (token,tag), ...] for IOB tagging

    def build_multifeature_corpus():
        ...
    
        def tuple2dict_featureset(sent, tagnames=("word", "pos", "trueiob")):
            return (dict(zip(tagnames, t)) for t in sent)
    
        def tag_tokens(tokens):
            return [(t, t["trueiob"]) for t in tokens]
        ...
    

    2) change the initial tagger accordingly

    ...
    initial_tagger = MyInitialTagger('O')
    ...
    

    3) modify the feature-extracting class definitions

    class POS(tbl.Feature):
        @staticmethod
        def extract_property(tokens, index):
            return tokens[index][0]["pos"]
    
    class IOB(tbl.Feature):
        @staticmethod
        def extract_property(tokens, index):
            return tokens[index][1]