[SOLVED] Loading manually annotated data to train RNN POS tagger

Loading manually annotated data to train RNN POS tagger

I've got a large manually annotated data. I would like to train a part of speech tagger using RNN. The data is something similar to the text below :

Lorem <NP> Ipsum <NP> dummy <N> text <ADV> printing <VREL> typesetting <NUMCR> Ipsum <VREL> Ipsum <NP> Ipsum <NP> Lorem <N> Ipsum <NP> Ipsum <N> Ipsum <NP> Lorem <ADJ> Lorem <NP> Ipsum <N> Lorem <VN> Lorem <ADJ> Lorem <N> Lorem <N> ፣ <PUNC> Lorem <ADJ> Lorem <ADJ> Ipsum <NC> Ipsum <NC> Ipsum <NP>

Please guide me on how to load this data to train an RNN based tagger.

Solution

To read it, I suggest you convert it to a tsv file with examples separated by blank lines (a.k.a conll format) as follows:

src_fp, tgt_fp = "source/file/path.txt", "target/file/path.tsv"
with open(src_fp) as src_f:
    with open(tgt_fp, 'w') as tgt_f:    
        for line in src_f:
            words = list(line.split(' '))[0::2]
            tags = list(line.split(' '))[1::2]
            for w, t in zip(words, tags):
                tgt_f.write(w+'\t'+t+'\n')
                tgt_f.write('\n')

Then, you'll be able to read it using SequenceTaggingDataset from torchtext.datasets as follows:

text_field, label_field = data.Field(), data.Field()
pos_dataset = torchtext.datasets.SequenceTaggingDataset(
        path='data/pos/pos_wsj_train.tsv',
        fields=[('text', text_field),
                ('labels', label_field)])

the last steps are to create your vocabulary and to get iterators over your data:

text_field.build_vocab(pos_dataset)
train_iter = data.BucketIterator.splits(
            (unsup_train, unsup_val, unsup_test), batch_size=MY_BATCH_SIZE, device=MY_DEVICE)
# using the iterator
for ex in self train_iter:
    train(ex.text, ex.labels)

I suggest you take a moment to read the documentation about the functions used hereabove, so that you'll be able to adapt them to your needs (maximum vocabulary size, whether to shuffle your examples, sequence lengths, etc). For building an RNN with for classification, the official pytorch tutorial is very easy to learn from. So I suggest you start there and adapt the network inputs and outputs from sequence classification (1 label for each text span), to sequence tagging (1 label for each token).