nlp dataset pos-tagger pytorch-dataloader torchtext

Unable to create a custom torchtext BucketIterator

I'm trying to create a POS tagger with LSTM and I'm facing some difficulties with preparing the data.

I've successfully followed a guide that used the following code to prepare the data itertors:

TEXT = data.Field(lower = True)
UD_TAGS = data.Field(unk_token = None)
PTB_TAGS = data.Field(unk_token = None)
fields = (("text", TEXT), ("udtags", UD_TAGS), ("ptbtags", PTB_TAGS))
train_data, valid_data, test_data = datasets.UDPOS.splits(fields)
MIN_FREQ = 2

TEXT.build_vocab(train_data, 
                 min_freq = MIN_FREQ,
                 vectors = "glove.6B.100d",
                 unk_init = torch.Tensor.normal_)
UD_TAGS.build_vocab(train_data)
PTB_TAGS.build_vocab(train_data)

BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

And then when training the model the code is:

for batch in iterator:
        text = batch.text
        tags = batch.udtags

Now for my problem - I have a dataset of lists: list of sentences (where every sentence is a list of words) and a list of lists of tags corresponsing to the sentences words.

I created a torch DataSet intance from the x_train, y_train (each one is a list of lists). But, it does not behave like the 'train_data' that comes from datasets.UDPOS.splits(fields). So, when trying to access the data with:

for batch in iterator:
        text = batch.text
        tags = batch.udtags

I'm getting an error since my iterator does not have the fields inside. I tried accecing the data in a different manner but coudn't find a way around it. I also noticed that in the above example, the data in the batch is with the embeddings indexes, while the batch in my code is still the words themselves.

All of the examples I found on the internet uses datasets from torchtext.legacy.datasets, so it does not really help me with my problem.

If it helps, here is my code (it's part of a bigger project, so a bit messy):

class ConvertDataset(Dataset):
    """
    Create an instances of pytorch Dataset from lists.
    """

    def __init__(self, x, y):
        # data loading
        self.x = x
        self.y = y

    def __getitem__(self, index):
        return {'text': self.x[index], 'tags': self.y[index]}

    def __len__(self):
        return len(self.x)

# ## model variables
DROPOUT = 0.25
HIDDEN_DIM = 128

# ## load and prepare train data
train_set = load_annotated_corpus(params_d['data_fn'])
x_train, y_train = _prepare_data(train_set)
TEXT = Field(lower=True)
UD_TAGS = Field(unk_token=None)

# ## build words and tags vocabularies
TEXT.build_vocab(x_train,
                 min_freq=params_d['min_frequency'],
                 vectors='glove.6B.100d',
                 unk_init=torch.Tensor.normal_,
                 max_size=None if params_d['max_vocab_size'] == -1 else 
                              params_d['max_vocab_size'])

UD_TAGS.build_vocab(y_train)

# ## more model variables
INPUT_DIM = len(TEXT.vocab)
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

# ## initiate a model
 lstm_model = BiLSTM.LSTM(input_dim=INPUT_DIM,
                             embedding_dim=params_d['embedding_dimension'],
                             hidden_dim=HIDDEN_DIM,
                             output_dim=params_d['output_dimension'],
                             n_layers=params_d['num_of_layers'],
                             dropout=DROPOUT,
                             pad_idx=PAD_IDX)

lstm_model.apply(_init_weights)

pretrained_embeddings = TEXT.vocab.vectors
lstm_model.embedding.weight.data.copy_(pretrained_embeddings)
# set pad tag embedding to 0
lstm_model.embedding.weight.data[PAD_IDX] = torch.zeros(params_d['embedding_dimension'])

BATCH_SIZE = 128

My data (lists of lists):

x_train, y_train = _prepare_data(train_data)

Data preparation

train_torch_dataset = ConvertDataset(x_train, y_train)

# ## create data iterators
train_iterator = BucketIterator(
        train_torch_dataset,
        batch_size=BATCH_SIZE,
        device=device,
        # Function to use for sorting examples.
        sort_key=lambda x: len(x['text']),
        # Repeat the iterator for multiple epochs.
        repeat=True,
        # Sort all examples in data using `sort_key`.
        sort=False,
        # Shuffle data on each epoch run.
        shuffle=True,
        # Use `sort_key` to sort examples in each batch.
        sort_within_batch=True
    )

Solution

Took me a while but I found a solution. To create a torchtext dataset with input data as lists, use SequenceTaggingDataset (from torchtext.legacy.datasets.SequenceTaggingDataset) but you need to do a simple change to the original source code in the __init__ function, like this:

    def __init__(self, columns, fields, encoding="utf-8", separator="\t", **kwargs):
        examples = []
        # for 2 fields data sets (text, tags)
        for words, labels in zip(columns[0], columns[-1]):
            examples.append(data.Example.fromlist([words, labels], fields))
        super(SequenceTaggingDataset, self).__init__(examples, fields,
                                                     **kwargs)

Then, assuming you have a data with two field (in my example, text and pos-tags) you can define the dataset like that:

from torchtext.legacy import data

TEXT = data.Field()
UD_TAGS = data.LabelField()
# define torchtext fields
fields = (("text", TEXT), ("udtags", UD_TAGS))
# push the data into a torchtext type of dataset (** modified SequenceTaggingDataset **)
train_torchtext_dataset = SequenceTaggingDataset([x_train, y_train], fields=fields)

Note that x_train, y_train are nested lists.