pythonnlptorchtext

"Token second\team not found and default index is not set" error in torchtext function


This is my code, the function work well for train set but for test set returns this error RuntimeError: Token second\team not found and default index is not set

train_data, train_labels = text_classification._create_data_from_iterator(
    vocab, text_classification._csv_iterator(train_csv_path, ngrams, yield_cls=True), False)
test_data, test_labels = text_classification._create_data_from_iterator(
    vocab, text_classification._csv_iterator(test_csv_path, ngrams, yield_cls=True), False)

Does anyone know what is wrong?


Solution

  • The vocabulary acts as a lookup table for your data translating str to int. When a given string (in this case "second\team") doesn't appear in the vocabulary, there are two strategies to compensate:

    1. Throw an error because you don't know how to handle it. Imagine something like a KeyError when calling {}[1] in Python
    2. Assign a default "unknown" token to the missing tokens. Imagine a default value like {}.get(1, "I don't know!") in Python.

    Your code is currently doing #1. You seem to want #2 which you can achieve using vocab.set_default_index. When you build your vocab, add the specials=["<unk>"] kwarg and then call vocab.set_default_index(vocab['<unk>']).