pythonpytorchtorchtext

Torchtext TabularDataset() reads in Datafields incorrectly


Goal: I want to create a text classifier based upon my custom Dataset, simillar (and following) This (now deleted) Tutorial from mlexplained.

What happened I sucessfully formatted my data, created a training, validation and test dataset, and formatted it, so that it so that it equals the "toxic tweet" dataset they are using (with a column for each tag, with 1/0 for True/not True). Most of the other parts also worked just as intended, but when it came to iterating i got an Error.

The `device` argument should be set by using `torch.device` or passing a string as an argument. 

This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
  0%|          | 0/25517 [00:01<?, ?it/s]
Traceback (most recent call last):
... (trace back messages)
AttributeError: 'Example' object has no attribute 'text'

The lines the Traceback indicated:

opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()

epochs = 2

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() # turn on training mode
    for x, y in tqdm.tqdm(train_dl): # **THIS LINE CONTAINS THE ERROR**
        opt.zero_grad()

        preds = model(x)
        loss = loss_func(y, preds)
        loss.backward()
        opt.step()

        running_loss += loss.data[0] * x.size(0)

    epoch_loss = running_loss / len(trn)

    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval() # turn on evaluation mode
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_func(y, preds)
        val_loss += loss.data[0] * x.size(0)

    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

Attempts to solve the problems already made, and what i think was the Reson:

I know that this Problem occured to others, there are even 2 Questions to it on here, bot both had the problem of skipping either Columns or Rows in the dataset (i checked for empty lines/Cokumns, and found none). Another Solution was that the Parameters given the model had to be in the same order (with none Missing) than the parameters in the .csv File.

However, the relevant code (the loading and creating of the tst, trn and vld sets) def createTestTrain():

    # Create a Tokenizer
    tokenize = lambda x: x.split()
    
    # Defining Tag and Text
    TEXT = Field(sequential=True, tokenize=tokenize, lower=True)
    LABEL = Field(sequential=False, use_vocab=False)
    
    # Our Datafield
    tv_datafields = [("ID", None),
                     ("text", TEXT)]
    
    # Loading our Additional columns we added earlier
    with open(PATH + 'columnList.pickle', 'rb') as handle:
        addColumns = pickle.load(handle)
    
        
    # Adding the extra columns, no way we are defining 1000 tags by hand            
    for column in addColumns:
        tv_datafields.append((column, LABEL))
        
    #tv_datafields.append(("split", None))
    # Loading Train/Test Split we created
    trn = TabularDataset(
                   path=PATH+'train.csv', 
                   format='csv',
                   skip_header=True, 
                   fields=tv_datafields)
    
    vld = TabularDataset(
            path=PATH+'train.csv',
            format='csv',
            skip_header=True, 
            fields=tv_datafields)
    
    # Creating Test Datafield
    tst_datafields = [("id", None), 
              ("text", TEXT)]
    # Using TabularDataset, as we want to Analyse Text on it
    tst = TabularDataset(
               path=PATH+"test.csv", # the file path
               format='csv',
               skip_header=True, 
               fields=tst_datafields)
    
    return(trn, vld, tst)

Has uses the same list and order, like my csv does. tv_datafields is structured exactly like the file. Furthermore, as Datafield objects are just Dicts with Datapoints, i read out the Keys of the dictionary, like the tutorial also did, via:

trn[0].dict_keys()

What Should have happened: The behaviour of the example was like this

trn[0]
torchtext.data.example.Example at 0x10d3ed3c8
trn[0].__dict__.keys()
dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])

My result:

trn[0].__dict__.keys()
Out[19]: dict_keys([])

trn[1].__dict__.keys()
Out[20]: dict_keys([])

trn[2].__dict__.keys()
Out[21]: dict_keys([])

trn[3].__dict__.keys()
Out[22]: dict_keys(['text'])

While trn[0] does contain nothing, it is instead spread from 3 to 15, the amount of columns that should normally be there should be way more than that.

Now i am at a loss, as to where i went wrong. The Data fits, the function obviously works, but TabularDataset() seems to read in my columns the wrong way (if at all). Did i classify

# Defining Tag and Text
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)
LABEL = Field(sequential=False, use_vocab=False)

the wrong way? At least that is what my Debuggin seems to indicate.

With the meager Documentation on Torchtext i have problems finding that out, but when im looking at the definitions of Data or Fields i cant see anything wrong with it.

Thank you for your help.


Solution

  • I found out where my Problem was, apparently Torchtext only accepts Data in Quotes and only with "," as separator. My Data was not within quotes and has ";" as separator.