pythonnlppytorchtransformer-modeltorchtext

How to create an iterable DataPipe with PyTorch using txt files


I have two text files to train a transformer model. However, instead of using PyTorch's own datasets, I'm using something I downloaded from the internet.

source = open('./train_de.de', encoding='utf-8').read().split('\n')
target = open('./train_en.en', encoding='utf-8').read().split('\n')

With the code above, I have some Danish sentences in a list named "source", and their translation in English sentences in another list named "target".

My question is, how can I make an iterable DataPipe with PyTorch such that when I write something like:

source, target = next(iter(train_iter))

this will give me the Danish sentence with it's corresponding English translation in seperate strings?


Solution

  • You can use the Dataset and DataLoader class for that.

    import torch
    
    class YourDataset(torch.utils.data.Dataset):
        def __init__(self) -> None:
            self.source = open('./train_de.de', encoding='utf-8').read().split('\n')
            self.target = open('./train_en.en', encoding='utf-8').read().split('\n')
    
        def __getitem__(self, idx) -> torch.Tensor:
            # load one sample by index, e.g like this:
            source_sample = self.source[idx]
            target_sample = self.target[idx]
            
            # do some preprocessing, convert to tensor and what not
            
            return source_sample, target_sample
    
        def __len__(self):
            return len(self.source)
    

    Now you can create a DataLoader from your custom dataset:

    yourDataset = YourDataset()
    dataloader = torch.utils.data.DataLoader(
            yourDataset,
            batch_size=8,
            num_workers=0,
            shuffle=True
    )
    

    Now you can iterate through the dataloader (using a loop or next(iter(...))) and one iteration returns as many samples as your batchsize (in this case 8 stacked samples).