I have two text files to train a transformer model. However, instead of using PyTorch's own datasets, I'm using something I downloaded from the internet.
source = open('./train_de.de', encoding='utf-8').read().split('\n')
target = open('./train_en.en', encoding='utf-8').read().split('\n')
With the code above, I have some Danish sentences in a list named "source", and their translation in English sentences in another list named "target".
My question is, how can I make an iterable DataPipe with PyTorch such that when I write something like:
source, target = next(iter(train_iter))
this will give me the Danish sentence with it's corresponding English translation in seperate strings?
You can use the Dataset
and DataLoader
class for that.
import torch
class YourDataset(torch.utils.data.Dataset):
def __init__(self) -> None:
self.source = open('./train_de.de', encoding='utf-8').read().split('\n')
self.target = open('./train_en.en', encoding='utf-8').read().split('\n')
def __getitem__(self, idx) -> torch.Tensor:
# load one sample by index, e.g like this:
source_sample = self.source[idx]
target_sample = self.target[idx]
# do some preprocessing, convert to tensor and what not
return source_sample, target_sample
def __len__(self):
return len(self.source)
Now you can create a DataLoader from your custom dataset:
yourDataset = YourDataset()
dataloader = torch.utils.data.DataLoader(
yourDataset,
batch_size=8,
num_workers=0,
shuffle=True
)
Now you can iterate through the dataloader (using a loop or next(iter(...))
) and one iteration returns as many samples as your batchsize (in this case 8 stacked samples).