I have a custom dataset where data has been stored as dictionary in the form,
file1.pt {tensor1:tensor2} file2.pt {tensor1:tensor2} . and so on goes on for about 50k files gathering 20GB volume.
Tensor1 is the data and Tensor2 is its label. What would be the best way to retain the tensors, or best load into a dataloader as tensors and not dict_keys or dict_values types from all the files.
I currently have all the dictionaries loaded into a dataset.
using "dict.keys()" and "dict.values()", needs to be cast into list then further processed. I'm looking for something quicker.
Implement a custom dataset class that overrides the getitem and len methods. By doing so, you can ensure that the data is loaded lazily, rather than all at once, which should be more memory-efficient.
class CustomTensorDataset(Dataset):
def __init__(self, file_paths):
self.file_paths = file_paths
def __len__(self):
return len(self.file_paths)
def __getitem__(self, idx):
file_path = self.file_paths[idx]
data_dict = torch.load(file_path)
return data_dict["tensor1"], data_dict["tensor2"]
# Collect file paths
folder_path = "./your_data_folder"
file_paths = [os.path.join(folder_path, fname) for fname in os.listdir(folder_path) if fname.endswith('.pt')]
# Initialize custom dataset and DataLoader
custom_dataset = CustomTensorDataset(file_paths)
data_loader = DataLoader(custom_dataset, batch_size=32, shuffle=True)