I have A Large dataset (> 62 GiB) after processing saved as two NumPy.memmap arrays one of the data and the other for the labels the dataset has these shapes (7390,60,224,224,3) , and (7390) and is NOT shuffled so i need to shuffle it first.
now i use tensorflow2 and used this code with my generator to manage memmap arrays before
def my_generator():
for i in range(len(numpy_array)):
yield numpy_array[i,:,:,:,:],np.array(labels[i]).reshape(1)
full_dataset = tf.data.Dataset.from_generator(
generator=my_generator,
output_types=(np.uint8,np.int32),
output_shapes=((60,224,224,3),(1))
)
full_dataset = full_dataset.shuffle(SHUFFLE_BUFFER_SIZE, reshuffle_each_iteration=False)
train_dataset = full_dataset.take(train_size)
test_dataset = full_dataset.skip(train_size)
val_dataset = test_dataset.skip(test_size)
test_dataset = test_dataset.take(test_size)
That way i can train without loading to memory the entire dataset with shuffling and batching.
Now with this current model and dataset the vram is not enogh for more than 2 batches to be loaded as tensors. and i can't train with batchsize of 2.
i thought of gradient accumulation but i couldn't do it with TF2 and i found it easy with pytorch but i can't find how to deal with the memmap arrays with shuffle and split as in tensorflow with generators.
so i need to know how to load the datset from pytorch with the same shuffling and batching in pytorch.
Or if someone has a readymade code for GA on TF2
I will just address the shuffle question.
Instead of shuffling with tf.data.Dataset, do it at the generator level. This should work:
class Generator(object):
def __init__(self, images, labels, batch_size):
self.images = images
self.labels = labels
self.batch_size = batch_size
self.idxs = np.arange(len(self.images))
self.on_epoch_end()
def on_epoch_end(self):
# Shuffle the indices
np.random.shuffle(self.idxs)
def generator(self):
i = 0
while i < len(self.idxs):
idx = self.idxs[i]
yield (self.images[idx], self.labels[i])
i += 1
self.on_epoch_end()
def batch_generator(self):
it = iter(self.generator)
while True:
vals = [next(it) for i in range(self.batch_size)]
images, labels = zip(*vals)
yield images, labels
Then you can use it by
gen = Generator(...)
it = iter(gen)
batch = next(it) # Call this every time you want a new batch
I'm sure pytorch has build in methods for this kind of stuff though