pythondeep-learningpytorchmini-batch

Training on minibatches of varying size


I'm trying to train a deep learning model in PyTorch on images that have been bucketed to particular dimensions. I'd like to train my model using mini-batches, but the mini-batch size does not neatly divide the number of examples in each bucket.

One solution I saw in a previous post was to pad the images with additional whitespace (either on the fly or all at once at the beginning of training), but I do not want to do this. Instead, I would like to allow the batch size to be flexible during training.

Specifically, if N is the number of images in a bucket and B is the batch size, then for that bucket I would like to get N // B batches if B divides N, and N // B + 1 batches otherwise. The last batch can have fewer than B examples.

As an example, suppose I have indexes [0, 1, ..., 19], inclusive and I'd like to use a batch size of 3.

The indexes [0, 9] correspond to images in bucket 0 (shape (C, W1, H1))
The indexes [10, 19] correspond to images in bucket 1 (shape (C, W2, H2))

(The channel depth is the same for all images). Then an acceptable partitioning of the indexes would be

batches = [
    [0, 1, 2], 
    [3, 4, 5], 
    [6, 7, 8], 
    [9], 
    [10, 11, 12], 
    [13, 14, 15], 
    [16, 17, 18], 
    [19]
]

I would prefer to process the images indexed at 9 and 19 separately because they have different dimensions.

Looking through PyTorch's documentation, I found the BatchSampler class that generates lists of mini-batch indexes. I made a custom Sampler class that emulates the partitioning of indexes described above. If it helps, here's my implementation for this:

class CustomSampler(Sampler):

    def __init__(self, dataset, batch_size):
        self.batch_size = batch_size
        self.buckets = self._get_buckets(dataset)
        self.num_examples = len(dataset)

    def __iter__(self):
        batch = []
        # Process buckets in random order
        dims = random.sample(list(self.buckets), len(self.buckets))
        for dim in dims:
            # Process images in buckets in random order
            bucket = self.buckets[dim]
            bucket = random.sample(bucket, len(bucket))
            for idx in bucket:
                batch.append(idx)
                if len(batch) == self.batch_size:
                    yield batch
                    batch = []
            # Yield half-full batch before moving to next bucket
            if len(batch) > 0:
                yield batch
                batch = []

    def __len__(self):
        return self.num_examples

    def _get_buckets(self, dataset):
        buckets = defaultdict(list)
        for i in range(len(dataset)):
            img, _ = dataset[i]
            dims = img.shape
            buckets[dims].append(i)
        return buckets

However, when I use my custom Sampler class I generate the following error:

Traceback (most recent call last):
    File "sampler.py", line 143, in <module>
        for i, batch in enumerate(dataloader):
    File "/home/roflcakzorz/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 263, in __next__
        indices = next(self.sample_iter)  # may raise StopIteration
    File "/home/roflcakzorz/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 139, in __iter__
        batch.append(int(idx))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

The DataLoader class seems to expect to be passed indexes, not list of indexes.

Should I not be using a custom Sampler class for this task? I also considered making a custom collate_fn to pass to the DataLoader, but with that approach I don't believe I can control which indexes are allowed to be in the same mini-batch. Any guidance would be greatly appreciated.


Solution

  • Do you have 2 networks for each of the samples(A cnn kernel size has to be fix). If yes just pass the above custom_sampler to the batch_sampler args of DataLoader class. That would fix the issue.