I have a input batch which is a list (size 8) of images (480,640,3), which I would like to convert to Pytorch tensors, normalize with mean and std, and pass to a model as (8,3,480,640). Presently I'm doing the following, which works.
import torch as T
from torchvision import transforms
batch_size=8
height = 480
width = 640
input_shape = (batch_size, 3, height, width)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
input_batch = [...] # np.ones(480,640,3) * 8
pre_items = [transform(item) for item in input_batch]
pre_items = T.stack(pre_items).to("cuda")
This is obviously not optimal because the preprocessing happens on CPU before being moved to CUDA.
What's the correct way to perform this on GPU on the batch as a whole?
My attempt at a solution was:
import torch as T
batch_size = 8
height = 480
width = 640
mean = T.ones((batch_size, height, width, 3)).to("cuda") * T.tensor([0.485, 0.456, 0.406]).to("cuda")
std = T.ones((batch_size, height, width, 3)).to("cuda") * T.tensor([0.229, 0.224, 0.225]).to("cuda")
input_batch = T.stack([T.tensor(item).to("cuda").float() for item in input_batch])
pre_items = (input_batch - mean)/std
pre_items = T.permute(pre_items, (0,3,1,2))
The output of this script does not match the expected tensor from the bottlenecked solution.
According to OPs clarification, this is a speedy way to peform the normalization on the gpu.
mean
and std
variable so that I can multiply it with input_batch
without stacking the same value multiple times (this is called broadcasting).import torch as T
import numpy as np
batch_size = 8
height = 480
width = 640
channels = 3
# CPU
input_batch = np.ones((batch_size, height, width, channels))
mean = np.array([0.485, 0.456, 0.406]).reshape((1, 1, 1, -1)) # match input_batch dimension
std = np.array([0.229, 0.224, 0.225]).reshape((1, 1, 1, -1)) # match input_batch dimension
# GPU
input_batch = T.from_numpy(input_batch).to("cuda")
mean = T.from_numpy(mean).to("cuda")
std = T.from_numpy(std).to("cuda")
pre_items = (input_batch - mean)/std