I have written a conv. neural network from scratch before, but I've decided to use Pytorch for its speed. However, I could not find documentation as to how to format for the conv2d layer. In general, there seems to be a lot of overheads and wrappers which prevents me from viewing what exactly is happening and writing my code accordingly.
I have trained a model on the MNIST dataset, and imported the model in order to run it (as per the tutorial):
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 8, 3, stride = 1, padding = 1)
self.pool = nn.MaxPool2d(2, stride = 2)
self.conv2 = nn.Conv2d(8, 8, 3, stride = 1, padding = 1)
self.linear1 = nn.Linear(7 * 7 * 8, 128)
self.linear2 = nn.Linear(128, 128)
self.linear3 = nn.Linear(128, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = torch.flatten(x, 1)
x = F.relu(self.linear1(x))
x = F.relu(self.linear2(x))
x = self.linear3(x)
return x
my_model = NeuralNetwork()
my_model.load_state_dict(torch.load("model_weights.pth", weights_only=True))
my_model.eval()
Now, I have a web application where:
I have a sample code of what I wish to perform:
formatted_array = some_formatting_function(flattened_array_of_0_and_1)
x = torch.tensor(formatted_array)
pred = my_model(x)
guessed_digit = some_reading_function(pred)
print(guessed_digit)
# eventually return the guessed_digit
What should my some_formatting_function
and some_reading_function
be?
The input of the model should be the same shape as the input of the first layer, which is a Conv2D
in your case. According to PyTorch's documentation on Conv2D
, the input of such a layer must of the shape (N,C_in,H_in,W_in)
or (C_in,H_in,W_in)
, where N
is the batch size, C_in
is the number of channels (1 in your case), H_in
is the image height (28) and W_in
is the image width (28). Since you only evaluate inputs one by one, you can use the second form (or N=1
).
This means you should pass a tensor of shape (1,28,28)
to your model. To obtain it, you could do something like :
formatted_array = torch.tensor(flattened_array_of_0_and_1).view(1,28,28)
, optionally followed by a .transpose(1, 2)
to swap the two spatial dimensions if they are inverted in the resulting tensor.
You may also consider not flattening the data between the user drawing and the neural network inference, but you should probably still use .view(...)
to add the "channels" dimension to your input tensor.
Classifier neural networks use the one-hot encoding, meaning they are trained to output (for each training sample) a target vector of all zeros, except for a one in the dimensions corresponding to the category of the training sample. During training, we are trying to get as close as possible to such a representation, and during inference we pick the dimension with the highest value from the output vector, and use this as the predicted label. You can do this using argmax()
: guessed_digit = pred.argmax()