deep-learningpytorchconv-neural-networkresnetencoder-decoder

Encoder - Decoder neural network architecture with different input and output size


I am trying to figure out what would be a good architecture for neural network that takes projections (2D images) from different angles and creates volume consisting of 2D slices (CT-like).

So for example:

I have ground truth volumes.

I came up with the idea of using ResNet as Encoder. But I'm not really sure how to implement Decoder and what model would be a good choice for this kind of problem. I did think of U-net architecture, but output dimension is different, so I've abandoned this idea.

I am using PyTorch.


Solution

  • Specifying the whole network is out of scope of a single answer, but generally you want something like this:

    1. Use a Resnet or vision transformer as the encoder
    2. Use the encoder to map the input down to a latent tensor
    3. Reshape latent tensor as needed
    4. Use ConvTranspose3d layers to upsample latent tensor to desired output size

    You can do a UNet-like setup where you have skip connections between encoder layers and decoder layers, you would just need a projection layer to map the encoder activations into a shape compatible with the decoder activations.