I have been studying UNet inspired architecture ENet and I think I follow the basic concepts. The ground-rock of efficiency of ENet is dilated convolution (apart other things). I understand the preserving spatial resolution, how it is computed and so on, however I can't understand why it is computationally and memory-wise less expensive than e.g. max-pooling.
You simply skip computational layer with a dilated convolution layer:
For example a dilated convolution with
is comparable to
For further reference look at the amazing paper from Vincent Dumoulin, Francesco Visin: A guide to convolution arithmetic for deep learning
Also on the github of this paper is a animation how dilated convolution works: https://github.com/vdumoulin/conv_arithmetic