Why doesn't the loss calculated by Flux `withgradient` match what I have calculated?

I'm trying to train a simple CNN with Flux and running into a weird issue...during training the loss appears to go down (indicating that it's working) but despite what the loss curve suggested the "trained" model output was very bad, and when I calculated the loss by hand I noticed that it differed from what the training indicated it should be (it was acting like it hadn't been trained at all).

I then started calculating the loss returned inside the gradient vs. outside, and after a lot of digging I think the problem is related to the BatchNorm layer. Consider the following minimum example:

using Flux
x = rand(100,100,1,1) #say a greyscale image 100x100 with 1 channel (greyscale) and 1 batch
y = @. 5*x + 3 #output image, some relationship to the input values (doesn't matter for this)
m = Chain(BatchNorm(1),Conv((1,1),1=>1)) #very simple model (doesn't really do anything but illustrates the problem)
l_init = Flux.mse(m(x),y) #initial loss after model creation
l_grad, grad = Flux.withgradient(m -> Flux.mse(m(x),y), m) #loss calculated by gradient
l_final = Flux.mse(m(x),y) #loss calculated again using the model (no parameters have been updated)
println("initial loss: $l_init")
println("loss calculated in withgradient: $l_grad")
println("final loss: $l_final")

All of the losses above will be different, sometimes pretty drastically (when running just now I got 22.6, 30.7, and 23.0), when I think they should all be the same?

Interestingly if I remove the BatchNorm layer, the outputs are all the same, i.e. running:

using Flux
x = rand(100,100,1,1) #say a greyscale image 100x100 with 1 channel (greyscale) and 1 batch
y = @. 5*x + 3 #output image
m = Chain(Conv((1,1),1=>1))
l_init = Flux.mse(m(x),y) #initial loss after model creation
l_grad, grad = Flux.withgradient(m -> Flux.mse(m(x),y), m)
l_final = Flux.mse(m(x),y)
println("initial loss: $l_init")
println("loss calculated in withgradient: $l_grad")
println("final loss: $l_final")

Produces the same number for each loss calculation.

Why does including the BatchNorm layer change the value of the loss like this?

My (limited) understanding was that this was just supposed to normalize the input values, which I understand could affect the loss between the unormalized and normalized case, but I don't understand why it would produce different values of the loss for the same input values on the same model without any of the parameters of said model being updated?

Solution

Look at the documentation of BatchNorm

BatchNorm(channels::Integer, λ=identity;
            initβ=zeros32, initγ=ones32,
            affine=true, track_stats=true, active=nothing,
            eps=1f-5, momentum= 0.1f0)

  Batch Normalization (https://arxiv.org/abs/1502.03167) layer. channels should
  be the size of the channel dimension in your data (see below).

  Given an array with N dimensions, call the N-1th the channel dimension. For a
  batch of feature vectors this is just the data dimension, for WHCN images it's
  the usual channel dimension.

  BatchNorm computes the mean and variance for each D_1×...×D_{N-2}×1×D_N input
  slice and normalises the input accordingly.

  If affine=true, it also applies a shift and a rescale to the input through to
  learnable per-channel bias β and scale γ parameters.

  After normalisation, elementwise activation λ is applied.

  If track_stats=true, accumulates mean and var statistics in training phase that
  will be used to renormalize the input in test phase.

  Use testmode! during inference.

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  julia> using Statistics
  
  julia> xs = rand(3, 3, 3, 2);  # a batch of 2 images, each having 3 channels
  
  julia> m = BatchNorm(3);
  
  julia> Flux.trainmode!(m);
  
  julia> isapprox(std(m(xs)), 1, atol=0.1) && std(xs) != std(m(xs))
  true

The key bit here is that per default track_stats=true. This leads to the changing inputs. If you don't want to have this behaviour, initialise your model with

m = Chain(BatchNorm(1, track_state=false),Conv((1,1),1=>1)) #very simple model (doesn't really do anything but illustrates the problem)

and you'll get identical outputs as in your second example.

The BatchNorm is initialised with zero mean and unit std, and your input data isn't, that's why you'll get the changing output even with repeated identical input in the case that track_state=true, as far as I can see it (quickly).