I am testing the example here: https://fluxml.ai/Flux.jl/stable/models/overview/
using Flux
actual(x) = 4x + 2
x_train, x_test= hcat(0:5...), hcat(6:10...)
y_train, y_test = actual.(x_train), actual.(x_test)
predict = Dense(1 => 1)
predict(x_train)
loss(x,y) = Flux.Losses.mse(predict(x),y)
loss(x_train,y_train)
using Flux:train!
opt = Descent(0.1)
data = [(x_train, y_train)]
parameters = Flux.params(predict)
predict.weight in parameters, predict.bias in parameters
train!(loss, parameters, data, opt)
loss(x_train, y_train)
for epoch in 1:1000
train!(loss, parameters, data, opt)
end
loss(x_train, y_train)
predict(x_test)
y_test
As you can see, it is just a very simple model actual(x) = 4x + 2. If you run these codes you will get an almost perfect prediction result.
1×5 Matrix{Float32}: 26.0001 30.0001 34.0001 38.0001 42.0001
1×5 Matrix{Int64}: 26 30 34 38 42
But if I make a minor change in term of feeding the model with one more data, like this:
x_train, x_test= hcat(0:6...), hcat(6:10...)
So I didn't change anything except line 3. I just changed 5 to 6. Then the prediction result will become infinite.
1×5 Matrix{Float32}: NaN NaN NaN NaN NaN
1×5 Matrix{Int64}: 26 30 34 38 42
But why?
I think this is simply a case of a high learning rate gone wrong. I can reproduce the same NaN
behaviour with Descent(0.1)
. I tried printing it out and the loss goes to Inf
first before NaN
- a classic sign of a divergence because of a high learning rate. So I tried a learning rate of 0.01 and it works just fine - it gives the expected answer. It is probably diverging when x_train
is hcat(0:6...)
. A smaller learning rate allows the network to take smaller steps and it manages to find the minimum as expected.