[SOLVED] Hessian-Free Optimization versus Gradient Descent for DNN training

Hessian-Free Optimization versus Gradient Descent for DNN training

How do the Hessian-Free (HF) Optimization techniques compare against the Gradient Descent techniques (for e.g. Stochastic Gradient Descent (SGD), Batch Gradient Descent, Adaptive Gradient Descent) for training Deep Neural Networks (DNN)?

Under what circumstances should one prefer HF techniques as opposed to Gradient Descent techniques?

Solution

In short, HFO is a way to avoid the vanishing gradient problem which comes from (naively) using backpropagation in deep nets. However, Deep Learning is about avoiding this issue tweaking the learning and/or architecture, so in the end it comes down to specific comparisons between each specific network model (and strategy, like pre-tuning) and HFO. There's a lot of recent studies on this topic but it's not fully explored yet. In some cases it performs better, in some it doesn't. Afaik (might be outdated soon) Elman-based RNNs (not LSTM) benefit from it the most.

Tl;dr: SGD is still the goto-method although flawed. Until someone finds a better way of non-SGD learning. HFO is one suggestion of many and but it's not found to be state-of-the-art yet.