machine-learningneural-networkrecurrent-neural-network

Can a neural network be trained while it changes in size?


Are there known methods of continuous training and graceful degradation of a neural net while it shrinks or grows in size (by number of nodes, connections, whatever)?

To the best of my memory, everything I've read about neural networks is from a static perspective. You define the net and then train it.

If there is some neural network X with N nodes (neurons, whatever), is it possible to train the network (X) so that while N increases or decreases, the network is still useful and capable of performing?


Solution

  • In general, changing network architecture (adding new layers, adding more neurons into existing layers) once the network was already trained makes sense and a rather common operation in Deep Learning domain. One example is the dropout - during training half of the neurons randomly get switched off completely and only remaining half participates in training during specific iteration (each iteration or 'epoch' as it often is named has different random list of switched off neurons). Another example is transfer learning - where you learn network on one set of input data, cut off part of the outcoming layers, replace them with new layers and re-learn the model on another dataset.

    To better explain why it makes sense lets step back for a moment. In deep networks, where you have lots of hidden layers each layer learns some abstraction from the incoming data. Each additional layer uses abstract representations learned by previous layer and builds upon them, combining such abstraction to form a higher level of the data representation. For instance, you could be trying to classify the images with DNN. First layer will learn rather simple concepts from images - like edges or points in data. Next layer could combine this simple concepts to learn primitives - like triangles or circles of squares. Next layer could drive it further and combine this primitives to represent some objects which you could find in images, like 'a car' or 'a house'and using softmax it calculates the probabilities of the answer you are looking for (what to actually output). I need to mention that these facts and learned representations could be actually checked. You could visualize the activation of your hidden layer and see what it learned. For example this was done with google's project 'inceptionism'. With that in mind let's get back to what I mentioned earlier.

    Dropout is used to improve generalization of the network. It forces each neuron to 'not be so sure' that some pieces of the information from the previous layer will be available and makes it to try to learn the representations relying on less favorable and informative pieces of abstractions from previous layer. It forces it to consider all of the representations from previous layer to make decisions instead of putting all of its weight into couple of neurons it 'likes most of all'. By doing this the network is usually better prepared to new data where the input will be different from the training set.

    Q: "As far as you're aware is the quality of the stored knowledge (whatever training has done to the net) still usable following the dropout? Maybe random halves could be substituted by random 10ths with a single 10th dropping, that might result in less knowledge loss during the transition period."

    A: Unfortunately I can't properly answer why precisely half of the neurons is switched off and not 10% (or any other number). Maybe there is an explanation but I haven't seen it. In general it just works and that's it. Also I need to mention that the task of dropout is to ensure that each neuron doesn't consider just several of the neurons from previous layer and is ready to make some decision even if neurons which usually helped it to make correct decision are not available. This is used for generalization only and helps the network to better cope with the data it haven't seen previously, nothing else is achieved with a dropout.

    Now let's consider Transfer Learning again. Consider that you have a network with 4 layers. You train it to recognize specific objects in pictures (cat, dog, table, car etc). Than you cut off last layer, replace it with three additional layers and now you train the resulting 6-layered network on a dataset which, for instance, wrights short sentences about what is shown on this image ('a cat is on the car', 'house with windows and tree nearby' etc). What we did with such operation? Our original 4-layer network was capable to understand if some specific object is in the image we feed it with. Its first 3 layers learned good representations of the images - first layer learned about possible edges or points or some extremely primitive geometric shapes in images. Second layer learned some more elaborate geometric figures like 'circle' or 'square'. Last layer knows how to combine them to form some higher level objects - 'car', 'cat', 'house'. Now, we could just re-use this good representation which we learned in different domain and just add several more layers. Each of them will use abstractions from last (3rd) layer of original network and learn how combine them to create meaningful descriptions of images. While you will perform learning on new dataset with images as input and sentences as output it will adjust first 3 layers which we got from original network but these adjustments will be mostly minor, while 3 new layers will be adjusted by learning significantly. What we achieve with transfer learning is:

    1) We can learn a much better data representations. We could create a network which is very good at specific task and than build upon that network to perform something different.

    2) We can save training time - first layers of network will already be trained well enough so that your layers which are closer to output already get a rather good data representations. So the training should finish much faster using pre-trained first layers.

    So the bottom line is that pre-training some network and than re-using part or whole network in another network makes perfect sense and is not something uncommon.