pythontextdeep-learningnlpgrover

Effect of max sequence length on Grover


Have been working on grover model of rowanz . I was able to train grover's large model on 4 batch size but was getting memory allocation error while fine tuning mega model I then reduce batch size to 1 and training is now on going. I also tried to reduce max_seq_length to 512 and set batch_size to 4 and it was working.

My questions is what parameter will effect more on performance reducing batch size or reducing max_seq_length ?

Also can I set the value of max_seq_length other then the power of 2 like some value between 512 and 1024?


Solution

  • My questions is what parameter will effect more on performance reducing batch size or reducing max_seq_length?

    Effects of batch size:

    1. On performance: None. It is a big misconception that batch size in any way affects the end metrics (e.g. accuracy). Although finer batch size means metrics being reported on shorter intervals giving illusion of much larger variability than there actually is. Effect is highly noticeable in case of batch size = 1 for obvious reasons. Larger batch sizes tend to report higher veracity for metrics as they are being calculated over several data points. End metrics are usually the same (with account for random initialization of weights).
    2. On efficiency: Larger batch sizes means metrics being calculated less often but at the same time more space in the memory at the same time as metrics are being aggregated over a number of data points as per batch size. The same issue you were facing. So, batch size is more of a efficiency concern than a performance one. Moreover, how often you want to check model’s output.

    Effects of max_seq_length:

    1. On performance: Probably the most important metric for performance of language based models like Grover. Reason behind this is the perplexity of human-written text is lower than randomly sampled text, and this gap increases with sequence length. Generally, more the sequence length is, easier it is for a language model to stay consistent during the whole course of the output. So yeah it does help in model performance. However you might want to look into documentation for your particular model for “Goldilocks Zones” of sequence lengths and whether the sequences in power of 2 are more desirable than others.

    2. On efficiency: Larger sequence sizes are of course require more processing power and computational memory so higher you go for the sequence lengths, more power you will need.

    Also can I set the value of max_seq_length other then the power of 2 like some value between 512 and 1024?

    Yeah why not? No model is designed to work with a fixed set of values. Wxperiment different sequence lengths and see whichever works for you best. Adjsuting some parameters in powers of two has been a classical practice for having a little computational advantage because of their simple binary representations but is negligible in case of large models as of today.