tensorflowkerastensorflow-litequantization-aware-training

Why is a TFLite model derived from a quantization aware trained model different different than from a normal model with same weights?


I am training a Keras model that I want to deploy with TFLite in a quantized, 8-bit environment (microcontroller). To improve quantization performance, I perform quantization aware training. I then create the quantized TFLite model using my validation set as a representative dataset. Performance is evaluated using the validation set and illustrated in this image:

Error rate for various batches of 20 runs in different conditions

If instead of simply generating the TFLite model (cyan in the figure) from the QA-trained model (red in the figure) I copy the weights from the QA-trained model to the original one and then generate the TFLite model to work around an issue (purple in the figure), this gives slightly different predictions. Why is that?

I understand that the TFLite models would be slightly different than the QA-trained model, since the transformation uses a post-training quantization based on the validation set. But shouldn't the quantization be the same if the structure, weights and biases of the network are the same?

Sub-question: why is the TFLite model on average slightly worse than the normal Keras model? Since I am quantizing and evaluating on the validation set, if anything I would expect it to perform artificially better.


Solution

  • It sounds like you are combining post-training quantization and quantization aware training. If I understand you correctly you are training a quantized model and then copying the float weights only to the original float model and then running post-training quantization.

    This procedure is a bit strange -- the issue is that the quantized version of the model also quantizes activations, so that just copying the weights does not result in the same exact network. The quantization parameters for activations used by the quantized TF model may end up different from those calculated from the representative dataset, and will lead to different answers.

    I think you could expect that the QAT model works better than the resulting TFLite model because of the trained quantization parameters for activations.

    I suggest resolving your earlier question, which would lead to a better solution and higher accuracy.