pythonpytorchonnxquantizationonnxruntime

Converting PyTorch to ONNX model increases file size for ALBert


Goal: Use this Notebook to perform quantisation on albert-base-v2 model.

Kernel: conda_pytorch_p36.


Outputs in Sections 1.2 & 2.2 show that:

However, when running ALBert:

I think this is the reason for poorer model performance of both Quantization methods of ALBert, compared to vanilla ALBert.

PyTorch:

Size (MB): 44.58906650543213
Size (MB): 22.373255729675293

ONNX:

ONNX full precision model size (MB): 341.64233207702637
ONNX quantized model size (MB): 85.53886985778809

Why might exporting ALBert from PyTorch to ONNX increase model size, but not for BERT?

Please let me know if there's anything else I can add to post.


Solution

  • Explanation

    ALBert model has shared weights among layers. torch.onnx.export outputs the weights to different tensors, which causes the model size to grow larger.

    A number of Git Issues have been marked Solved regarding this phenomena.

    The most common solution is to remove shared weights, that is to remove tensor arrays that contain the exact same values.


    Solutions

    Section "Removing shared weights" in onnx_remove_shared_weights.ipynb.

    Pseudo-code:

    from onnxruntime.transformers.onnx_model import OnnxModel
    model=onnx.load(path)
    onnx_model=OnnxModel(model)
    count = len(model.graph.initializer)
    same = [-1] * count
    for i in range(count - 1):
      if same[i] >= 0:
        continue
      for j in range(i+1, count):
         if has_same_value(model.graph.initializer[i], model.graph.initializer[j]):
        same[j] = i
    
    for i in range(count):
       if same[i] >= 0:
            onnx_model.replace_input_of_all_nodes(model.graph.initializer[i].name, model.graph.initializer[same[i]].name)
    
    onnx_model.update_graph()
    onnx_model.save_model_to_file(output_path)
    

    Source of both solutions