pythonpicklecloudpickle

cloudpickle of object from imported class versus class defined in the same module as where pickling occurs


I noticed that the pickle file resulting from cloudpickle.dump(obj) is different depending on whether the class of obj (call it SubClass, a subclass of BaseClass) is imported or defined in the same module as where the cloudpickling occurs.

In particular, if BaseClass and SubClass is imported, then the pickle file only stores a reference to the MyClass module and class, as determined by disassembling the byte code.

If BaseClass and SubClass are defined in the same module as where the cloudpickling occurs, then the pickle file seems to store the code of BaseClass and SubClass.

Does anyone know why this happens? Is this because cloudpickle serializes objects with their classes completely when they are defined in the main module?

BaseClass and SubClass defined in same module as where cloudpickling occurs:

import cloudpickle
import pickletools


class BaseClass:
    def func(self):
        print("BaseClass")


class SubClass(BaseClass):
    def subfunc(self):
        print("SubClass")


obj = SubClass()
with open("cloudpickle_object.pkl", "wb") as f:
    cloudpickle.dump(obj, f)

with open("cloudpickle_object.pkl", "rb") as infile:
    pickletools.dis(infile)

Output of disassembler shows BaseClass and SubClass code in pickle file:

   83: \x8c     SHORT_BINUNICODE 'SubClass'
   93: \x94     MEMOIZE    (as 6)
   94: h        BINGET     2
   96: (        MARK
   97: h            BINGET     5
   99: \x8c         SHORT_BINUNICODE 'BaseClass'
  110: \x94         MEMOIZE    (as 7)
  111: h            BINGET     3
  113: \x8c         SHORT_BINUNICODE 'object

BaseClass and SubClass imported from a different module as where cloudpickling occurs:

import cloudpickle
import pickletools
from myclass import SubClass


obj = SubClass()
with open("cloudpickle_object.pkl", "wb") as f:
    cloudpickle.dump(obj, f)

with open("cloudpickle_object.pkl", "rb") as infile:
    pickletools.dis(infile)

Output with only reference to SubClass, no BaseClass or code:

    0: \x80 PROTO      4
    2: \x95 FRAME      27
   11: \x8c SHORT_BINUNICODE 'myclass'
   20: \x94 MEMOIZE    (as 0)
   21: \x8c SHORT_BINUNICODE 'SubClass'
   [...]

Solution

  • cloudpickle only serializes objects that are part of the __main__ module as mentioned on their github https://github.com/cloudpipe/cloudpickle.

    So, it is natural for the pickle object using the imports to be smaller since those are expected to be imported when unpickling.

    Interestingly, there have been some feature requests and some work done to make cloudpickle serialize some imported modules. For example,