pythonmemoryreflectionprotocol-buffers

Why does asizeof() claim that Python instance of protobuf message takes so much memory?


I'm investigating a memory issue related to protobuf message in Python.

Here is the simple protobuf message:

syntax="proto3";
message LiveAgentMessage {
  string text = 1;
}

So it gets compiled to a class in python with protoc, I ran the following script

from pympler import asizeof
from foo_pb2 import LiveAgentMessage

print(f"{asizeof.asizeof(LiveAgentMessage(text=''))=:}")
print(f"{asizeof.asizeof(LiveAgentMessage(text='a'))=:}")

The output is

asizeof.asizeof(LiveAgentMessage(text='')=816
asizeof.asizeof(LiveAgentMessage(text='a'))=895904

I wonder why the LiveAgentMessage(text='a') (the second line) takes up so much memory with just a one-letter string, it's 10^4x compared to the first line?

I'm using pympler for size calculation, but not sure if that's the right one for the python proto class.


Solution

  • This depends on the version of python-protobuf you use, as well as the environment variable PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION. But in any case it is mostly an artefact of how asizeof() works.

    You can pass stats = 1 to asizeof() to see details:

    >>> asizeof.asizeof(LiveAgentMessage(text='a', stats = 1))
    
     316712 bytes or 309.3 KiB
          8 byte aligned
          8 byte sizeof(void*)
          1 object given
       2838 objects sized
      19577 objects seen
        271 objects ranked
          0 objects missed
        725 duplicates
         50 deepest recursion
    
         10 largest objects (of 271 over 1024 bytes or 1.0 KiB)
     316712 bytes or 309.3 KiB: class foo_pb2.LiveAgentMessage: text: "a"\n, ix 0
     316104 bytes or 308.7 KiB: class dict: {<google.protobuf.descriptor.FieldDescriptor object at 0x7915ce4f9d80>: 'a'} leng 0, ix 1 (at 1), pix 0
     315816 bytes or 308.4 KiB: class google.protobuf.descriptor.FieldDescriptor: <google.protobuf.descriptor.FieldDescriptor object at 0x7915ce4f9d80>, ix 2 (at 2), pix 1
     315768 bytes or 308.4 KiB: class dict: {'_features': field_presence: IMPLICIT.....MakeScalarDefault at 0x7915cddbd990>} leng 32!, ix 3 (at 3), pix 2
     314560 bytes or 307.2 KiB: class google.protobuf.descriptor_pb2.FeatureSet: field_presence: IMPLICIT\nenum_type: O.... LENGTH_PREFIXED\njson_format: ALLOW\n, ix 4 (at 4), pix 3
     314008 bytes or 306.6 KiB: class dict: {<google.protobuf.descriptor.FieldDesc....scriptor object at 0x7915ce22b940>: 1} leng 8!, ix 5 (at 5), pix 4
     313648 bytes or 306.3 KiB: class google.protobuf.descriptor.FieldDescriptor: <google.protobuf.descriptor.FieldDescriptor object at 0x7915ce22b850>, ix 6 (at 6), pix 5
     313600 bytes or 306.2 KiB: class dict: {'_features': <google.protobuf.descrip....ocals>.EncodeField at 0x7915ce2863b0>} leng 32!, ix 7 (at 7), pix 6
     311968 bytes or 304.7 KiB: class google.protobuf.descriptor.FileDescriptor: <google.protobuf.descriptor.FileDescriptor object at 0x7915ce1e7640>, ix 8 (at 8), pix 7
     311920 bytes or 304.6 KiB: class dict: {'_features': <google.protobuf.descrip....ncies': [], 'public_dependencies': []} leng 32!, ix 9 (at 9), pix 8
    

    In this case, the message class contains a reference to the Protocol Buffers descriptor set. This reference is included in all messages that use the pure Python protobuf implementation. There is only one global instance, so the data is not duplicated even if you have multiple messages.

    The reason why the size shown is different for empty message is that asizeof seems to ignore msg.DESCRIPTOR and only finds it if it is referenced in the msg._fields:

    >>> LiveAgentMessage(text = 'a')._fields
    {<google.protobuf.descriptor.FieldDescriptor object at 0x7f96163b26e0>: 'a'}
    
    >>> LiveAgentMessage()._fields
    {}
    

    You can check how asizeof() interprets the type by using the internal methods:

    >>> msg = LiveAgentMessage(text = 'a')
    >>> dict((x.name, asizeof.asizeof(x.ref)) for x in asizeof._typedef(msg).refs(msg, True))
    
    {'__class__': 0, '_cached_byte_size': 24, '_cached_byte_size_dirty': 32,
     '_fields': 317656, '_unknown_fields': 40,
     '_unknown_field_set': 16, '_is_present_in_parent': 24,
     '_listener': 152, '_listener_for_children': 376, '_oneofs': 64}