I'm investigating a memory issue related to protobuf message in Python.
Here is the simple protobuf message:
syntax="proto3";
message LiveAgentMessage {
string text = 1;
}
So it gets compiled to a class in python with protoc, I ran the following script
from pympler import asizeof
from foo_pb2 import LiveAgentMessage
print(f"{asizeof.asizeof(LiveAgentMessage(text=''))=:}")
print(f"{asizeof.asizeof(LiveAgentMessage(text='a'))=:}")
The output is
asizeof.asizeof(LiveAgentMessage(text='')=816
asizeof.asizeof(LiveAgentMessage(text='a'))=895904
I wonder why the LiveAgentMessage(text='a')
(the second line) takes up so much memory with just a one-letter string, it's 10^4x compared to the first line?
I'm using pympler for size calculation, but not sure if that's the right one for the python proto class.
This depends on the version of python-protobuf
you use, as well as the environment variable PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION
. But in any case it is mostly an artefact of how asizeof()
works.
You can pass stats = 1
to asizeof()
to see details:
>>> asizeof.asizeof(LiveAgentMessage(text='a', stats = 1))
316712 bytes or 309.3 KiB
8 byte aligned
8 byte sizeof(void*)
1 object given
2838 objects sized
19577 objects seen
271 objects ranked
0 objects missed
725 duplicates
50 deepest recursion
10 largest objects (of 271 over 1024 bytes or 1.0 KiB)
316712 bytes or 309.3 KiB: class foo_pb2.LiveAgentMessage: text: "a"\n, ix 0
316104 bytes or 308.7 KiB: class dict: {<google.protobuf.descriptor.FieldDescriptor object at 0x7915ce4f9d80>: 'a'} leng 0, ix 1 (at 1), pix 0
315816 bytes or 308.4 KiB: class google.protobuf.descriptor.FieldDescriptor: <google.protobuf.descriptor.FieldDescriptor object at 0x7915ce4f9d80>, ix 2 (at 2), pix 1
315768 bytes or 308.4 KiB: class dict: {'_features': field_presence: IMPLICIT.....MakeScalarDefault at 0x7915cddbd990>} leng 32!, ix 3 (at 3), pix 2
314560 bytes or 307.2 KiB: class google.protobuf.descriptor_pb2.FeatureSet: field_presence: IMPLICIT\nenum_type: O.... LENGTH_PREFIXED\njson_format: ALLOW\n, ix 4 (at 4), pix 3
314008 bytes or 306.6 KiB: class dict: {<google.protobuf.descriptor.FieldDesc....scriptor object at 0x7915ce22b940>: 1} leng 8!, ix 5 (at 5), pix 4
313648 bytes or 306.3 KiB: class google.protobuf.descriptor.FieldDescriptor: <google.protobuf.descriptor.FieldDescriptor object at 0x7915ce22b850>, ix 6 (at 6), pix 5
313600 bytes or 306.2 KiB: class dict: {'_features': <google.protobuf.descrip....ocals>.EncodeField at 0x7915ce2863b0>} leng 32!, ix 7 (at 7), pix 6
311968 bytes or 304.7 KiB: class google.protobuf.descriptor.FileDescriptor: <google.protobuf.descriptor.FileDescriptor object at 0x7915ce1e7640>, ix 8 (at 8), pix 7
311920 bytes or 304.6 KiB: class dict: {'_features': <google.protobuf.descrip....ncies': [], 'public_dependencies': []} leng 32!, ix 9 (at 9), pix 8
In this case, the message class contains a reference to the Protocol Buffers descriptor set. This reference is included in all messages that use the pure Python protobuf implementation. There is only one global instance, so the data is not duplicated even if you have multiple messages.
The reason why the size shown is different for empty message is that asizeof seems to ignore msg.DESCRIPTOR
and only finds it if it is referenced in the msg._fields
:
>>> LiveAgentMessage(text = 'a')._fields
{<google.protobuf.descriptor.FieldDescriptor object at 0x7f96163b26e0>: 'a'}
>>> LiveAgentMessage()._fields
{}
You can check how asizeof()
interprets the type by using the internal methods:
>>> msg = LiveAgentMessage(text = 'a')
>>> dict((x.name, asizeof.asizeof(x.ref)) for x in asizeof._typedef(msg).refs(msg, True))
{'__class__': 0, '_cached_byte_size': 24, '_cached_byte_size_dirty': 32,
'_fields': 317656, '_unknown_fields': 40,
'_unknown_field_set': 16, '_is_present_in_parent': 24,
'_listener': 152, '_listener_for_children': 376, '_oneofs': 64}