[SOLVED] Why is protobuf bad for large data structures?

Why is protobuf bad for large data structures?

I'm new to protobuf. I need to serialize a complex graph-like structure and share it between C++ and Python clients. I'm trying to apply protobuf because:

It is language agnostic, has generators both for C++ and Python
It is binary. I can't afford text formats because my data structure is quite large

But the Protobuf user guide says:

Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

https://developers.google.com/protocol-buffers/docs/techniques#large-data

I have graph-like structures that are sometimes up to 1 GB in size, way above 1 MB.

Why is protobuf bad for serializing large datasets? What should I use instead?

Solution

It is just general guidance, so it doesn't apply to every case. For example, the OpenStreetMap project uses a protocol buffers based file format for its maps, and the files are often 10-100 GB in size. Another example is Google's own TensorFlow, which uses protobuf and the graphs it stores are often up to 1 GB in size.

However, OpenStreetMap does not have the entire file as a single message. Instead it consists of thousands individual messages, each encoding a part of the map. You can apply a similar approach, so that each message only encodes e.g. one node.

The main problem with protobuf for large files is that it doesn't support random access. You'll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue. This is what TensorFlow does, and it appears to store everything in a single message.

If you need a random access format that is compatible across many languages, I would suggest HDF5 or sqlite.