jsonserializationbsonjson-serializationcbor

Binary JSON format that supports traversal


Does anyone know of a serialisation format that:

  1. Is binary and at least relatively compact,
  2. Can store JSON-style data (not Protobuf, Thrift, etc.),
  3. Supports traversal (i.e. you don't need to parse the entire document to read one part of it), and
  4. Supports large files (e.g. 30 GB)?

I have looked at the following:

BSON was so close but the maximum file size kills it for me. Are there any formats that would work? Obviously I can write my own, but there are sooooo many binary JSON formats, surely someone has made a decent one?

Edit: By "traversal" I mean the same thing that the BSON authors mean - you should be able to find a given object without having to parse the entire file. Amazon calls this "sparse" or "shallow" reading.


Solution

  • Found one! Amazon Ion. From the FAQ:

    Many reads are shallow or sparse, meaning that the application is focused on only a subset of the values in the stream, and that it can quickly determine if full materialization of a value is required.

    In the spirit of these principles, the Ion specification includes features that make Ion’s binary encoding more efficient to read than other schema-free formats. These features include length-prefixing of binary values and Ion’s use of symbol tables.

    Brief notes on Ion:

    It is not very popular. Libraries are available for only a few languages and I can't even find a command line tool that uses it. Still, it seems to be the only option if you want these features!

    Edit:

    In the end we went with SQLite which is pretty excellent. It doesn't really follow the JSON data model but it does let you do sparse reads very easily and it is very fast. Another possibility is DuckDB which is kind of a modern take on SQLite but less widely supported.