jsonfilteruniquestream-compaction

JSON log file compaction


Given a file with line delimited JSON records:

{"id": 1, "position": 1234}
{"id": 2, "position": 23}
{"id": 3, "position": 43}
{"id": 1, "position": 223}

I would like to compact such a file, by keeping only the last record for an id, e.g. for the above example I would like to have as output:

{"id": 2, "position": 23}
{"id": 3, "position": 43}
{"id": 1, "position": 223}

tldr; Is there a uniq that works with line delimited JSON (and is fast)?


The input files might contain 1 billion records of which maybe 10-20% of the records can be thrown away.

I tried various approaches:

  1. Seen ids

    Keep a set of "seen" ids in memory. This runs out of memory.

  2. Sort and unique

    Sort the file by "id" first (with a stable sort, so order is preserved). Then run through the file again, and just keep the last record. This is like the usual unix sort | uniq approach. Sorting is expensive here and maybe just too much work.

  3. Extract offset and length information

    Extract offset and length information and the id from the file, e.g.

    id offset length
    1  0      27
    2  27     25
    3  52     25
    1  77     26
    

and figure out, which records would belong to the compacted set. Then seek and read through the file. The extraction of this information is reasonably fast, but millions of seeks and reads to extract the records slows this approach down.


What could be a better, faster (or fastest) approach? Are there existing tools that solve this kind of problem?


Solution

  • This problem can be solved by a three step process:

    1. Extract interesting values (plus line number) with tools like jq or ldjtab.
    2. Use tac and sort -u to keep only relevant lines.
    3. Filter original file and only keep specified lines (tools like filterline or one a few other approaches will filter a file and only keep certain specified lines).

    The overall process is quite efficient. Step 1 and 2 are parallelizable. Step 3 can be made fast.