Given a file with line delimited JSON records:
{"id": 1, "position": 1234}
{"id": 2, "position": 23}
{"id": 3, "position": 43}
{"id": 1, "position": 223}
I would like to compact such a file, by keeping only the last record for an id, e.g. for the above example I would like to have as output:
{"id": 2, "position": 23}
{"id": 3, "position": 43}
{"id": 1, "position": 223}
tldr; Is there a uniq
that works with line delimited JSON (and is fast)?
The input files might contain 1 billion records of which maybe 10-20% of the records can be thrown away.
I tried various approaches:
Seen ids
Keep a set of "seen" ids in memory. This runs out of memory.
Sort and unique
Sort the file by "id" first (with a stable sort, so order is preserved). Then run through the file again, and just keep the last record. This is like the usual unix sort | uniq
approach. Sorting is expensive here and maybe just too much work.
Extract offset and length information
Extract offset and length information and the id from the file, e.g.
id offset length
1 0 27
2 27 25
3 52 25
1 77 26
and figure out, which records would belong to the compacted set. Then seek and read through the file. The extraction of this information is reasonably fast, but millions of seeks and reads to extract the records slows this approach down.
What could be a better, faster (or fastest) approach? Are there existing tools that solve this kind of problem?
This problem can be solved by a three step process:
tac
and sort -u
to keep only relevant lines.The overall process is quite efficient. Step 1 and 2 are parallelizable. Step 3 can be made fast.