bashparallel-processingjqndjson

How do I run jq in parallel on multiple CPUs


My script processes ~30 lines per second and uses just one CPU core.

while read -r line; do echo "$line" | jq -c '{some-tansfomration-logic}'; done < input.json >> output.json

The input.json is ~6GB 17M lines file. It's a new-line delimited json, not an array.

I have 16 (or more, if makes sense) cores (vCPUs on GCP) and want to run this process in parallel. I know, hadoop is the way to go. But it's a one-time thing, how do I speed up the process to ~600 lines per second simply?

Lines ordering is not important.


Solution

  • Given input order does have have to match output order, try parallel:

    parallel -j16 --spreadstdin '(transformation)' < input.json > output.json
    

    Notice that parallel has an option to define the number of jobs based on the number of available cores, to make the script adapt to the actual configuration. Check the man page for options/syntax.

    parallel -j0 --spreadstdin '(transformation)' < input.json > output.json
    

    Also this solution will "batch" multiple input lines to jq, reducing the overhead of running jq per line, as is implemented the original post, and per comment froms for @stkvtflw

    The --keep-order option can be used to force the input order, at some extra processing time. Per OP, not needed.