jsonjq

Process large JSON stream with jq


I get a very large JSON stream (several GB) from curl and try to process it with jq.

The relevant output I want to parse with jq is packed in a document representing the result structure:

{
  "results":[
    {
      "columns": ["n"],

      // get this
      "data": [    
        {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
        {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
      //  ... millions of rows      

      ]
    }
  ],
  "errors": []
}

I want to extract the row data with jq. This is simple:

curl XYZ | jq -r -c '.results[0].data[0].row[]'

Result:

{"key1": "row1", "key2": "row1"}
{"key1": "row2", "key2": "row2"}

However, this always waits until curl is completed.

I played with the --stream option which is made for dealing with this. I tried the following command but is also waits until the full object is returned from curl:

curl XYZ | jq -n --stream 'fromstream(1|truncate_stream(inputs)) | .[].data[].row[]'

Is there a way to 'jump' to the data field and start parsing row one by one without waiting for closing tags?


Solution

  • (1) The vanilla filter you would use would be as follows:

    jq -r -c '.results[0].data[].row'
    

    (2) One way to use the streaming parser here would be to use it to process the output of .results[0].data, but the combination of the two steps will probably be slower than the vanilla approach.

    (3) To produce the output you want, you could run:

    jq -nc --stream '
      fromstream(inputs
        | select( [.[0][0,2,4]] == ["results", "data", "row"])
        | del(.[0][0:5]) )'
    

    (4) Alternatively, you may wish to try something along these lines:

    jq -nc --stream 'inputs
          | select(length==2)
          | select( [.[0][0,2,4]] == ["results", "data", "row"])
          | [ .[0][6], .[1]] '
    

    For the illustrative input, the output from the last invocation would be:

    ["key1","row1"] ["key2","row1"] ["key1","row2"] ["key2","row2"]