bashsortingjq

Sorting user-defined array with strings gives wrong order, even when file content is fully available on disk


I am querying ElasticSearch and sorting the documents locally in Bash with jq, as sorting in ES is too slow for me.

The original purpose is to create a CSV file.

But I find the sorting does not work properly, it seems sort step does nothing.

As I am launching cURL requests, I thought the wrong order is due to content is chunked so I save some results into a local test.json file and tried again, but it still does not work.

test.json:

{
    "took": 680,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "max_score": 1.0,
        "hits": [
            {
                "_index": "my-index",
                "_type": "_doc",
                "_id": "111111113584925",
                "_score": 1.0,
                "fields": {
                    "field2": [
                        "FOO"
                    ],
                    "field1": [
                        "111111113584925"
                    ]
                }
            },
            {
                "_index": "my-index",
                "_type": "_doc",
                "_id": "111111121254059",
                "_score": 1.0,
                "fields": {
                    "field2": [
                        "FOO"
                    ],
                    "field1": [
                        "111111121254059"
                    ]
                }
            }
        ]
    }
}

(There are many more records - edited for brevity.)

Command that I use:

jq '.hits.hits[].fields | [.field1[0] + "," + .field2[0]] | sort | .[0]' -r test.json

The result:

111111113584925,FOO
111111121254059,FOO
111111116879444,FOO

etc.

Why?

Should I rely on jq sorting? Am I using it correctly? I mean I want to do string comparison by alphabetical order, and field1 all have unique values, so it will never be a tie and start to compare values of field2(it also could have various values but I only want to sort by field1)

Should I use Bash sort -k 1 instead? Which is faster when it comes to 100K rows?


Solution

  • You're looking for something like this:

    .hits.hits | map(.fields | .field1[0] + "," + .field2[0]) | sort[]
    

    Online demo