jsonselectjqblacklist

Discard JSON objects if they contain substrings from a list


I want to parse a JSON file and extract some values, while also discarding or skipping certain entries if they contain substrings from another list passed in as an argument. The purpose is to exclude objects containing miscellaneous human-readable keywords from a master list.

input.json

{
  "entities": [
    {
      "id": 600,
      "name": "foo-001"
    },
    {
      "id": 601,
      "name": "foo-002"
    },
    {
      "id": 602,
      "name": "foobar-001"
    }
  ]
}

args.json (list of keywords)

"foobar-"
"BANANA"

The output must definitely contain the foo-* entries (but not the excluded foobar- entries), but it can also contain any other names, provided they don't contain foobar- or BANANA. The exclusions are to be based on substrings, not exact matches.

I'm looking for a more performant way of doing this, because currently I just do my normal filters:

    jq '[.[].entities[] | select(.name != "")] | walk(if type == "string" then gsub ("\t";"") else . end)' > file

(the input file has some erroneous tab escapes and null fields in it that are preprocessed)

At this stage, the file has only been minimally prepared. Then I iterate through this file line by line in shell and invoke grep -vf with a long list of invalid patterns from the keywords file. This gives a "master list" that is sanitized for later parsing by other applications. This seems intuitively wrong, though.

It seems like this should be done in one fell swoop on the first pass with jq instead of brute forcing it in a loop later.

I tried various invocations of INDEX and --slurpfile, but I seem to be missing something:

jq '.entities | INDEX(.name)[inputs]' input.json args.json

The above is a simplistic way of indexing the input args that at least seems to demonstrate that the patterns in the file can be matched verbatim, but doesn't account for substrings (contains ).

jq '.[] | walk(if type == "object" and (.name | contains($args[]))then empty else . end)' --slurpfile args args.json input.json

This looks to be getting closer to the idea, but something is screwy here. It seems like it's regurgitating all of the input file for each iteration of the arguments in the keywords file and returning them all for N number of arguments, and not actually emptying the original input, just dumbly checking the entire file for the presence of a single keyword and then starting over.

It seems like I need to unwrap the $args[] and map it here somehow so that the input file only gets iterated through once, with each keyword being checked for each record, rather than the entire file over and over again.

I found some conflicting information about whether a slurpfile is strictly necessary and can't determine what's the optimal approach here.

Thanks.


Solution

  • You could use all/2 as follows:

    < input.json jq --slurpfile blacklist args.json '
     .entities
     | map(select(.name as $n
           | all( $blacklist[]; . as $b | $n | index($b) | not) ))
    '
    

    or more concisely (but perhaps less obviously correct):

    .entities | map( select( all(.name; index( $blacklist[]) | not) ))
    

    You might wish to write .entities |= map( ... ) instead if you want to retain the original structure.