bashgrepcomm

How to use "grep -f file" if "file" has null-delimited items?


I need to find null-delimited items from numerous files (data2, data3, ...) that are present in data1. Exact match is required.

All works well with grep -f data1 data2 data3 ... until the items in data1 are also null-delimited.

  1. Using only newlines - ok:

    $ cat data1
    1234
    abcd
    efgh
    5678
    $ cat data2
    1111
    oooo
    abcd
    5678
    $ grep -xFf data1 data2
    abcd
    5678
    
  2. data2 contains null-delimited items - ok when -z used:

    $ printf '1111\0oooo\0abcd\0005678' > data2
    $ grep -zxFf data1 data2 | xargs -0 printf '%s\n'
    abcd
    5678
    
  3. Now both data1 and data2 contain null-delimited items - fail. Seems that the -z option does not apply to the file specified with -f:

    $ printf '1234\0abcd\0efgh\0005678' > data1
    $ grep -zxFf data1 data2 | xargs -0 printf '%s\n'
    
    $
    

The problem is that I do need both files to have null-delimited items. Obvious work-around could be (for example) a good old while loop:

while IFS= read -rd '' line || [[ $line ]]; do
    if grep -zqxF "$line" data2; then
        printf '%s\n' "$line"
    fi
done < data1

But since I have many files with lots of items, this will be painfully slow! Is there a better approach (I do not insist on using grep)?


Solution

  • Since order retention isn't important, you're trying to match exact strings, and you have GNU tools available, instead of using fgrep I'd suggest comm -z.

    $ printf '%s\0' 1111 oooo abcd 005678 >data2
    $ printf '%s\0' 1234 abcd efgh 005678 >data
    $ comm -z12 <(sort -uz <data) <(sort -uz <data2) | xargs -0 printf '%s\n'
    005678
    abcd
    

    If you generate your files sorted in the first place (and thus can leave out the sort operations), this will also have very good memory and performance characteristics.