[SOLVED] How to use "grep -f file" if "file" has null-delimited items?

How to use "grep -f file" if "file" has null-delimited items?

I need to find null-delimited items from numerous files (data2, data3, ...) that are present in data1. Exact match is required.

All works well with grep -f data1 data2 data3 ... until the items in data1 are also null-delimited.

Using only newlines - ok:

$ cat data1
1234
abcd
efgh
5678
$ cat data2
1111
oooo
abcd
5678
$ grep -xFf data1 data2
abcd
5678

data2 contains null-delimited items - ok when -z used:

$ printf '1111\0oooo\0abcd\0005678' > data2
$ grep -zxFf data1 data2 | xargs -0 printf '%s\n'
abcd
5678

Now both data1 and data2 contain null-delimited items - fail. Seems that the -z option does not apply to the file specified with -f:
```
$ printf '1234\0abcd\0efgh\0005678' > data1
$ grep -zxFf data1 data2 | xargs -0 printf '%s\n'

$
```

The problem is that I do need both files to have null-delimited items. Obvious work-around could be (for example) a good old while loop:

while IFS= read -rd '' line || [[ $line ]]; do
    if grep -zqxF "$line" data2; then
        printf '%s\n' "$line"
    fi
done < data1

But since I have many files with lots of items, this will be painfully slow! Is there a better approach (I do not insist on using grep)?

Solution

Since order retention isn't important, you're trying to match exact strings, and you have GNU tools available, instead of using fgrep I'd suggest comm -z.

$ printf '%s\0' 1111 oooo abcd 005678 >data2
$ printf '%s\0' 1234 abcd efgh 005678 >data
$ comm -z12 <(sort -uz <data) <(sort -uz <data2) | xargs -0 printf '%s\n'
005678
abcd

If you generate your files sorted in the first place (and thus can leave out the sort operations), this will also have very good memory and performance characteristics.