I need to find null-delimited items from numerous files (data2
, data3
, ...) that are present in data1
. Exact match is required.
All works well with grep -f data1 data2 data3 ...
until the items in data1
are also null-delimited.
Using only newlines - ok:
$ cat data1
1234
abcd
efgh
5678
$ cat data2
1111
oooo
abcd
5678
$ grep -xFf data1 data2
abcd
5678
data2
contains null-delimited items - ok when -z
used:
$ printf '1111\0oooo\0abcd\0005678' > data2
$ grep -zxFf data1 data2 | xargs -0 printf '%s\n'
abcd
5678
Now both data1
and data2
contain null-delimited items - fail. Seems that the -z
option does not apply to the file specified with -f
:
$ printf '1234\0abcd\0efgh\0005678' > data1
$ grep -zxFf data1 data2 | xargs -0 printf '%s\n'
$
The problem is that I do need both files to have null-delimited items.
Obvious work-around could be (for example) a good old while
loop:
while IFS= read -rd '' line || [[ $line ]]; do
if grep -zqxF "$line" data2; then
printf '%s\n' "$line"
fi
done < data1
But since I have many files with lots of items, this will be painfully slow! Is there a better approach (I do not insist on using grep
)?
Since order retention isn't important, you're trying to match exact strings, and you have GNU tools available, instead of using fgrep
I'd suggest comm -z
.
$ printf '%s\0' 1111 oooo abcd 005678 >data2
$ printf '%s\0' 1234 abcd efgh 005678 >data
$ comm -z12 <(sort -uz <data) <(sort -uz <data2) | xargs -0 printf '%s\n'
005678
abcd
If you generate your files sorted in the first place (and thus can leave out the sort
operations), this will also have very good memory and performance characteristics.