I have two CSV files I want to compare. The first looks like:
aaaaaaaa-fb34-4e3e-aed5-eec78d02d59b,1234
bbbbbbbb-f76a-4a05-bb53-63aba9d03fe2,5678
cccccccc-e351-4d8e-b44a-080f6ccdef7d,9012
The second looks like (only one field):
bbbbbbbb-f76a-4a05-bb53-63aba9d03fe2
cccccccc-e351-4d8e-b44a-080f6ccdef7d
I want to compare (with shell script and commands like comm
, awk
or others) these two sets based on the first field (the UUID in the example), showing full rows in the first file with UUIDs that do not appear in the second file. So, for the sample data above, I want to write myscript
so that
$ myscript file1.csv file2.csv
aaaaaaaa-fb34-4e3e-aed5-eec78d02d59b,1234
If I just had two lists of UUIDs I could easily compare with comm
but I'm not sure how to deal with the extra field. I guess I could cut off the second field, comm
, and then loop over results and grep in the original two-field set, but this seems a little inefficent. Any better ways with standard utilities? I see also this question but that seems to discard rows / fields that are not common.
If awk
is an option, you can use it to first read all lines of the second file as keys into an array, and then filter the first file by containedness of its first column among those array keys. Note the order of the files.
awk -F, 'NR==FNR {a[$0]; next} !($1 in a)' file2.csv file1.csv
aaaaaaaa-fb34-4e3e-aed5-eec78d02d59b,1234