I have two files that contain n lines with a string in each line. I want to print out the difference in characters between those lists. You could imagine the operation as a sort of "Subtraction" of letters. This is how it should look like:
List1 List2 Result
AaBbCcDd AaCcDd Bb
AaBbCcE AaBbCc E
AaBbCcF AaCcF Bb
Which means that the second list is not sorted alphabetically, but all the substrings to remove are sorted within each string (Aa
comes before Bb
comes before Cc
). Note that the elements to remove can be either 1 or 2 characters long (Aa
or F
), always starting with uppercase letters followed (sometimes) by a lowercased letter. The strings are completely composed of permutations of a few "elements" like Aa
, Bb
, Cc
, Dd
, E
, F
, Gg
, ... and so on.
This question has been answered in very similar form here: Bash script Find difference between two strings, but only for two strings entered manually, whereas I need to do the operation many hundreds of times. I am struggling with implementing files as a source to this command while also separating the characters correctly. Here is my adaptation:
split_chars() { sed $'s/./&\\\n/g' <<< "$1"; }
comm -23 <(split_chars AaBbCcDd) <(split_chars AaCcDd)
which gives as output
B
b
so still not quite what I want even in this single case. I guess that the split_chars
command is the key here but I was not able to apply it to my files in any way. Putting the file names inside the brackets does not work obviously.
For reference, a simple
commm -23 List1 List2
just leads to
AaBbCcDd
AaBbCcEe
AaBbCcF
comm: file 2 is not in sorted order
Since you don't want to split characters but substrings starting with an uppercase letter you should replace split_chars
with the following function.
split() { sed 's/[A-Z]/\n&/g' <<< "$1"; }
Splitting a line can be undone by deleting all newline characters using tr -d \\n
.
To subtract a list of lines from another list of lines you can use grep
without having to sort.
grep -vFxf subtrahend minuend
This will print in original order those lines from file minuend
which are not in file subtrahend
.
To put everything together, you have to
Here is a simplified version assuming your input files contain only lines of the described format and have the same length.
split() { sed 's/[A-Z]/\n&/g' <<< "$1"; }
subtract() { grep -vFxf "$2" "$1"; }
union() { tr -d \\n; echo; }
paste List1 List2 | while read -r minuend subtrahend; do
subtract <(split "$minuend") <(split "$subtrahend") | union
done
Bash scripts with loops are slow. If you need a faster solution you should rewrite this script in a more advanced language like perl
or python
.