bashcomm

Print differences between not sorted strings from files


I have two files that contain n lines with a string in each line. I want to print out the difference in characters between those lists. You could imagine the operation as a sort of "Subtraction" of letters. This is how it should look like:

List1       List2      Result
AaBbCcDd    AaCcDd     Bb
AaBbCcE     AaBbCc     E
AaBbCcF     AaCcF      Bb

Which means that the second list is not sorted alphabetically, but all the substrings to remove are sorted within each string (Aa comes before Bb comes before Cc). Note that the elements to remove can be either 1 or 2 characters long (Aa or F), always starting with uppercase letters followed (sometimes) by a lowercased letter. The strings are completely composed of permutations of a few "elements" like Aa, Bb, Cc, Dd, E, F, Gg, ... and so on.

This question has been answered in very similar form here: Bash script Find difference between two strings, but only for two strings entered manually, whereas I need to do the operation many hundreds of times. I am struggling with implementing files as a source to this command while also separating the characters correctly. Here is my adaptation:

split_chars() { sed $'s/./&\\\n/g' <<< "$1"; }
comm -23 <(split_chars AaBbCcDd) <(split_chars AaCcDd)

which gives as output

B
b

so still not quite what I want even in this single case. I guess that the split_chars command is the key here but I was not able to apply it to my files in any way. Putting the file names inside the brackets does not work obviously. For reference, a simple

commm -23 List1 List2

just leads to

AaBbCcDd
AaBbCcEe
AaBbCcF
comm: file 2 is not in sorted order

Solution

  • Since you don't want to split characters but substrings starting with an uppercase letter you should replace split_chars with the following function.

    split() { sed 's/[A-Z]/\n&/g' <<< "$1"; }
    

    Splitting a line can be undone by deleting all newline characters using tr -d \\n.

    To subtract a list of lines from another list of lines you can use grep without having to sort.

    grep -vFxf subtrahend minuend
    

    This will print in original order those lines from file minuend which are not in file subtrahend.

    To put everything together, you have to

    Here is a simplified version assuming your input files contain only lines of the described format and have the same length.

    split() { sed 's/[A-Z]/\n&/g' <<< "$1"; }
    subtract() { grep -vFxf "$2" "$1"; }
    union() { tr -d \\n; echo; }
    paste List1 List2 | while read -r minuend subtrahend; do
        subtract <(split "$minuend") <(split "$subtrahend") | union
    done
    

    Bash scripts with loops are slow. If you need a faster solution you should rewrite this script in a more advanced language like perl or python.