unixsedcomm

How to get the output from the comm command into 3 separate files?


The question Unix command to find lines common in two files has an answer suggesting the use of the comm command to do the task:

comm -12 1.sorted.txt 2.sorted.txt

This shows the lines common to the two files (the -1 suppresses the lines that are only in the first file, and the -2 suppresses the lines only in the second file, leaving just the lines common to both files as output). As the file names suggest, the input files must be in sorted order.

In a comment to that question, bapors asks:

How would one have the outputs in different files?

Seeking clarification, I asked:

If you want the lines only in File1 in one file, those only in File2 in another, and those in both in a third, then (provided that none of the lines in the files starts with a tab) you could use sed to split the output to three files.

User bapors confirmed:

It is exactly what I was asking. Would you show an example?

The answer is relatively long-winded and would spoil the simplicity of the answer to the other question (drowning it out with lots of information), so I've asked the question separately here — and provided an answer too.


Solution

  • The basic solution using sed relies on the fact that comm outputs lines found only in the first file with no prefix; it outputs the lines found only in the second file with a single tab; and it outputs the lines found in both files with two tabs.

    It also relies on sed's w command to write to files.

    Given file 1.sorted.txt containing:

    1.line-1
    1.line-2
    1.line-4
    1.line-6
    2.line-2
    3.line-5
    

    and file 2.sorted.txt containing:

    1.line-3
    2.line-1
    2.line-2
    2.line-4
    2.line-6
    3.line-5
    

    the basic output from comm 1.sorted.txt 2.sorted.txt is:

    1.line-1
    1.line-2
            1.line-3
    1.line-4
    1.line-6
            2.line-1
                    2.line-2
            2.line-4
            2.line-6
                    3.line-5
    

    Given a file script.sed containing:

    /^\t\t/ {
        s///
        w file.3
        d
    }
    /^\t/ {
        s///
        w file.2
        d
    }
    /^[^\t]/ {
        w file.1
        d
    }
    

    you can run the command shown below and get the desired output like this:

    $ comm 1.sorted.txt 2.sorted.txt | sed -f script.sed
    $ cat file.1
    1.line-1
    1.line-2
    1.line-4
    1.line-6
    $ cat file.2
    1.line-3
    2.line-1
    2.line-4
    2.line-6
    $ cat file.3
    2.line-2
    3.line-5
    $
    

    The script works by:

    1. matching lines that start with 2 tabs, deleting the tabs, writing the line to file.3, and deleting the line (so the rest of the script is ignored),
    2. matching lines that start with 1 tab, deleting the tab, writing the line to file.2, and deleting the line (so the rest of the script is ignored),
    3. matching lines that do not start with a tab, writing the line to file.1, and deleting the line.

    The match and delete operations in step 3 are more for symmetry than anything else; they could be omitted (leaving just w file.1) and this script would work the same. However, see script3.sed below for further justification for keeping the symmetry.

    As written, that requires GNU sed; BSD sed doesn't recognize the \t escapes. Obviously, the file could be written with actual tabs in place of the \t notation, and then BSD sed is OK with the script.

    It is possible to make it work all on the command line, but it is fiddly (and that's being polite about it). Using Bash's ANSI C Quoting, you can write:

    $ comm 1.sorted.txt 2.sorted.txt |
    > sed -e $'/^\t\t/  { s///\n w file.3\n d\n }' \
    >     -e $'/^\t/    { s///\n w file.2\n d\n }' \
    >     -e $'/^[^\t]/ {        w file.1\n d\n }'
    $
    

    which writes each of the three 'paragraphs' of script.sed in a separate -e option. The w command is fussy; it expects the file name, and only the file name, after it on the same line of the script, hence the use of \n after the file names in the script. There are spaces aplenty that could be eliminated, but the symmetry is clearer with the layout shown. And using the -f script.sed file is probably simpler — it is certainly a technique worth knowing as it can avoid problems when the sed script must operate on single, double and back-quotes, which makes it difficult to write the script on the Bash command line.

    Finally, if the two files can contain lines starting with tabs, this technique requires some more brute force to make it work. One variant solution exploits Bash's process substitution to add a prefix before the lines in the files, and then the post-processing sed script removes the prefixes before writing to the output files.

    script3.sed (with tabs replaced by up to 8 spaces) — note that this time there is a substitute s/// needed in the third paragraph (the d is still optional, but may as well be included):

    /^              X/ {
        s///
        w file.3
        d
    }
    /^      X/ {
        s///
        w file.2
        d
    }
    /^X/ {
        s///
        w file.1
        d
    }
    

    And the command line:

    $ comm <(sed 's/^/X/' 1.sorted.txt) <(sed 's/^/X/' 2.sorted.txt) |
    > sed -f script3.sed
    $
    

    For the same input files, this produces the same output, but by adding and then removing the X at the start of each line, the code doesn't change the sort order of the data and would handle leading tabs if they were present.

    You can also easily write solutions that use Perl or Awk, and those do not even have to use comm (and can be made to work with unsorted files, provided the files fit into memory).