linuxshellawkgrepcomm

Remove blank spaces comm output


I have two lists of IDs that I am comparing with comm command. My problem is that output looks like this:

YAL002W
YAL003W
        YAL004W
        YAL005C
                YAL008W
        YAL011W

All I want to do is try to pipe it somehow so the file is written with out the empty spcaces, that translate into white cell when I open this files in excel. I have tried every possible combination I have found of grep, awk and sed to remove blank spaces without luck...
So I have came to the conclusion that columns are separated by one or two tabs respectively, therefore I can not remove them as easily as removing blank spaces without removing the formating of the file.

any help or suggestion will be welcomed. Thanks

EDIT:

I want my output to be three columns, tab delimited without the blank spaces

YAL002W YAL004W YAL008W
YAL003W YAL005C
        YAL011W

EDIT2 to avoit XY Problem as referenced:

Original problem (X): I have to lists and I want to find common and unique words between both lists (To generate a Venn diagram later on). So comm seemed like the perfect solution since I get all three lists at the same time, which I can later on import into excel easily.

Secondary problem (Y): The three columns that are generated are not three columns (or so I am starting to think) since I can't cut -f them, nor I can't remove the blank spaces with usual awk 'NF' or grep . (for example).


Solution

  • Given this input and comm output:

    $ cat file1
    YAL002W
    YAL003W
    YAL008W
    
    $ cat file2
    YAL004W
    YAL005C
    YAL008W
    YAL011W
    
    $ comm file1 file2
    YAL002W
    YAL003W
            YAL004W
            YAL005C
                    YAL008W
            YAL011W
    

    This will do what you asked for:

    $ cat tst.awk
    BEGIN { FS=OFS="\t" }
    {
        colNr = NF
        rowNr = ++rowNrs[colNr]
        val[rowNr,colNr] = $NF
        numCols = (colNr > numCols ? colNr : numCols)
        numRows = (rowNr > numRows ? rowNr : numRows)
    }
    END {
        for (rowNr=1; rowNr<=numRows; rowNr++) {
            for (colNr=1; colNr<=numCols; colNr++) {
                printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
            }
        }
    }
    

    .

    $ comm file1 file2 | awk -f tst.awk
    YAL002W YAL004W YAL008W
    YAL003W YAL005C
            YAL011W
    

    but of course you could just skip the call to comm and use awk right off the bat:

    $ cat tst.awk
    BEGIN { FS=OFS="\t" }
    NR==FNR {
        file1[$0]
        next
    }
    {
        if ($0 in file1) {
            colNr = 3
            delete file1[$0]
        }
        else {
            colNr = 2
        }
        rowNr = ++rowNrs[colNr]
        val[rowNr,colNr] = $0
    }
    END {
        for (v in file1) {
            colNr = 1
            rowNr = ++rowNrs[colNr]
            val[rowNr,colNr] = v
        }
    
        numRows = (rowNrs[1] > rowNrs[2] ? rowNrs[1] : rowNrs[2])
        numRows = (numRows   > rowNrs[3] ? numRows   : rowNrs[3])
        numCols = 3
        for (rowNr=1; rowNr<=numRows; rowNr++) {
            for (colNr=1; colNr<=numCols; colNr++) {
                printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
            }
        }
    }
    

    .

    $ awk -f tst.awk file1 file2
    YAL002W YAL004W YAL008W
    YAL003W YAL005C
            YAL011W