joinawkgawkgnuwin32

How to improve a GNUwin32 join command?


Am not able to produce the desired results using join.

Am running GNUwin32 on Windows 7 64 bit. Am running join version 5.3.0.1936 and gawk version 3.1.6.2962.

The following two tables are input:

Table_1

UID_C   CID
C000002 31799
C000002 31800
C000386 14950
C000386 9807916
C000386 10255083
C008114 5318432
C008117 799
C008117 444150
C008117 46878464

Table_2

UID_C   CID name
C000002 31799   bevonium
C000002 31800   bevonium
C002284 24832095    hypromellose
C008117 799 indoleglycerol phosphate
C008117 444150  indoleglycerol phosphate
C008117 46878464    indoleglycerol phosphate

Am using the following command in a bat file:

C:\gnuwin32\bin\join -t"|" -1 1 -2 1 -a1 -a2 -e "NULL" -o "0,1.2,2.2,2.3" C:\directory\Table_1.txt C:\directory\Table_2.txt > C:\directory\Table_3.txt

In my illustration on stackoverflow, the Tables are formatted using tabs for easier reading, but in practice I use pipe as the input and output delimiter.

The following table is output:

Table_3

UID_C   CID CID name
C000002 31800   31799   bevonium
C000002 31800   31800   bevonium
C000002 31799   31799   bevonium
C000002 31799   31800   bevonium
C000386 10255083    NULL    NULL
C000386 9807916 NULL    NULL
C000386 14950   NULL    NULL
C002284 NULL    24832095    hypromellose
C008114 5318432 NULL    NULL
C008117 46878464    799 indoleglycerol phosphate
C008117 46878464    444150  indoleglycerol phosphate
C008117 46878464    46878464    indoleglycerol phosphate
C008117 444150  799 indoleglycerol phosphate
C008117 444150  444150  indoleglycerol phosphate
C008117 444150  46878464    indoleglycerol phosphate
C008117 799 799 indoleglycerol phosphate
C008117 799 444150  indoleglycerol phosphate
C008117 799 46878464    indoleglycerol phosphate

The desired output is:

Table_4

UID_C   CID name
C000002 31799   bevonium
C000002 31800   bevonium
C000386 14950   NULL
C000386 9807916 NULL
C000386 10255083    NULL
C002284 24832095    hypromellose
C008114 5318432 NULL
C008117 799 indoleglycerol phosphate
C008117 444150  indoleglycerol phosphate
C008117 46878464    indoleglycerol phosphate

How do I change the join command to produce the desired output?

Alternatively, how should I use awk as a post process for Table_3 to produce Table_4?

Thanks in advance for advice.


Solution

  • I think you need more logic than join providex:

    awk -F"|" -v "OFS=|" '
        NR==FNR {uid_cid[$1 OFS $2]=1; next}
        { 
            key = $1 OFS $2
            if (key in uid_cid) {
                delete uid_cid[key]
            }
            print
        }
        END {
            for (key in uid_cid) {
                print key, "NULL"
            }
        }
    ' Table_1 Table_2 | sort -k1,1 -k2,2n -t "|"
    
    C000002|31799|bevonium
    C000002|31800|bevonium
    C000386|14950|NULL
    C000386|9807916|NULL
    C000386|10255083|NULL
    C002284|24832095|hypromellose
    C008114|5318432|NULL
    C008117|799|indoleglycerol phosphate
    C008117|444150|indoleglycerol phosphate
    C008117|46878464|indoleglycerol phosphate