bashawkgrepcutncbi

Awk command to set variable name while matching a regular expression


I have a names.dmp file which contains taxonomy ids and scientific names among other details.

I want to fetch the scientific name of a particular tax-id, for which I am running this command:

cat names.dmp | grep "scientific name" | awk '$1~/^10090$/{print $0}' | cut -d "|" -f1,2

which gives me the output:

10090 | Mus musculus

But I need this to be dynamic, i.e., set a variable id=10090 and use this variable inside the regular expression. I need an exact match of the value while using "id", as there are entries such as 210090 and 100904 which I am getting as output which are not needed.

I am quite inexperienced when it comes to awk, so any help is appreciated.

EDIT:

Here is the example input:

10089   |       Mus formosanus Kuroda, 1925     |               |       authority       |
10089   |       Mus formosanus  |               |       synonym |
10089   |       ricefield mouse |               |       common name     |
10089   |       Ryukyu mouse    |               |       genbank common name     |
10090   |       house mouse     |               |       genbank common name     |
10090   |       LK3 transgenic mice     |               |       includes        |
10090   |       mouse   |       mouse <Mus musculus>    |       common name     |
10090   |       Mus musculus Linnaeus, 1758     |               |       authority       |
10090   |       Mus musculus    |               |       scientific name |
10090   |       Mus sp. 129SV   |               |       includes        |
10090   |       nude mice       |               |       includes        |
10090   |       transgenic mice |               |       includes        |
10091   |       Mus castaneus   |               |       synonym |
10091   |       Mus musculus castaneus  |               |       scientific name |
10091   |       Mus musculus castaneus Waterhouse, 1843 |               |       authority       |
10091   |       southeastern Asian house mouse  |               |       genbank common name     |
10092   |       Mus domesticus  |               |       synonym |
10092   |       Mus musculus domesticus Schwarz & Scharz 1943   |               |       authority       |
10092   |       Mus musculus domesticus |               |       scientific name |
10092   |       Mus musculus praetextus |               |       synonym |
100902  |       Fusarium oxysporum f. sp. conglutinans  |               |       scientific name |
100903  |       Fusarium oxysporum f. sp. fragariae     |               |       scientific name |
100905  |       Cloning vector pACN     |               |       scientific name |
100906  |       Nitrosomonas sp. ENI-11 |               |       scientific name |
100907  |       Chilean sea bass        |               |       common name     |

And the output I need is:

10090 | Mus musculus


Solution

  • When you use awk, frequently, you don't need anything else:

    $ awk -F'[[:space:]]*\\|[[:space:]]*' -v id="10090" '
      /scientific name/ && $1 == id {print $1 " | " $2}' file
    10090 | Mus musculus
    
    1. -F'[[:space:]]*\\|[[:space:]]*': set the input field separator as space-surrounded |.
    2. -v id="10090": declare awk variable id and assign it 10090 (change this if needed).
    3. If the input record matches string scientific name and the first field equals id, print the two first fields separated by |.

    As noted in comments this does not preserve the input field separators. In case you want to preserve them you can use the split function of GNU awk, instead of the input field separator, to save the fields in an array and the separators in another:

    $ awk -v id="10090" '/scientific name/ {
        split($0,f,/[[:space:]]*\|[[:space:]]*/,s)
        if(f[1] == id) print f[1] s[1] f[2]}' file
    10090   |       Mus musculus
    

    Finally, if your awk is not GNU awk but you want to preserve the field separators, you can use match and substr instead of split:

    $ awk -F'[[:space:]]*\\|[[:space:]]*' -v id="10090" '
      /scientific name/ && $1==id {
        a=match($0,/\|/); b=match(substr($0,a+1),/[[:space:]]*\|/)
        print substr($0,1,a+b-1)}' file
    10090   |       Mus musculus
    

    We simply use match to find the index of the first | (a), then the index of the first space before the second | (b), and print only the everything before that (substr).