I have a names.dmp file which contains taxonomy ids and scientific names among other details.
I want to fetch the scientific name of a particular tax-id, for which I am running this command:
cat names.dmp | grep "scientific name" | awk '$1~/^10090$/{print $0}' | cut -d "|" -f1,2
which gives me the output:
10090 | Mus musculus
But I need this to be dynamic, i.e., set a variable id=10090
and use this variable inside the regular expression. I need an exact match of the value while using "id", as there are entries such as 210090 and 100904 which I am getting as output which are not needed.
I am quite inexperienced when it comes to awk, so any help is appreciated.
EDIT:
Here is the example input:
10089 | Mus formosanus Kuroda, 1925 | | authority |
10089 | Mus formosanus | | synonym |
10089 | ricefield mouse | | common name |
10089 | Ryukyu mouse | | genbank common name |
10090 | house mouse | | genbank common name |
10090 | LK3 transgenic mice | | includes |
10090 | mouse | mouse <Mus musculus> | common name |
10090 | Mus musculus Linnaeus, 1758 | | authority |
10090 | Mus musculus | | scientific name |
10090 | Mus sp. 129SV | | includes |
10090 | nude mice | | includes |
10090 | transgenic mice | | includes |
10091 | Mus castaneus | | synonym |
10091 | Mus musculus castaneus | | scientific name |
10091 | Mus musculus castaneus Waterhouse, 1843 | | authority |
10091 | southeastern Asian house mouse | | genbank common name |
10092 | Mus domesticus | | synonym |
10092 | Mus musculus domesticus Schwarz & Scharz 1943 | | authority |
10092 | Mus musculus domesticus | | scientific name |
10092 | Mus musculus praetextus | | synonym |
100902 | Fusarium oxysporum f. sp. conglutinans | | scientific name |
100903 | Fusarium oxysporum f. sp. fragariae | | scientific name |
100905 | Cloning vector pACN | | scientific name |
100906 | Nitrosomonas sp. ENI-11 | | scientific name |
100907 | Chilean sea bass | | common name |
And the output I need is:
10090 | Mus musculus
When you use awk
, frequently, you don't need anything else:
$ awk -F'[[:space:]]*\\|[[:space:]]*' -v id="10090" '
/scientific name/ && $1 == id {print $1 " | " $2}' file
10090 | Mus musculus
-F'[[:space:]]*\\|[[:space:]]*'
: set the input field separator as space-surrounded |
.-v id="10090"
: declare awk
variable id
and assign it 10090
(change this if needed).scientific name
and the first field equals id
, print the two first fields separated by |
.As noted in comments this does not preserve the input field separators. In case you want to preserve them you can use the split
function of GNU awk
, instead of the input field separator, to save the fields in an array and the separators in another:
$ awk -v id="10090" '/scientific name/ {
split($0,f,/[[:space:]]*\|[[:space:]]*/,s)
if(f[1] == id) print f[1] s[1] f[2]}' file
10090 | Mus musculus
Finally, if your awk
is not GNU awk
but you want to preserve the field separators, you can use match
and substr
instead of split
:
$ awk -F'[[:space:]]*\\|[[:space:]]*' -v id="10090" '
/scientific name/ && $1==id {
a=match($0,/\|/); b=match(substr($0,a+1),/[[:space:]]*\|/)
print substr($0,1,a+b-1)}' file
10090 | Mus musculus
We simply use match
to find the index of the first |
(a
), then the index of the first space before the second |
(b
), and print only the everything before that (substr
).