[SOLVED] Parsing sdf file in bash

Parsing sdf file in bash

I found this code for parsing a sdf file but I cannot ignore the whitespace that's why Ki (nm) output does not show.

My file look like this:

> <Ligand InChI Key>
CPZBLNMUGSZIPR-NVXWUHKLSA-N

> <BindingDB MonomerID>
50417287

> <BindingDB Ligand Name>
Aloxi::Aurothioglucose::PALONOSETRON::PALONOSETRON HYDROCHLORIDE

> <Target Name Assigned by Curator or DataSource>
5-hydroxytryptamine receptor 3A

> <Target Source Organism According to Curator or DataSource>
Homo sapiens

> <Ki (nM)>
 0.0316

> <IC50 (nM)>


> <Kd (nM)>


> <EC50 (nM)>
---------------------------

awk -v  OFS='\t' '
    /^>/ { tag=$2; next }
    NF { f[tag]=$1 }
    $0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
' P46098.sdf

Thank you!

Solution

Please try match() function to extract the tag between < and > inclusive.

awk -v  OFS='\t' '
    /^>/ { match($0, /<.+>/); tag = substr($0, RSTART, RLENGTH); next }
    NF { f[tag]=$1 }
    $0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
' P46098.sdf

The function match($0, /<.+>/) returns a non-zero value if the regex <.+> matches $0 assigning awk variables RSTART and RLENGTH to the start position and the length of the matched substring.
The regex <.+> matches a substring which starts with < and ends with >. The substring may contain whitespace characters.
substr($0, RSTART, RLENGTH) returns the substring of $0 starting at RSTART and length of RLENGTH characters. Then the variable tag is assigned to it.