I found this code for parsing a sdf file but I cannot ignore the whitespace that's why Ki (nm) output does not show.
My file look like this:
> <Ligand InChI Key>
CPZBLNMUGSZIPR-NVXWUHKLSA-N
> <BindingDB MonomerID>
50417287
> <BindingDB Ligand Name>
Aloxi::Aurothioglucose::PALONOSETRON::PALONOSETRON HYDROCHLORIDE
> <Target Name Assigned by Curator or DataSource>
5-hydroxytryptamine receptor 3A
> <Target Source Organism According to Curator or DataSource>
Homo sapiens
> <Ki (nM)>
0.0316
> <IC50 (nM)>
> <Kd (nM)>
> <EC50 (nM)>
---------------------------
awk -v OFS='\t' '
/^>/ { tag=$2; next }
NF { f[tag]=$1 }
$0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
' P46098.sdf
Thank you!
Please try match()
function to extract the tag between <
and >
inclusive.
awk -v OFS='\t' '
/^>/ { match($0, /<.+>/); tag = substr($0, RSTART, RLENGTH); next }
NF { f[tag]=$1 }
$0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
' P46098.sdf
match($0, /<.+>/)
returns a non-zero value if the regex <.+>
matches $0
assigning awk variables RSTART
and RLENGTH
to the start position and the length of the matched substring.<.+>
matches a substring which starts with <
and ends with >
.
The substring may contain whitespace characters.substr($0, RSTART, RLENGTH)
returns the substring of $0
starting
at RSTART
and length of RLENGTH
characters. Then the variable
tag
is assigned to it.