I found this code for parsing a sdf file but I cannot ignore the whitespace that's why Ki (nm) output does not show.
My file look like this:
> <Ligand InChI Key>
CPZBLNMUGSZIPR-NVXWUHKLSA-N
> <BindingDB MonomerID>
50417287
> <BindingDB Ligand Name>
Aloxi::Aurothioglucose::PALONOSETRON::PALONOSETRON HYDROCHLORIDE
> <Target Name Assigned by Curator or DataSource>
5-hydroxytryptamine receptor 3A
> <Target Source Organism According to Curator or DataSource>
Homo sapiens
> <Ki (nM)>
0.0316
> <IC50 (nM)>
> <Kd (nM)>
> <EC50 (nM)>
---------------------------
awk -v OFS='\t' '
/^>/ { tag=$2; next }
NF { f[tag]=$1 }
$0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
' P46098.sdf
Thank you!
Please try match() function to extract the tag between < and > inclusive.
awk -v OFS='\t' '
/^>/ { match($0, /<.+>/); tag = substr($0, RSTART, RLENGTH); next }
NF { f[tag]=$1 }
$0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
' P46098.sdf
match($0, /<.+>/) returns a non-zero value if the regex <.+>
matches $0 assigning awk variables RSTART and RLENGTH
to the start position and the length of the matched substring.<.+> matches a substring which starts with < and ends with >.
The substring may contain whitespace characters.substr($0, RSTART, RLENGTH) returns the substring of $0 starting
at RSTART and length of RLENGTH characters. Then the variable
tag is assigned to it.