linuxbashparsingsdf

Parsing sdf file in bash


I found this code for parsing a sdf file but I cannot ignore the whitespace that's why Ki (nm) output does not show.

My file look like this:


> <Ligand InChI Key>
CPZBLNMUGSZIPR-NVXWUHKLSA-N

> <BindingDB MonomerID>
50417287

> <BindingDB Ligand Name>
Aloxi::Aurothioglucose::PALONOSETRON::PALONOSETRON HYDROCHLORIDE

> <Target Name Assigned by Curator or DataSource>
5-hydroxytryptamine receptor 3A

> <Target Source Organism According to Curator or DataSource>
Homo sapiens

> <Ki (nM)>
 0.0316

> <IC50 (nM)>


> <Kd (nM)>


> <EC50 (nM)>
---------------------------
awk -v  OFS='\t' '
    /^>/ { tag=$2; next }
    NF { f[tag]=$1 }
    $0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
' P46098.sdf 

Thank you!


Solution

  • Please try match() function to extract the tag between < and > inclusive.

    awk -v  OFS='\t' '
        /^>/ { match($0, /<.+>/); tag = substr($0, RSTART, RLENGTH); next }
        NF { f[tag]=$1 }
        $0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
    ' P46098.sdf