unixawkextractbioinformaticssdf

Extract molecules in order from SDF file according to IDs given in another file


I have an SDFile containing thousands of molecules and I need to extract from it molecules according to their IDs given in a simple one column file. So, the example of the SDF will be file1.sdf:

MOL108108
  -Chem-8567890432

 15 15  0     0  0  0  0  0  0999 V2000
    6.1792   -2.6875    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    6.9542   -2.6875    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.4125   -2.7167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.1667   -3.4667    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    6.1667   -1.9000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.7375   -3.4625    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.1000   -2.7667    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    3.1500   -4.1292    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.0542   -3.3792    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.0167   -2.0542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.8792   -2.7542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.2542   -3.7125    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.2500   -2.0792    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2875   -3.4042    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.9542   -3.4875    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  2  0  0  0  0
  1  5  2  0  0  0  0
  6  7  1  0  0  0  0
  7 11  1  0  0  0  0
  6  8  2  0  0  0  0
  3  9  1  0  0  0  0
  3 10  2  0  0  0  0
 11 13  2  0  0  0  0
  2 12  1  0  0  0  0
 10 13  1  0  0  0  0
  9 14  2  0  0  0  0
  6 15  1  0  0  0  0
 11 14  1  0  0  0  0
M  END
> <mol_id>
MOL108108

$$$$
MOL16520
  -Chem4051902312

 22 21  0     1  0  0  0  0  0999 V2000
    0.2750    0.1500    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2458   -0.1500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7917   -0.1500    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
    1.3167    0.1458    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7625    0.1583    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
   -1.8083    0.1583    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7917   -0.7500    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
   -1.2833   -0.1417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2458   -0.7500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.3167    0.7458    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7625    0.7583    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.8000    0.7583    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.8292   -0.1542    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -2.3208   -0.1417    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -2.8375    0.1583    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -0.6083    1.3333    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
    1.3125   -1.0542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7875   -1.3500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.2750   -1.0500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.3542    0.1458    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.0375    1.7583    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0333    1.4875    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  3  4  1  0  0  0  0
  2  5  1  0  0  0  0
  6  8  1  0  0  0  0
  3  7  1  6  0  0  0
  5  8  1  0  0  0  0
  2  9  2  0  0  0  0
  4 10  2  0  0  0  0
  5 11  1  1  0  0  0
  6 12  2  0  0  0  0
  4 13  1  0  0  0  0
  6 14  1  0  0  0  0
 14 15  1  0  0  0  0
 11 16  1  0  0  0  0
  7 17  1  0  0  0  0
  7 18  1  0  0  0  0
  7 19  1  0  0  0  0
 13 20  1  0  0  0  0
 16 21  1  0  0  0  0
 16 22  1  0  0  0  0
M  END
> <mol_id>
MOL16520

$$$$
MOL55310
  -Chem04051902312

 11 11  0     0  0  0  0  0  0999 V2000
    6.7292   -1.5750    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    7.5542   -1.5750    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    6.7250   -2.4000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    6.7292   -0.7500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.9125   -1.5917    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.9667   -0.8542    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.5167   -2.3292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4792   -0.8917    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.6542   -0.9167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.6917   -2.3417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2625   -1.6375    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  2  0  0  0  0
  1  4  2  0  0  0  0
  1  5  1  0  0  0  0
  2  6  1  0  0  0  0
  5  7  2  0  0  0  0
  5  8  1  0  0  0  0
  8  9  2  0  0  0  0
  7 10  1  0  0  0  0
  9 11  1  0  0  0  0
 10 11  2  0  0  0  0
M  END
> <mol_id>
MOL55310

$$$$

.........

And this is example of the IDs file file2:

MOL101103
MOL103108
MOL108108

I use awk: awk 'BEGIN{ORS="$$$$"}NR==FNR{a[$1];next}$1 in a' file2 RS="$" file1.sdf

but the resulting output is not ordered, I need to extract molecules from file1.sdf corresponding and ordered as in file2, so that the output will be an SDF like this:

MOL101103
  -Chem-6789043209

12 12  0     0  0  0  0  0  0999 V2000
    5.5667   -2.7625    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    6.3292   -2.7625    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    4.8292   -2.7917    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.6292   -3.7167    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.5542   -2.0042    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.4375   -2.1500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.4792   -3.4375    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.7667   -3.9167    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.7417   -3.4542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6917   -2.1750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.3500   -2.8292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5917   -2.8417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  2  0  0  0  0
  1  5  2  0  0  0  0
  3  6  2  0  0  0  0
  3  7  1  0  0  0  0
  2  8  1  0  0  0  0
  7  9  2  0  0  0  0
  6 10  1  0  0  0  0
  9 11  1  0  0  0  0
 11 12  1  0  0  0  0
 10 11  2  0  0  0  0
M  END
> <mol_id>
MOL101103

$$$$
MOL103108
  -Chem-6789005434

14 14  0     0  0  0  0  0  0999 V2000
    5.9250   -2.8417    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    2.8875   -2.9292    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
    6.6917   -2.8417    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.1667   -2.8750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6542   -2.9125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.9167   -3.6167    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.9167   -2.0667    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.7042   -3.9042    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
    2.4042   -2.1500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.8167   -3.5292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.7792   -2.2167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0167   -2.2417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0542   -3.5542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.0125   -3.7792    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  2  5  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  1  0  0  0  0
  5 12  2  0  0  0  0
  1  6  2  0  0  0  0
  1  7  2  0  0  0  0
  2  8  1  0  0  0  0
  2  9  2  0  0  0  0
  4 10  1  0  0  0  0
  4 11  2  0  0  0  0
 11 12  1  0  0  0  0
 10 13  2  0  0  0  0
  3 14  1  0  0  0  0
  5 13  1  0  0  0  0
M  CHG  2   2   1   8  -1
M  END
> <mol_id>
MOL103108

$$$$
MOL108108
  -Chem-8567890432

12 12  0     0  0  0  0  0  0999 V2000
    5.8875   -2.8500    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    6.6500   -2.8500    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.1542   -2.8750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.8792   -3.7292    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.8792   -2.0875    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.7542   -2.2292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.8000   -3.5167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6667   -2.9125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.9417   -3.8125    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.9125   -2.9292    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    4.0167   -2.2625    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0667   -3.5417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  2  0  0  0  0
  1  5  2  0  0  0  0
  3  6  2  0  0  0  0
  3  7  1  0  0  0  0
  8 12  1  0  0  0  0
  2  9  1  0  0  0  0
  8 10  1  0  0  0  0
  6 11  1  0  0  0  0
  7 12  2  0  0  0  0
  8 11  2  0  0  0  0
M  END
> <mol_id>
MOL108108

$$$$

......

So the first molecule of the output file is the first molecule of the ID file and so on. Thank you!


Solution

  • Adoption you original awk:

    awk 'BEGIN{RS="\\$\\$\\$\\$"; ORS="$$$$"}
         (NR==FNR){a[$1]=$0; next}
         ($1 in a) { print a[$1] }' file1.sdf RS="\n" file2.txt
    

    The idea is to read the SDF-file into memory, record by record.

    For the first file, (NR==FNR) we store the full records $0 in an array indexed by the first field (molecule name). (a[$1]=$0).

    The second file has a normal record separator as a new-line (RS="\n"). So every time we read a record, we check if it is an element of a and if so, print it.