I have an SDFile containing thousands of molecules and I need to extract from it molecules according to their IDs given in a simple one column file. So, the example of the SDF will be file1.sdf:
MOL108108
-Chem-8567890432
15 15 0 0 0 0 0 0 0999 V2000
6.1792 -2.6875 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
6.9542 -2.6875 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.4125 -2.7167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.1667 -3.4667 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.1667 -1.9000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.7375 -3.4625 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.1000 -2.7667 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.1500 -4.1292 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.0542 -3.3792 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0167 -2.0542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.8792 -2.7542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.2542 -3.7125 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.2500 -2.0792 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2875 -3.4042 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9542 -3.4875 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
1 4 2 0 0 0 0
1 5 2 0 0 0 0
6 7 1 0 0 0 0
7 11 1 0 0 0 0
6 8 2 0 0 0 0
3 9 1 0 0 0 0
3 10 2 0 0 0 0
11 13 2 0 0 0 0
2 12 1 0 0 0 0
10 13 1 0 0 0 0
9 14 2 0 0 0 0
6 15 1 0 0 0 0
11 14 1 0 0 0 0
M END
> <mol_id>
MOL108108
$$$$
MOL16520
-Chem4051902312
22 21 0 1 0 0 0 0 0999 V2000
0.2750 0.1500 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-0.2458 -0.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7917 -0.1500 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
1.3167 0.1458 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.7625 0.1583 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
-1.8083 0.1583 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7917 -0.7500 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0
-1.2833 -0.1417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.2458 -0.7500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.3167 0.7458 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-0.7625 0.7583 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.8000 0.7583 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.8292 -0.1542 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-2.3208 -0.1417 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-2.8375 0.1583 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-0.6083 1.3333 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0
1.3125 -1.0542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7875 -1.3500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.2750 -1.0500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.3542 0.1458 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.0375 1.7583 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.0333 1.4875 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
3 4 1 0 0 0 0
2 5 1 0 0 0 0
6 8 1 0 0 0 0
3 7 1 6 0 0 0
5 8 1 0 0 0 0
2 9 2 0 0 0 0
4 10 2 0 0 0 0
5 11 1 1 0 0 0
6 12 2 0 0 0 0
4 13 1 0 0 0 0
6 14 1 0 0 0 0
14 15 1 0 0 0 0
11 16 1 0 0 0 0
7 17 1 0 0 0 0
7 18 1 0 0 0 0
7 19 1 0 0 0 0
13 20 1 0 0 0 0
16 21 1 0 0 0 0
16 22 1 0 0 0 0
M END
> <mol_id>
MOL16520
$$$$
MOL55310
-Chem04051902312
11 11 0 0 0 0 0 0 0999 V2000
6.7292 -1.5750 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
7.5542 -1.5750 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
6.7250 -2.4000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.7292 -0.7500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.9125 -1.5917 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.9667 -0.8542 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.5167 -2.3292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.4792 -0.8917 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.6542 -0.9167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.6917 -2.3417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2625 -1.6375 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 2 0 0 0 0
1 4 2 0 0 0 0
1 5 1 0 0 0 0
2 6 1 0 0 0 0
5 7 2 0 0 0 0
5 8 1 0 0 0 0
8 9 2 0 0 0 0
7 10 1 0 0 0 0
9 11 1 0 0 0 0
10 11 2 0 0 0 0
M END
> <mol_id>
MOL55310
$$$$
.........
And this is example of the IDs file file2:
MOL101103
MOL103108
MOL108108
I use awk:
awk 'BEGIN{ORS="$$$$"}NR==FNR{a[$1];next}$1 in a' file2 RS="$" file1.sdf
but the resulting output is not ordered, I need to extract molecules from file1.sdf corresponding and ordered as in file2, so that the output will be an SDF like this:
MOL101103
-Chem-6789043209
12 12 0 0 0 0 0 0 0999 V2000
5.5667 -2.7625 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
6.3292 -2.7625 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
4.8292 -2.7917 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.6292 -3.7167 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.5542 -2.0042 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.4375 -2.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.4792 -3.4375 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.7667 -3.9167 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.7417 -3.4542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6917 -2.1750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.3500 -2.8292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.5917 -2.8417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
1 4 2 0 0 0 0
1 5 2 0 0 0 0
3 6 2 0 0 0 0
3 7 1 0 0 0 0
2 8 1 0 0 0 0
7 9 2 0 0 0 0
6 10 1 0 0 0 0
9 11 1 0 0 0 0
11 12 1 0 0 0 0
10 11 2 0 0 0 0
M END
> <mol_id>
MOL101103
$$$$
MOL103108
-Chem-6789005434
14 14 0 0 0 0 0 0 0999 V2000
5.9250 -2.8417 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
2.8875 -2.9292 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0
6.6917 -2.8417 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.1667 -2.8750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6542 -2.9125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.9167 -3.6167 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.9167 -2.0667 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.7042 -3.9042 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
2.4042 -2.1500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.8167 -3.5292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.7792 -2.2167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.0167 -2.2417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.0542 -3.5542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.0125 -3.7792 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2 5 1 0 0 0 0
1 3 1 0 0 0 0
1 4 1 0 0 0 0
5 12 2 0 0 0 0
1 6 2 0 0 0 0
1 7 2 0 0 0 0
2 8 1 0 0 0 0
2 9 2 0 0 0 0
4 10 1 0 0 0 0
4 11 2 0 0 0 0
11 12 1 0 0 0 0
10 13 2 0 0 0 0
3 14 1 0 0 0 0
5 13 1 0 0 0 0
M CHG 2 2 1 8 -1
M END
> <mol_id>
MOL103108
$$$$
MOL108108
-Chem-8567890432
12 12 0 0 0 0 0 0 0999 V2000
5.8875 -2.8500 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
6.6500 -2.8500 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.1542 -2.8750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.8792 -3.7292 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.8792 -2.0875 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.7542 -2.2292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8000 -3.5167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6667 -2.9125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.9417 -3.8125 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.9125 -2.9292 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
4.0167 -2.2625 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.0667 -3.5417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
1 4 2 0 0 0 0
1 5 2 0 0 0 0
3 6 2 0 0 0 0
3 7 1 0 0 0 0
8 12 1 0 0 0 0
2 9 1 0 0 0 0
8 10 1 0 0 0 0
6 11 1 0 0 0 0
7 12 2 0 0 0 0
8 11 2 0 0 0 0
M END
> <mol_id>
MOL108108
$$$$
......
So the first molecule of the output file is the first molecule of the ID file and so on. Thank you!
Adoption you original awk:
awk 'BEGIN{RS="\\$\\$\\$\\$"; ORS="$$$$"}
(NR==FNR){a[$1]=$0; next}
($1 in a) { print a[$1] }' file1.sdf RS="\n" file2.txt
The idea is to read the SDF-file into memory, record by record.
The record separator is $$$$
. You can set this in Gnu awk as RS="\\$\\$\\$\\$"
. Here you need to escape the $
as it has a special meaning as a regex (anchor to the end). There is a double escape ongoing. Escape one is the lexographic parser or awk converting \\$
into \$
which is then the proper escaped $
.
The output record separator (the one used when printing records) is just ORS="$$$$"
. Here we do not need to escape it as it is a normal string.
For the first file, (NR==FNR)
we store the full records $0
in an array indexed by the first field (molecule name). (a[$1]=$0
).
The second file has a normal record separator as a new-line (RS="\n"
). So every time we read a record, we check if it is an element of a
and if so, print it.