I wrote a bash script in order to split a file. The file looks like this:
@<TRIPOS>MOLECULE
ZINC32514653
....
....
@<TRIPOS>MOLECULE
ZINC982347645
....
....
Here is the script I wrote:
#!/bin/bash
#split the file into files named xx##.mol2
csplit -b %d.mol2 ./Zincpharmer_ligprep_1.mol2 '/@<TRIPOS>MOLECULE/' '{*}'
#rename all files called xx##.mol2 by their 2nd line which is ZINC######
for filename in ./xx*.mol2;
do
newFilename=$(echo $filename | sed -n 2p $filename)
if [ ! -e "./$newFilename.mol2" ]; then
mv -i $filename ./$newFilename.mol2
else
num=2
while [ -e "./"$newFilename"_$num.mol2" ]; do
num=$((num+1))
done
mv $filename "./"$newFilename"_$num.mol2"
fi
done
I have two questions:
1) is there a way to include the prefix option into csplit and telling csplit that the prefix is the line after the seperator.
2) the first line created by csplit xx00 is an empty file, as the separator is in the first line. How can I avoid this?
The expected output would be files named ZINC32514653.mol2 and ZINC982347645.mol2. An in case there a two entries with the same ZINC### ZINC982347645_2.mol2.
All you need to know if available from this man csplit
page:-
To tell csplit
to change the prefix:-
-f, --prefix=PREFIX
use PREFIX instead of 'xx'
To exclude empty files:-
-z, --elide-empty-files
remove empty output files