I have a very long file (yes, this is DNA in fasta format) that is actually a batch of several files patched together, output on the stdout. E.g.:
>id1 ACGT >id2 GTAC = >id3 ACGT = >id4 ACCGT >id6 AACCGT
I want to split this stream according to a pattern (here shown as =
) and perform actions on each piece individually.
I've looked into something like
myprogram | while read -d = STRING; do
# do something
done
but I'm concerned that putting large amount of info into a variable will be very inefficient. In addition I've read that read (...) is inefficient per se.
I'd like to find something like csplit
that outputs the pieces into a loop, but I couldn't come up with something smart. Ideally something like this very bad pseudocode:
myprogram | csplit - '=' | while csplit_outputs; do
# do something with csplit_outputs
done
I'd like to avoid writing temporary files as well, as I fear it will also be very inefficient.
Does that make any sense?
Any help appreciated!
I would use awk, and set the record separator to =
.
awk '{do something}' RS='=' input.file