bashloopslarge-filescsplit

Split a long file (on stdout) according to a pattern and input that into a loop


I have a very long file (yes, this is DNA in fasta format) that is actually a batch of several files patched together, output on the stdout. E.g.:

>id1
ACGT
>id2
GTAC
=
>id3
ACGT
=
>id4
ACCGT
>id6
AACCGT

I want to split this stream according to a pattern (here shown as =) and perform actions on each piece individually.

I've looked into something like

myprogram | while read -d = STRING; do 
  # do something
done

but I'm concerned that putting large amount of info into a variable will be very inefficient. In addition I've read that read (...) is inefficient per se.

I'd like to find something like csplit that outputs the pieces into a loop, but I couldn't come up with something smart. Ideally something like this very bad pseudocode:

myprogram | csplit - '=' | while csplit_outputs; do
  # do something with csplit_outputs
done

I'd like to avoid writing temporary files as well, as I fear it will also be very inefficient.

Does that make any sense?

Any help appreciated!


Solution

  • I would use awk, and set the record separator to =.

    awk '{do something}' RS='=' input.file