bashmacos

How to split text file into multiple files in specific a pattern in terminal?


I have a large text file with FASTA sequence (basically text) of multiple genes. I would like to split the txt file into multiple files according to file name of genes.

The structure of the file looks like this:

file1.txt

>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG

I want two files with the outputs as:

PDGFRB|ENST00000522466.1.txt

>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC

and, DGAT2|ENST00000604935.5.txt

>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG

I tried this, it splits the files but does not save into separate files with gene names. It also gives the error 'ambiguous redirect'.

#!/bin/bash

IFS=">" read -r -d '' -a my_array < file1.txt

for element in "${my_array[@]}";
do
    gene_name=$(echo "$element" | awk '{print $1}')
    gene_name=$(echo "$gene_name" | cut -d $'\n' -f 1)
    echo "$gene_name"
    echo $"element" > $gene_name.txt
done

Solution

  • Did you consider awk for this task?

    awk -F'\n' -v RS='>' '
        FNR > 1 {
          outFile = $1 ".txt";
          printf("%s", RS $0) > outFile;
          close(outFile);
        }
    ' file1.txt
    

    The idea is to consume the input file using > as record separator (instead of the linefeed character). Each record will then contain the header (stripped from its leading >) in the first line and the whole sequence in the remainder lines. That makes the processing quite straightforward.

    Now, the very first record is expected to be empty (or containing comments), so you skip it using the condition FNR > 1


    ASIDE

    Not that it is wrong, but do you really want to keep the | in the filenames?