bashmacos

How to split text file into multiple files in specific a pattern in terminal?


I have a large text file with FASTA sequence (basically text) of multiple genes. I would like to split the txt file into multiple files according to file name of genes.

The structure of the file looks like this:

file1.txt

>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG

I want two files with the outputs as:

PDGFRB|ENST00000522466.1.txt

>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC

and, DGAT2|ENST00000604935.5.txt

>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG

I tried this, it splits the files but does not save into separate files with gene names. It also gives the error 'ambiguous redirect'.

#!/bin/bash

IFS=">" read -r -d '' -a my_array < file1.txt

for element in "${my_array[@]}";
do
    gene_name=$(echo "$element" | awk '{print $1}')
    gene_name=$(echo "$gene_name" | cut -d $'\n' -f 1)
    echo "$gene_name"
    echo $"element" > $gene_name.txt
done

Solution

  • Using any awk:

    $ awk -F'>' 'NF>1{ close(out); out=$2".txt" } { print > out }' file1.txt
    

    $ head *\|*
    ==> DGAT2|ENST00000604935.5.txt <==
    >DGAT2|ENST00000604935.5
    AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
    GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
    AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
    GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
    CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
    GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG
    
    ==> PDGFRB|ENST00000522466.1.txt <==
    >PDGFRB|ENST00000522466.1
    TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
    GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
    CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
    TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
    GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
    TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
    TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC