I have a large text file with FASTA sequence (basically text) of multiple genes. I would like to split the txt file into multiple files according to file name of genes.
The structure of the file looks like this:
file1.txt
>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG
I want two files with the outputs as:
PDGFRB|ENST00000522466.1.txt
>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC
and, DGAT2|ENST00000604935.5.txt
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG
I tried this, it splits the files but does not save into separate files with gene names. It also gives the error 'ambiguous redirect'.
#!/bin/bash
IFS=">" read -r -d '' -a my_array < file1.txt
for element in "${my_array[@]}";
do
gene_name=$(echo "$element" | awk '{print $1}')
gene_name=$(echo "$gene_name" | cut -d $'\n' -f 1)
echo "$gene_name"
echo $"element" > $gene_name.txt
done
Using any awk:
$ awk -F'>' 'NF>1{ close(out); out=$2".txt" } { print > out }' file1.txt
$ head *\|*
==> DGAT2|ENST00000604935.5.txt <==
>DGAT2|ENST00000604935.5
AGAAAGGCCGGGCGCGGCGAGGCTGGGCGCTGGGCGGCTGCGGCGCGCGGTGCGCGGTGC
GTAGTCTGGAGCTATGGTGGTGGTGGCAGCCGCGCCGAACCCGGCCGACGGGACCCCTAA
AGTTCTGCTTCTGTCGGGGCAGCCCGCCTCCGCCGCCGGAGCCCCGGCCGGCCAGGCCCT
GCCGCTCATGGTGCCAGCCCAGAGAGGGGCCAGCCCGGAGGCAGCGAGCGGGGGGCTGCC
CCAGGCGCGCAAGCGACAGCGCCTCACGCACCTGAGCCCCGAGGAGAAGGCGCTGAGGAG
GTGGGCGAGGGGCCGGGGTCTGGGGCCAGATCTGAAGCCGGGACTAGGGACAGGGGCAGG
==> PDGFRB|ENST00000522466.1.txt <==
>PDGFRB|ENST00000522466.1
TCAGTCATCCTTTCCCTCTCTAGCCCCCTACCCTATCCCCAAGCTGAAGTGCTAGTGGCT
GGTGGTGACTTCCCCAGACCTAAGCCAATCTCTCTCTACCAGTGTCATCCATCAACGTCT
CTGTGAACGCAGTGCAGACTGTGGTCCGCCAGGGTGAGAACATCACCCTCATGTGCATTG
TGATCGGGAATGAGGTGGTCAACTTCGAGTGGACATACCCCCGCAAAGAAGTAATGTGGG
GCCAGGCAGGGGTCGGAGGAGGGGCCAGGAACGGGTGGATATCTGGCTTGCAGGCTGATT
TCTCCCCGGCCCCTCCTGATTTGGGGGGCCTGCCCAACCTGTTGCTGCAGAGTGGGCGGC
TGGTGGAGCCGGTGACTGACTTCCTCTTGGATATGCCTTACCACATCCGCTCCATC