I try to replace in several files some patterns by other patterns. For example my infile looks like this:
>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGCTGGGCCTGCGAGACACCAGCACCCCCATCGTGGCCATCACCCTGCACAGCCTCGCCGTGCTGGTCTCCCTGCTCGGACCAGAGGTGGTTGTGGGCGGAGAAAGAACCAAGATCTTCAAACGCACTGCCCCCAGCTTTACAAAAACCACTGACCTCTCCCCAGAAGAC
and I want output:
>Genus_species_Something_something|ENSG00000000457_ENST00000367772
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGCTGGGCCTGCGAGACACCAGCACCCCCATCGTGGCCATCACCCTGCACAGCCTCGCCGTGCTGGTCTCCCTGCTCGGACCAGAGGTGGTTGTGGGCGGAGAAAGAACCAAGATCTTCAAACGCACTGCCCCCAGCTTTACAAAAACCACTGACCTCTCCCCAGAAGAC
I have two list files, my old patterns:
Genus_species_SRR13259292
and new patterns:
Genus_species_Something_something
I tried to do this with sed
. Here is my command:
while IFS= read -r line1 && IFS= read -r line2 <&3; do
for f in *.fasta; do
sed -e "s/${line1}/${line2}/g" "$f" > "${f%.fasta}_NewName.fasta"
done
done < "List_oldpattern.txt" 3<"List_newpatterns.txt"
But this doesn't work, maybe it is because of the >
and |
delimited the pattern?
If sed
doesn't work it may be possible with Awk?
Since the question has been tagged with awk
I propose we replace all of OP's current code with a single awk
script ...
My sample .fasta
files:
$ head f?.fasta
==> f1.fasta <==
>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772 # change
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772 # do not change
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
==> f2.fasta <==
>Genus_species_SRR13259292|ENSG00000000457_ENST00000367772 # change
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772 # do not change
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
NOTE: files do not contain the comments
We'll make use of the paste
command to append OP's old and new patterns into a single line; we'll use a |
as the delimiter:
$ paste -d'|' List_oldpattern.txt List_newpatterns.txt
Genus_species_SRR13259292|Genus_species_Something_something
Now the awk
script:
awk '
BEGIN { FS = OFS = "|" } # input/output field delimiter
FNR==NR { map[">" $1] = ">" $2; next } # 1st file (paste output): populate our map[] array; $1==old $2==new; then skip to next input line
FNR==1 { close(outf) # 2nd-nth files: 1st record; close previous output file
outf = FILENAME # make copy of input FILENAME
sub(/\.fasta$/,"",outf) # strip trailing ".fasta"
outf = outf "_NewName.fasta" # append new suffix to our output filename
}
$1 in map { $1 = map[$1] } # if 1st field (">some_string") is an index in the map[] array then replace 1st field with array contents
{ print > outf } # print current line to output file
' <(paste -d'|' List_oldpattern.txt List_newpatterns.txt) *.fasta
NOTE: assuming OP has more than one old/new pattern pair, this script has the added benefit of only scanning each *.fasta
file once (as opposed to OP's current while/read/for/sed
loop which scans each .fasta
file N
times - where N
is the number of old/new pattern pairs)
This generates:
$ head *_NewName.fasta
==> f1_NewName.fasta <==
>Genus_species_Something_something|ENSG00000000457_ENST00000367772 # changed
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772 # not changed
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
==> f2_NewName.fasta <==
>Genus_species_Something_something|ENSG00000000457_ENST00000367772 # changed
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....
>Genus_buckets_ABC13259292|ENSG00000000457_ENST00000367772 # not changed
TACGCCGCGCACTTCACGCGAGAGCAGCTGCGCACTATCGTCCTGCCCCAGGTGCTGC....