I have a fasta file with the following header structure:
>Saurogobio_punctatus-NC_080528.1|taxid=1771284|cellularorganisms,Eukaryota,Opisthokonta,Metazoa
GCTAGCGTAGCTTAATATAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAA
Where each section is separated by a pipe '|'
, and the first section is a combination of species_name-accessionID
.
I want to remove the accesionIDs after the hyphen '-'
, but keep everything else. Like this:
>Saurogobio_punctatus|taxid=1771284|cellularorganisms,Eukaryota,Opisthokonta,Metazoa
GCTAGCGTAGCTTAATATAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAA
I've tried:
sed -E '/^>/s/(\|[^-]*)-.*$/\1/' input.fasta > output.fasta
But this removes everything after the hyphen '-'
:
>Saurogobio_punctatus
GCTAGCGTAGCTTAATATAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAA
I've used this piece of code before to edit my header and include the taxid=
before my 2nd column:
awk 'BEGIN { FS=OFS="|" } /^>/ { print $1, "taxid=", $2, $3; next } { print }' file.fa > edit_file
I was wondering if there is a way to maybe combine these 2 commands, where i edit my first column and then reprint the rest, but i don't know how to do it :(
I appreciate any help with this!
I suggest with sed
:
sed 's/-[^|]*//' file
Output to stdout:
>Saurogobio_punctatus|taxid=1771284|cellularorganisms,Eukaryota,Opisthokonta,Metazoa GCTAGCGTAGCTTAATATAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAA