bashsedbioinformaticsfasta

Rename entries in fasta/fastq files in UNIX


I have fasta file with x reads and I want to rename the ids of each read. My file is as follow:

>e855552f-9484-4674-8fc4-d9b1f1add023 runid=4140cbe7f7cd36a17a00b732f27dd37bd09e4380 sampleid=mariposas read=55861 ch=2 start_time=2022-11-23T17:43:40Z model_version_id=2021-05-17_dna_r9.4.1_minion_768_2f1c8637 barcode=barcode06
GCTTTAACATTTAGCTATTTATGACACAGTGAAATAAAAGTAATATCTTTTTATTTTTAATTGTATTTATTAGTTACATGTTTTCACATGCATTTAACATAAATGTGATAATTTATGGGAATTACACTACTGTCAAAGTAGTT
>c9d90319-ec63-4347-9244-ad080b0815c5 runid=4140cbe7f7cd36a17a00b732f27dd37bd09e4380 sampleid=mariposas read=30196 ch=317 start_time=2022-11-23T14:47:32Z model_version_id=2021-05-17_dna_r9.4.1_minion_768_2f1c8637 barcode=barcode11
GCCTTGACTATATGGTTTACCTGTTCAAATACGACTCTACTCATGGTCGTTTCAAGGGAACAGTTGAGGTTCAAGGATGGTTTCCTCGTAGTAGTCTCAATGGAAACAAATCTCCTGTCTTCTGTGAAAGAGACCCTAAAATC

And I want to rename the reads with the first id and the barcode (last word) linked by "_".

My expect out is:

>e855552f-9484-4674-8fc4-d9b1f1add023_barcode06
GCTTTAACATTTAGCTATTTATGACACAGTGAAATAAAAGTAATATCTTTTTATTTTTAATTGTATTTATTAGTTACATGTTTTCACATGCATTTAACATAAATGTGATAATTTATGGGAATTACACTACTGTCAAAGTAGTT
>c9d90319-ec63-4347-9244-ad080b0815c5_barcode11
GCCTTGACTATATGGTTTACCTGTTCAAATACGACTCTACTCATGGTCGTTTCAAGGGAACAGTTGAGGTTCAAGGATGGTTTCCTCGTAGTAGTCTCAATGGAAACAAATCTCCTGTCTTCTGTGAAAGAGACCCTAAAATC

I'm trying with sed command, but I don't know if it's the best option. I don't know how to specify a number of words (first and last) with sed instead of a fixed word.


Solution

  • This sed one-liner should do the trick:

    sed 's/ .* barcode=/_/' file
    

    which simply replaces the substring from the first space character to the barcode= with a _.