bashawksedbioinformaticsfasta

How to substitute a specific string pattern in a multi-line FASTA file using bash?


I have a large multi-line FASTA file that looks like the following:

>NWQ47741.1 CLTR1 protein, partial [Melospiza_melodia] Vertebrate
CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE
>XP_002763076.2 cysteinyl leukotriene receptor 1 [Callithrix_jacchus] Vertebrate
MDGTGNLTVSSATCHDTIDEFRNQVYSTLYSMISVVGFFGNGFVLYVLIKTYHEKSAFQI
YMINLAIADLLCVCTLPLRVVYYVHKGIWFFGDFLCRLSTYALYVNLYCSIFFMTAMSFF
RCIAIVFPVQNINLVTQKKARFVCVGIWIFVILASSPFLITKSYKDEKNNTKCFEPPQDN
QTKNHVLILHYVSLFLGFIIPFVIIIVCYTMIILTLLKKSMKKNLSSHKKAIRMIMVVTA
AFLVSFMPYHIQRTIHLHFLHNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFFSGG
NFRRRLSTFRKHSLSSMTYVPRKKASLPEKGEEICKV
>XP_036988076.1 cysteinyl leukotriene receptor 1 [Artibeus_jamaicensis] Vertebrate
MDGTGNLTASSASNNMCNSSIDDFRNQVYSTMYSMISIVGFFGNGFVLYVLIRTYHEKSA
FQIYMINLAVSDLLCVCTLPLRVVYYVHKGMWFFGDILCRLSTYALYVNLYCSIFFMTAM
SFFRCIAIVFPVKNINLVTEKKARFVCASIWVFVILTSSPFLMSKSYKDEKNNTKCFEPP
QDNETKNHIFILHYVSLLVGFLIPFIIIIVCYTMIIFTLLKNSMQKNVPSRKKAVGMIII
VTAAFLISFMPYHIQRTIHLHFLYNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFF
SGGNFRRRLSTFRKHSLSSMTYVPKKKVSLPEKEDEVCK

I need to replace a specific substring within each sequence in the file. The substring to be replaced is the part after the accession ID and before the first '[' character. For example, in the first sequence, I want to replace "CLTR1 protein, partial" with "new_string", resulting in:

>NWQ47741.1 new_string [Melospiza_melodia] Vertebrate
CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE

I'm looking for a way to achieve this using AWK or Sed, as the file is large, and I want an efficient solution. Can someone provide a sample script or command for this task?

I tried the following script:

awk '/^>/ {gsub(/\[[^[]*/, "new_string"); print; next} 1' your_file.fasta > modified_file.fasta

But got the following output:

>NWQ47741.1 CLTR1 protein, partial new_string
CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE
>XP_002763076.2 cysteinyl leukotriene receptor 1 new_string
MDGTGNLTVSSATCHDTIDEFRNQVYSTLYSMISVVGFFGNGFVLYVLIKTYHEKSAFQI
YMINLAIADLLCVCTLPLRVVYYVHKGIWFFGDFLCRLSTYALYVNLYCSIFFMTAMSFF
RCIAIVFPVQNINLVTQKKARFVCVGIWIFVILASSPFLITKSYKDEKNNTKCFEPPQDN
QTKNHVLILHYVSLFLGFIIPFVIIIVCYTMIILTLLKKSMKKNLSSHKKAIRMIMVVTA
AFLVSFMPYHIQRTIHLHFLHNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFFSGG
NFRRRLSTFRKHSLSSMTYVPRKKASLPEKGEEICKV
>XP_036988076.1 cysteinyl leukotriene receptor 1 new_string
MDGTGNLTASSASNNMCNSSIDDFRNQVYSTMYSMISIVGFFGNGFVLYVLIRTYHEKSA
FQIYMINLAVSDLLCVCTLPLRVVYYVHKGMWFFGDILCRLSTYALYVNLYCSIFFMTAM
SFFRCIAIVFPVKNINLVTEKKARFVCASIWVFVILTSSPFLMSKSYKDEKNNTKCFEPP
QDNETKNHIFILHYVSLLVGFLIPFIIIIVCYTMIIFTLLKNSMQKNVPSRKKAVGMIII
VTAAFLISFMPYHIQRTIHLHFLYNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFF
SGGNFRRRLSTFRKHSLSSMTYVPKKKVSLPEKEDEVCK

Solution

  • Using sed

    $ sed -E '/^>/s/ [^[]*/ new_string /' input_file
    >NWQ47741.1 new_string [Melospiza_melodia] Vertebrate
    CLSQGTMTALSPNLSCHNPSIDDFRNSVYSTLYSMISIMGFVGNGVVLYVLIRTYRQKTA
    FQIYMLNLALSDFLCVLTLPLRVIYYVHKGHWFFSDFLCRLSSYALYVNLYCSIFFMTAM
    SFFRCIAIVFPVRNISLVSEKKAKFLCVGIWVFVTLTSAPFLRNGTYQHGNKTKCFEPPE
    NSQKTNMVVILDFIALFVGFIFPFVIITICYTMIIRTLLRNSLRKNEANRRKAVWMIVIV
    TATFLVSFTPYHVLRTVHLHALRLRGPGCADTVFLQKAVIVTLPLAAANCCFDPLLYFFS
    GGNFRQRLTTLRKASSSSLSQAFRKKISVKEKEEEPFGE
    >XP_002763076.2 new_string [Callithrix_jacchus] Vertebrate
    MDGTGNLTVSSATCHDTIDEFRNQVYSTLYSMISVVGFFGNGFVLYVLIKTYHEKSAFQI
    YMINLAIADLLCVCTLPLRVVYYVHKGIWFFGDFLCRLSTYALYVNLYCSIFFMTAMSFF
    RCIAIVFPVQNINLVTQKKARFVCVGIWIFVILASSPFLITKSYKDEKNNTKCFEPPQDN
    QTKNHVLILHYVSLFLGFIIPFVIIIVCYTMIILTLLKKSMKKNLSSHKKAIRMIMVVTA
    AFLVSFMPYHIQRTIHLHFLHNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFFSGG
    NFRRRLSTFRKHSLSSMTYVPRKKASLPEKGEEICKV
    >XP_036988076.1 new_string [Artibeus_jamaicensis] Vertebrate
    MDGTGNLTASSASNNMCNSSIDDFRNQVYSTMYSMISIVGFFGNGFVLYVLIRTYHEKSA
    FQIYMINLAVSDLLCVCTLPLRVVYYVHKGMWFFGDILCRLSTYALYVNLYCSIFFMTAM
    SFFRCIAIVFPVKNINLVTEKKARFVCASIWVFVILTSSPFLMSKSYKDEKNNTKCFEPP
    QDNETKNHIFILHYVSLLVGFLIPFIIIIVCYTMIIFTLLKNSMQKNVPSRKKAVGMIII
    VTAAFLISFMPYHIQRTIHLHFLYNETKPCDSVLRMQKSVVITLSLAASNCCFDPLLYFF
    SGGNFRRRLSTFRKHSLSSMTYVPKKKVSLPEKEDEVCK