bashfilterbioinformaticsfasta

Filter a .fasta file


I have the following two files:

  1. A .fasta file with >700 protein sequences. Each sequence occupies a few lines, and has a header with some information about the protein, including an accession code (e.g. A0A2K5BUY4). Here is an extract from the file:
>A0A2K5RDW4|unreviewed|Olfactory receptor|taxID:2715852
MAHINCTQVAEFILVGLTDHQELKMPLFVLFLSIYLFTVVGNLGLILLIRADSRLSTPMYFFLSNLAFVDFCYSSVITPK
MLGTFLYKQNVISFNACATQLCCFLTFMISECLLLASIAYDRYVAICNPLLYMVVMSPGICIQLVAVPYSYSFLMALFHT
ILTFRLSYCHSNIVNHFYCDDMPLLRLTCSDTSSKQLWIFACAGIMFTSSLLIVFVSYMFIISAILRMRSAEGRQKAFST
CGSHMLAVTIFYGTLIFMYLQPSSNHALDTDKMASVFYTVIIPMLNPLIYSLRNKEVKEALKKVIINKNQ
>A0A2K5RFA5|unreviewed|Olfactory receptor|taxID:2715852
MSRWSNGSSNVSYTSFLLAGFPGLRESRALLVLPFLSLYLMIIFTNALVIHTVLTQQSLHQPMYLLIALLLAVNICATST
VVPAMLFSFSTHFNCISLARCLIQMFCIYFLIVFDCNILLVMALDRYVAICYPLCYPEIVTGQLLAGLVGAAATRSTCIV
APVVFLASRVHFCRSDMIQHFACEHMALMKLSCGDISLNKTVGLTVRIFNRVLDMLLLGTSYSHIIHAAFRISSGRARSK
ALNTCGSHLLVIFTVYLSTMSSSIVYRTAHTASQDVHNLLSTFYLLLPCLVNPIIYGTRTKEIRYHLGEQFLSSGPRDTM
MSI
  1. A .txt file with 229 lines, each one containing one accession code, like the ones in the header of the sequences in the .fasta file. Here is an extract from this file:
A0A2K5BUY4
A0A2K5BUY7
A0A2K5BUZ1
A0A2K5BW26
A0A2K5BW36
A0A2K5BWA0

I want to create a second .fasta file, that contains only the sequences (including headers) whose accession codes are in the .txt file.

I am new to bioinformatics and to programming in general, so I did the following with help from chatGPT:

# Use pcregrep to extract sequences that match the accession codes
pcregrep -M -f platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta | \
  pcregrep -M ">([^>]+)(\n[^>]+)*" > filtered_platyrrhini_ORs.fasta

but this did not catch the sequences, as the output contained only the headers of the matching proteins:

>A0A2R8PLQ0|unreviewed|Olfactory receptor|taxID:9483
>A0A5F4VRG8|unreviewed|Olfactory receptor|taxID:9483
>A0A5F4VRT9|unreviewed|Olfactory receptor|taxID:9483
>A0A5F4VS80|unreviewed|Olfactory receptor|taxID:9483

I also tried with grep, but it didn't work because the sequences do not occupy a fixed number of lines:

grep -A 1 -f platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta > filtered_platyrrhini_ORs.fasta

This captured the 229 headers but only 1 line from each sequence.

And lastly I used awk:

awk 'BEGIN {FS="|"} NR==FNR {accessions[$1]; next} 
     {if (substr($0, 1, 1) == ">") {seq_id = substr($0, 2); seq = $0} 
      else {seq = seq "\n" $0} 
      if (seq_id in accessions) {print seq_id "\n" seq}}' \
platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta > filtered_platyrrhini_ORs.fasta

and the output file was empty.

I would really appreciate any help.


Solution

  • Using awk for this is a good idea but there are several issues with your code:

    If the header lines always start with a > and no other lines start with a >, you can try:

    awk -F '|' '
      NR == FNR {accessions[">" $1]; next}
      /^>/ {to_print = $1 in accessions}
      to_print' platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta
    

    awk statements are pattern-action pairs applied to the input records, where the action is executed if and only if the pattern is true for the current record. In your case records are regular lines.

    Note: in awk there is no real boolean type; 0 and the empty string are considered as false, anything else as true. With any POSIX-compliant awk, in the above code, the to_print variable takes value 0 (false) or a non-zero numeric value (true), likely 1 (but this is not specified by POSIX).