linuxawkbioinformaticssequencesfasta

Select sequences in a fasta file with more than 300 aa and "C" occurs at least 4 times


I have a fasta file which contains protein sequences. I'd like to select sequences with more than 300 amino acids and Cysteine (C) amino acid appears more than 4 times.

I've used this command to select sequences with more than 300 aa:

 cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }' 

Some sequence example:

  >jgi|Triasp1|216614|CE216613_3477
 MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
 NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
 YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
 AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
 YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
 MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
 QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ* 

Solution

  • I do not know bioawk but I assume it is identical to with some initial parsing and constant definitions.

    I would proceed as follows. Assuming you want the find the strings with more then 4 times the letter C in and a length of more than 300, then you could do :

    bioawk -c fastx '
       (length($seq) > 300) && (gsub("C","C",$seq)>4) {
           print ">"$name; print $seq
       }' 72hDOWN-fasta.fasta
    

    but this assumes that seq is the full character sequence.

    The idea behind it is the following. The gsub command performs substitutions in strings and returns the total substitutions it did. Hence, if we substitute all characters "C" with "C" we actually did not change the string, but get the total amount of "C"'s in the string back.

    From the POSIX standard IEEE Std 1003.1-2017:

    gsub(ere, repl[, in]): Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in $0 or in the in argument, when specified.

    sub(ere, repl[, in ]): Substitute the string repl in place of the first instance of the extended regular expression ere in string in and return the number of substitutions. An <ampersand> ( & ) appearing in the string repl shall be replaced by the string from in that matches the ERE. An <ampersand> preceded with a <backslash> shall be interpreted as the literal <ampersand> character. An occurrence of two consecutive <backslash> characters shall be interpreted as just a single literal <backslash> character. Any other occurrence of a <backslash> (for example, preceding any other character) shall be treated as a literal <backslash> character. Note that if repl is a string literal (the lexical token STRING; see Grammar), the handling of the <ampersand> character occurs after any lexical processing, including any lexical <backslash>-escape sequence processing. If in is specified and it is not an lvalue (see Expressions in awk), the behavior is undefined. If in is omitted, awk shall use the current record ($0) in its place.

    Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.