I have a fasta file which contains protein sequences. I'd like to select sequences with more than 300 amino acids and Cysteine (C) amino acid appears more than 4 times.
I've used this command to select sequences with more than 300 aa:
cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }'
Some sequence example:
>jgi|Triasp1|216614|CE216613_3477
MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ*
I do not know bioawk
but I assume it is identical to awk with some initial parsing and constant definitions.
I would proceed as follows. Assuming you want the find the strings with more then 4 times the letter C
in and a length of more than 300, then you could do :
bioawk -c fastx '
(length($seq) > 300) && (gsub("C","C",$seq)>4) {
print ">"$name; print $seq
}' 72hDOWN-fasta.fasta
but this assumes that seq
is the full character sequence.
The idea behind it is the following. The gsub
command performs substitutions in strings and returns the total substitutions it did. Hence, if we substitute all characters "C" with "C" we actually did not change the string, but get the total amount of "C"'s in the string back.
From the POSIX standard IEEE Std 1003.1-2017:
gsub(ere, repl[, in])
: Behave likesub
(see below), except that it shall replace all occurrences of the regular expression (like theed
utility global substitute) in$0
or in the in argument, when specified.
sub(ere, repl[, in ])
: Substitute the stringrepl
in place of the first instance of the extended regular expressionere
in stringin
and return the number of substitutions. An <ampersand> (&
) appearing in the stringrepl
shall be replaced by the string fromin
that matches the ERE. An <ampersand> preceded with a <backslash> shall be interpreted as the literal <ampersand> character. An occurrence of two consecutive <backslash> characters shall be interpreted as just a single literal <backslash> character. Any other occurrence of a <backslash> (for example, preceding any other character) shall be treated as a literal <backslash> character. Note that ifrepl
is a string literal (the lexical token STRING; see Grammar), the handling of the <ampersand> character occurs after any lexical processing, including any lexical <backslash>-escape sequence processing. Ifin
is specified and it is not an lvalue (see Expressions in awk), the behavior is undefined. Ifin
is omitted, awk shall use the current record ($0
) in its place.
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.