linuxshellsedawk

How to filter lines by counting occurences of a char in AWK or bash?


The input is like this:

CNNCC
NCNCN
NNNCC
CCNNN
CCCCN

The output should be like this:

CNNCC
CCCCN

which means, if there're more than 3 occurences of N, that line would be filtered out, otherwise it's kept. (In my work, I need to filter out 100000 lines with more than 500 N so performance might be important)

I know how to filter by consecutive N in awk, but I don't know how to calculate inconsecutive ones..

Does anyone have ideas about this? Solutions in shell is also ok.

Among all the answers, I think this one might be the simplest:

awk -FN 'NF<=3'

Solution

  • awk -FN -vcount=3 'NF<=count'
    

    or, for older awk which does not support the -v option,

    awk -FN 'NF<=count' count=3
    

    The command uses the target char as the field separator and the maximum allowed occurence as count. By comparing the resulting number of fields against count we can selectively print lines that meet our criteria.

    The intention of the statement is not immediately obvious and therefore less readable. It does however has the advantage of having the char and count parametrised and can therefore be easily reused for different settings.

    Admittedly, this would not be very efficient for large numbers of count. Setting the maximum number of fields to count+1 would overcome this performance issue, unfortunately the -mf option is ignored by gawk.