regexperlalternation

How do I count regex matches in perl when using multiple possible match targets separated by "|"?


I have a (very) long list of strings of numbers that I need to count the number of occurrences of certain values in order to decide whether to pull the line the string is associated with. Essentially, the file is formatted like this:

,4,8,9,11,12,
,5,6,7,9,11,
etc.

where the strings range in length from 1 - 100 values, the values range from 1 - 100, and the values in the string the are always ordered smallest to largest.

I'm trying to find all the lines that have, for example, at least two out of the three values 4, 9, and 11, so here is the test code I wrote to try out my regex:

my $string = ",4,8,9,11,12,";

my $test = ",4,|,9,|,11,";

my @c = $string =~ m/$test/g;
my $count = @c;

print "count: $count\n";
print "\@c:, join(" ", @c), "\n";

The output when I run this is:

count: 2
@c:,4, ,9,

When I expect count to be 3 and @c to be ,4, ,9, ,11,.

I realize this is because the 9 and the 11 share the same comma, but I'm wondering if anyone knows how to get around this. I can't just drop the last comma from the match because if I'm trying to match ,4 in a string that has a ,41, for example, it will the erroneously count the ,41,.

I suppose I could do something like:

my $test = "4|9|11";
$string =~ s/,/ /;
my @c = $string =~ m/\b($test)\b/g

which works, but adds another step before the match counting. Is there a way to perform the matches keeping the original string unchanged?

I'm also trying to avoid looping through my match targets individually and summing the individual match counts because I'm trying to maximize efficiency. I'm working with some really massive lists of values requiring millions of permutations and the way I currently have my script written using loops it's taking days to complete. I'm hoping by regex matching it will go faster.

Thanks


Solution

  • The problem is that the trailing , is consumed in the ,9, match, so when it starts looking for the next match it starts at 11,12,. There's no leading , before the 11, so it can't match that. I'd recommend using a lookahead like this:

    ,(4|9|11)(?=,)
    

    This way, the trailing , will not be consumed as part of the match.

    For example:

    my $string = ",4,8,9,11,12,";
    
    my $test = ",(4|9|11)(?=,)";
    
    my @c = $string =~ m/$test/g;
    my $count = @c;
    print "count: $count\n";
    print "\@c:", join(" ", @c), "\n";
    

    Outputs:

    count: 3
    @c:4 9 11