regexperlsplitwww-mechanize

Perl - Regex to extract only the comma-separated strings


I have a question I am hoping someone could help with...

I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).

The variable contains data such as these:

$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"

The only bits I am interested in from the above examples are:

@array = ("cat_dog","horse","rabbit","chicken-pig")
@array = ("elephant","MOUSE_RAT","spider","lion-tiger") 
@array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")

The problem I am having:

I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.

But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.

Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...

For example:

Saturn was long thought to be the only ringed planet, however, this is now known not to be the case. 
                                                     ^        ^
                                                     |        |
                                    note the spaces here and here

I am not interested in any cases where the comma is followed by a space (as shown above).

I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)

I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.


Solution

  • How about

    [^,\s]+(,[^,\s]+)+
    

    which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.

    Further to comments

    To match more than one sequence add the g modifier for global matching.
    The following splits each match $& on a , and pushes the results to @matches.

    my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
    my @matches;
    
    while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
        push(@matches, split(/,/, $&));
    }   
    
    print join("\n",@matches),"\n";