variablesrakuuser-definedcharacter-class

How to insert variable in user-defined character class?


What I am trying to do is to allow programs to define character class depending on text encountered. However, <[]> takes characters literally, and the following yields an error:

my $all1Line = slurp "htmlFile";
my @a = ($all1Line ~~ m:g/ (\" || \') ~ $0 {} :my $marker = $0; http <-[ $marker ]>*? page <-[ $marker ]>*? /); # error: $marker is taken literally as $ m a r k e r

I wanted to match all links that are the format "https://foo?page=0?ssl=1" or 'http ... page ...'


Solution

  • Based on your example code and text, I'm not entirely sure what your source data looksl ike, so I can't provide much more detailed information. That said, based on how to match characters from an earlier part of the match, the easiest way to do this is with array matching:

    my $input = "(abc)aaaaaa(def)ddee(ghi)gihgih(jkl)mnmnoo";
    
    my @output = $input ~~ m:g/
        :my @valid;                # initialize variable in regex scope
        '(' ~ ')'  $<valid>=(.*?)  # capture initial text
        { @valid = $<valid>.comb } # split the text into characters
        $<text>=(@valid+)          # capture text, so long as it contains the characters
    /;
    
    say @output;
    .say for @output.map(*<text>.Str);
    

    The output of which is

    [「(abc)aaaaaa」
     valid => 「abc」
     text => 「aaaaaa」 「(def)ddee」
     valid => 「def」
     text => 「ddee」 「(ghi)gihgih」
     valid => 「ghi」
     text => 「gihgih」]
    aaaaaa
    ddee
    gihgih
    

    Alternatively, you could store the entire character class definition in a variable and reference the variable as <$marker-char-class>, or you if you want to avoid that, you can define it all inline as code to be interpreted as regex with <{ '<[' ~ $marker ~ ']>' }>. Note that both methods are subject to the same problem: you're constructing the character class from the regex syntax, which may require escape characters or particular ordering, and so is definitely suboptimal.

    If it's something you'll do very often and not very adhoc, you could also define your own regex method token, but that's probably very overkill and would serve better as its own question.