lualpeg

LPeg Pattern which matches strings without consecutive hypens


I'm trying to write an LPeg pattern to match strings which:

For reference, the regular expression [a-zA-Z](-?[a-zA-Z0-9])* matches what I'm looking for.

Here's the code I'm working with, for reference:

require "lpeg"
P,R,C = lpeg.P,lpeg.R,lpeg.C

dash  = P"-"
ucase  = R"AZ"
lcase  = R"az"
digit  = R"09"
letter = ucase + lcase
alphanum = letter + digit

str_match = C(letter * ((dash^-1) * alphanum)^0)

strs = {
    "1too",
    "too0",
    "t-t-t",
    "t-t--t",
    "t--t-t",
    "t-1-t",
    "t--t",
    "t-one1",
    "1-1",
    "t-1",
    "t",
    "tt",
    "t1",
    "1",
}

for _,v in ipairs(strs) do
    if lpeg.match(str_match,v) ~= nil then
        print(v," => match!")
    else
        print(v," => no match")
    end
end

However, much to my frustration, I get the following output:

1too     => no match
too0     => match!
t-t-t    => match!
t-t--t   => match!
t--t-t   => match!
t-1-t    => match!
t--t     => match!
t-one1   => match!
1-1      => no match
t-1      => match!
t        => match!
tt       => match!
t1       => match!
1        => no match

Despite what the code outputs, t-t--t, t--t-t, and t--t shouldn't match.


Solution

  • In your pattern letter * ((dash^-1) * alphanum)^0, lpeg will try to match against the prefix of the string. For cases where you didn't expect a match

    t-t--t
    t--t-t
    t--t

    The part highlighted in bold is where your pattern successfully matches. lpeg.match returns the last position(which is a number) it was able to parse up to using your pattern if nothing gets captured. For the above 3 cases, the matching subpart is captured which explains the erroneous output you're seeing.

    If you're just matching each string one at a time, you can modify your pattern to check that there are no remaining characters left after the parse.

    str_match = C(letter * ((dash^-1) * alphanum)^0) * -1
    

    Similarly using lpeg.re module

    re_pat = re.compile "{ %a ('-'? %w)* } !."
    

    For stream matching or finding all pattern occurrences in the target string, stack the grammar rules together like this

    stream_parse = re.compile
    [[
      stream_match  <- ((str_match / skip_nonmatch) delim)* str_match?
      str_match     <- { %a ('-'? %w)* } (&delim / !.)
      skip_nonmatch <- !str_match (!delim .)*
    
      delim         <- %s+
    ]]
    

    Any matches will get captured and returned. If there are no matches you'll either get back nil or a number indicating where in the string the pattern stopped parsing.

    Edit: For cases where you need the parse to return nil on no match, this tweak to the grammar should do the trick

    stream_parse = re.compile
    [[
      stream_match  <- (str_match / skip_nonmatch+ &str_match)+
      str_match     <- { %a ('-'? %w)* } (&delim / !.)
      skip_nonmatch <- !str_match (!delim .)* delim
    
      delim         <- %s+
    ]]