I'm trying to write an LPeg pattern to match strings which:
test--string
)For reference, the regular expression [a-zA-Z](-?[a-zA-Z0-9])*
matches what I'm looking for.
Here's the code I'm working with, for reference:
require "lpeg"
P,R,C = lpeg.P,lpeg.R,lpeg.C
dash = P"-"
ucase = R"AZ"
lcase = R"az"
digit = R"09"
letter = ucase + lcase
alphanum = letter + digit
str_match = C(letter * ((dash^-1) * alphanum)^0)
strs = {
"1too",
"too0",
"t-t-t",
"t-t--t",
"t--t-t",
"t-1-t",
"t--t",
"t-one1",
"1-1",
"t-1",
"t",
"tt",
"t1",
"1",
}
for _,v in ipairs(strs) do
if lpeg.match(str_match,v) ~= nil then
print(v," => match!")
else
print(v," => no match")
end
end
However, much to my frustration, I get the following output:
1too => no match
too0 => match!
t-t-t => match!
t-t--t => match!
t--t-t => match!
t-1-t => match!
t--t => match!
t-one1 => match!
1-1 => no match
t-1 => match!
t => match!
tt => match!
t1 => match!
1 => no match
Despite what the code outputs, t-t--t
, t--t-t
, and t--t
shouldn't match.
In your pattern letter * ((dash^-1) * alphanum)^0
, lpeg will try to match against the prefix of the string. For cases where you didn't expect a match
t-t--t
t--t-t
t--t
The part highlighted in bold is where your pattern successfully matches. lpeg.match
returns the last position(which is a number) it was able to parse up to using your pattern if nothing gets captured. For the above 3 cases, the matching subpart is captured which explains the erroneous output you're seeing.
If you're just matching each string one at a time, you can modify your pattern to check that there are no remaining characters left after the parse.
str_match = C(letter * ((dash^-1) * alphanum)^0) * -1
Similarly using lpeg.re
module
re_pat = re.compile "{ %a ('-'? %w)* } !."
For stream matching or finding all pattern occurrences in the target string, stack the grammar rules together like this
stream_parse = re.compile
[[
stream_match <- ((str_match / skip_nonmatch) delim)* str_match?
str_match <- { %a ('-'? %w)* } (&delim / !.)
skip_nonmatch <- !str_match (!delim .)*
delim <- %s+
]]
Any matches will get captured and returned. If there are no matches you'll either get back nil
or a number indicating where in the string the pattern stopped parsing.
Edit: For cases where you need the parse to return nil
on no match, this tweak to the grammar should do the trick
stream_parse = re.compile
[[
stream_match <- (str_match / skip_nonmatch+ &str_match)+
str_match <- { %a ('-'? %w)* } (&delim / !.)
skip_nonmatch <- !str_match (!delim .)* delim
delim <- %s+
]]