I've been working on a text editor that uses LPEG to implement syntax highlighting support. Getting things up and running was pretty simple, but I've only done the minimum required.
I've defined a bunch of patterns like this:
-- Keywords
local keyword = C(
P"auto" +
P"break" +
P"case" +
P"char" +
P"int"
-- more ..
) / function() add_syntax( RED, ... )
This correctly handles input, but unfortunately matches too much. For example int
matches in the middle of printf
, which is expected because I'm using "P
" for a literal match.
Obviously to perform "proper" highlighting I need to match on word-boundaries, such that "int" matches "int", but not "printf", "vsprintf", etc.
I tried to use this to limit the match to only occurring after "<[{ \n
", but this didn't do what I want:
-- space, newline, comma, brackets followed by the keyword
S(" \n(<{,")^1 * P"auto" +
Is there a simple, obvious, solution I'm missing here to match only keywords/tokens that are surrounded by whitespace or other characters that you'd expect in C-code? I do need the captured token so I can highlight it, but otherwise I'm not married to any particular approach.
e.g. These should match:
int foo;
void(int argc,std::list<int,int> ) { .. };
But this should not:
fprintf(stderr, "blah. patterns are hard\n");
The LPeg construction -pattern
(or more specifically -idchar
in the following example) does a good job of making sure that the current match is not followed by pattern
(i.e. idchar
). Luckily this also works for empty strings at the end of the input, so we don't need special handling for that. For making sure that a match is not preceded by a pattern, LPeg provides lpeg.B(pattern)
. Unfortunately, this requires a pattern that matches a fixed length string, and so won't work at the beginning of the input. To fix that the following code separately tries to match without lpeg.B()
at the beginning of the input before falling back to a pattern that checks suffixes and prefixes for the rest of the string:
local L = require( "lpeg" )
local function decorate( word )
-- highlighting in UNIX terminals
return "\27[32;1m"..word.."\27[0m"
end
-- matches characters that may be part of an identifier
local idchar = L.R( "az", "AZ", "09" ) + L.P"_"
-- list of keywords to be highlighted
local keywords = L.C( L.P"in" +
L.P"for" )
local function highlight( s )
local p = L.P{
(L.V"nosuffix" + "") * (L.V"exactmatch" + 1)^0,
nosuffix = (keywords / decorate) * -idchar,
exactmatch = L.B( 1 - idchar ) * L.V"nosuffix",
}
return L.match( L.Cs( p ), s )
end
-- tests:
print( highlight"" )
print( highlight"hello world" )
print( highlight"in 0in int for xfor for_ |for| in" )