I'm trying to write a sytnax highlighter for VSCode, which uses the TextMate format. I've got an entry for one-line comments, copied from an example, and it works fine, but I'd like to extend/modify it.
"linecomment": {
"name": "comment",
"match": "(%)(?!(\\[=*\\[|\\]=*\\])).*$\n?",
"captures": {
"1": {
"name": "comment"
}
}
},
The problem is, the regular expressions used here are not documented anywhere that I can find. I understand basic Grep and the theory behind regular expressions, but I have no idea what is going on in ?!(\\[=*\\[|\\]=*\\])).*$\n?
. In particular, I don't know which characters are in the regex language, and which are being matched.
Can somebody explain to me:
I don't know the answer to (1), but the answer to (2) is as follows:
Firstly, if you've only used grep and not other flavours of regex, you should know that there are some syntax differences. In most flavours, for example, \+
is a literal +
and +
is the quantifier; in grep +
is literal and \+
is the quantifier. And there are other characters where the meaning of \
is reversed in this way.
Secondly, the string literal isn't the same as the string itself, because of backslash-escaping. The string literal looks like this:
"(%)(?!(\\[=*\\[|\\]=*\\])).*$\n?"
while the string itself looks like this:
(%)(?!(\[=*\[|\]=*\])).*$
?
(with a newline character near the end).
Let's look at the following subexpression:
\[=*\[|\]=*\]
At first I thought this was a character class, delimited by \[
and \]
. But (a) I don't know of any flavour of regex where backslash-escaped square brackets are character class delimiters and unescaped ones are literal square brackets, rather than vice versa; (b) why would someone write a character class with repeated characters?; (c) there's no obvious reason why the first \]
would be a literal ]
and the second one would end the character class. So it looks like \[
and \]
are literal square brackets.
|
means "or" in regexes. It is a low-precedence operator. So this subexpression means either \[=*\[
or \]=*\]
. In other words, it matches strings such as [[
, [=[
, [======[
, etc, as well as ]]
, ]=]
, etc.
(?!...)
is a zero-width assertion. It is a negative lookahead: it matches at any point in the string where the positive lookahead (?=...)
would not match. In general, if the regex A
matches the string a
and C
matches string c
then the regex A(?!B)C
matches the string ac
, unless the regex B
matches c
(or some substring of c
). In other words, the match fails if the string is something like %]==]
.
.*
matches any number of characters. (0 is a number). (I assume this doesn't match newlines.) $
is another zero-width assertion: it can only match at the end of the line. Actually, it's not needed in this case - the .*
subexpression is greedy and will match all non-newline characters, so the end of the .*
match is guaranteed to be the end of the line. That is, unless there's some edge case I'm not aware of involving carriage returns or some even more exotic line terminating character.
Finally, \n?
will match the newline character itself, if it exists (?
is a quantifier). If this is the last line of the string then there may not be a newline; in that case the regex match would fail without the ?
.
Putting it all together: The regex will match from a %
until the end of the line, including the newline character if it exists, unless the string it's trying to match starts with %[[
or %]==]
or something similar.