I'm trying to capture VBA comments. I have the following so far
'[^";]+\Z
Which captures anything that starts with a single quote but not contain any double quotes until end of string. i.e. it will not match single quotes within a double quote string.
dim s as string ' a string variable -- works
s = "the cat's hat" ' quote within string -- works
But fails if the comment contains a double quote string
i.e.
dim s as string ' string should be set to "ten"
How can I fix my regex to handle that too?
The pattern in @Jeff Wurz's comment (^\'[^\r\n]+$|''[^\r\n]+$
) doesn't even match any of your test samples, and the linked question is useless, the regex in there will only match that specific comment in the OP's question, not "the VBA comment syntax".
The regex you have come up with works even better than what I had when I gave up the regex approach.
Well done!
The problem is that you can't parse VBA comments with a regex.
In Lexers vs Parsers, @SasQ's answer does a good job at explaining Chomsky's grammar levels:
Level 3: Regular grammars
They use regular expressions, that is, they can consist only of the symbols of alphabet (a,b), their concatenations (ab,aba,bbb etd.), or alternatives (e.g. a|b). They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton). Regular grammars can't handle with nested syntax, e.g. properly nested/matched parentheses (()()(()())), nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels.
Level 2: Context-free grammars
They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well. They can be implemented as state automaton with stack. This stack is used to represent the nesting level of the syntax. In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax. But they can't handle with a context-sensitive syntax. E.g. when you have an expression x+3 and in one context this x could be a name of a variable, and in other context it could be a name of a function etc.
Level 1: Context-sensitive grammars
Regular Expressions simply aren't the appropriate tool for solving this problem, because whenever there's more than a single quote (/apostrophe), or when double quotes are involved, you need to figure out whether the left-most apostrophe in the code line is inside double quotes, and if it is, then you need to match the double quotes and find the left-most apostrophe after the closing double quote - actually, the left-most apostrophe that isn't part of a string literal, is your comment marker.
My understanding is that VBA comment syntax is a context-sensitive grammar (level 1), because the apostrophe is only your marker if it's not part of a string literal, and to figure out whether an apostrophe is part of a string literal, the easiest is probably to walk your string left to right and to toggle some IsInsideQuote
flag as you encounter double-quotes... but only if they're not escaped (doubled-up). Actually you don't even check to see if there's an apostrophe inside the string litereal: you just keep walking until open quotes are closed, and only when the "in-quotes flag" is False
you found a comment marker if you encounter a single quote.
Good luck!
Here's a test case you're missing:
s = "abc'def ""xyz""'nutz!" 'string with apostrophes and escaped double quotes
If you don't care about capturing the string literals, you can simply ignore the escaped double quotes and see 3 string literals here: "abc'def "
, "xyz"
and "'nutz!"
.
This C# code outputs 'string with apostrophes and escaped double quotes
(all in-string double quotes are escaped with a backslash in the code), and works with all the test strings I gave it:
static void Main(string[] args)
{
var instruction = "s = \"abc'def \"\"xyz\"\"'nutz!\" 'string with apostrophes and escaped double quotes";
// var instruction = "s = \"the cat's hat\" ' quote within string -- works";
// var instruction = "dim s as string ' string should be set to \"ten\"";
int? commentStart = null;
var isInsideQuotes = false;
for (var i = 0; i < instruction.Length; i++)
{
if (instruction[i] == '"')
{
isInsideQuotes = !isInsideQuotes;
}
if (!isInsideQuotes && instruction[i] == '\'')
{
commentStart = i;
break;
}
}
if (commentStart.HasValue)
{
Console.WriteLine(instruction.Substring(commentStart.Value));
}
Console.ReadLine();
}
Then if you want to capture all legal comments, you need to handle the legacy Rem
keyword, and consider line continuations:
Rem this is a legal comment
' this _
is also _
a legal comment
In other words, \r\n
in itself isn't enough to correctly identify all end-of-statement tokens.
A proper lexer+parser seems the only way to capture all comments.