This is a simplified version of a larger parser. I has been driving me slightly mad as I can't seem to get it to work consistently. The larger version works but has so many edge cases it does not work well enough, I'm finding myself having to teach python the entire syntax and all I want is the #ifdef pre-processor directives.
I tried to simplify it and got to this (below) but it never finds the ifdef / endif. Why?
from parsimonious.grammar import Grammar
CPP = Grammar(r"""
file = (code_block / nl / ws)+
label = ~"[A-z_]+\w*" # Alpha or _ followed by alphanumeric or _
ifdef = "#ifdef" ws+ label nl
else = "#else" nl
endif = "#endif" nl
line = ~r".+" nl
lines = line+
ws = ~r"[\s\t]+" # One or more whitespace
nl = ~r"[\s\t]*[\n\r]+" # Zero or more whitespace followed by a newline
if_block = ifdef code_block* endif
if_else_block = ifdef code_block* else code_block* endif
code_block = (if_block / if_else_block / lines)
""")
code = """
int a;
#ifdef test1
a += 2;
#ifdef test2
a--;
#endif
#else
a += 20;
#endif
"""
res = CPP.parse(code)
print(res)
It finds many 'lines' but never a 'if' block. (note 'code' is not suppose to do anything useful, it's just some sample text)
PEG parsing is essentially greedy, as parsimonious notes in its documentation. (This is not parsimonious-specific. You'll hit the same issue with any PEG parser.)
things*
Zero or morething
s. This is greedy, always consuming as many repetitions as it can.
So you have a pattern for code_block
:
code_block = (if_block / if_else_block / lines)
Suppose the parser is trying to match that at some point in the input. If the corresponding text starts with an if_block
or if_else_block
(more about this later), then that's fine. But suppose it starts with something else, like int a;
? In that case, the parser is going to try to match lines
, which is to say line*
, which is to say as many repetitions of line
as possible, as per the documentation above.
Now, line
is just a line: any arbitrary sequence of characters terminated at the first newline. So line
will happily match #ifdef test1
or #endif
or whatever.
In other words, if a match for code_block
is attempted at some point in the input and the first thing in the input is not an if_block
or an if_else_block
, then code_block
will greedily match the rest of the input.
Now, let's go back to matching if_block
(and note that the same comment will apply to matching if_else_block
). The rule for if_block
is:
if_block = ifdef code_block endif
ifdef
is a line starting with #ifdef
[Note 1]. The next thing is a code_block
, which as we just saw will greedily match to the end of the input (unless it happens to immediately match a nested #ifdef
). In other words, code_block
does not care that you would like it to stop when it reaches an #endif
; it knows nothing about #endif
. It's a greedy little monster chomping its way across your input. Even if #endif
is the next line, it will get absorbed by lines
, along with the rest of the input.
Since #endif
can never be recognised, if_block
can never match and the parser will have to fall back to the next possibility, if_else_block
. But that can never match either, so it will have to fall back to the greedy monster.
In short, this grammatical model cannot work, and that's not the fault of parsimonious. If you stick to this model, it doesn't matter which PEG parser you use, you will end up with the same problem.
From the above analysis, it seems clear that any repeated match for lines needs to be aware of preprocessor directives, in order to not match them [Note 2]. That can be done, but it makes things a bit more complicated because it forces us to categorise the possible inputs, and not just rely on ordered alternatives [Note 3].
So the key is that there are two kinds of input line: preprocessor directives (at least, the ones we're interested in) and everything else. (This is a radical simplification, because "everything else" includes comments and quoted literals, which need to be handled separately. But we'll ignore that for now.) It's easy to see how to recognize a preprocessor directive, so let's focus on recognising a line which is not a preprocessor directive. Fortunately, parsimonious (like many PEG parser generators) has a negative lookahead operator, which we can use to our benefit here:
line = ws? !("#" ws? ("ifdef" / "else" / "endif")) ~".*" nl
(Below, there's a correction to the definitions of ws
and nl
)
There's no need for lines
, because of another issue I didn't note above. In the definition
code_block = (if_block / if_else_block / lines)
a code_block
can be any number of lines but only exactly one preprocessor block. That's not right: a code_block
can be any mix of any number of lines and nested blocks. The correct definition would be:
code_block = (if_block / if_else_block / line)*
It seems to me simpler to use a conditional in order to avoid having two different preprocessor block definitions. After all, both of them start with an #ifdef
, and there's no point trying to scan that again after a fallback, even if parsimonious manages to do the rescan efficiently by packratting. So I propose the following:
ifdef = ws? "#" ws? "ifdef" ws label nl
else = ws? "#" ws? "else" nl
endif = ws? "#" ws? "endif" nl
code_block = (if_block / line)*
if_block = ifdef code_block (else code_block)? endif
(I removed the redundant +
from ifdef
because ws
already includes a repetition operator.)
At this point, a small digression to fix your regex patterns. The goal here is that ws
matches any horizontal whitespace (excluding newline characters), while nl
matches possible horizontal whitespace followed by a sequence of newline characters. But your ws
pattern include \s
, which in Python regexes matches any whitespace including newlines, which is a bit too generous. And while we're fixing regexes, [A-z]
is not the same as [A-Za-z]
because a
does not immediately follow Z
; they are separated by several punctuation characters (including _
, as it happens, but also []
among others). The following isn't perfect either, but it's a start:
label = ~"[A-Za-z_]\w*"
ws = ~"[ \t]+"
nl = ~"\s*[\n\r]+"
So, to put all that together:
file = code_block
label = ~"[A-Za-z_]\w*"
ws = ~"[ \t]+"
nl = ~"\s*[\n\r]+"
ifdef = ws? "#" ws? "ifdef" ws label nl
else = ws? "#" ws? "else" nl
endif = ws? "#" ws? "endif" nl
line = ws? !("#" ws? ("ifdef" / "else" / "endif")) ~".*" nl
code_block = (if_block / line)*
if_block = ifdef code_block (else code_block)? endif
That seems to handle your example code, at least.
This isn't really correct, because in C a preprocessor directive can be preceded with whitespace, and there is no requirement for the #
to be immediately followed by the directive name. But that's not the issue here.
Warning: opinion follows. Contrary to the oft-repeated advertisement that PEG parsers "simplify the grammar" by not forcing a division into lexical analysis and syntax, it seems to me that in many cases (such as this one), a systematic separation of concerns would actually be the simplest.
There is a place for ordered alternatives. They work well in lexical analysis, for example. But that's well outside the scope of this answer.