pythonparsingpegparsimonious

Why does this Python parser using parsimonious never find the ifdef blocks


This is a simplified version of a larger parser. I has been driving me slightly mad as I can't seem to get it to work consistently. The larger version works but has so many edge cases it does not work well enough, I'm finding myself having to teach python the entire syntax and all I want is the #ifdef pre-processor directives.

I tried to simplify it and got to this (below) but it never finds the ifdef / endif. Why?

from parsimonious.grammar import Grammar
CPP = Grammar(r"""
    file  = (code_block / nl / ws)+
    label = ~"[A-z_]+\w*"                         # Alpha or _ followed by alphanumeric or _
    
    ifdef = "#ifdef" ws+ label nl
    else  = "#else" nl
    endif = "#endif" nl
    
    line = ~r".+" nl
    lines = line+
    
    ws    = ~r"[\s\t]+"                           # One or more whitespace
    nl    = ~r"[\s\t]*[\n\r]+"                    # Zero or more whitespace followed by a newline
    
    if_block = ifdef code_block* endif
    if_else_block = ifdef code_block* else code_block* endif
    
    code_block = (if_block / if_else_block / lines)
    """)

code = """
int a;

#ifdef test1

a += 2;

#ifdef test2
a--;
#endif

#else

a += 20;

#endif

"""

res = CPP.parse(code)
print(res)

It finds many 'lines' but never a 'if' block. (note 'code' is not suppose to do anything useful, it's just some sample text)


Solution

  • PEG parsing is essentially greedy, as parsimonious notes in its documentation. (This is not parsimonious-specific. You'll hit the same issue with any PEG parser.)

    things*
       Zero or more things. This is greedy, always consuming as many repetitions as it can.

    So you have a pattern for code_block:

    code_block = (if_block / if_else_block / lines)
    

    Suppose the parser is trying to match that at some point in the input. If the corresponding text starts with an if_block or if_else_block (more about this later), then that's fine. But suppose it starts with something else, like int a;? In that case, the parser is going to try to match lines, which is to say line*, which is to say as many repetitions of line as possible, as per the documentation above.

    Now, line is just a line: any arbitrary sequence of characters terminated at the first newline. So line will happily match #ifdef test1 or #endif or whatever.

    In other words, if a match for code_block is attempted at some point in the input and the first thing in the input is not an if_block or an if_else_block, then code_block will greedily match the rest of the input.

    Now, let's go back to matching if_block (and note that the same comment will apply to matching if_else_block). The rule for if_block is:

    if_block = ifdef code_block endif
    

    ifdef is a line starting with #ifdef [Note 1]. The next thing is a code_block, which as we just saw will greedily match to the end of the input (unless it happens to immediately match a nested #ifdef). In other words, code_block does not care that you would like it to stop when it reaches an #endif; it knows nothing about #endif. It's a greedy little monster chomping its way across your input. Even if #endif is the next line, it will get absorbed by lines, along with the rest of the input.

    Since #endif can never be recognised, if_block can never match and the parser will have to fall back to the next possibility, if_else_block. But that can never match either, so it will have to fall back to the greedy monster.

    In short, this grammatical model cannot work, and that's not the fault of parsimonious. If you stick to this model, it doesn't matter which PEG parser you use, you will end up with the same problem.

    From the above analysis, it seems clear that any repeated match for lines needs to be aware of preprocessor directives, in order to not match them [Note 2]. That can be done, but it makes things a bit more complicated because it forces us to categorise the possible inputs, and not just rely on ordered alternatives [Note 3].

    So the key is that there are two kinds of input line: preprocessor directives (at least, the ones we're interested in) and everything else. (This is a radical simplification, because "everything else" includes comments and quoted literals, which need to be handled separately. But we'll ignore that for now.) It's easy to see how to recognize a preprocessor directive, so let's focus on recognising a line which is not a preprocessor directive. Fortunately, parsimonious (like many PEG parser generators) has a negative lookahead operator, which we can use to our benefit here:

    line = ws? !("#" ws? ("ifdef" / "else" / "endif")) ~".*" nl
    

    (Below, there's a correction to the definitions of ws and nl)

    There's no need for lines, because of another issue I didn't note above. In the definition

    code_block = (if_block / if_else_block / lines)
    

    a code_block can be any number of lines but only exactly one preprocessor block. That's not right: a code_block can be any mix of any number of lines and nested blocks. The correct definition would be:

    code_block = (if_block / if_else_block / line)*
    

    It seems to me simpler to use a conditional in order to avoid having two different preprocessor block definitions. After all, both of them start with an #ifdef, and there's no point trying to scan that again after a fallback, even if parsimonious manages to do the rescan efficiently by packratting. So I propose the following:

    ifdef = ws? "#" ws? "ifdef" ws label nl
    else  = ws? "#" ws? "else" nl
    endif = ws? "#" ws? "endif" nl
    
    code_block = (if_block / line)*
    if_block = ifdef code_block (else code_block)? endif
    

    (I removed the redundant + from ifdef because ws already includes a repetition operator.)

    At this point, a small digression to fix your regex patterns. The goal here is that ws matches any horizontal whitespace (excluding newline characters), while nl matches possible horizontal whitespace followed by a sequence of newline characters. But your ws pattern include \s, which in Python regexes matches any whitespace including newlines, which is a bit too generous. And while we're fixing regexes, [A-z] is not the same as [A-Za-z] because a does not immediately follow Z; they are separated by several punctuation characters (including _, as it happens, but also [] among others). The following isn't perfect either, but it's a start:

    label = ~"[A-Za-z_]\w*"
    ws    = ~"[ \t]+"
    nl    = ~"\s*[\n\r]+"
    

    So, to put all that together:

    file  = code_block
    
    label = ~"[A-Za-z_]\w*"
    ws    = ~"[ \t]+"
    nl    = ~"\s*[\n\r]+"
    
    ifdef = ws? "#" ws? "ifdef" ws label nl
    else  = ws? "#" ws? "else" nl
    endif = ws? "#" ws? "endif" nl
    
    line  = ws? !("#" ws? ("ifdef" / "else" / "endif")) ~".*" nl
    
    code_block = (if_block / line)*
    if_block = ifdef code_block (else code_block)? endif
    

    That seems to handle your example code, at least.

    Notes

    1. This isn't really correct, because in C a preprocessor directive can be preceded with whitespace, and there is no requirement for the # to be immediately followed by the directive name. But that's not the issue here.

    2. Warning: opinion follows. Contrary to the oft-repeated advertisement that PEG parsers "simplify the grammar" by not forcing a division into lexical analysis and syntax, it seems to me that in many cases (such as this one), a systematic separation of concerns would actually be the simplest.

    3. There is a place for ordered alternatives. They work well in lexical analysis, for example. But that's well outside the scope of this answer.