pythongrammarindentationjflexcup

Parsing blocks as Python


I am writing a lexer + parser in JFlex + CUP, and I wanted to have Python-like syntax regarding blocks; that is, indentation marks the block level.

I am unsure of how to tackle this, and whether it should be done at the lexical or sintax level.

My current approach is to solve the issue at the lexical level - newlines are parsed as instruction separators, and when one is processed I move the lexer to a special state which checks how many characters are in front of the new line and remembers in which column the last line started, and accordingly introduces and open block or close block character.

However, I am running into all sort of trouble. For example:

  1. JFlex cannot match empty strings, so my instructions need to have at least one blanck after every newline.
  2. I cannot close two blocks at the same time with this approach.

Is my approach correct? Should I be doing things different?


Solution

  • Your approach of handling indents in the lexer rather than the parser is correct. Well, it’s doable either way, but this is usually the easier way, and it’s the way Python itself (or at least CPython and PyPy) does it.

    I don’t know much about JFlex, and you haven’t given us any code to work with, but I can explain in general terms.

    For your first problem, you're already putting the lexer into a special state after the newline, so that "grab 0 or more spaces" should be doable by escaping from the normal flow of things and just running a regex against the line.

    For your second problem, the simplest solution (and the one Python uses) is to keep a stack of indents. I'll demonstrate something a bit simpler than what Python does.

    First:

    indents = [0]
    

    After each newline, grab a run of 0 or more spaces as spaces. Then:

    if len(spaces) == indents[-1]:
        pass
    elif len(spaces) > indents[-1]:
        indents.append(len(spaces))
        emit(INDENT_TOKEN)
    else:
        while len(spaces) != indents[-1]:
            indents.pop()
            emit(DEDENT_TOKEN)
    

    Now your parser just sees INDENT_TOKEN and DEDENT_TOKEN, which are no different from, say, OPEN_BRACE_TOKEN and CLOSE_BRACE_TOKEN in a C-like language.

    Of you’d want better error handling—raise some kind of tokenizer error rather than an implicit IndexError, maybe use < instead of != so you can detect that you’ve gone too far instead of exhausting the stack (for better error recovery if you want to continue to emit further errors instead of bailing at the first one), etc.

    For real-life example code (with error handling, and tabs as well as spaces, and backslash newline escaping, and handling non-syntactic indentation inside of parenthesized expressions, etc.), see the tokenize docs and source in the stdlib.