c++parsingbisonlexparser-generator

Choice of Parser Generator


OK, I understand this question may sound quite opinion-based, however since I have several specific criteria of choice, I think it would make a nice fit for SO. So, here I am...

I've worked with compiler/interpreter construction in the past quite a lot (mostly as a hobby obviously) and for some reason I stuck with Lex/Yacc (or Flex/Bison, I'm quite confused as to how they call them now... lol).

However, since I'm finding myself currently playing with yet another hobbyist interpreter project, I thought I should give a try to something different maybe in order to avoid what I don't like about Lex/Yacc.

So, namely :

OK, I hope that wasn't too verbose. I'm all ears! :-)


Solution

  • I've been building parser generators and parsers since 1969.

    Recursive descent, YACC and JavaCC are the typical answers you hear.

    These are your grandpa's parser generators, and suffer from limitations in the grammars they will accept. Invariably, (esp on Stack Overflow), some poor soul asks "how do I solve this shift/reduce" problem (for LR parser generators like YACC) or "how do I eliminate left recursion" (for recursive descent or LL parser generators like JavaCC). Worse, they can't handle grammars that really have syntactic ambiguity, as occurs in most complex languages.

    GLR (and GLL) parsers allow you to write context-free grammars ... and parse them, with no fuss or muss. This is a real productivity enhancement. There's a price: you can end up with ambiguous parses but there are ways to handle that. (see this discussion of C++ parsing troubles that neither YACC nor JavaCC can handle by themselves).

    Bison (widely available) has a GLR option; use it! Recent multilingual program manipulation tools seems to all use GLL or GLR. Our DMS Software Reengineering Toolkit uses GLR and parses C++ (full C++14 in MS and GNU variants!), Java, COBOL, and a raft of other complicated languages; GLR has been one of the best technical choices I've made in my career. Stratego uses GLR. I think RascalMPL uses GLL. Scott McPeak's Elkhound GLR parser generator is C++ based and generates, I'm pretty sure, C++ code (OP asked for a C++ based answer).

    Hot topics these days are PEG and ANTLR4. These are better that LL or LR parsers but still give one grief in trying to shape the grammar. (With PEG, you have to order the productions, assuming you can find such an order, to handle ambiguous rules with priorities. With ANTLR4, you still have specify lookaheads to resolve ambiguity; I don't know how it handles infinite lookahead). AFAIK, nobody has built practical C++ parsers with either of these technologies, so they aren't living up to their reputation.

    I think GLR and GLL are much, much better answers.