antlr4antlr3

How to convert a lexer rule with a semantic predicate from ANTLR3 to ANTLR4?


I'm trying to convert the ActionSplitter.g grammar to ANTLR4 and came across a problem wrt. semantic predicates. The grammar has rules like this:

QUALIFIED_ATTR
    :   '$' x=ID '.' y=ID {input.LA(1)!='('}? {delegate.qualifiedAttr($text, $x, $y);}
    ;

ATTR
    :   '$' x=ID {delegate.attr($text, $x);}
    ;

fragment
ID  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*

I know how to handle the actions, but I have trouble finding a way to simulate the semantic predicate in QUALIFIED_ATTR. In ANTLR3, when the predicate fails, the entire rule is rejected. Instead rule ATTR would match the first id.

In ANTLR4 this doesn't happen. The predicate essentially does nothing, since the ID rule doesn't match an opening par. I tried throwing an exception instead:

QUALIFIED_ATTR:
    '$' ID '.' ID {
        if (this.inputStream.LA(1) == 0x28 /* '(' */) {
          throw new Error("Qualified attribute cannot be followed by '('");
        }
      };

But that is of course not captured and bubbles up to the app code (and hence stops parsing altogether). So, I think I need to write these two rules differently, but failed to come up with an idea.


Solution

  • You could move the predicate to the front of the rule and check if an open parens is ahead of it. A quick demo:

    lexer grammar PredicateTestLexer;
    
    @members {
      private boolean ahead(String pattern) {
        StringBuilder builder = new StringBuilder();
    
        // Collect all chars until we encounter EOF, a line break or a '('
        for (int steps = 1; ; steps++) {
          int nextChar = _input.LA(steps);
          builder.append((char)nextChar);
          if (nextChar == EOF || !String.valueOf((char)nextChar).matches("[$\\w.]")) {
            break;
          }
        }
    
        return builder.toString().matches(pattern);
      }
    }
    
    QUALIFIED_ATTR
     : {!ahead("\\$\\w+\\.\\w+\\(")}? '$' ID '.' ID
     ;
    
    ATTR
     : '$' ID
     ;
    
    OPAR   : '(';
    CPAR   : ')';
    DOT    : '.';
    ID     : [a-zA-Z_] [a-zA-Z_0-9]*;
    SPACES : [ \t\r\n\f]+ -> skip;
    

    which will tokenize $id.getText $blah.blub() as follows:

    7 tokens:
      1    QUALIFIED_ATTR                 '$id.getText'
      2    ATTR                           '$blah'
      3    DOT                            '.'
      4    ID                             'blub'
      5    OPAR                           '('
      6    CPAR                           ')'
      7    EOF                            '<EOF>'
    

    EDIT

    I've updated the the check inside ahead(...) to stop looping once we encounter a char that cannot be in a QUALIFIED_ATTR. It still is an ugly method, but I don't see any other way to support this in v4...