parsingantlrantlr4context-free-grammar

How to enforce using the same production for non-terminal symbol in a rule?


Say I have an ANTLR grammar:

program = word (' ' word)*
  ;

word = 'dog' | 'cat' | 'bird'
  ;

As I understand, it will match any sequence of words above, e.g. "dog dog cat", "dog cat bird cat", etc. But what if I want to match only repetition of the same value, e.g. "dog dog", "cat cat cat cat", "bird bird bird", etc. How can the grammar above be modified to do this?

In other words, I want repeated occurrences of non-terminal symbol "word" in the rule for "program" to always match the same production rule instead of any of the number of rules specified for "word", but without having to explicitly list each of them in the rule for "program" (e.g. if there is large number of alternatives for "word"), meaning I want to avoid something like this:

program:
    'dog' (' ' 'dog')*
  | 'cat' (' ' 'cat')*
  | 'bird' (' ' 'bird')*
  ...
  ;

I think that in regular expressions this is achieved using backreferences (e.g. "\1"), is there an equivalent for this for ANTLR grammars, or some other way to do it?


Solution

  • As mentioned by kaby76: handling this in a parser rule would mean introducing a predicate, which would mean adding target specific code to your grammar (which is usually not advised). That could look like this (using the Java target):

    grammar P;
    
    
    @parser::members {
      boolean sameTokenAhead() {
        Token previous = _input.LT(-1);
        Token next = _input.LT(1);
        return previous.getType() == next.getType();
      }
    }
    
    parse
     : repeated_words* EOF
     ;
    
    repeated_words
     : word ( {sameTokenAhead()}? word )*
     ;
    
    word
     : DOG
     | CAT
     | BIRD
     ;
    

    Note that spaces are discarded by the lexer in the example above. Parsing bird dog dog dog cat cat would result in the following parse tree:

    '- parse
       |- repeated_words
       |  '- word
       |     '- 'bird' (BIRD)
       |- repeated_words
       |  |- word
       |  |  '- 'dog' (DOG)
       |  |- word
       |  |  '- 'dog' (DOG)
       |  '- word
       |     '- 'dog' (DOG)
       |- repeated_words
       |  |- word
       |  |  '- 'cat' (CAT)
       |  '- word
       |     '- 'cat' (CAT)
       '- '<EOF>'
    

    Another possibility would be to create a single token from these repeated words:

    DOG
     : 'dog' (' ' 'dog')*
     ;
    

    Or just match one or more words as you mentioned in your question:

    program
     : word (' ' word)*
     ;
    
    word
     : 'dog' | 'cat' | 'bird'
     ;
    

    and then after parsing, check in a listener if the words are the same and potentially throw an error if they do not.