parsinglexerbnfjavacc

JAVACC How to pass token for 1 or more


I have started learning javacc parser recently. I was requested to write a parser in which a token takes numbers from 1 to many and another token which take numbers from 2 to many Therefore I came up with something like this:

TOKEN : {
<NUM1: <NUM> (<NUM>)* > //for one or more
<NUM2: (<NUM>)+> //for two or more
<NUM :(["0"-"9"])+>





 // and in the function
    void calc():
    {}
    {
     (
      (<NUM1>)+
     (<NUM2>)+
     )* <EOF>
    }

However even if I pass a text value WITH no numbers, it is getting passed successfully. What am i doing wrong in this?


Solution

  • The JavaCC syntax for the lexical tokens allows you to have repetitions of elements enclosed in a scope () followed by one of:

    ? - zero or one time
    * - zero or more times
    + - one or more times
    

    In your case, you need two tokens:

    TOKEN: 
    {
      <NUM2: ["0"-"9"] (["0"-"9"])+> // for two or more
      <NUM1:           (["0"-"9"])+> // for one or more
    }
    

    You read that as:

    The lexical machinery in JavaCC consumes one character from the input character stream and attempts to recognize a token. The two automata are as follows:

    Lexer Automata

    The lexer progresses simultaneously in the both automata for the both tokens. After no more progress is possible the latest found token is recognized. If more the one token type is possible, then the one declared first is recognized. For this reason NUM2 is declared before NUM1. This means that for input 1 the token NUM2 is not going to be recognized, because more then one digit is needed for it. In this case NUM1 is going to be the only one token type that matches this input. For input 12 both token types are going to accept it, but NUM2 is going to be recognized, because its declared first. This means that if you order them NUM1 first, then NUM2 you are never going to receive NUM2 token, because NUM1 will always "win" with its highest precedence.

    To use them, you can have two parser functions like these:

    void match_one_to_many_numbers() : {} { <NUM1> (" " <NUM1>)* <EOF> }
    void match_two_to_many_numbers() : {} { <NUM2> (" " <NUM2>)* <EOF> }
    

    You read that as:

    Because both of the tokens accept infinite number of digits, you cannot have a sequence of these tokens without a delimiter that is not a digit.