parsinggrammarlexerjflexcup

Parse tree generation with Java CUP


I am using CUP with JFlex to validate expression syntax. I have the basic functionality working: I can tell if an expression is valid or not.

Next step is to implement simple arithmetic operations, such as "add 1". For example, if my expression is "1 + a", the result should be "2 + a". I need access to parse tree to do that, because simply identifying a numeric term won't do it: the result of adding 1 to "(1 + a) * b" should be "(1 + a) * b + 1", not "(2 + a) * b".

Does anyone have a CUP example that generates a parse tree? I think I will be able to take it from there.

As an added bonus, is there a way to get a list of all tokens in expression using JFlex? Seems like a typical use case, but I cannot figure out how to do it.

Edit: Found a promising clue on stack overflow: Create abstract tree problem from parser

Discussion of CUP and AST:

http://pages.cs.wisc.edu/~fischer/cs536.s08/lectures/Lecture16.4up.pdf

Specifically, this paragraph:

The Symbol returned by the parser is associated with the grammar’s start symbol and contains the AST for the whole source program

This does not help. How to traverse the tree given Symbol instance, if Symbol class does not have any navigation pointers to its children? In other words, it does not look or behave like a tree node:

package java_cup.runtime;
/**
 * Defines the Symbol class, which is used to represent all terminals
 * and nonterminals while parsing.  The lexer should pass CUP Symbols 
 * and CUP returns a Symbol.
 *
 * @version last updated: 7/3/96
 * @author  Frank Flannery
 */

/* ****************************************************************
  Class Symbol
  what the parser expects to receive from the lexer. 
  the token is identified as follows:
  sym:    the symbol type
  parse_state: the parse state.
  value:  is the lexical value of type Object
  left :  is the left position in the original input file
  right:  is the right position in the original input file
******************************************************************/

public class Symbol {

/*******************************
  Constructor for l,r values
 *******************************/

  public Symbol(int id, int l, int r, Object o) {
    this(id);
    left = l;
    right = r;
    value = o;
  }

/*******************************
  Constructor for no l,r values
********************************/

  public Symbol(int id, Object o) {
    this(id, -1, -1, o);
  }

/*****************************
  Constructor for no value
  ***************************/

  public Symbol(int id, int l, int r) {
    this(id, l, r, null);
  }

/***********************************
  Constructor for no value or l,r
***********************************/

  public Symbol(int sym_num) {
    this(sym_num, -1);
    left = -1;
    right = -1;
    value = null;
  }

/***********************************
  Constructor to give a start state
***********************************/
  Symbol(int sym_num, int state)
    {
      sym = sym_num;
      parse_state = state;
    }

/*. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .*/

  /** The symbol number of the terminal or non terminal being represented */
  public int sym;

  /*. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .*/

  /** The parse state to be recorded on the parse stack with this symbol.
   *  This field is for the convenience of the parser and shouldn't be 
   *  modified except by the parser. 
   */
  public int parse_state;
  /** This allows us to catch some errors caused by scanners recycling
   *  symbols.  For the use of the parser only. [CSA, 23-Jul-1999] */
  boolean used_by_parser = false;

/*******************************
  The data passed to parser
 *******************************/

  public int left, right;
  public Object value;

  /*****************************
    Printing this token out. (Override for pretty-print).
    ****************************/
  public String toString() { return "#"+sym; }
}

Solution

  • Ok, I got it. But unfortunately I cannot publish all my code here as-is. I will try to outline solution anyway, and please ask questions if something is not clear.

    JFlex uses its own Symbol class. Look here: JFlex.jar/java_cup.runtime/Symbol.class

    You will see a couple of constructors added:

    public Symbol(int id, Symbol left, Symbol right, Object o){
        this(id,left.left,right.right,o);
    }
    public Symbol(int id, Symbol left, Symbol right){
        this(id,left.left,right.right);
    }
    

    The key here is Object o, which is the value of Symbol.

    Define your own class to represent an AST tree node, and another one to represent lexer token. Granted, you can use the same class, but I found it more clear to use different classes to distinguish between the two. Both JFlex and CUP will generate java code, and it is easy to get your tokens and nodes mixed-up.

    Then, in your parser.flex, in the lexical rules sections, you want to do something like this for each token:

    {float_lit}        { return symbol(sym.NUMBER, createToken(yytext(), yycolumn)); }
    

    Do this for all your tokens. Your createToken could be something like this:

    %{
        private LexerToken createToken(String val, int start) {
            LexerToken tk = new LexerToken(val, start);
            addToken(tk);
            return tk;
        }
    }%
    

    Now let's move on to parser.cup. Declare all your terminals to be of type LexerToken, and all your non-terminals to be of type Node. You want to read CUP manual, but for quick refresher, a terminal would be anything recognized by the lexer (e.g. numbers, variables, operators), and non-terminal would be parts of your grammar (e.g. expression, factor, term...).

    Finally, this all comes together in the grammar definition. Consider the following example:

       factor    ::= factor:f TIMES:times term:t
                     {: RESULT = new Node(times.val, f, t, times.start); :}
                     |
                     factor:f DIVIDE:div term:t
                     {: RESULT = new Node(div.val, f, t, div.start); :}
                     |
                     term:t
                     {: RESULT = t; :}
                     ;
    

    Syntax factor:f means you alias the factor's value to be f, and you can refer to it in the following section {: ... :}. Remember, our terminals have values of type LexerToken, and our non-terminals have values that are Nodes.

    Your term in expression may have the following definition:

       term  ::= LPAREN expr:e RPAREN
             {: RESULT = new Node(e.val, e.start); :}
             |
             NUMBER:n
             {: RESULT = new Node(n.val, n.start); :}
             ;
    

    When you successfully generate the parser code, you will see in your parser.java the part where the parent-child relationship between nodes is established:

      case 16: // term ::= UFUN LPAREN expr RPAREN 
        {
          Node RESULT =null;
        int ufleft = ((java_cup.runtime.Symbol)CUP$parser$stack.elementAt(CUP$parser$top-3)).left;
        int ufright = ((java_cup.runtime.Symbol)CUP$parser$stack.elementAt(CUP$parser$top-3)).right;
        LexerToken uf = (LexerToken)((java_cup.runtime.Symbol) CUP$parser$stack.elementAt(CUP$parser$top-3)).value;
        int eleft = ((java_cup.runtime.Symbol)CUP$parser$stack.elementAt(CUP$parser$top-1)).left;
        int eright = ((java_cup.runtime.Symbol)CUP$parser$stack.elementAt(CUP$parser$top-1)).right;
        Node e = (Node)((java_cup.runtime.Symbol) CUP$parser$stack.elementAt(CUP$parser$top-1)).value;
         RESULT = new Node(uf.val, e, null, uf.start); 
          CUP$parser$result = parser.getSymbolFactory().newSymbol("term",0, ((java_cup.runtime.Symbol)CUP$parser$stack.elementAt(CUP$parser$top-3)), ((java_cup.runtime.Symbol)CUP$parser$stack.peek()), RESULT);
        }
      return CUP$parser$result;
    

    I am sorry that I cannot publish complete code example, but hopefully this will save someone a few hours of trial and error. Not having complete code is also good because it won't render all those CS homework assignments useless.

    As a proof of life, here's a pretty-print of my sample AST.

    Input expression:

    T21 + 1A / log(max(F1004036, min(a1, a2))) * MIN(1B, 434) -LOG(xyz) - -3.5+10 -.1 + .3 * (1)
    

    Resulting AST:

    |--[+]
       |--[-]
       |  |--[+]
       |  |  |--[-]
       |  |  |  |--[-]
       |  |  |  |  |--[+]
       |  |  |  |  |  |--[T21]
       |  |  |  |  |  |--[*]
       |  |  |  |  |     |--[/]
       |  |  |  |  |     |  |--[1A]
       |  |  |  |  |     |  |--[LOG]
       |  |  |  |  |     |     |--[MAX]
       |  |  |  |  |     |        |--[F1004036]
       |  |  |  |  |     |        |--[MIN]
       |  |  |  |  |     |           |--[A1]
       |  |  |  |  |     |           |--[A2]
       |  |  |  |  |     |--[MIN]
       |  |  |  |  |        |--[1B]
       |  |  |  |  |        |--[434]
       |  |  |  |  |--[LOG]
       |  |  |  |     |--[XYZ]
       |  |  |  |--[-]
       |  |  |     |--[3.5]
       |  |  |--[10]
       |  |--[.1]
       |--[*]
          |--[.3]
          |--[1]