javasyntaxgrammarcup

How to define a syntax that uses multiple dashes for nested "if" instructions?


I'm trying to create a Syntax Analyzer in Java (using CUP) that could recognise this piece of code:

if ¿b? then
~ a = 2;
~ if ¿b && c? then
~ ~ a = 3;
else
~ a = 4;

My productions used by the "if" statement are these that follow:

Instr ::= ...
       | IF CONOP Exp:e CONCL THEN CondInstrList:l
       ...
       ;
...
CondInstrList ::= CondInstrList CondInstr
       | /*empty*/
       ;
...
CondInstr ::= CONTROLD Instr
       | CONTROLD CondInstr
       ;

where Instr stands for instruction/statement, CondInstrList stands for Conditional Instruction List and CONTROLD stands for Control Dash (~). (CONOP and CONCL mean Condition Open/Close)

The problem is that with that grammar, the generated AST is as follows:

if
|-condition b
|-condInstrListT
  |---asig a = 2
  |---if
      |---condition b and c
      |---condInstrListT 
      |   |---asig a = 2
      |---condInstrListF
          |---asig a = 4

and so, the "else" part is associated with the inner "if".

I just don't know how to write a grammar that respects the way I want my language to be.

Any help is appreciated.

I can give more detail if needed.


Solution

  • I don't think you can do what you have in mind through the grammar alone. But it is possible with a slightly different grammar and some help from your lexical analyzer.

    Here's what to do: rather than treat the ~ marks as individual grammar symbols, have the lexical analyzer turn sequences of ~ at the start of a line into INDENT and OUTDENT tokens that will work in your grammar the same way that { and } work in Java. You keep track of a "current indent level", which starts at zero. At the start of each line, count the ~ characters. For each ~ in excess of the current indent level, generate an INDENT token and increase the current indent level; for each ~ less than the current indent level, generate an OUTDENT token and decrease the current indent level.

    So your example text of

    if ¿b? then
    ~ a = 2;
    ~ if ¿b && c? then
    ~ ~ a = 3;
    else
    ~ a = 4;
    

    would be tokenized as:

    // Indent level = 0 and no ~, so no INDENT here
    [IF] [CONOP] [ID b] [CONCL] [THEN]
    // Indent level = 0, one ~, so one INDENT
    [INDENT]
        // Indent level = 1
        [ID a] [OP =] [CONST 2] [SEMICOLON]
        // Indent level = 1, one ~, so no INDENT here
        [IF] [CONOP] [ID b] [OP &&] [ID c] [CONCL] [THEN]
        // Indent level = 1, two ~, so one INDENT
        [INDENT]
            // Indent level = 2
            [ID a] [ASSIGN] [CONST 3] [SEMICOLON]
            // Indent level = 2, lines starts with no ~, two OUTDENTs
        [OUTDENT]
        // Indent level = 1
    [OUTDENT]
    //Indent level = 0
    [ELSE] // No ~ at start of this line, so no INDENT
    // Indent level = 0; one ~, so one INDENT
    [INDENT] 
        // Indent level = 1
        [ID a] [ASSIGN] [CONST 4] [SEMICOLON]
    // End-of-input.  Indent level = 1, so 1 OUTDENT
    [OUTDENT]
    // Done; indent level = 0;
    

    The INDENT and OUTDENT tokens would act in your grammar like left- and right-braces do in Java, so your grammar could look something like:

    Instr ::= ...
           | IF CONOP Exp:e CONCL THEN INDENT CondInstrList:l OUTDENT
           ...
           ;
    ...
    CondInstrList ::= CondInstrList Instr
           | /*empty*/
           ;
    ...
    

    The language Python does the same thing but with just white space instead of ~. You can download Python source here if you're interested. Look for the files Grammar\Grammar and Parser\tokenizer.c.