I'm trying to create a Syntax Analyzer in Java (using CUP) that could recognise this piece of code:
if ¿b? then
~ a = 2;
~ if ¿b && c? then
~ ~ a = 3;
else
~ a = 4;
My productions used by the "if" statement are these that follow:
Instr ::= ...
| IF CONOP Exp:e CONCL THEN CondInstrList:l
...
;
...
CondInstrList ::= CondInstrList CondInstr
| /*empty*/
;
...
CondInstr ::= CONTROLD Instr
| CONTROLD CondInstr
;
where Instr stands for instruction/statement, CondInstrList stands for Conditional Instruction List and CONTROLD stands for Control Dash (~). (CONOP and CONCL mean Condition Open/Close)
The problem is that with that grammar, the generated AST is as follows:
if
|-condition b
|-condInstrListT
|---asig a = 2
|---if
|---condition b and c
|---condInstrListT
| |---asig a = 2
|---condInstrListF
|---asig a = 4
and so, the "else" part is associated with the inner "if".
I just don't know how to write a grammar that respects the way I want my language to be.
Any help is appreciated.
I can give more detail if needed.
I don't think you can do what you have in mind through the grammar alone. But it is possible with a slightly different grammar and some help from your lexical analyzer.
Here's what to do: rather than treat the ~ marks as individual grammar symbols, have the lexical analyzer turn sequences of ~ at the start of a line into INDENT and OUTDENT tokens that will work in your grammar the same way that { and } work in Java. You keep track of a "current indent level", which starts at zero. At the start of each line, count the ~ characters. For each ~ in excess of the current indent level, generate an INDENT token and increase the current indent level; for each ~ less than the current indent level, generate an OUTDENT token and decrease the current indent level.
So your example text of
if ¿b? then
~ a = 2;
~ if ¿b && c? then
~ ~ a = 3;
else
~ a = 4;
would be tokenized as:
// Indent level = 0 and no ~, so no INDENT here
[IF] [CONOP] [ID b] [CONCL] [THEN]
// Indent level = 0, one ~, so one INDENT
[INDENT]
// Indent level = 1
[ID a] [OP =] [CONST 2] [SEMICOLON]
// Indent level = 1, one ~, so no INDENT here
[IF] [CONOP] [ID b] [OP &&] [ID c] [CONCL] [THEN]
// Indent level = 1, two ~, so one INDENT
[INDENT]
// Indent level = 2
[ID a] [ASSIGN] [CONST 3] [SEMICOLON]
// Indent level = 2, lines starts with no ~, two OUTDENTs
[OUTDENT]
// Indent level = 1
[OUTDENT]
//Indent level = 0
[ELSE] // No ~ at start of this line, so no INDENT
// Indent level = 0; one ~, so one INDENT
[INDENT]
// Indent level = 1
[ID a] [ASSIGN] [CONST 4] [SEMICOLON]
// End-of-input. Indent level = 1, so 1 OUTDENT
[OUTDENT]
// Done; indent level = 0;
The INDENT and OUTDENT tokens would act in your grammar like left- and right-braces do in Java, so your grammar could look something like:
Instr ::= ...
| IF CONOP Exp:e CONCL THEN INDENT CondInstrList:l OUTDENT
...
;
...
CondInstrList ::= CondInstrList Instr
| /*empty*/
;
...
The language Python does the same thing but with just white space instead of ~. You can download Python source here if you're interested. Look for the files Grammar\Grammar
and Parser\tokenizer.c
.