compiler-constructionantlrabstract-syntax-treetranspilerimplicit-declaration

Transpiling/code generation - declaration of variables issue


I have recently been working on ANTLR and Java and I built a simple grammar that parses this code and generates an AST. I also wrote a built-in interpreter to execute this code and it seems to work well:

Some notes on my toy language:

/* A sample program */
BEGIN
    j := 1;
    WHILE j <= 5 DO
        PRINT "ITERATION NO: "; PRINTLN j;
        sumA1 := 0;
        WHILE 1 = 1 DO 
            PRINT "Enter a number, 0 to quit: ";
            i := INPUT;
            IF i = 0 THEN
                BREAK;
            ENDIF
            sumA1 := ADD sumA1, i;
        ENDWHILE
        j := ADD j, 1;
        PRINT "The sum is: "; PRINTLN sumA1;
    ENDWHILE
    j := MINUS j;
    PRINTLN j;
END

I then wrote the code generation functions into the AST to output this to C from my AST class and I get this result (beautified):

#include <stdio.h>

#include <stdlib.h>

int main(int argc, char * argv[]) {
  double j;
  j = 1.00000;
  while (j <= 5.0) {
    printf("ITERATION NO: ");
    printf("%g\n", j);
    double sumA1;
    sumA1 = 0.00000;
    while (1.0 == 1.0) {
      printf("Enter a number, 0 to quit: ");
      double i;
      scanf("%lf", & i);
      if (i == 0.0) {
        break;
      }
      sumA1 = sumA1 + i;
    }
    j = j + 1.00000;
    printf("The sum is: ");
    printf("%g\n", sumA1);
  }
  j = -j;
  printf("%g\n", j);
}

During the code generation, I am checking first if the variable name is available in the HashMap. For assignment statements/input statements, I add the variable declaration just before assignment, as you can see. For usage of variables other than assignment, I throw an Exception for non-initializing of variable before usage.

All well and good. The above code works for this example, since in my source program I am not using any variable outside the scope in which it is declared.

But there is one issue. Since I am initializing certain variables inside the blocks (like while they cannot be used outside the scope), I need a way to collect all the variables used in my source program as global in C (or at least on top of the main() function). Declaring variables just before usage in C will cause valid programs in the source language to fail to compile in C if there is usage of the variable in my program outside of the block.

I thought I can solve it by first resolving all the variables and declaring them at the start of the C program and then generating code.

But if I update the symbol table (HashMap) before generating the code, I won't have a way to know if the variable is actually assigned before usage.

What is the best way to re-design this to ensure that:

It is the first time I am attempting something like this. Please provide me pointers to any possible solution.


Solution

  • In the general case, detecting use before assignment is impossible. Consider the following (not very good) C code:

    int sum;          /* uninitialised */
    for (i = 0; i < n; ++i) {
      if (check(i)) sum = 0;
      sum += val[i];  /* Is sum initialised here? */
      process(sum);
    }
    

    If check(i) is, say, i % 10 == 0, then sum will certainly be initialised. But if it is i % 10 == 1, then sum is used uninitialised in the first iteration. In general, whether sum is used uninitialised depends on the value of check(0). But there may be no way to know what that is. check() might be an external function. Or its return value might be dependent on input. Or it might be based on a difficult computation.

    That doesn't mean you shouldn't try to detect the problem. You could use symbolic execution, for example, to try to compute a conservative estimate of undefined use. You could throw an exception if you could prove undefined use, and issue a warning if you can't prove that all uses are defined. (Many compilers use a variant of this technique.) That might be an interesting exercise in control-flow analysis.

    But for a real-world solution, given that all variables are numeric, I'd suggest just automatically initialising all variables to 0, as part of the language semantics.