I'm implementing a compiler for a simple toy language in C. I have a working scanner and parser, and a reasonable background on the conceptual function/construction of an AST. My question is related to the specific way to represent an AST in C. I've come across three styles pretty frequently in different texts/resources online:
One struct per type of node.
This has a base node "class"(struct) that is the first field in all the child structs. The base node contains an enum that stores the type of node(constant, binary operator, assignment, etc). Members of the struct are accessed using a set of macros, with one set per struct. It looks something like this:
struct ast_node_base {
enum {CONSTANT, ADD, SUB, ASSIGNMENT} class;
};
struct ast_node_constant {
struct ast_node_base *base;
int value;
};
struct ast_node_add {
struct ast_node_base *base;
struct ast_node_base *left;
struct ast_node_base *right;
};
struct ast_node_assign {
struct ast_node_base *base;
struct ast_node_base *left;
struct ast_node_base *right;
};
#define CLASS(node) ((ast_node_base*)node)->class;
#define ADD_LEFT(node) ((ast_node_add*)node)->left;
#define ADD_RIGHT(node) ((ast_node_add*)node)->right;
#define ASSIGN_LEFT(node) ((ast_node_assign*)node)->left;
#define ASSIGN_RIGHT(node) ((ast_node_assign*)node)->right;
One struct per layout of node.
This appears to be mostly the same as the above layout, except instead of having ast_node_add and ast_node_assign it would have an ast_node_binary to represent both, because the layout of the two structs is the same and they only differ by the contents of base->class. The advantage to this seems to be a more uniform set of macros(LEFT(node) for all nodes with a left and right instead of one pair of macros per), but the disadvantage seems that the C type checking won't be as useful(there would be no way to detect an ast_node_assign where there should only be an ast_node_add, for example).
One struct total, with a union to hold different types of node data.
A better explanation of this than I can give can be found here. Using the types from the previous example it would look like:
struct ast_node {
enum { CONSTANT, ADD, SUB, ASSIGNMENT } class;
union {
int value;
struct {
struct ast_node* left;
struct ast_node* right;
} op;
};
};
I'm inclined to like the third option the most because it makes recursive traversal much easier(in that lots of pointer casting is avoided in favor of the union), but it also doesn't take advantage of C type checking. The first option seems the most dangerous in that it relies on pointers to structs being cast to access the member of any node(even different members of the same node requiring different cases to access(base vs. left)), but these casts are type checked so that might be moot. The second option to me seems like the worst of both worlds, although maybe I'm missing something.
Which of these three schemes are the best, and why? Is there a better fourth option I haven't come across yet? I'm assuming none of them are a "one size fits all" solution, so if it matters the language I'm implementing is a statically typed imperative language, almost a small subset of C.
A specific question I have about the third(union) layout. If I use only the value field, will there be empty space following the value to accommodate for the possibility of op being written to?
You can make any of these work.
I prefer the union layout, because then all nodes have "the same" layout.
[You may find it useful to have a "child sublist" option, e.g., and arbitarily big, dynamic array of children, instead of having left- or right-leaning lists.]
You are going to find that this issue isn't the one that makes building your compiler hard. Rather, it is having symbol tables, performing various kinds of analyses, choosing a machine-level IR, building a code generator, and doing code optimizations. Then you're going to encounter real users and you'll discover what you really did wrong :-}
I'd pick one and run with it, so that you have a chance to get near the other issues.