I have a simple EBNF:
<program> ::= <function>
<function> ::= int <id> ( ) { <statement> }
<statement> ::= return <expr> ;
<expr> ::= <int>
I am writing pure C and targeting Linux (for now). The answer doesn't have to be written in C.
I have a Lexer
that simply indexes a source file and creates a linked list of token
structs. The token
only carries a char*
and size_t
as well as referencing the next token. That's all it has.
struct Token {
char *ndx;
size_t length;
struct Token *next;
}
The Lexer
indexes the source with only two basic conditions which it follows to discern - or differentiate - one token from another:
The source file I'm starting with looks like:
int main()
{
return 2;
}
I am trying to figure out how I get from the EBNF to a ProgramExpr
with the members - a FunctionExpr
and a StatementExpr
. Since I have the tokens representing the source file, the parser needs to resolve the tokens - i.e., discover what they are. Based on the production rules, the parser will build an Abstract Syntax Tree (what I also call ExpressionNode
.
How do I get from the EBNF to an Expression Node
?
I am thinking I will have a struct something like:
struct ExprNode {
enum ExprType {
PROG,
FUNC,
IDNT,
STMT,
EXPR,
INT_LIT
} type;
char *term;
int pos;
int line;
struct ExprNode* left;
struct ExprNode* right;
};
What is the procedural order of parsing? Since my EBNF leads off with <program>
, is my first parsing function looking for a program - i.e., a <function>
? And if we don't have the function
yet, does the parser drop into a looking for a function?
When I ask, everyone just says, "Use ANTLR." But I have a problem, and I want learn how to resolve it before I start using someone else's black box.
I've read Nora Sandler's article, Write a Compiler. While a good article, it seems some bits are skipped over and I'm just not getting it. I also realize that I may be confusing Parser Generator and Parser Combinator.
P.S. - someone suggested the Dragon Book ... that is on my list of necessary reading.
Since my EBNF leads off with
<program>
, is my first parsing function looking for a program - i.e., a<function>
?
Yes, so you might have a function look_for_a_program()
that calls look_for_a_function()
.
And if we don't have the function yet, does the parser drop into a looking for a function?
look_for_a_function()
will be more interesting. First it will expect the keyword int
(i.e., a Token
that points at the text "int" in the input). If that succeeds, it expects an <id>
(a Token
that points at word-like text in the input). And so on, mirroring the structure of the <function>
rule in your EBNF. (When it gets to the <statement>
part, it doesn't look for a Token, it calls look_for_a_statement
.) If you can successfully get to recognizing the right brace at the end of the rule, then you can make an ExprNode
to represent the <function>
that you just recognized. (Dealing with errors is another matter.)
This is a particular kind of parser called a "recursive descent parser", which you can find tons of help for on the web.
(Very roughly speaking, ANTLR is a parser generator that will take your EBNF and create something like the above code for you.)
Some thoughts:
Token
structure to convey the kind of token that the lexer has recognized. (E.g., an integer literal vs a word-like thing vs punctuation) This isn't strictly necessary, but it will make things easier for the parser.left
and right
fields of the ExprNode
structure suggest that your AST will be a binary tree. This might work initially, but eventually (as you enlarge your EBNF) you'll probably want to allow an arbitrary number of "child nodes".