How to use EBNF to drive the Parser?

I have a simple EBNF:

<program>   ::= <function>
<function>  ::= int <id> ( ) { <statement> }
<statement> ::= return <expr> ;
<expr>      ::= <int>

I am writing pure C and targeting Linux (for now). The answer doesn't have to be written in C.

I have a Lexer that simply indexes a source file and creates a linked list of token structs. The token only carries a char* and size_t as well as referencing the next token. That's all it has.

struct Token {
  char *ndx;
  size_t length;
  struct Token *next;
}

The Lexer indexes the source with only two basic conditions which it follows to discern - or differentiate - one token from another:

a space separates tokens
symbols are tokens - ( ) [ ] { } ; ... etc.

The source file I'm starting with looks like:

int main()
{
    return 2;
}

I am trying to figure out how I get from the EBNF to a ProgramExpr with the members - a FunctionExpr and a StatementExpr. Since I have the tokens representing the source file, the parser needs to resolve the tokens - i.e., discover what they are. Based on the production rules, the parser will build an Abstract Syntax Tree (what I also call ExpressionNode.

How do I get from the EBNF to an Expression Node?

I am thinking I will have a struct something like:

struct ExprNode {
  enum ExprType {
    PROG,
    FUNC,
    IDNT,
    STMT,
    EXPR,
    INT_LIT
  } type;
  char *term;
  int pos;
  int line;

  struct ExprNode* left;
  struct ExprNode* right;
};

What is the procedural order of parsing? Since my EBNF leads off with <program>, is my first parsing function looking for a program - i.e., a <function>? And if we don't have the function yet, does the parser drop into a looking for a function?

When I ask, everyone just says, "Use ANTLR." But I have a problem, and I want learn how to resolve it before I start using someone else's black box.

I've read Nora Sandler's article, Write a Compiler. While a good article, it seems some bits are skipped over and I'm just not getting it. I also realize that I may be confusing Parser Generator and Parser Combinator.

P.S. - someone suggested the Dragon Book ... that is on my list of necessary reading.

Solution

Since my EBNF leads off with <program>, is my first parsing function looking for a program - i.e., a <function>?

Yes, so you might have a function look_for_a_program() that calls look_for_a_function().

And if we don't have the function yet, does the parser drop into a looking for a function?

look_for_a_function() will be more interesting. First it will expect the keyword int (i.e., a Token that points at the text "int" in the input). If that succeeds, it expects an <id> (a Token that points at word-like text in the input). And so on, mirroring the structure of the <function> rule in your EBNF. (When it gets to the <statement> part, it doesn't look for a Token, it calls look_for_a_statement.) If you can successfully get to recognizing the right brace at the end of the rule, then you can make an ExprNode to represent the <function> that you just recognized. (Dealing with errors is another matter.)

This is a particular kind of parser called a "recursive descent parser", which you can find tons of help for on the web.

(Very roughly speaking, ANTLR is a parser generator that will take your EBNF and create something like the above code for you.)

Some thoughts:

You might want to add a field to the Token structure to convey the kind of token that the lexer has recognized. (E.g., an integer literal vs a word-like thing vs punctuation) This isn't strictly necessary, but it will make things easier for the parser.
The left and right fields of the ExprNode structure suggest that your AST will be a binary tree. This might work initially, but eventually (as you enlarge your EBNF) you'll probably want to allow an arbitrary number of "child nodes".