[SOLVED] Parse a list of subroutines

Parse a list of subroutines

I have written parser_sub.mly and lexer_sub.mll which can parse a subroutine. A subroutine is a block of statement englobed by Sub and End Sub.

Actually, the raw file I would like to deal with contains a list of subroutines and some useless texts. Here is an example:

' a example file
Sub f1()
  ...
End Sub
haha
' hehe
Sub f2()
  ...
End Sub

So I need to write parser.mly and lexer.mll which can parse this file by ignoring all the comments (e.g. haha, ' hehe, etc.) and calling parser_sub.main, and returns a list of subroutines.

Could anyone tell me how to let the parser ignore all the useless sentences (sentences outside a Sub and End Sub)?

Here is a part of parser.mly I tried to write:

%{
  open Syntax
%}
%start main
%type <Syntax.ev> main
%%
main:
  subroutine_declaration*  { $1 };

subroutine_declaration:
  SUB name = subroutine_name LPAREN RPAREN EOS
  body = procedure_body?
  END SUB 
  { { subroutine_name = name;
      procedure_body_EOS_opt = body; } }

The rules and parsing for procedure_body are complex and are actually defined in parser_sub.mly and lexer_sub.mll, so how could I let parser.mly and lexer.mll do not repeat defining it, and just call parser_sub.main?

Solution

If the stuff you want so skip can have any form (not necessarily valid tokens of your language), you pretty much have to solve this by hacking your lexer, as Kakadu suggests. This may be the easiest thing in any case.

If the filler (stuff to skip) consists of valid tokens, and you want to skip using a grammar rule, it seems to me the main problem is to define a nonterminal that matches any token other than END. This will be unpleasant to keep up to date, but seems possible.

Finally you have the problem that your end marker is two symbols, END SUB. You have to handle the case where you see END not followed by SUB. This is even trickier because SUB is your beginning marker also. Again, one way to simplify this would be to hack your lexer so that it treats END SUB as a single token. (Usually this is trickier than you'd expect, say if you want to allow comments between END and SUB.)