[SOLVED] How do I preform 'lookahead' in an OCaml lexer / how do I rollback a lexeme?

How do I preform 'lookahead' in an OCaml lexer / how do I rollback a lexeme?

Well, I'm writing my first parser, in OCaml, and I immediately somehow managed to make one with an infinite-loop.

Of particular note, I'm trying to lex identifiers according to the rules of the Scheme specification (I have no idea what I'm doing, obviously) — and there's some language in there about identifiers requiring that they are followed by a delimiter. My approach, right now, is to have a delimited_identifier regex that includes one of the delimiter characters, that should not be consumed by the main lexer … and then once that's been matched, the reading of that lexeme is reverted by Sedlexing.rollback (well, my wrapper thereof), before being passed to a sublexer that only eats the actual identifier, hopefully leaving the delimiter in the buffer to be eaten as a different lexeme by the parent lexer.

I'm using Menhir and Sedlex, mostly synthesizing the examples from @smolkaj's ocaml-parsing example-repo and RWO's parsing chapter; here's the simplest reduction of my current parser and lexer:

%token LPAR RPAR LVEC APOS TICK COMMA COMMA_AT DQUO SEMI EOF
%token <string> IDENTIFIER
(* %token <bool> BOOL *)
(* %token <int> NUM10 *)
(* %token <string> STREL *)

%start <Parser.AST.t> program

%%

program:
  | p = list(expression); EOF { p }
  ;

expression:
  | i = IDENTIFIER { Parser.AST.Atom i }

%%

… and …

(** Regular expressions *)
let newline = [%sedlex.regexp? '\r' | '\n' | "\r\n" ]
let whitespace = [%sedlex.regexp? ' ' | newline ]
let delimiter = [%sedlex.regexp? eof | whitespace | '(' | ')' | '"' | ';' ]

let digit = [%sedlex.regexp? '0'..'9']
let letter = [%sedlex.regexp? 'A'..'Z' | 'a'..'z']

let special_initial = [%sedlex.regexp?
   '!' | '$' | '%' | '&' | '*' | '/' | ':' | '<' | '=' | '>' | '?' | '^' | '_' | '~' ]
let initial = [%sedlex.regexp? letter | special_initial ]

let special_subsequent = [%sedlex.regexp? '+' | '-' | '.' | '@' ]
let subsequent = [%sedlex.regexp? initial | digit | special_subsequent ]

let peculiar_identifier = [%sedlex.regexp? '+' | '-' | "..." ]
let identifier = [%sedlex.regexp? initial, Star subsequent | peculiar_identifier ]
let delimited_identifier = [%sedlex.regexp? identifier, delimiter ]


(** Swallow whitespace and comments. *)
let rec swallow_atmosphere buf =
   match%sedlex buf with
   | Plus whitespace -> swallow_atmosphere buf
   | ";" -> swallow_comment buf
   | _ -> ()

and swallow_comment buf =
   match%sedlex buf with
   | newline -> swallow_atmosphere buf
   | any -> swallow_comment buf
   | _ -> assert false

(** Return the next token. *)
let rec token buf =
   swallow_atmosphere buf;
   match%sedlex buf with
   | eof -> EOF

   | delimited_identifier ->
     Sedlexing.rollback buf;
     identifier buf

   | '(' -> LPAR
   | ')' -> RPAR

   | _ -> illegal buf (Char.chr (next buf))

and identifier buf =
   match%sedlex buf with
   | _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)

(Yes, it's basically a no-op / the simplest thing possible rn. I'm trying to learn! :x)

Unfortunately, this combination results in an infinite loop in the parsing automaton:

State 0:
Lookahead token is now IDENTIFIER (1-1)
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER 
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER 
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
...

I'm new to parsing and lexing and all this; any advice would be welcome. I'm sure it's just a newbie mistake, but …

Thanks!

Solution

As said before, implementing too much logic inside the lexer is a bad idea. However, the infinite loop does not come from the rollback but from your definition of identifier:

 identifier buf =
   match%sedlex buf with
   | _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)

within this definition _ matches the shortest possible words in the language consisting of all possible characters. In other words, _ always matches the empty word μ without consuming any part of its input, sending the parser into an infinite loop.