I am working on making a Lisp interpreter in OCaml. I naturally started with the front-end. So far I have an S-expression parsing algorithm that works most of the time. For both simple S-expressions like (a b)
and ((a b) (c d))
my function, ast_as_str
, shows that the output list structure is incorrect. I've documented that below. After trying countless variations on parse
nothing seems to work. Does someone adept in writing parsers in OCaml have a suggestion as to how I can fix my code?
type s_expression = Nil | Atom of string | Pair of s_expression * s_expression
let rec parse tokens =
match tokens with
| [] -> Nil
| token :: rest ->
match token with
| "(" -> parse rest
| ")" -> Pair(Nil, parse rest)
| atom -> Pair(Atom atom, parse rest)
let rec ast_as_str ast =
match ast with
| Nil -> "nil"
| Atom a -> Printf.sprintf "%s" a
| Pair(a, b) -> Printf.sprintf "(%s %s)" (ast_as_str a) (ast_as_str b);;
let check_output test = print_endline (ast_as_str (parse test));;
(*
Input:
(a b)
Output:
(a (b (nil nil)))
Almost correct...
*)
check_output ["("; "a"; "b"; ")"];;
(*
Input:
((w x) (y z))
Output:
(w (x (nil (y (z (nil (nil nil)))))))
Incorrect.
*)
check_output ["("; "("; "w"; "x"; ")"; "("; "y"; "z"; ")"; ")"]
I'm going to assume this isn't homework. If it is, I will change the answer to some less specific hints.
A recursive descent parser works by recognizing the beginning token of a construct, then parsing the contents of the construct, then (very often) recognizing the ending token of the construct. S-expressions have just one construct, the parenthesized list. Your parser isn't doing the recognizing of the end of the construct.
If you assume your parser works correctly, then encountering a right parenthesis )
is a syntax error. There shouldn't be any unmatched right parentheses, and matching right parentheses are parsed as part of the parenthesized list construct (as I described above).
If you swear that this is just a personal project I'd be willing to write up a parser. But you should try writing up something as described above.
Note that when you see an atom, you aren't seeing a pair. It's not correct to return Pair (Atom xyz, rest)
when seeing an atom.
Update
The way to make things work in a functional setting is to have the parsing functions return not only the construct that they saw, but also the remaining tokens that haven't been parsed yet.
The following code works for your examples and is probably pretty close to correct:
let rec parse tokens =
match tokens with
| [] -> failwith "Syntax error: end of input"
| "(" :: rest ->
(match parselist rest with
| (sexpr, ")" :: rest') -> (sexpr, rest')
| _ -> failwith "Syntax error: unmatched ("
)
| ")" :: _ -> failwith "Syntax error: unmatched )"
| atom :: rest -> (Atom atom, rest)
and parselist tokens =
match tokens with
| [] | ")" :: _ -> (Nil, tokens)
| _ ->
let (sexpr1, rest) = parse tokens in
let (sexpr2, rest') = parselist rest in
(Pair (sexpr1, sexpr2), rest')
You can define check_output like this:
let check_output test =
let (sexpr, toks) = parse test in
if toks <> [] then
Printf.printf "(extra tokens in input)\n";
print_endline (ast_as_str sexpr)
Here's what I see for your two test cases:
# check_output ["("; "a"; "b"; ")"];;
(a (b nil))
- : unit = ()
# check_output ["("; "("; "w"; "x"; ")"; "("; "y"; "z"; ")"; ")"];;
((w (x nil)) ((y (z nil)) nil))
- : unit = ()
I think these are the right results.