I'm building a small, single file parser combinator library in Haskell for my own fun. A challenge that I gave myself for this library is building the tools so I can create a parser for Java function signatures, such as parsing boolean isPrime(int n)
, into a language agnostic representation, such as
Function {
returnType = Type {
typeName = "boolean",
subTypes = []
},
functionName = "isPrime",
args = ...
}
However, when it comes to parsing Java types, I'm running into an issue. A type like Tuple<Integer, List<Integer>>
is fine to parse because its grammar can be represented without left recursion:
javatype := identifier, subtypes
identifier := regex("[a-zA-Z_][a-zA-Z0-9_]*")
subtypes := ('<', javatype, (',', javatype) zero or more times, '>') | NOTHING
(Apologies for my bad grammar writing skills and for simplifying Java identifier rules)
However, array types introduce left recursion, which parser combinators famously have difficulty dealing with. With array types, we change the javatype
rule to: javatype := (javatype, "[]") | (identifier, subtypes)
. This gives us left recursion. We can try to rewrite the rule to javatype := (identifier, subtypes) | (javatype, "[]")
but that would mean that for a type like int[]
, the parser would just parse the "int"
and not the "[]"
.
How can I rewrite the type grammar rules to avoid left recursion while allowing array types?
You can elimnate the left-recursion in this rule:
javatype := (javatype, "[]") | (identifier, subtypes)
by translating it into these two equivalent rules:
javatype := (identifier, subtypes) javatype2
javatype2 := "[]" javatype2 | e
(where e is the empty string)