parsing haskell functional-programming ll-grammar arrow-abstraction

John Hughes' Deterministic LL(1) parsing with Arrow and errors

I wanted to write a parser based on John Hughes' paper Generalizing Monads to Arrows. When reading through and trying to reimplement his code I realized there were some things that didn't quite make sense. In one section he lays out a parser implementation based on Swierstra and Duponchel's paper Deterministic, error-correcting combinator parsers using Arrows. The parser type he describes looks like this:

data StaticParser ch = SP Bool [ch]
data DynamicParser ch a b = DP (a, [ch]) -> (b, [ch])
data Parser ch a b = P (StaticParser ch) (DynamicParser ch a b)

with the composition operator looking something like this:

(.) :: Parser ch b c -> Parser ch a b -> Parser ch a c
  P (SP e2 st2) (DP f2) . P (SP e1 st1) (DP f1) =
  P (SP (e1 && e2) (st1 `union` if e1 then st2 else []))
    (DP $ f2 . f1)

The issue is that the composition of parsers q . p 'forgets' q's starting symbols. One possible interpretation I thought of is that Hughes' expects all our DynamicParsers to be total such that a symbol parser's type signature would be symbol :: ch -> Parser ch a (Maybe ch) instead of symbol :: ch -> Parser ch a ch. This still seems awkward though since we have to duplicate information putting starting symbol information in both the StaticParser and DynamicParser. Another issue is that almost all parsers will have the potential to throw which means we will have to spend a lot of time inside Maybe or Either creating what is essentially the "monads do not compose problem." This could be remedied by rewriting DynamicParser itself to handle failure or as an Arrow transformer, but this is straying quite a bit from the paper. None of these issues are addressed in the paper, and the Parser is presented as if it obviously works, so I feel like I must me missing something basic. If someone can catch what I missed that would be super helpful.

Solution

I think the deterministic parsers described by Swierstra and Duponcheel are a bit different from traditional parsers: they do not handle failure at all, only choice.

See also the invokeDet function in the S&D paper:

invokeDet :: Symbol s => DetPar s a -> Input s -> a
invokeDet (_, p) inp = case p inp [] of (a, _) -> a

This function clearly assumes it will always be able to find a valid parse.

With the arrow version of the parsers described by Hughes you can write a examples like this:

main = do
  let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
  print $ invokeDet p "ab"
  print $ invokeDet p "ac"

Which will print the expected:

'b'
'c'

However, if you write a "failing" parse:

main = do
  let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
  print $ invokeDet p "ad"

It will still print:

'c'

To make this behavior a bit more sensible, Swierstra and Duponcheel also introduce error-correction. The output 'c' is expected if we assume the erroneous character d has been corrected to be a c in the input. This requires an extra mechanism which presumably was too complicated to include in Hughes' paper.

I have uploaded the implementation I used to get these results here: https://gist.github.com/noughtmare/eced4441332784cc8212e9c0adb68b35

For more information about a more practical parser in the same style (but no longer deterministic and no longer limited to LL(1)) I really like the "Combinator Parsing: A Short Tutorial" by Swierstra. An interesting excerpt from section 9.3:

A subtle point here is the question how to deal with monadic parsers. As we described in [13] the static analysis does not go well with monadic computations, since in that case we dynamically build new parses based on the input produced thus far: the whole idea of a static analysis is that it is static. This observation has lead John Hughes to propose arrows for dealing with such situations [7]. It is only recently that we realised that, although our arguments still hold in general, they do not apply to the case of the LL(1) analysis. If we want to compute the symbols which can be recognised as the first symbol by a parser of the form p >>= q then we are only interested in the starting symbols of the right hand side if the left hand side can recognise the empty string; the good news is that in that case we statically know what value will be returned as a witness, and can pass this value on to q, and analyse the result of this call statically too. Unfortunately we will have to take special precautions in case the left hand side operator contains a call to pErrors in one of the empty derivations, since then it is no longer true that the witness of this alternative can be determined statically.

The full parser implementation by Swierstra can be found in the uu-parsinglib package, although I do not know how many of the extensions are implemented there.