parsinghaskellparsecleft-recursion

Parsing left-recursive grammar in sum-type inifinite recursion


I'm trying to write a parser for the Tiger language from modern compiler implementation in ML, and am stuck on one of the recursive types.

I have the following type

data LValue =                                                                                                                       
    Id Atom                                                                                                                         
    | RecordAccess LValue Atom                                                                                                      
    | ArraySubscript LValue Expression  

with the following grammar:

lvalue -> id
       -> lvalue.id
       -> lvalue[exp]
id -> atom
exp -> (the big, overarching, everything-is-an-expression type)

And I'm trying to parse it with Parsec, but I'm getting stuck in an infinitely recursive loop. Here's my current base parser:

lvalueParser :: Parsec String () LValue                                                                                             
lvalueParser =                                                                                                                      
    try (Id <$> (atomParser <* (notFollowedBy (char '.'))))                                                                         
    <|> try recordAccessParser                                                                                                      
    where recordAccessParser = (uncurry RecordAccess) <$> do {                                                                      
      record <- lvalueParser;                                                                                                         
      char '.';                                                                                                                     
      atom <- atomParser;                                                                                                           
      return (record, atom)                                                                                                      
      }

(Note: I haven't yet attempted to implement anything to handle the ArrayAccess portion)

Clearly, the infinite-loopiness happens when the recordAccessParser calls back to the lvalueParser.

I can change the recordAccessParser thusly:

recordAccessParser = (uncurry RecordAccess) <$> do {                                                                      
          record <- atomParser;                                                                                                         
          char '.';                                                                                                                     
          atom <- atomParser;                                                                                                           
          return (Id record, atom)                                                                                                      
          }

and it terminates. However, it will not parse record access more than a single level deep:

Parsec.parse lvalueParser "" "record_1.field_1.field_2"
#=> RecordAccess (Id record_1) (Id field_1)

And I expect

#=> RecordAccess (RecordAccess (Id record_1) (Id field_1)) (Id field_2)

I looked at chainl1, but the type of the chaining parser is a -> a -> a, and that doesn't match the type of LValue that reflects the grammar. I also looked at many; however I don't have a constant prefix for each term - the left recursion term is what I'm trying to parse into part of the result type.

I imagine I'm missing a specific concept of Parsec/parsing and would love to be pointed in the right direction. There are more types in the language that I'm writing a parser for that will have similar constructions.


Solution

  • Though you cannnot use chainl1 here, You can define chainl1-like combinator like this:

    leftRec :: (Stream s m t)
            => ParsecT s u m a -> ParsecT s u m (a -> a) -> ParsecT s u m a
    leftRec p op = rest =<< p
      where
        rest x = do f <- op
                    rest (f x)
              <|> return x
    

    I consulted here to implement this. By using this combinator, lvalueParser can be defined as follows:

    lvalueParser :: Parser LValue
    lvalueParser = leftRec idParser (recordAccessModifier <|> arraySubscriptModifier)
      where
        idParser = Id <$> atomParser
        recordAccessModifier = do
          a <- char '.' *> atomParser
          return (\l -> RecordAccess l a)
        arraySubscriptModifier = do
          e <- between (char '[') (char ']') expParser
          return (\l -> ArraySubscript l e)
    

    Example:

    main = parseTest lvalueParser "x.y[2].z"
    -- => RecordAccess (ArraySubscript (RecordAccess (Id 'x') 'y') (ENat 2)) 'z'
    

    where Atom, Expression, and their parsers are defined as follows:

    type Atom = Char
    atomParser :: Parser Atom
    atomParser = letter <?> "atom"
    
    data Expression = ENat Int
      deriving Show
    expParser :: Parser Expression
    expParser = (ENat . read) <$> many digit