parsinghaskellparser-combinators

Mixfix Parser Bug: Infix Right Associative Operators failing


Introduction

Heya,

Before diving into it, I'll admit that this is a long and very specific question. I've tried to remove as much unnecessary info as possible without loss of clarity. Anyways, on to the question:

I'm working on a parser for mixfix expressions in Haskell. I found this paper and thought I'd base a first-draft implementation on it.

To try and summarize the paper, the idea is this: we construct a DAG, which represents a precedence graph. A node contains a set of operators which all have the same precedence. These operators can be infix (left/right/nonassociative), prefix, postfix or closed. Edges point towards nodes of higher precedence. I've taken a screenshot of the example graph from the paper.

precedence-graph

If an infix operator has multiple 'slots' (like if_then_else_), then it is represented as a List of 'name parts' (i.e. [if, then, else]) and the fixity is only talking about the left and rightmost name parts, so if_then_else_ is prefix, but the python ternary if _if_else_ is infix.

This graph is then used to produce a parser combinator. For reference, the libraries I'm using are:

Before moving on to the code, I will mention that I am aware that my implementation is probably very inefficient (particularly with my liberal usage of backtracking) and I'll worry about that once it's working properly.

The Code

I use two types as the 'output' of the parser: RawCore and Telescope. RawCore represents an output expression, which can be either a variable or a curried function application. A telescope is simply a way of representing repeated application more conveniently.

-- Technically RawCore is more complex than this, but this definition contains the relevant bits
data RawCore = Var Text | App RawCore RawCore

-- Telescope e [a, b] represents (App (App e a) b)
data Telescope = Tel RawCore [RawCore]

type Parser = Parsec Text Text

For representing operators and the precedence graph, I use the following types:

data Associativity = LeftAssociative | RightAssociative | NonAssociative
  deriving (Eq, Ord, Show)
  
data Fixity = Closed | Prefix | Postfix | Infix Associativity 
  deriving (Eq, Ord, Show)

data Operator = Operator { fixity :: Fixity, name_parts :: Vector Text }
  deriving (Eq, Ord, Show)

-- adapted from the paper to use the Topograph library
type PrecedenceNode = (Set Operator)

type PrecedenceGraph i = G PrecedenceNode i

Just before I get to the implementation, I'll briefly describe some helper functions I use (full definitions are just below the parser definition):

-- Note: I'm using the scoped type variables extension, so the 'i' is captured
mixfix :: forall i. PrecedenceGraph i -> Parser RawCore
mixfix G {..} = expr
  where
    expr :: Parser RawCore
    expr = precs gVertices

    -- precs will take a list of nodes (lowest -> highest precedence) and return the
    -- first one which successfully parses the string.
    precs :: [i] -> Parser RawCore
    precs (p:ps) = prec p <||> precs ps
    precs [] = customFailure "ran out of operators in precedence graph" 

    -- prec will build a parser originating at a single precedence node
    prec :: i -> Parser RawCore
    prec node = choice
      [ try (unscope <$> close Closed)
      , try (appn <$> psucs <*> close (Infix NonAssociative) <*> psucs)
      , try (appr <$> many1 preRight <*> psucs)
      , try (appl <$> psucs <*> many1 postLeft)
      ]

      where
        close :: Fixity -> Parser Telescope
        close = inner . ops current_ops
          where
            current_ops :: [Operator]
            current_ops = Set.toList $ gFromVertex node

        psucs :: Parser RawCore
        psucs = precs $ gEdges node

        preRight :: Parser (RawCore -> RawCore)
        preRight = 
              (\(Tel core lst) val -> unscope $ Tel core (lst <> [val])) <$> close Prefix
          <||> (\l (Tel core lst) r -> unscope $ Tel core (l : lst <> [r]))
               <$> psucs <*> close (Infix RightAssociative)

        postLeft :: Parser (RawCore -> RawCore)
        postLeft =
              (\(Tel core lst) val -> unscope $ Tel core (val : lst)) <$> close Postfix
          <||> (\(Tel core lst) l r -> unscope $ Tel core (r : lst <> [l]))
               <$> close (Infix LeftAssociative) <*> psucs

        appn e (Tel core lst) e' = unscope $ Tel core (e : lst <> [e'])
        appr fs e = foldr (\f e -> f e) e fs
        appl e fs = foldl (\e f -> f e) e fs

    -- inner : for a given operator, return a parser which parses that operator with 
    --         only expressions between name bits, so, e.g.
    --         inner if_then_else_ would return a parser parsing expressions if e1 then e2 else
    --         inner _+_ would return a parser parsing the symbol '+'
    inner :: [Operator] -> Parser Telescope
    inner [] = customFailure "inner ran out of operators"
    inner (op : ops) =
      Tel (Var (opName op))
        <$> betweenM (fmap symbol $ name_parts op) expr
      <||> inner ops

    -- ops  : get all operators in a given node with a specified fixity
    --        also, get all operators of successor nodes
    ops :: [Operator] -> Fixity -> [Operator]
    ops op f = filter ((== f) . fixity) op

-- For the time being, I'm liberally applying 'try' at each alternative so
-- I don't have to think about whether it's causing the bug! (I know this is inefficient)
infixl 3 <||>
(<||>) :: Parser a -> Parser a -> Parser a
l <||> r = try l <|> r   

choice' :: [Parser a] -> Parser a
choice' = choice . fmap try

betweenM :: Vector (Parser b) -> Parser a -> Parser [a]  
betweenM vec p = case length vec of 
  0 -> pure []
  1 -> head vec *> pure []
  2 -> between (head vec) (last vec) ((\x -> [x]) <$> p)
  _ -> (head vec) *> ((:) <$> p <*> betweenM (tail vec) p)

many1 :: Parser a -> Parser [a]
many1 p = (:) <$> p <*> many p 

opName :: Operator -> Text
opName (Operator {..}) = case fixity of
  Closed -> name
  Prefix -> "_" <> name
  Postfix -> name <> "_"
  Infix _ -> "_" <> name <> "_"
  where name = underscore (Vector.toList name_parts)
        underscore [] = ""
        underscore [x] = x
        underscore (x:y:[]) = x <> "_" <> y
        underscore (x:y:xs) = x <> "_" <> y <> "_" <> underscore xs

unscope :: Telescope -> RawCore
unscope (Tel core l) = go core l where
  go :: RawCore -> [RawCore] -> RawCore
  go core [] = core 
  go core (c:cs) = go (App core c) cs

sc :: Parser () 
sc = L.space
  space1
  (L.skipLineComment ";;")
  (L.skipBlockComment "(;;" ";;)")

symbol :: Text -> Parser Text
symbol = L.symbol sc

The Problem

I've written some tests for the above code, and have found that for some reason, right-associative operators are not parsing. I've got the following precedence graph:

_&_ right -> _=_ non -> _+_ left -> _! -> if_then_else_ -> (_)
                        _-_ left                           false
                                                           true                        

In code:

ops :: Map PrecedenceNode (Set PrecedenceNode)
ops = Map.fromList
  [ (node_and,   Set.fromList [node_eq])
  , (node_eq,    Set.fromList [node_arith, node_fact])
  , (node_arith, Set.fromList [node_close])
  , (node_fact,  Set.fromList [node_close])
  , (node_if,    Set.fromList [node_close])
  , (node_close, Set.fromList []) ]

  where 
    node_and   = Set.fromList [and_op]
    node_eq    = Set.fromList [eq_op]
    node_arith = Set.fromList [add_op, sub_op]
    node_fact  = Set.fromList [fact_op]
    node_if    = Set.fromList [if_op]
    node_close = Set.fromList [paren_op, true, false]
    
    true     = Operator Closed $ Vec.singleton "true"
    false    = Operator Closed $ Vec.singleton "false"
    and_op   = Operator (Infix RightAssociative) $ Vec.fromList ["&"]
    eq_op    = Operator (Infix NonAssociative)   $ Vec.fromList ["="]
    add_op   = Operator (Infix LeftAssociative)  $ Vec.fromList ["+"]
    sub_op   = Operator (Infix LeftAssociative)  $ Vec.fromList ["-"]
    fact_op  = Operator Postfix $ Vec.fromList ["!"]
    if_op    = Operator Prefix $ Vec.fromList ["if", "then", "else"]
    paren_op = Operator Closed $ Vec.fromList ["(", ")"]

When running the parser, I use transitive closure of this graph, i.e. runG ops (parse . closure). If I provide the string "true & false", it parses as simply true. If I change the and_op definition to be either NonAssociative or LeftAssociative, then the parser works just fine (i.e. it returns App 'true' 'false'. My question then is: why, and how can I fix it?

Below are some sample strings and their parses - I have used as a left-associative binary operator instead of App to make things clearer.

"true" (var "true")
"false" (var "false")
"( true )" ((var "(_)") ⋅ (var "true"))
"true = false" ((var "_=_") ⋅ (var "true" ) ⋅ (var "false"))
"true + false" ((var "_+_") ⋅ (var "true" ) ⋅ (var "false"))
"true + false - false" ((var "_-_") ⋅ ((var "_+_") ⋅ (var "true" ) ⋅ (var "false")) ⋅ (var "false"))

And to any internet stranger who got this far - even if you don't answer this question, I'd like you to know I appreciate you taking the time to read my ramblings :)


Solution

  • I found an answer to my own question. Turns out it was an issue with one of the helper functions. My symmetric choice operator:

    infixl 3 <||>
    (<||>) :: Parser a -> Parser a -> Parser a
    l <||> r = try l <|> r   
    

    Should have been defined:

    infixl 3 <||>
    (<||>) :: Parser a -> Parser a -> Parser a
    l <||> r = try l <|> try r   
    

    I suppose if there's any lesson to be learned it's this: when using parser-combinators, pay attention to backtracking.