Heya,
Before diving into it, I'll admit that this is a long and very specific question. I've tried to remove as much unnecessary info as possible without loss of clarity. Anyways, on to the question:
I'm working on a parser for mixfix expressions in Haskell. I found this paper and thought I'd base a first-draft implementation on it.
To try and summarize the paper, the idea is this: we construct a DAG, which represents a precedence graph. A node contains a set of operators which all have the same precedence. These operators can be infix (left/right/nonassociative), prefix, postfix or closed. Edges point towards nodes of higher precedence. I've taken a screenshot of the example graph from the paper.
If an infix operator has multiple 'slots' (like if_then_else_
), then it is represented
as a List of 'name parts' (i.e. [if, then, else]) and the fixity is only talking about the left and rightmost name parts, so if_then_else_
is prefix, but the python ternary if _if_else_
is infix.
This graph is then used to produce a parser combinator. For reference, the libraries I'm using are:
Before moving on to the code, I will mention that I am aware that my implementation is probably very inefficient (particularly with my liberal usage of backtracking) and I'll worry about that once it's working properly.
I use two types as the 'output' of the parser: RawCore and Telescope. RawCore represents an output expression, which can be either a variable or a curried function application. A telescope is simply a way of representing repeated application more conveniently.
-- Technically RawCore is more complex than this, but this definition contains the relevant bits
data RawCore = Var Text | App RawCore RawCore
-- Telescope e [a, b] represents (App (App e a) b)
data Telescope = Tel RawCore [RawCore]
type Parser = Parsec Text Text
For representing operators and the precedence graph, I use the following types:
data Associativity = LeftAssociative | RightAssociative | NonAssociative
deriving (Eq, Ord, Show)
data Fixity = Closed | Prefix | Postfix | Infix Associativity
deriving (Eq, Ord, Show)
data Operator = Operator { fixity :: Fixity, name_parts :: Vector Text }
deriving (Eq, Ord, Show)
-- adapted from the paper to use the Topograph library
type PrecedenceNode = (Set Operator)
type PrecedenceGraph i = G PrecedenceNode i
Just before I get to the implementation, I'll briefly describe some helper functions I use (full definitions are just below the parser definition):
Parser Helper Functions
<||>
: backtracking version of <|>
choice'
: backtracking version of choice
many1
: repeat a parser 1 or more timesbetweenM
: best explained by example - take a vector of parsers (e.g. [symbol "a", symbol "b", symbol "c"]
) and a singular parser (e.g. int :: Parser Int
) and return a parser which would parse a 2 b 3 c
and return [2, 3]. If the vector contains only a single parser [p]
, is equivalent to p >> pure []
. If the vector is empty, is equivalent to pure []
.symbol
: take a string as input (e.g. "if") and produce a parser which parses that string followed by any trailing whitespace.Misc. Helper Functions
opName
: take a vector of name parts and fixity, returning the name of the operator with underscores, i.e. turn (Prefix, [if, then, else])
into "if_then_else"
.unscope
: unfold a telescope, converting it into an expression (RawCore
).-- Note: I'm using the scoped type variables extension, so the 'i' is captured
mixfix :: forall i. PrecedenceGraph i -> Parser RawCore
mixfix G {..} = expr
where
expr :: Parser RawCore
expr = precs gVertices
-- precs will take a list of nodes (lowest -> highest precedence) and return the
-- first one which successfully parses the string.
precs :: [i] -> Parser RawCore
precs (p:ps) = prec p <||> precs ps
precs [] = customFailure "ran out of operators in precedence graph"
-- prec will build a parser originating at a single precedence node
prec :: i -> Parser RawCore
prec node = choice
[ try (unscope <$> close Closed)
, try (appn <$> psucs <*> close (Infix NonAssociative) <*> psucs)
, try (appr <$> many1 preRight <*> psucs)
, try (appl <$> psucs <*> many1 postLeft)
]
where
close :: Fixity -> Parser Telescope
close = inner . ops current_ops
where
current_ops :: [Operator]
current_ops = Set.toList $ gFromVertex node
psucs :: Parser RawCore
psucs = precs $ gEdges node
preRight :: Parser (RawCore -> RawCore)
preRight =
(\(Tel core lst) val -> unscope $ Tel core (lst <> [val])) <$> close Prefix
<||> (\l (Tel core lst) r -> unscope $ Tel core (l : lst <> [r]))
<$> psucs <*> close (Infix RightAssociative)
postLeft :: Parser (RawCore -> RawCore)
postLeft =
(\(Tel core lst) val -> unscope $ Tel core (val : lst)) <$> close Postfix
<||> (\(Tel core lst) l r -> unscope $ Tel core (r : lst <> [l]))
<$> close (Infix LeftAssociative) <*> psucs
appn e (Tel core lst) e' = unscope $ Tel core (e : lst <> [e'])
appr fs e = foldr (\f e -> f e) e fs
appl e fs = foldl (\e f -> f e) e fs
-- inner : for a given operator, return a parser which parses that operator with
-- only expressions between name bits, so, e.g.
-- inner if_then_else_ would return a parser parsing expressions if e1 then e2 else
-- inner _+_ would return a parser parsing the symbol '+'
inner :: [Operator] -> Parser Telescope
inner [] = customFailure "inner ran out of operators"
inner (op : ops) =
Tel (Var (opName op))
<$> betweenM (fmap symbol $ name_parts op) expr
<||> inner ops
-- ops : get all operators in a given node with a specified fixity
-- also, get all operators of successor nodes
ops :: [Operator] -> Fixity -> [Operator]
ops op f = filter ((== f) . fixity) op
-- For the time being, I'm liberally applying 'try' at each alternative so
-- I don't have to think about whether it's causing the bug! (I know this is inefficient)
infixl 3 <||>
(<||>) :: Parser a -> Parser a -> Parser a
l <||> r = try l <|> r
choice' :: [Parser a] -> Parser a
choice' = choice . fmap try
betweenM :: Vector (Parser b) -> Parser a -> Parser [a]
betweenM vec p = case length vec of
0 -> pure []
1 -> head vec *> pure []
2 -> between (head vec) (last vec) ((\x -> [x]) <$> p)
_ -> (head vec) *> ((:) <$> p <*> betweenM (tail vec) p)
many1 :: Parser a -> Parser [a]
many1 p = (:) <$> p <*> many p
opName :: Operator -> Text
opName (Operator {..}) = case fixity of
Closed -> name
Prefix -> "_" <> name
Postfix -> name <> "_"
Infix _ -> "_" <> name <> "_"
where name = underscore (Vector.toList name_parts)
underscore [] = ""
underscore [x] = x
underscore (x:y:[]) = x <> "_" <> y
underscore (x:y:xs) = x <> "_" <> y <> "_" <> underscore xs
unscope :: Telescope -> RawCore
unscope (Tel core l) = go core l where
go :: RawCore -> [RawCore] -> RawCore
go core [] = core
go core (c:cs) = go (App core c) cs
sc :: Parser ()
sc = L.space
space1
(L.skipLineComment ";;")
(L.skipBlockComment "(;;" ";;)")
symbol :: Text -> Parser Text
symbol = L.symbol sc
I've written some tests for the above code, and have found that for some reason, right-associative operators are not parsing. I've got the following precedence graph:
_&_ right -> _=_ non -> _+_ left -> _! -> if_then_else_ -> (_)
_-_ left false
true
In code:
ops :: Map PrecedenceNode (Set PrecedenceNode)
ops = Map.fromList
[ (node_and, Set.fromList [node_eq])
, (node_eq, Set.fromList [node_arith, node_fact])
, (node_arith, Set.fromList [node_close])
, (node_fact, Set.fromList [node_close])
, (node_if, Set.fromList [node_close])
, (node_close, Set.fromList []) ]
where
node_and = Set.fromList [and_op]
node_eq = Set.fromList [eq_op]
node_arith = Set.fromList [add_op, sub_op]
node_fact = Set.fromList [fact_op]
node_if = Set.fromList [if_op]
node_close = Set.fromList [paren_op, true, false]
true = Operator Closed $ Vec.singleton "true"
false = Operator Closed $ Vec.singleton "false"
and_op = Operator (Infix RightAssociative) $ Vec.fromList ["&"]
eq_op = Operator (Infix NonAssociative) $ Vec.fromList ["="]
add_op = Operator (Infix LeftAssociative) $ Vec.fromList ["+"]
sub_op = Operator (Infix LeftAssociative) $ Vec.fromList ["-"]
fact_op = Operator Postfix $ Vec.fromList ["!"]
if_op = Operator Prefix $ Vec.fromList ["if", "then", "else"]
paren_op = Operator Closed $ Vec.fromList ["(", ")"]
When running the parser, I use transitive closure of this graph, i.e. runG ops (parse . closure)
. If I provide the string "true & false", it parses as simply true
. If I change the and_op
definition to be either NonAssociative
or LeftAssociative
, then the parser works just fine (i.e. it returns App 'true' 'false'
. My question then is: why, and how can I fix it?
Below are some sample strings and their parses - I have used ⋅
as a left-associative binary operator instead of App
to make things clearer.
"true" (var "true")
"false" (var "false")
"( true )" ((var "(_)") ⋅ (var "true"))
"true = false" ((var "_=_") ⋅ (var "true" ) ⋅ (var "false"))
"true + false" ((var "_+_") ⋅ (var "true" ) ⋅ (var "false"))
"true + false - false" ((var "_-_") ⋅ ((var "_+_") ⋅ (var "true" ) ⋅ (var "false")) ⋅ (var "false"))
And to any internet stranger who got this far - even if you don't answer this question, I'd like you to know I appreciate you taking the time to read my ramblings :)
I found an answer to my own question. Turns out it was an issue with one of the helper functions. My symmetric choice operator:
infixl 3 <||>
(<||>) :: Parser a -> Parser a -> Parser a
l <||> r = try l <|> r
Should have been defined:
infixl 3 <||>
(<||>) :: Parser a -> Parser a -> Parser a
l <||> r = try l <|> try r
I suppose if there's any lesson to be learned it's this: when using parser-combinators, pay attention to backtracking.