parsingf#abstract-syntax-treerecursive-descent

Parsing nested parentheses in F# using recursive descent


I'm trying to make a recursive descent parser in F#. I looked at http://www.fssnip.net/bM but this type of parser uses a string instead of a list.

I'm struggling with parsing parentheses, and especially nested parentheses. There are a lot of edge cases where the parser fails.

I used the following data type to represent the tokens.

type Token =
    | ParOpen
    | ParClose
    | Identifier of string
    | IntLiteral of int
    | Add

let example = [ParOpen;IntLiteral 3;Token.Add;IntLiteral 5;ParClose]

The data type below is used to represent nodes in the AST. It is a little bit like production rules.

type Expression =
  | Add of Expression * Expression
  | Par of Expression
  | Constant of int
  | Leaf //not used currently

The function below can be used for parsing "3 + 5" for instance. But it does not work for most cases when parsing parentheses. For example "(3 + 5) + 1" will fail.

let rec parse listOfTokens = 
    match listOfTokens with
    | IntLiteral i::Token.Add::list -> Add(Constant i, parse list)
    | IntLiteral i::[] -> Constant i
    | ParOpen::list -> Par(parse list)
    | IntLiteral i::ParClose::[] -> Constant i
    | _ -> failwith "Unexpected token"

let example = [ParOpen;IntLiteral 3;Token.Add;IntLiteral 5;ParClose;Token.Add;IntLiteral 1]

The parse function produces an AST which can be evaluated using the following function. But it does not always calculate correctly.

let rec evaluateExpression exp =
    match exp with
    | Add(x, y) -> evaluateExpression(x) + evaluateExpression(y)
    | Par(expression) -> evaluateExpression(expression)
    | Constant i -> i 
    | _ -> failwith "Unexpected expression"

A big problem is that pattern matching with lists allows me to only look at a few tokens at a time. For instance:

match tokens with
| ParOpen::exp::ParClose -> Par(parse exp) //exp is just one token right now, this does not work

I could implement a function which would filter/remove tokens after a ParClose is detected. Or is there an easier/better way to solve this problem?

Any advice would be appreciated. Or maybe you have a useful link?


Solution

  • You're trying to pack too much into this one feeble function. You're trying to do everything all at once, as a single, simple recursive function, but the problem is inherently more complex, which is what you're discovering in the form of "too many edge cases". As a general rule, if you have too many edge cases, it means you need to redefine your edges.

    The key insight here is that a parsing function shouldn't have a simple signature Token list -> Expression. Instead, the parsing function should:


    Let's try to write down the type of this function, just so we have it in front of our eyes and can go back to it for reference:

    type ParseError = string  // Error is going to be just a string for now
    type ParseFn<'a> = Token list -> Result< ('a * Token list), ParseError >
    

    So here you can see that ParserFn is:

    With that signature in mind, let's try to implement the simplest parsing function that can be - parsing an "open paren" token:

    let parseParOpen tokens = 
        match tokens with
        | ParOpen::rest -> Ok (ParOpen, rest)
        | first::_ -> Error $"expected ParOpen, got {first}"
        | [] -> Error $"expected ParOpen, got EOF"
    

    Let's test it out:

    > parseParOpen [ParOpen; Add]
    Ok (ParOpen, [Add])
    
    > parseParOpen [Add]
    Error "expected ParOpen, got Add"
    
    > parseParOpen []
    Error "expected ParOpen, got EOF"
    

    Nice! Now lets implement a parser for ParClose:

    let parseParClose tokens = 
        match tokens with
        | ParClose::rest -> Ok (ParClose, rest)
        | first::_ -> Error $"expected ParClose, got {first}"
        | [] -> Error $"expected ParClose, got EOF"
    

    Hmmm, that seems awfully repetitive, doesn't it? Can we extract the common parts? Of course we can!

    let parseToken t tokens = 
        match tokens with
        | first::rest when first = t -> Ok (t, rest)
        | first::_ -> Error $"expected {t}, got {first}"
        | [] -> Error $"expected {t}, got EOF"
    

    The parseToken function takes the token it's supposed to parse as a parameter, and returns a ParserFn<Token> as a result. Let's try it out:

    > parseToken ParOpen [ParOpen; Add; ParClose]
    Ok (ParOpen, [Add; ParClose])
    
    > parseToken ParClose [ParOpen; Add; ParClose]
    Error "expected ParClose, got ParOpen"
    
    > parseToken ParClose [ParClose; Add]
    Ok (ParClose, [Add])
    

    Even nicer!

    How about parsing a literal? That's a bit more tricky: we can't just call parseToken (IntLiteral 3), because that would only parse the literal 3, but error out on literal 5. So what to do? Well, this is a legitimate case of requiring a special function for parsing literals. And in general, when designing parsers, you will more or less find a separate function for every rule in your grammar. Sometimes they can be parametrized, like parseParOpen and parseParClose, but in general they can't.

    So let's parse a literal:

    let parseIntLiteral tokens =
        match tokens with
        | (IntLiteral i)::rest -> Ok (i, rest)
        | first::_ -> Error $"expected an int literal, got {first}"
        | [] -> Error $"expected an int literal, got EOF"
    

    (note that even this can be combined, to an extent, with parseToken, to avoid repeating the two error lines; but I won't go into that, since this answer is already too long)

    Let's test it out:

    > parseIntLiteral [IntLiteral 5; Add; ParClose]
    Ok (5, [Add; ParClose])
    
    > parseIntLiteral [ParOpen; Add; ParClose]
    Error "expected an int literal, got ParOpen"
    

    Now, how would we go about parsing several things in succession - like, for example, "3 + 5"? Well, that should be pretty straightforward: first parse an int, then a plus sign, then an int again. Let's try that out:

    let parseAddition tokens = 
        match parseIntLiteral tokens with
        | Error e -> Error e
        | Ok (x, tokens1) ->
            match parseToken Add tokens1 with
            | Error e -> Error e
            | Ok (_, tokens2) ->
                match parseIntLiteral tokens2 with
                | Error e -> Error e
                | Ok (y, rest) -> Ok (Expression.Add (Constant x, Constant y), rest)
    

    Wow, that's what one calls a "Pyramid of Doom"! This one has just three levels, but imagine if we got to more complex stuff!

    So what to do? Can we, perhaps, extract some common parts? Oh yes we can! Notice how it's the same pattern at all three levels: first we apply some parser to the incoming tokens, then if it's an error, we bail right away, otherwise we apply the next parser, rinse and repeat. We can surely capture that pattern as a function:

    let applyNextParser firstParser nextParser tokens =
        match firstParser tokens with
        | Error e -> Error e
        | Ok (r, rest) -> nextParser r rest
    

    Note how firstParser is just a parser, but nextParser is a function that takes the first parser's result and returns another parser, which we then apply to the rest tokens.

    let parseAddition tokens =
        let combinedParser =
            applyNextParser parseIntLiteral (fun x ->
                applyNextParser (parseToken Add) (fun _ ->
                    applyNextParser parseIntLiteral (fun y ->
                        fun restTokens -> Ok (Expression.Add (Constant x, Constant y), restTokens)
                    )
                )
            )
          
        combinedParser tokens
    

    One thing to note about this is that applyNextParser returns a parsing function, which we store in the combinedParser variable and then call it with our parameter tokens. Of course we can just get rid of the intermediate variable:

    let parseAddition =
        applyNextParser parseIntLiteral (fun x ->
            applyNextParser (parseToken Add) (fun _ ->
                applyNextParser parseIntLiteral (fun y ->
                    fun restTokens -> Ok (Expression.Add (Constant x, Constant y), restTokens)
                )
            )
        )
    

    Another thing to note is the expression in the very middle - it's a function that takes restTokens and returns Ok without consuming any of those restTokens. Such function can also be looked at as a kind of parser. It's a parser that doesn't consume any input, but already has a ready parsing result for you. This will be very useful as we proceed, so let's extract this pattern as well:

    let constParser c tokens = Ok (c, tokens)
    

    And then:

    let parseAddition =
        applyNextParser parseIntLiteral (fun x ->
            applyNextParser (parseToken Add) (fun _ ->
                applyNextParser parseIntLiteral (fun y ->
                    constParser (Expression.Add (Constant x, Constant y))
                )
            )
        )
    

    Ok, this is much nicer than the original Pyramid of Doom, but still pretty pyramidal. Can we do any better? Oh yes we can, if only we make applyNextParser into an operator:

    let (>>=) firstParser nextParser tokens =
        match firstParser tokens with
        | Error e -> Error e
        | Ok (r, rest) -> nextParser r rest
    

    And then:

    let parseAddition =
        parseIntLiteral >>= (fun x ->
            parseToken Add >>= (fun _ ->
                parseIntLiteral >>= (fun y ->
                    constParser (Expression.Add (Constant x, Constant y))
                )
            )
        )
    

    What's that? Not much better you say? But wait! Now that it's an operator, the F# syntax allows us to remove the parentheses:

    let parseAddition =
        parseIntLiteral >>= fun x ->
            parseToken Add >>= fun _ ->
                parseIntLiteral >>= fun y ->
                    constParser (Expression.Add (Constant x, Constant y))
    

    What, still pyramidal? But wait again! The F# syntax also allows us to get rid of the indentation:

    let parseAddition =
        parseIntLiteral >>= fun x ->
        parseToken Add >>= fun _ ->
        parseIntLiteral >>= fun y ->
        constParser (Expression.Add (Constant x, Constant y))
    

    And behold! Now it almost looks like we're "assigning" the result of each parser to a variable. The first parseIntLiteral is "assigned" to x, the second parseIntLiteral is "assigned" to y, and then the ultimate result combines x and y to get a new parsing result. We finally have a sane way to combine multiple parsers in a sequence!

    Let's test it out:

    > parseAddition [IntLiteral 3; Add; IntLiteral 5]
    Ok (Add (Constant 3, Constant 5), [])
    
    > parseAddition [IntLiteral 3; ParOpen; IntLiteral 5]
    Error "expected Add, got ParOpen"
    
    > parseAddition [Add; IntLiteral 5]
    Error "expected an int literal, got Add"
    

    Phew! Now, finally, we can parse the coveted parenthesized expression:

    let parseParens =
        parseToken ParOpen >>= fun _ ->
        parseAddition >>= fun exp ->
        parseToken ParClose >>= fun _ ->
        constParser exp
    
    > parseParens [ParOpen; IntLiteral 3; Add; IntLiteral 5; ParClose]
    Ok (Add (Constant 3, Constant 5), [])
    
    > parseParens [ParOpen; IntLiteral 3; Add; IntLiteral 5]
    Error "expected ParClose, got EOF"
    

    Ok, that works, but wait! Does this mean we can only parse parentheses? What about an expression without any parentheses? Well, here we have finally come to the question of trying out several parsers in sequence and seeing which one works. For you see, your expression grammar actually has a hidden structure: it could either be parenthesized or not, and parenthesized should take precedence. That is, we first try to parse a parenthesized expression, and only if that doesn't work, we should fall back to the "regular" one.

    So, just like before, let's try to just do this:

    let parseExpression tokens =
        match parseParens tokens with
        | Ok (exp, rest) -> Ok (exp, rest)
        | Error _ ->
            match parseAddition tokens with
            | Ok (exp, rest) -> Ok (exp, rest)
            | Error e -> Error e
    

    Here, we first try parseParens, and only if that fails, we go back and try parseAddition. This works, but once again, we got a Pyramid of Doom. Sure it's tiny and cute right now, but imagine adding a few more alternatives. Except this time, the pyramid grows out of the Error cases, whereas last time it was the Ok cases. But no matter: just like last time, we can extract the pyramid aspect into a handy operator:

    let (<|>) p1 p2 tokens =
        match p1 tokens with
        | Ok (result, rest) -> Ok (result, rest)
        | Error _ ->
            match p2 tokens with
            | Ok (result, rest) -> Ok (result, rest)
            | Error e -> Error e
    
    let parseExpression =
        parseParens <|> parseAddition
    

    Boom! Now we can clearly see that an "expression" is either "parens" or "addition". Readable, eh?

    (note that our definition of <|> swallows the first error and only preserves the second one; this answer is already way too long, so I'll leave the logic of combining the errors as an exercise)

    Ok, let's test it out:

    > parseExpression [ParOpen; IntLiteral 3; Add; IntLiteral 5; ParClose]
    Ok (Add (Constant 3, Constant 5), [])
    
    > parseExpression [IntLiteral 3; Add; IntLiteral 5]
    Ok (Add (Constant 3, Constant 5), [])
    

    Nice! Let's test it out some more:

    > parseExpression [IntLiteral 3]
    Error "expected Add, got EOF"
    

    Oh. Oh my. What happened here?

    Well, the program behaves exactly as we told it to, actually: an expression is either a parenthesized addition or a naked addition. A single int literal is not allowed. That's what we wrote, so that's what the program does. In fact, we need to do some more work to define what our expression actually is:

    Note 1: these definitions are mutually recursive. They have to be, because you want sub-expressions inside parens, and parens as part of expressions.
    Note 2: I had to invent a new concept - TERM. This is required to avoid making the grammar infinitely recursive: because EXPRESSION can be an ADDITION, we can't have ADDITION itself start with an EXPRESSION. This is the kind of subtlety you only get with experience (or formal education).
    Note 3: I left out Identifier: you're not talking about it at all, so I'll leave it as an exercise.

    So now that we have these rules, let's try to rewrite our parsers. They will have to be mutually recursive of course, and we can follow the above English definitions very closely, more or less translating them into F# word for word:

    let rec parseExpression =
        parseAddition <|> parseTerm
    
    and parseTerm =
        parseParens
        <|> (parseIntLiteral >>= fun i -> constParser (Constant i))
    
    and parseAddition =
        parseTerm >>= fun x ->
        parseToken Add >>= fun _ ->
        parseExpression >>= fun y ->
        constParser (Expression.Add (x, y))
    
    and parseParens =
        parseToken ParOpen >>= fun _ ->
        parseExpression >>= fun exp ->
        parseToken ParClose >>= fun _ ->
        constParser exp
    

    Let's test it out:

    > parseExpression [ParOpen; IntLiteral 3; Add; IntLiteral 5; ParClose]
    Ok (Add (Constant 3, Constant 5), [])
    
    > parseExpression [ParOpen; IntLiteral 3; Add; IntLiteral 5; ParClose; Add; IntLiteral 8]
    Ok (Add (Add (Constant 3, Constant 5), Constant 8), [])
    
    > parseExpression [IntLiteral 3; Add; IntLiteral 5; Add; IntLiteral 8]
    Ok (Add (Constant 3, Add (Constant 5, Constant 8)), [])
    
    > parseExpression [IntLiteral 3]
    Ok (Constant 3, [])
    
    > parseExpression [ParOpen; IntLiteral 3; ParClose]
    Ok (Constant 3, [])
    
    > parseExpression [ParOpen; IntLiteral 3; Add; IntLiteral 5; ParClose; Add; ParOpen; IntLiteral 8; Add; ParOpen; IntLiteral 1; Add; IntLiteral 10; ParClose; ParClose];;
    Ok
      (Add
         (Add (Constant 3, Constant 5),
          Add (Constant 8, Add (Constant 1, Constant 10))), [])
    

    And in conclusion, a few general notes: