regexparsinghaskellalternation

Does order really matter with <|> here in this parser?


The Monday Morning Haskell post Parsing Part 2: Applicative Parsing says this about alternation with regex-applicative:

Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.

Referring to this function from their Git repository:

numberParser :: RE Char Value
numberParser = (ValueNumber . read) <$>
  (negativeParser <|> decimalParser <|> integerParser)
  where
    integerParser = some (psym isNumber)
    decimalParser = combineDecimal <$> many (psym isNumber) <*> sym '.' <*> some (psym isNumber)
    negativeParser = (:) <$> sym '-' <*> (decimalParser <|> integerParser)

    combineDecimal :: String -> Char -> String -> String
    combineDecimal base point decimal = base ++ (point : decimal)

However, I can't figure out why that would be so. When I change decimalParser <|> integerParser to integerParser <|> decimalParser, it still seems like it always parses the right thing (in particular, I did that and ran stack test, and their tests all still passed). The decimal parser must have a decimal point, and the integer parser can't have one, so it will stop parsing there, resulting in the decimal point making the next piece of the parse fail, backtracking us back to the decimal parser. It seems like the only case this wouldn't occur in would be if the next part of the overall parser after this one could accept the decimal point (making it an ambiguous grammar), but you still wouldn't "lose all the information after the decimal, or worse, have a parse failure". Is my reasoning correct and this a mistake in that article, or is there a case I'm not seeing where one of their outcomes could happen?


Solution

  • There is a difference, and it matters, but part of the reason is that the rest of the parser is quite fragile.

    When I change decimalParser <|> integerParser to integerParser <|> decimalParser, it still seems like it always parses the right thing (in particular, I did that and ran stack test, and their tests all still passed).

    The tests pass because the tests don't cover this part of the parser (the closest ones only exercise stringParser).

    Here's a test that currently passes, but wouldn't if you swapped those parsers (stick it in test/Spec.hs and add it to the do block under main):

    badex :: Spec
    badex = describe "Bad example" $ do
      it "Should fail" $
        shouldMatch
          exampleLineParser
          "| 3.4 |\n"
          [ ValueNumber 3.4 ]
    

    If you swap the parsers, you get as a result ValueNumber 3.0: the integerParser (which is now first) succeeds parsing 3, but then the rest of the input gets discarded.

    To give more context, we have to see where numberParser is used:

    1. numberParser is one of the alternatives of valueParser...
    2. which is used in exampleLineParser, where valueParser is followed by readThroughBar (and I mean the relevant piece of code is literally valueParser <* readThroughBar);
    3. readThroughBar discards all characters until the next vertical bar (using many (psym (\c -> c /= '|' && c /= '\n'))).

    So if valueParser succeeds parsing just 3, then the subsequent readThroughBar will happily consume and discard the rest .4 |.

    The explanation from the blogpost you quote is only partially correct:

    Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.

    (emphasis mine) You will only lose information if your parser actively discards it, which readThroughBar does here.

    As you already suggested, the backtracking behavior of RE means that the noncommutativity of <|> really only matters for correctness with ambiguous syntaxes (it might still have an effect on performance in general), which would not be a problem here if readThroughBar were less lenient, e.g., by consuming only whitespace before |.

    I think that shows that using psym with (/=) is at least a code smell, if not a clear antipattern. By only looking for the delimiter without restricting the characters in the middle, it makes it hard to catch mistakes where the preceding parser does not consume as much input as it should. A better alternative is to ensure that the consumed characters may contain no meaningful information, for example, requiring them to all be whitespace.