performanceparsingtime-complexityidrisparser-combinators

Why does this expression parser scale so bad in the number of (some?) rules?


I'm trying to use Idris 2's Text.Parser library to parse a pre-tokenized byte stream. I wrote the following utility function in the style of Parsec's expression parser:

module Text.Parser.Expression

import Text.Parser

public export
data Assoc
   = AssocNone
   | AssocLeft
   | AssocRight

public export
data Op state k a
  = Prefix (Grammar state k True (a -> a))
  | Infix (Grammar state k True (a -> a -> a)) Assoc

public export
OpTable : Type -> Type -> Type -> Type
OpTable state k a = List (List (Op state k a))

public export
expressionParser :
  OpTable state k a ->
  Grammar state k True a ->
  Grammar state k True a
expressionParser table term = foldl level term table
  where
    level : Grammar state k True a -> List (Op state k a) -> Grammar state k True a
    level factor ops = choiceMap toP ops <|> factor
      where
        toP : Op state k a -> Grammar state k True a
        toP (Infix op AssocNone) = do
          x <- factor
          f <- op
          y <- factor
          pure $ f x y
        toP (Infix op AssocLeft) = do
          x <- factor
          fs <- some (flip <$> op <*> factor)
          pure $ foldl (flip ($)) x fs
        toP (Infix op AssocRight) = do
          fs <- some (factor >>= \x => op <*> pure x)
          y <- factor
          pure $ foldr ($) y fs
        toP (Prefix op) = op <*> factor

For certain inputs, this seems to scale really badly with the number of operator definitions. Here's a cut-down example:

public export
Number : Type
Number = Double

public export
data Fun
  = IntFun
  | Rnd

public export
data BinOp
   = Eq
   | NEq
   | LT
   | LE
   | GT
   | GE
   | Plus
   | Minus
   | Mul
   | And
   | Or

public export
data Expr
  = NumLitE Number
  | Bin BinOp Expr Expr
  | FunE Fun (List1 Expr)

public export
Show Fun where
  show IntFun = "INT"
  show Rnd = "RND"

public export
Show BinOp where
  show Eq = "="
  show NEq = "<>"
  show LT = "<"
  show LE = "<="
  show GT = ">"
  show GE = ">="
  show Plus = "+"
  show Minus = "-"
  show Mul = "*"
  show And = "AND"
  show Or = "OR"

public export
Show Expr where
  show (NumLitE n) = show n
  show (Bin op x y) = unwords [show x, show op, show y]
  show (FunE f args) = show f ++ "(" ++ show args ++ ")"

bits8 : Bits8 -> Grammar state Bits8 True ()
bits8 x = terminal ("Byte " ++ show x) $ \x' => if x == x' then Just () else Nothing

lexeme : {c : Bool} -> Grammar state Bits8 c a -> Grammar state Bits8 c a
lexeme p = afterMany (bits8 0x20) p

comma : Grammar state Bits8 True ()
comma = lexeme $ bits8 0x2c

parens : {c : Bool} -> Grammar state Bits8 c a -> Grammar state Bits8 True a
parens = between (lexeme $ bits8 0x28) (lexeme $ bits8 0x29)

digit : Grammar state Bits8 True Bits8
digit = terminal "digit" $ \x =>
  toMaybe (0x30 <= x && x <= 0x39) x

digitLit : (Num a) => Grammar state Bits8 True a
digitLit = fromInteger . cast . (\x => x - 0x30) <$> digit

numLit : (Num a, Neg a) => Grammar state Bits8 True a
numLit {a} = fromDigits <$> lexeme sign <*> lexeme (some digitLit)
  where
    fromDigits : Bool -> List1 a -> a
    fromDigits neg =
      (if neg then negate else id) .
      foldl (\x => \y => x * 10 + y) (the a 0)

    sign : Grammar state Bits8 False Bool
    sign = option False $ True <$ bits8 0xab

expr : Grammar state Bits8 True Expr
expr = expressionParser table term <|> fail "expression"
  where
    table : List (List (Op state Bits8 Expr))
    table =
      [ [ Infix (lexeme $ Bin Mul <$ bits8 0xac) AssocLeft
        ]
      , [ Infix (lexeme $ Bin Plus  <$ bits8 0xaa) AssocLeft
        , Infix (lexeme $ Bin Minus <$ bits8 0xab) AssocLeft
        ]
      , -- This next group is the one I will keep shrinking
        [ Infix (lexeme $ Bin Eq  <$ bits8 0xb2) AssocNone
        , Infix (lexeme $ Bin NEq <$ (bits8 0xb3 *> bits8 0xb1)) AssocNone
        , Infix (lexeme $ Bin GE  <$ (bits8 0xb1 *> bits8 0xb2)) AssocNone
        , Infix (lexeme $ Bin GT  <$ bits8 0xb1) AssocNone
        , Infix (lexeme $ Bin LE  <$ (bits8 0xb3 *> bits8 0xb2)) AssocNone
        , Infix (lexeme $ Bin LT  <$ bits8 0xb3) AssocNone
        ]
      , [ Infix (lexeme $ Bin And <$ bits8 0xaf) AssocLeft
        , Infix (lexeme $ Bin Or  <$ bits8 0xb0) AssocLeft
        ]
      ]

    fun : Grammar state Bits8 True Fun
    fun = lexeme $ choice
      [ IntFun  <$ bits8 0xb5
      , Rnd     <$ bits8 0xbb
      ]

    term : Grammar state Bits8 True Expr
    term =
          NumLitE <$> numLit
      <|> FunE <$> fun <*> parens (sepBy1 comma expr)
      <|> parens expr

For measurement, I have tried parsing [181,40,40,187,40,49,41,172,51,41,170,49,41] while removing the parsing rules for Eq, NEq, ..., Lt. Here is the user time of parsing the above list of bytes with the number of rules not commented out in that parsing rule group:

n usr (seconds)
1 0.41
2 1.56
3 4.67
4 13.92
5 25.71
6 45.92

What is going on here?


Solution

  • I fixed this by copying more of Parsec's design. As can be seen at that link, the important idea is to parse a leading term just once, and then parse a chain of associative operators and operands following it. This avoids repeated re-parsing of higher-precedence terms, which is what's causing the slowdown in the code in the question.