haskellhappyalex

Implementing eval() function with Happy parser generator


Using Alex and Happy on a tiny grammar for arithmetic expressions (only + and - among integers), I want to implement an eval() function, which parses as an Expr the string given as an argument.

Here's an example of the expected result:

$ stack run
1 + eval("3-2") 
Plus (ConstInt 1) (Minus (ConstInt 3) (ConstInt 2))

The content of eval was parsed correctly.

This question is about understanding a bit better why the way I implemented works (I would expect NOT to work!) and whether it should be done differently.

The Lexer:

{
module Lexer where

import Control.Monad (when)
}

%wrapper "monadUserState"

$digit = 0-9

tokens :-

<0> $white+     { skip }

<0> "("       { tok TokLParen }
<0> ")"       { tok TokRParen }
<0> "+"       { tok TokPlus }
<0> "-"       { tok TokMinus }

<0> "eval"    { tok TokBuiltinEvalExpr }

<0> $digit+                                             { tokInt }

<0> \"        { enterString `andBegin` string }
<string> \"   { exitString `andBegin` 0 }
<string> \\\\ { emit '\\' }
<string> \\\" { emit '"' }
<string> \\n  { emit '\n' }
<string> \\t  { emit '\t' }
<string> .    { emitCurrent }

{
data Token =
    TokLParen
  | TokRParen
  | TokPlus
  | TokMinus
  | TokBuiltinEvalExpr
  | TokConstInt Int
  | TokConstStr String
  | EOF
  deriving (Eq, Show)

data AlexUserState = AlexUserState {
    nestLevel :: Int
  , strStart :: AlexPosn
  , strBuffer :: String
} deriving (Show)

alexInitUserState :: AlexUserState
alexInitUserState = AlexUserState
  { nestLevel = 0
  , strStart = AlexPn 0 0 0
  , strBuffer = []
  }

alexModifyUserState :: (AlexUserState -> AlexUserState) -> Alex ()
alexModifyUserState f = f <$> alexGetUserState >>= alexSetUserState

alexEOF :: Alex Token
alexEOF = do
  (AlexPn _ line column, _, _, _) <- alexGetInput
  startCode <- alexGetStartCode
  when (startCode == string) $
    alexError $ "Error: unclosed string at line " <> show line <> ", column " <> show column
  pure EOF

tok :: Token -> AlexAction Token
tok ctor _ _ = pure ctor

tokInt :: AlexAction Token
tokInt (_, _, _, str) len =
  pure $ TokConstInt $ read $ take len str

enterString, exitString :: AlexAction Token
enterString inp@(pos, _, _, _) len = do
  alexModifyUserState $ \s -> s{strStart = pos, strBuffer = strBuffer s}
  skip inp len
exitString _ _ = do
  s <- alexGetUserState
  alexSetUserState s{strStart = AlexPn 0 0 0, strBuffer = []}
  pure $ TokConstStr $ reverse $ strBuffer s

emit :: Char -> AlexAction Token
emit c inp len = do
  alexModifyUserState $ \s -> s{strBuffer = c : strBuffer s}
  skip inp len

emitCurrent :: AlexAction Token
emitCurrent inp@(_, _, _, str) len = do
  alexModifyUserState $ \s -> s{strBuffer = head str : strBuffer s}
  skip inp len
}

The Parser:

{
module Parser where

import qualified Lexer as L
}

%name parser expr
%tokentype { L.Token }
%error { parseError }
%monad { L.Alex } { >>= } { pure }
%lexer { lexer } { L.EOF }

%token

'('     { L.TokLParen }
')'     { L.TokRParen }
'+'     { L.TokPlus }
'-'     { L.TokMinus }
'eval'  { L.TokBuiltinEvalExpr }

int   { L.TokConstInt $$ }
str   { L.TokConstStr $$ }

%left '+' '-'

%%

expr :: { Expr }
  : op_expr { $1 }
  | int { ConstInt $1 }
  | eval_expr { $1 }

inject_string :: { () }
  : str {% injectString $1 }

eval_expr :: { Expr }
  : 'eval' '(' inject_string ')' expr { $5 }

op_expr :: { Expr }
  : expr '+' expr { Plus $1 $3 }
  | expr '-' expr { Minus $1 $3 }


{
data Expr =
    ConstInt Int
  | Plus Expr Expr
  | Minus Expr Expr
  deriving (Show)

injectString :: String -> L.Alex ()
injectString s = do
  (a, b, c, t) <- L.alexGetInput
  L.alexSetInput (a, b, c, s ++ t)

parseError :: L.Token -> L.Alex a
parseError tok = do
  (L.AlexPn _ line column, _, _, _) <- L.alexGetInput
  L.alexError $ "Parse error. Unexpected token " ++ show tok ++ " at line " ++ show line <> ", column " <> show column

lexer :: (L.Token -> L.Alex a) -> L.Alex a
lexer = (=<< L.alexMonadScan)
}

My approach was:

It works, but in my opinion it should fail.

I repeat here the productions involved:

inject_string :: { () }
  : str {% injectString $1 }

eval_expr :: { Expr }
  : 'eval' '(' inject_string ')' expr { $5 }

With an initial input

1 + eval("3-2")

Before the inject_string production has been processed, Alex's input is:

"3-2")

After the inject_string production has been processed, Alex's input is:

3-2)

That is, an expression followed by a closed parenthesis. But rule eval_expr expects the opposite: ')' expr.

Why does that work?

If I turn the rule into what seems more logical to me:

eval_expr :: { Expr }
  : 'eval' '(' inject_string expr ')' { $4 }

then I get:

1 + eval("3-2") 
Parse error. Unexpected token TokRParen at line 1, column 16

My second question is whether there is a better approach to implement eval().

EDIT

Perhaps I understood why it works.

I said that after the inject_string production the input is:

3-2)

Because 3-2 is what I insert and ) is what is still to be read.

But ) is already consumed as a look-ahead token, isn't it? So the input after the inject_string production is just

3-2

I'm still confused as why the next ) token in the production doesn't fail then. Is the look-ahead token not just consumed but already matched with the ) in the rule, even before inject_string is processed?

My understanding was that it is only matched in a "non-persistent" way to be able to solve possible rule ambiguities, but that it would still need to be matched after the inject_string production is done. Is this what I got wrong?

The second question still stands. Can I do this better?


Solution

  • With respect to your first question, in monadic Happy parsers, the monadic actions are executed at the time of reduction. If you dump the debugging info for your Happy parser with:

    happy Parser.y -i
    

    and inspect the resulting Parser.info file, you'll find the following state:

    State 12
    
        inject_string -> str .                              (rule 4)
    
        ')'            reduce using rule 4
    

    When the parser is in state 12, having just read the string argument for an eval, it will reduce the inject_string non-terminal only AFTER reading an expected ) token. When this reduction happens Alex will have just finished reading the ) character, and your string will be injected at that point.

    With respect to your second question, you might want to consider the following issues. First, your parser happily accepts weird things like:

    $ stack exec main
    1+eval("3-")2
    Right (Plus (ConstInt 1) (Minus (ConstInt 3) (ConstInt 2)))
    1+eval("3-2")50
    Right (Plus (ConstInt 1) (Minus (ConstInt 3) (ConstInt 250)))
    

    which seems undesirable. Second, if you think about the purpose of having an eval function, it's to parse and evaluate a string at runtime, not parse time. In your current grammar, there's no way to generate strings that aren't just literals, but if you added syntax to convert numbers to strings with str and concatenate strings with, say, a . operator, then you'd ideally like something like the following to work:

    1+eval(str(4-3)."2".str(1+2))
    123
    

    and it doesn't seem reasonable to do string injection in the middle of the parser to accomplish this.

    I'd suggest an alternative design where eval("3-2") is parsed into an Eval constructor with the string intact:

    data Expr = ... | Eval String | ...
    

    (Future grammars could use StringExpr in place of String, to support more complicated string expressions.)

    Your evaluation routine, at runtime, should then re-invoke the Happy parser to parse the string into an expression for evaluation.

    If you really want to do this at parse time, I think you should take much the same approach. Don't fiddle with the parser's internal state. Instead, have your eval rule re-invoke the parser on the string:

    eval_expr :: { Expr }
      : 'eval' '(' str ')' {%
      case L.runAlex $3 parser of
        Right e -> pure e
        Left err -> L.alexError $ "while evaluting string " <> show $3 <> ", " ++ err }