Using Alex and Happy on a tiny grammar for arithmetic expressions (only +
and -
among integers), I want to implement an eval()
function, which parses as an Expr
the string given as an argument.
Here's an example of the expected result:
$ stack run
1 + eval("3-2")
Plus (ConstInt 1) (Minus (ConstInt 3) (ConstInt 2))
The content of eval
was parsed correctly.
This question is about understanding a bit better why the way I implemented works (I would expect NOT to work!) and whether it should be done differently.
The Lexer:
{
module Lexer where
import Control.Monad (when)
}
%wrapper "monadUserState"
$digit = 0-9
tokens :-
<0> $white+ { skip }
<0> "(" { tok TokLParen }
<0> ")" { tok TokRParen }
<0> "+" { tok TokPlus }
<0> "-" { tok TokMinus }
<0> "eval" { tok TokBuiltinEvalExpr }
<0> $digit+ { tokInt }
<0> \" { enterString `andBegin` string }
<string> \" { exitString `andBegin` 0 }
<string> \\\\ { emit '\\' }
<string> \\\" { emit '"' }
<string> \\n { emit '\n' }
<string> \\t { emit '\t' }
<string> . { emitCurrent }
{
data Token =
TokLParen
| TokRParen
| TokPlus
| TokMinus
| TokBuiltinEvalExpr
| TokConstInt Int
| TokConstStr String
| EOF
deriving (Eq, Show)
data AlexUserState = AlexUserState {
nestLevel :: Int
, strStart :: AlexPosn
, strBuffer :: String
} deriving (Show)
alexInitUserState :: AlexUserState
alexInitUserState = AlexUserState
{ nestLevel = 0
, strStart = AlexPn 0 0 0
, strBuffer = []
}
alexModifyUserState :: (AlexUserState -> AlexUserState) -> Alex ()
alexModifyUserState f = f <$> alexGetUserState >>= alexSetUserState
alexEOF :: Alex Token
alexEOF = do
(AlexPn _ line column, _, _, _) <- alexGetInput
startCode <- alexGetStartCode
when (startCode == string) $
alexError $ "Error: unclosed string at line " <> show line <> ", column " <> show column
pure EOF
tok :: Token -> AlexAction Token
tok ctor _ _ = pure ctor
tokInt :: AlexAction Token
tokInt (_, _, _, str) len =
pure $ TokConstInt $ read $ take len str
enterString, exitString :: AlexAction Token
enterString inp@(pos, _, _, _) len = do
alexModifyUserState $ \s -> s{strStart = pos, strBuffer = strBuffer s}
skip inp len
exitString _ _ = do
s <- alexGetUserState
alexSetUserState s{strStart = AlexPn 0 0 0, strBuffer = []}
pure $ TokConstStr $ reverse $ strBuffer s
emit :: Char -> AlexAction Token
emit c inp len = do
alexModifyUserState $ \s -> s{strBuffer = c : strBuffer s}
skip inp len
emitCurrent :: AlexAction Token
emitCurrent inp@(_, _, _, str) len = do
alexModifyUserState $ \s -> s{strBuffer = head str : strBuffer s}
skip inp len
}
The Parser:
{
module Parser where
import qualified Lexer as L
}
%name parser expr
%tokentype { L.Token }
%error { parseError }
%monad { L.Alex } { >>= } { pure }
%lexer { lexer } { L.EOF }
%token
'(' { L.TokLParen }
')' { L.TokRParen }
'+' { L.TokPlus }
'-' { L.TokMinus }
'eval' { L.TokBuiltinEvalExpr }
int { L.TokConstInt $$ }
str { L.TokConstStr $$ }
%left '+' '-'
%%
expr :: { Expr }
: op_expr { $1 }
| int { ConstInt $1 }
| eval_expr { $1 }
inject_string :: { () }
: str {% injectString $1 }
eval_expr :: { Expr }
: 'eval' '(' inject_string ')' expr { $5 }
op_expr :: { Expr }
: expr '+' expr { Plus $1 $3 }
| expr '-' expr { Minus $1 $3 }
{
data Expr =
ConstInt Int
| Plus Expr Expr
| Minus Expr Expr
deriving (Show)
injectString :: String -> L.Alex ()
injectString s = do
(a, b, c, t) <- L.alexGetInput
L.alexSetInput (a, b, c, s ++ t)
parseError :: L.Token -> L.Alex a
parseError tok = do
(L.AlexPn _ line column, _, _, _) <- L.alexGetInput
L.alexError $ "Parse error. Unexpected token " ++ show tok ++ " at line " ++ show line <> ", column " <> show column
lexer :: (L.Token -> L.Alex a) -> L.Alex a
lexer = (=<< L.alexMonadScan)
}
My approach was:
expr
productionIt works, but in my opinion it should fail.
I repeat here the productions involved:
inject_string :: { () }
: str {% injectString $1 }
eval_expr :: { Expr }
: 'eval' '(' inject_string ')' expr { $5 }
With an initial input
1 + eval("3-2")
Before the inject_string
production has been processed, Alex's input is:
"3-2")
After the inject_string
production has been processed, Alex's input is:
3-2)
That is, an expression followed by a closed parenthesis.
But rule eval_expr
expects the opposite: ')' expr
.
Why does that work?
If I turn the rule into what seems more logical to me:
eval_expr :: { Expr }
: 'eval' '(' inject_string expr ')' { $4 }
then I get:
1 + eval("3-2")
Parse error. Unexpected token TokRParen at line 1, column 16
My second question is whether there is a better approach to implement eval()
.
EDIT
Perhaps I understood why it works.
I said that after the inject_string
production the input is:
3-2)
Because 3-2
is what I insert and )
is what is still to be read.
But )
is already consumed as a look-ahead token, isn't it?
So the input after the inject_string
production is just
3-2
I'm still confused as why the next )
token in the production doesn't fail then. Is the look-ahead token not just consumed but already matched with the )
in the rule, even before inject_string
is processed?
My understanding was that it is only matched in a "non-persistent" way to be able to solve possible rule ambiguities, but that it would still need to be matched after the inject_string
production is done. Is this what I got wrong?
The second question still stands. Can I do this better?
With respect to your first question, in monadic Happy parsers, the monadic actions are executed at the time of reduction. If you dump the debugging info for your Happy parser with:
happy Parser.y -i
and inspect the resulting Parser.info
file, you'll find the following state:
State 12
inject_string -> str . (rule 4)
')' reduce using rule 4
When the parser is in state 12, having just read the string argument for an eval
, it will reduce the inject_string
non-terminal only AFTER reading an expected )
token. When this reduction happens Alex
will have just finished reading the )
character, and your string will be injected at that point.
With respect to your second question, you might want to consider the following issues. First, your parser happily accepts weird things like:
$ stack exec main
1+eval("3-")2
Right (Plus (ConstInt 1) (Minus (ConstInt 3) (ConstInt 2)))
1+eval("3-2")50
Right (Plus (ConstInt 1) (Minus (ConstInt 3) (ConstInt 250)))
which seems undesirable. Second, if you think about the purpose of having an eval
function, it's to parse and evaluate a string at runtime, not parse time. In your current grammar, there's no way to generate strings that aren't just literals, but if you added syntax to convert numbers to strings with str
and concatenate strings with, say, a .
operator, then you'd ideally like something like the following to work:
1+eval(str(4-3)."2".str(1+2))
123
and it doesn't seem reasonable to do string injection in the middle of the parser to accomplish this.
I'd suggest an alternative design where eval("3-2")
is parsed into an Eval
constructor with the string intact:
data Expr = ... | Eval String | ...
(Future grammars could use StringExpr
in place of String
, to support more complicated string expressions.)
Your evaluation routine, at runtime, should then re-invoke the Happy parser to parse the string into an expression for evaluation.
If you really want to do this at parse time, I think you should take much the same approach. Don't fiddle with the parser's internal state. Instead, have your eval
rule re-invoke the parser on the string:
eval_expr :: { Expr }
: 'eval' '(' str ')' {%
case L.runAlex $3 parser of
Right e -> pure e
Left err -> L.alexError $ "while evaluting string " <> show $3 <> ", " ++ err }