I'm parsing a language which has both <
and <<
. In my Alex definition I've got something that contains something like
tokens :-
"<" { token Lt }
"<<" { token (BinOp Shl) }
so whenever I encounter <<
, that gets tokenized as a left shift and not as to less-than's. This is generally a good thing, since I end up throwing out whitespace after tokenization and want to differentiate between 1 < < 2
and 1 << 2
. However, there are other times I wish <<
had been read as two <
. For example, I have things like
<<A>::B>
which I want read like
< < A > :: B >
Obviously I can try to adjust my Happy parser rules to accommodate for the extra cases, but that scales badly. In other imperative parser generators, I might try to do something like push back "part" of the token (something like push_back("<")
when I encountered <<
but I only needed <
).
Has anyone else had such a problem and, if so, how did you deal with it? Are there ways of "pushing back" tokens in Happy? Should I instead try to keep a whitespace token around (I'm actually leaning towards the last alternative - although being a huge headache, it would let me deal with <<
by just making sure there is no whitespace between the two <
).
While I initially went with @Jon's answer, I ended up running into a variety of precedence related issues (think precedence around expr < expr
vs expr << expr
) which caused me a lot of headaches. I recently (successfully) back to lexing <<
as one token. The solution was twofold:
I bit the bullet and added extra rules for <<
(where previously I only had rules for <
). For the example in the question (<<A>::B>
) my rule went from something like
ty_qual_path
: '<' ty_sum '>' '::' ident
to
ty_qual_path
: '<' ty_sum '>' '::' ident
| '<<' ty_sum '>' '::' ident '>' '::' ident
(The actual rule was actually a bit more involved, but that is not for this answer).
I found a clever way to deal with token that started with >
(these would cause problems around things like vector<i32,vector<i32>>
where the last >>
was a token): use a threaded lexer (section 2.5.2), exploit the {%% ... }
RHS of rules which lets you reconsider the lookahead token, and add a pushToken
facility to my parser monad (this turned out to be quite simple - here is exactly what I did). I then added a dummy rule - something like
gt :: { () }
: {- empty -} {%% \tok ->
case tok of
Tok ">>" -> pushToken (Tok ">") *> pushToken (Tok ">")
Tok ">=" -> pushToken (Tok "=") *> pushToken (Tok ">")
Tok ">>=" -> pushToken (Tok ">=") *> pushToken (Tok ">")
_ -> pushToken tok
}
And every time in some other rule I expected a >
but there could also be any other token starting with >
, I would precede the >
token with gt
. This has the effect of looking ahead to the next token which may could start with >
without being >
, and try to convert that token into one >
token and another token for the "rest" of the initial token.