While writing a C++ parser for learning purposes, I had a question.
int a, b;
template <typename T> T Foo() {}
int main() {
bool x = a < b;
auto y = Foo<int>();
}
Here, <
symbol in a <
is less-than operator. But in Foo<
, that means the start of template parameters. Both are essentially identifier, followed by the <
symbol, in expression. Can the recursive-descent parser handle this situation? Or can only more powerful parsers like the LR parser handle this?
Recursive-descent parsers are more powerful than LR parsers, not less. A recursive-descent parser is usually written in a Turing-complete programming language (such as C++), while an LR parser only has the theoretical power of a pushdown automaton.
Now, recursive-descent parsers also typically have a regular, top-down structure similar to LL parsers, and since LL is less powerful than LR, it's understandable to suspect recursive-descent is also less powerful. But, because it has access to all of the features of a conventional programming language, a recursive-descent parser can use very powerful lookahead techniques, limited only by decidability, including arbitrary backtracking (which is essentially using the parser as its own lookahead mechanism).
In any case, how does a typical recursive-descent parser handle the C++ left-angle-bracket ambiguity? First we have to know what we want to accomplish, and for that, C++17 17.2 ([temp.names]) paragraph 3 says:
When a name is considered to be a template-name, and it is followed by a
<
, the<
is always taken as the delimiter of a template-argument-list and never as the less-than operator. [...]
So, in an expression context, if we see:
identifier <
then we need to look up identifier
to determine if it should be
"considered to be a template-name" (the rules for which are in paragraph
2 of the same section, and have some subtleties I'll omit), which is
principally determined by looking up the identifier in the
symbol table. If it is
found to refer to a template, then we treat the <
as starting a
template argument list, otherwise as the less-than operator.
A concrete example of the technique can be found in the Clang parser.
In
Parser::ParseUnqualifiedId
,
we have:
// unqualified-id:
// identifier
// template-id (when it hasn't already been annotated)
if (Tok.is(tok::identifier)) {
ParseIdentifier:
// Consume the identifier.
IdentifierInfo *Id = Tok.getIdentifierInfo();
SourceLocation IdLoc = ConsumeToken();
[...]
// If the next token is a '<', we may have a template.
TemplateTy Template;
if (Tok.is(tok::less))
return ParseUnqualifiedIdTemplateId(
SS, ObjectType, ObjectHadErrors,
TemplateKWLoc ? *TemplateKWLoc : SourceLocation(), Id, IdLoc,
EnteringContext, Result, TemplateSpecified);
and then
ParseUnqualifiedIdTemplateId
contains code to check if the identifier is a template:
TNK = Actions.isTemplateName(getCurScope(), SS,
TemplateKWLoc.isValid(), Id,
ObjectType, EnteringContext, Template,
MemberOfUnknownSpecialization);
and, if TNK
indicates a template was found, code that parses the
template argument list:
// Parse the enclosed template argument list.
SourceLocation LAngleLoc, RAngleLoc;
TemplateArgList TemplateArgs;
if (ParseTemplateIdAfterTemplateName(true, LAngleLoc, TemplateArgs, RAngleLoc,
Template))
return true;
Meanwhile, surrounding code I have not shown deals with the case where
the name is not considered to be a template name. In that case, it sets
Result
to just the identifier, then returns false
. That (perhaps
unexpectedly) means "successful parse", and bubbles up to
Parser::tryParseCXXIdExpression
which processes the identifier alone,
leaving the <
as the next token in the input token stream, where it
will eventually be treated as less-than.
There are other contexts in which a similar decision has to be made, and other ambiguities that require more mechanism to handle (including the right-angle-bracket ambiguity, which is its own can of worms), but this code illustrates the basic idea: when faced with a difficult choice, check the possibilities one by one, taking care to isolate any parser state changes caused by speculative parsing that didn't work out.
But, having explained the basic idea, I should caution that parsing C++ is extremely difficult, far more difficult than any other language I'm aware of. If you're an experienced language implementor eager for a "final boss" challenge that will take years if not decades to complete, go for it. But if you just want to learn the basics, I strongly recommend picking a different language.