I am playing around with LLVM's frontend and would appreciate some help with general processing technqiues of the AST. In the CodeGen directory, I have a method that processes an instance of the CodeGenFunction class. I know that this CGF contains information about a call to a function foo() in my source file, so as far as I understand, this call would correspond to a node in the AST.
Inside that CodeGen method I would like to identify that node and process the information contained in it. When I dump the entire AST for my source file, I can identify the node I am interested in, but I need a more methodical way of traversing the AST.
In my CodeGen method I defined an Expr variable, and after dumping it to stderr I can see something like:
MemberExpr /*address*/ '<bound member function type>' -> foo /*address*/
`-ImplicitCastExpr /*address*/ 'class Base *' <UncheckedDerivedToBase> (Base)>
`-ImplicitCastExpr /*address*/ 'Derived1 *' <LValueToRValue>
`-DeclRefExpr /*address*/ 'Derived1 *' lvalue ParmVar /*address*/ 'obj1' 'Derived1 *'
I would like to access and process the line starting with -DeclRefExpr, but I am not sure how to do that. All help is very appreacited!
Also, I am still new to this, so if I am not thinking about it right please let me know. Any explanation is welcome!
As a prerequisite to traversing the AST, one should start with an understanding of the structure itself.
The Clang documentation has two overview articles:
Introduction to the Clang AST has a little bit of info, plus a link to a talk that is now somewhat old but still mostly accurate.
In the Clang CFE Internals Manual, there is the section The AST Library, which goes into much more detail, although the organization and choice of topics to cover is a bit incoherent.
After reading the above, what I recommend is to print the AST for small
examples by feeding them to
clang -fsyntax-only -Xclang -ast-dump filename.cc
, which produces
output similar to the dump you showed for the MemberExpr
. Then, each
of the AST node classes has its own API documentation page, for
example:
What I find works the best is to go to https://clang.llvm.org/doxygen/namespaceclang.html and then do a text search for a name of interest.
However, the API documentation only covers the public methods, whereas what you really want to understand is the private data. For that, you need to open or click through to the source code ("More", then the line number of the definition file) and look at the private data declarations. The private data is the "ground truth" of what the AST node contains, and often has useful commentary, but unfortunately neither the private declarations nor their comments get copied into the Doxygen output.
We can now apply this knowledge. The question asks how to navigate
from MemberExpr
to the contained DeclRefExpr
. Based on the AST dump
you showed, the original source code was something like:
// Preliminary declarations.
struct Base {
void foo();
};
struct Derived1 : Base {};
Derived1 *obj1;
void f()
{
obj1->foo();
//^^^^^^^^^ Expression whose AST we are examining.
}
By following the documentation and source code linked above, we find the following access path:
MemberExpr
has private member Stmt *Base
, which is the obj1
in obj1->foo
. That private member is retrieved by calling
getBase()
, yielding an Expr*
. (Note: Expr
is derived from
Stmt
.)
We can downcast the Expr*
to ImplicitCastExpr*
using
dyn_cast
.
ImplicitCastExpr
is derived from CastExpr
, which has private
data member Stmt *Op
which is not documented but from context and
its name must be the operand. That member is retrieved by calling
getSubExpr()
, yielding Expr*
again.
Repeat for the nested ImplicitCastExpr
.
DeclRefExpr
has private member ValueDecl *D
, which is the
declaration of obj1
, and can be retrieved using getDecl()
.
Putting these steps into code, it would be something like (not tested):
using namespace clang;
MemberExpr *memberExpr = ...;
ImplicitCastExpr *outerICE = dyn_cast<ImplicitCastExpr>(memberExpr->getBase());
ImplicitCastExpr *innerICE = dyn_cast<ImplicitCastExpr>(outerICE->getSubExpr());
DeclRefExpr *dre = dyn_cast<DeclRefExpr>(innerICE->getSubExpr());
ValueDecl *objValueDecl = dre->getDecl();
Of course, in a real analysis you should not blindly assume you know the structure of the AST in advance. Instead, at each step, you need to check what the type of the nested nodes are and treat them accordingly.
There are three main ways to check the type of a node:
dyn_cast
returns a nullable pointer, so a common idiom is if (auto derived = dyn_cast<DerivedType>(base)) {...}
, which enters the block only when base
is DerivedType
.
isa
, another template in the dyn_cast
family, returns a boolean (which is basically the same as checking if dyn_cast
returns non-null).
All of the AST root superclasses (of which the three most important are Stmt
, Decl
, and Type
) have a getKind()
or similar method that returns an enumeration. Thus, another common idiom is to use switch (base->getKind()) {...}
and have one case per AST node kind. Be aware that the enumerators correspond to the most-derived type, so, for example, you would need multiple case
statements to get all of the types derived from CastExpr
, whereas dyn_cast
and isa
work fine with such "intermediate" classes. Under the hood, dyn_cast
calls getKind()
.
Finally, note that you cannot in general use ordinary C++ dynamic_cast
. The Clang source code is normally compiled with RTTI disabled, and even if you change that, some of the AST classes do not have vtables.
RecursiveASTVisitor
Depending on what processing you are doing, it is sometimes convenient
to use
RecursiveASTVisitor
to automatically traverse the tree and dispatch on the node types. It's
a mixed bag, since it provides a lot of convenient automation at first,
but can be quite challenging to work with if your analysis does not fit
perfectly into its traversal system.
To use this method, you initiate a traversal from the MemberExpr
, and
then at some point during that, VisitDeclRefExpr
will be called. See
How to write RecursiveASTVisitor based ASTFrontendActions
for details. If that suffices, great! Otherwise, and especially if you
are intending to do code generation (which often needs to do things in a
particular order and with varying contexts), it may be better to stick
with direct traversal for maximum control.
Another method of AST inspection is to write AST match expressions, which let one write a pattern that is then compared to all of the input, with matching fragments passed to a handler. The article Tutorial for building tools using LibTooling and LibASTMatchers discusses this method in some detail.
That method does not seem like a good fit for a code generator, but is good for things like searching for problematic code patterns.
If you try the match expression route, it is convenient to test your
match expressions using the
clang-query
tool, which is a useful general AST search/grep tool in its own right.