pythonc++libclang

How can I get the fully qualified names of return types and argument types using libclang's python bindings?


Consider the following example. I use python clang_example.py to parse the header my_source.hpp for function and method declarations.

my_source.hpp

#pragma once 

namespace ns {

    struct Foo {
        struct Bar {};

        Bar fun1(void*);
    };

    using Baz = Foo::Bar;

    void fun2(Foo, Baz const&);
}

clang_example.py

I use the following code to parse the function & method declarations using libclang's python bindings:

import clang.cindex
import typing


def filter_node_list_by_predicate(
    nodes: typing.Iterable[clang.cindex.Cursor], predicate: typing.Callable
) -> typing.Iterable[clang.cindex.Cursor]:

    for i in nodes:
        if predicate(i):
            yield i
        yield from filter_node_list_by_predicate(i.get_children(), predicate)


if __name__ == '__main__':
    index = clang.cindex.Index.create()
    translation_unit = index.parse('my_source.hpp', args=['-std=c++17'])

    for i in filter_node_list_by_predicate(
        translation_unit.cursor.get_children(), 
        lambda n: n.kind in [clang.cindex.CursorKind.FUNCTION_DECL, clang.cindex.CursorKind.CXX_METHOD]
    ):
        print(f"Function name: {i.spelling}")
        print(f"\treturn type: \t{i.type.get_result().spelling}")
        for arg in i.get_arguments():
            print(f"\targ: \t{arg.type.spelling}")

Output

Function name: fun1
        return type:    Bar
        arg:    void *
Function name: fun2
        return type:    void
        arg:    Foo
        arg:    const Baz &

Now I would like to extract the fully qualified name of the return type and argument types so I can correctly reference them from the outermost scope:

Function name: ns::Foo::fun1
        return type:    ns::Foo::Bar
        arg:    void *
Function name: ns::fun2
        return type:    void
        arg:    ns::Foo
        arg:    const ns::Baz &

Using this SO answer I can get the fully qualified name of the function declaration, but not of the return and argument types.

How do I get the fully qualified name of a type (not a cursor) in clang?

Note:

I tried using Type.get_canonical and it gets me close:

print(f"\treturn type: \t{i.type.get_result().get_canonical().spelling}")
        for arg in i.get_arguments():
            print(f"\targ: \t{arg.type.get_canonical().spelling}")

But Type.get_canonical also resolves typedefs and aliases, which I do not want. I want the second argument of fun2 to be resolved as const ns::Baz & and not const ns::Foo::Bar &.

EDIT:

After having tested Scott McPeak's answer on my real application case I realized that I need this code to properly resolve template classes and nested types of template classes as well.

Given the above code as well as

namespace ns {
   template <typename T>
   struct ATemplate {
      using value_type = T;
   };

   typedef ATemplate<Baz> ABaz;

   ABaz::value_type fun3();
}

I would want the return type to be resolved to ns::ABaz::value_type and not ns::ATemplate::value_type or ns::ATemplate<ns::Foo::Bar>::value_type. I would be willing to settle for ns::ATemplate<Baz>::value_type.

Also, I can migrate to the C++ API, if the functionality of the Python bindings are too limited for what I want to do.


Solution

  • Unfortunately, there does not appear to be a simple way to print a type using fully-qualified names. Even in the C++ API, QualType::getAsString(PrintingPolicy&) ignores the SuppressScope flag due to the intervention of the ElaboratedTypePolicyRAII class (I don't know why, and the git commit history offers no clues that I could find). Even if the C++ API worked as I would have hoped/expected, PrintingPolicy isn't exposed in the C or Python APIs.

    Consequently, to do this, we have to resort to taking apart the type structure in the client code, printing fully qualified names whenever we hit a named type, which is typically expressed as TypeKind.ELABORATED. (I'm not sure if they always are.)

    The following example program demonstrates the technique, embodied by the type_str function. As a proof of concept, it does not exhaustively handle all of the cases, although it does cover the most common ones. You can look at the source of TypePrinter.cpp to get an idea of what handling all cases entails.

    #!/usr/bin/env python3
    """
    Print types with fully-qualified names.
    
    This demonstrates the basic approach, digging into the type structure to
    print its details, including fully-qualified names when we encounter
    named types.  However, there are several unhandled cases, some of which
    are indicated with TODOs below.
    
    Also, beware that I was unable to get 'mypy' to work properly on this
    (despite installing the 'types-clang' package), so the type annotations
    below might be incorrect.
    """
    
    import clang.cindex
    import typing
    
    
    def get_decl_fqn(decl: clang.cindex.Cursor) -> str:
        """
        Given a Cursor that refers to a Declaration, get its fully
        qualified name.
        """
    
        # The semantic parent is the enclosing class, namespace, or
        # translation unit.
        parent = decl.semantic_parent
        assert(parent is not None)
    
        # When we hit the TU, just return the simple identifier.
        if parent.kind == clang.cindex.CursorKind.TRANSLATION_UNIT:
            return decl.spelling
    
        # Otherwise, print the parent name as a qualifier.
        else:
            return get_decl_fqn(parent) + "::" + decl.spelling
    
    
    def starts_with_letter(s: str) -> bool:
        """
        True if 's' starts with a letter.
        """
    
        return s != "" and s[0].isalpha()
    
    
    def ends_with_letter(s: str) -> bool:
        """
        True if 's' ends with a letter.
        """
    
        return s != "" and s[-1].isalpha()
    
    
    def join_type_strs(s1: str, s2: str) -> str:
        """
        Join two strings containing fragments of type syntax, inserting a
        space if both are non-empty and either has a letter adjacent to the
        joined edge.
        """
    
        if s1 != "" and s2 != "" and (ends_with_letter(s1) or starts_with_letter(s2)):
            return s1 + " " + s2
        else:
            return s1 + s2
    
    
    def type_str(t: clang.cindex.Type) -> str:
        """
        Print 't' in C++ syntax, using fully qualified names for named
        types.  (In contrast, 't.spelling' omits qualifiers.)
        """
    
        return join_type_strs(before_type_str(t), after_type_str(t))
    
    
    def before_type_str(t: clang.cindex.Type) -> str:
        """
        Print the part of 't' that would go before the declarator name in a
        declaration of a variable with that type.
        """
    
        return join_type_strs(before_type_str_nq(t), cv_qualifiers_str(t))
    
    
    def cv_qualifiers_str(t: clang.cindex.Type) -> str:
        """
        If 't' has any const/volatile/restrict qualifiers, return a string
        containing them, separated by spaces.  Otherwise, return "".
        """
    
        qualifiers = []
        if t.is_const_qualified():
            qualifiers.append("const")
        if t.is_volatile_qualified():
            qualifiers.append("volatile")
        if t.is_restrict_qualified():
            qualifiers.append("restrict")
    
        return " ".join(qualifiers)
    
    
    def before_type_str_nq(t: clang.cindex.Type) -> str:
        """
        Print the part of 't' that would go before the declarator name in a
        declaration of a variable with that type, ignoring any CV
        qualifiers.
        """
    
        if t.kind == clang.cindex.TypeKind.ELABORATED:
            # Most named types are represented with the "elaborated" node,
            # which typically has a name.
            return get_decl_fqn(t.get_declaration())
    
        elif t.kind == clang.cindex.TypeKind.POINTER:
            p = t.get_pointee()
    
            # TODO: This does not handle pointer-to-function properly, since
            # that requires additional parentheses.
            return join_type_strs(before_type_str(p), "*")
    
        elif t.kind == clang.cindex.TypeKind.LVALUEREFERENCE:
            p = t.get_pointee()
            return join_type_strs(before_type_str(p), "&")
    
        elif t.kind == clang.cindex.TypeKind.RVALUEREFERENCE:
            p = t.get_pointee()
            return join_type_strs(before_type_str(p), "&&")
    
        elif t.kind == clang.cindex.TypeKind.FUNCTIONPROTO:
            rettype = t.get_result()
            return before_type_str(rettype)
    
        # TODO: FUNCTIONNOPROTO, pointer-to-member, and possibly others.
    
        else:
            # For other types, just use the spelling as its "before" syntax.
            return t.spelling
    
    
    def after_type_str(t: clang.cindex.Type) -> str:
        """
        Print the part of 't' that would go after the declarator name in a
        declaration of a variable with that type.
        """
    
        if t.kind == clang.cindex.TypeKind.FUNCTIONPROTO:
            res = "("
            count = 0
            for argtype in t.argument_types():
                if count > 0:
                    res += ", "
                count += 1
                res += type_str(argtype)
            res += ")"
            return res
    
        # TODO: FUNCTIONNOPROTO and the various array types.
    
        return ""
    
    
    # ------------- Original code, edited to call 'type_str' ---------------
    def filter_node_list_by_predicate(
        nodes: typing.Iterable[clang.cindex.Cursor], predicate: typing.Callable
    ) -> typing.Iterable[clang.cindex.Cursor]:
    
        for i in nodes:
            if predicate(i):
                yield i
            yield from filter_node_list_by_predicate(i.get_children(), predicate)
    
    
    if __name__ == '__main__':
        index = clang.cindex.Index.create()
        translation_unit = index.parse('my_source.hpp', args=['-std=c++17'])
    
        for i in filter_node_list_by_predicate(
            translation_unit.cursor.get_children(),
            lambda n: n.kind in [clang.cindex.CursorKind.FUNCTION_DECL, clang.cindex.CursorKind.CXX_METHOD]
        ):
            print(f"Function name: {i.spelling}")
    
            # ---- Edited section ----
            # Compare the 'spelling' method to 'type_str' defined above.
            t = i.type
            print(f"\tFunction type spelling  : {t.spelling}")
            print(f"\tFunction type type_str(): {type_str(t)}")
    
    
    # EOF
    

    On your example input, it prints:

    $ ./fq-type-name.py
    Function name: fun1
            Function type spelling  : Bar (void *)
            Function type type_str(): ns::Foo::Bar (void *)
    Function name: fun2
            Function type spelling  : void (Foo, const Baz &)
            Function type type_str(): void (ns::Foo, ns::Baz const &)
    

    Notably, this fully qualifies ns::Foo::Bar in the return type of fun1. It also uses ns::Baz in the argument list of fun2, rather than using the underlying type, Bar.


    The revised question asks about a case involving templates and a typedef that is used as a scope qualifier, and wants to recover a fully-qualified name that uses that typedef. This is not possible using the approach outlined above because we construct the qualifiers by walking up the scope stack from the found declaration, ignoring how the type was expressed originally.

    Using the Python API, it is possible to see the original type syntax and its qualifiers by iterating over children, but the child list is difficult to interpret. For example, if the input is:

    namespace ns {
      struct A {
        struct Inner {};
      };
      typedef A B;
      B::Inner f(int x, A a);
    }
    

    and we use this code to print the TU:

    def print_ast(node: clang.cindex.Cursor, label: str, indentLevel: int) -> None:
        """
        Recursively print the subtree rooted at 'node'.
        """
    
        indent = "  " * indentLevel
        print(f"{indent}{label}: kind={node.kind} " +
              f"spelling='{node.spelling}' " +
              f"loc={node.location.line}:{node.location.column}")
        indentLevel += 1
    
        index = 0
        for c in node.get_children():
            print_ast(c, f"child {index}", indentLevel)
            index += 1
    

    then the output is:

    TU: kind=CursorKind.TRANSLATION_UNIT spelling='test3.cc' loc=0:0
      child 0: kind=CursorKind.NAMESPACE spelling='ns' loc=1:11
        child 0: kind=CursorKind.STRUCT_DECL spelling='A' loc=2:10
          child 0: kind=CursorKind.STRUCT_DECL spelling='Inner' loc=3:12
        child 1: kind=CursorKind.TYPEDEF_DECL spelling='B' loc=5:13
          child 0: kind=CursorKind.TYPE_REF spelling='struct ns::A' loc=5:11
        child 2: kind=CursorKind.FUNCTION_DECL spelling='f' loc=6:12
          child 0: kind=CursorKind.TYPE_REF spelling='ns::B' loc=6:3
          child 1: kind=CursorKind.TYPE_REF spelling='struct ns::A::Inner' loc=6:6
          child 2: kind=CursorKind.PARM_DECL spelling='x' loc=6:18
          child 3: kind=CursorKind.PARM_DECL spelling='a' loc=6:23
            child 0: kind=CursorKind.TYPE_REF spelling='struct ns::A' loc=6:21
    

    Observe that the qualified type B::Inner is expressed as this pair of adjacent children:

          child 0: kind=CursorKind.TYPE_REF spelling='ns::B' loc=6:3
          child 1: kind=CursorKind.TYPE_REF spelling='struct ns::A::Inner' loc=6:6
    

    There's no simple way to see that the first child is the qualifier portion of the second child. This is a general problem with the Clang C API, and consequently of the Python API: accessors for specific roles are often missing, so one must resort to iterating over children and trying to reverse-engineer which is which. (I spent a couple weeks going down this road for a different project, and eventually had to admit defeat.)

    Therefore, with the revised requirement of not merely computing a type syntax string that uses fully-qualified names, but one that adheres to the original syntax as closely as possible, I think it's going to be a difficult task to robustly complete using the Python API since that original syntax is tough to unambiguously retrieve.

    I recommend instead using the C++ API. This is still non-trivial, but all the information is there and available through accessors that distinguish the various "child" roles. If you want a tip on getting started, I have a tool on GitHub called print-clang-ast that prints a lot (but by no means all) of the Clang AST in a moderately readable JSON format. I even just added code to print the details of NestedNameSpecifier (which is how qualified names are represented) while trying to see if the Python API could be used for what you what. If you try to accomplish this using the C++ API but run into trouble, you could then ask a new question based on where you get stuck.