abstract-syntax-treelibclang

clang AST for single line multiple variable declaration


I'm trying to use libclang from python to traverse the AST for the following snippet of C:

    /* simple.c */
    bool LED0 = 0; /* state0 */
    bool LED1 = 0; /* state1 */
    bool LED2 = 1; /* state2 */
    
    /* inputs */
    bool  a0,a1,a2;

I'm using the following Python code to parse this C code:

    import sys
    import clang.cindex
    
    var_decls = []
    function_decls = []
    
    def printnode(node):
        print(f"Found : {node.displayname} type: {node.kind} [line={node.location.line}]")
    
    def traverse(node):
        for child in node.get_children():
           traverse(child)
        if node.kind == clang.cindex.CursorKind.DECL_STMT:
           print("DECL_STMT")
           printnode(node)
        if node.kind == clang.cindex.CursorKind.VAR_DECL:
           var_decls.append(node)
           printnode(node)
        if node.kind == clang.cindex.CursorKind.FUNCTION_DECL:
           function_decls.append(node)
    
    index = clang.cindex.Index.create()
    translation_unit = index.parse("simple.c")
    traverse(translation_unit.cursor)

When I run this python script I get this output:

Found : LED0 type: CursorKind.VAR_DECL [line=2] 
Found : LED1 type: CursorKind.VAR_DECL [line=3] 
Found : LED2 type: CursorKind.VAR_DECL [line=4] 
Found : a0 type: CursorKind.VAR_DECL [line=7]

The problem here is that only a0 from the declaration bool a0,a1,a2; was found. No a1 or a2 declarations. How does one get the a1 and a2 declarations from the AST?

(NOTE: there was an older post entitled: "clang ast visitor for single line multiple variable declaration" from 2016 which suggested that you needed to check for a CursorKind DECL_STMT, but as you can see from the output above, that kind of node was never observed. A commenter on that answer said: "This will work only for local variables. For global variables and class members, AST doesn't show any common DeclStmt". Since this is a global declaration, it seems to be what the problem is here, wondering if any workaround has been discovered since 2016? )


Solution

  • How are multiple-declarator declarations represented?

    When a single declaration at file scope contains multiple declarators, as is the case with:

    // test1.c
    int x,y,z;
    

    this is represented in the Clang AST as multiple, seemingly-independent VarDecl nodes. This can be seen by running it through the clang compiler and having it dump the AST like this:

    $ clang -fsyntax-only -Xclang -ast-dump test1.c
    TranslationUnitDecl 0x56471dbd3328 <<invalid sloc>> <invalid sloc>
    |-TypedefDecl 0x56471dbd3b50 <<invalid sloc>> <invalid sloc> implicit __int128_t '__int128'
    | `-BuiltinType 0x56471dbd38f0 '__int128'
    |-TypedefDecl 0x56471dbd3bc0 <<invalid sloc>> <invalid sloc> implicit __uint128_t 'unsigned __int128'
    | `-BuiltinType 0x56471dbd3910 'unsigned __int128'
    |-TypedefDecl 0x56471dbd3ec8 <<invalid sloc>> <invalid sloc> implicit __NSConstantString 'struct __NSConstantString_tag'
    | `-RecordType 0x56471dbd3ca0 'struct __NSConstantString_tag'
    |   `-Record 0x56471dbd3c18 '__NSConstantString_tag'
    |-TypedefDecl 0x56471dbd3f60 <<invalid sloc>> <invalid sloc> implicit __builtin_ms_va_list 'char *'
    | `-PointerType 0x56471dbd3f20 'char *'
    |   `-BuiltinType 0x56471dbd33d0 'char'
    |-TypedefDecl 0x56471dbd4258 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list 'struct __va_list_tag[1]'
    | `-ConstantArrayType 0x56471dbd4200 'struct __va_list_tag[1]' 1 
    |   `-RecordType 0x56471dbd4040 'struct __va_list_tag'
    |     `-Record 0x56471dbd3fb8 '__va_list_tag'
    |-VarDecl 0x56471dc2f630 <test1.c:2:1, col:5> col:5 x 'int'
    |-VarDecl 0x56471dc2f6f8 <col:1, col:7> col:7 y 'int'
    `-VarDecl 0x56471dc2f778 <col:1, col:9> col:9 z 'int'
    

    When your Python script is run on this input, it produces:

    $ python3 visit.py
    Found : x type: CursorKind.VAR_DECL [line=2]
    Found : y type: CursorKind.VAR_DECL [line=2]
    Found : z type: CursorKind.VAR_DECL [line=2]
    

    I think this is what you're after; the script works here, visiting each of the VAR_DECL nodes.

    Incidentally, if you specifically want to know whether the VAR_DECLs all arose from the same syntactic declaration, you have to check their location information. See Clang AST: VarDecl (global variables) and DeclStmt.

    So why doesn't it work in the original example?

    The original simple.c is compiled as C (not C++) because it uses the .c extension. But C99 (the default version used by Clang-18, the current version of Clang), does not have a built-in bool type, so there are syntax errors:

    $ clang -fsyntax-only -Xclang -ast-dump simple.c
    simple.c:2:1: error: unknown type name 'bool'
    bool LED0 = 0; /* state0 */
    ^
    simple.c:3:1: error: unknown type name 'bool'
    bool LED1 = 0; /* state1 */
    ^
    simple.c:4:1: error: unknown type name 'bool'
    bool LED2 = 1; /* state2 */
    ^
    simple.c:7:1: error: unknown type name 'bool'
    bool  a0,a1,a2;
    ^
    TranslationUnitDecl 0x556e7fc99328 <<invalid sloc>> <invalid sloc>
    |-TypedefDecl 0x556e7fc99b50 <<invalid sloc>> <invalid sloc> implicit __int128_t '__int128'
    | `-BuiltinType 0x556e7fc998f0 '__int128'
    |-TypedefDecl 0x556e7fc99bc0 <<invalid sloc>> <invalid sloc> implicit __uint128_t 'unsigned __int128'
    | `-BuiltinType 0x556e7fc99910 'unsigned __int128'
    |-TypedefDecl 0x556e7fc99ec8 <<invalid sloc>> <invalid sloc> implicit __NSConstantString 'struct __NSConstantString_tag'
    | `-RecordType 0x556e7fc99ca0 'struct __NSConstantString_tag'
    |   `-Record 0x556e7fc99c18 '__NSConstantString_tag'
    |-TypedefDecl 0x556e7fc99f60 <<invalid sloc>> <invalid sloc> implicit __builtin_ms_va_list 'char *'
    | `-PointerType 0x556e7fc99f20 'char *'
    |   `-BuiltinType 0x556e7fc993d0 'char'
    |-TypedefDecl 0x556e7fc9a258 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list 'struct __va_list_tag[1]'
    | `-ConstantArrayType 0x556e7fc9a200 'struct __va_list_tag[1]' 1 
    |   `-RecordType 0x556e7fc9a040 'struct __va_list_tag'
    |     `-Record 0x556e7fc99fb8 '__va_list_tag'
    |-VarDecl 0x556e7fcf69b0 <simple.c:2:1, col:6> col:6 invalid LED0 'int'
    |-VarDecl 0x556e7fcf6a50 <line:3:1, col:6> col:6 invalid LED1 'int'
    |-VarDecl 0x556e7fcf6af0 <line:4:1, col:6> col:6 invalid LED2 'int'
    `-VarDecl 0x556e7fcf6b90 <line:7:1, col:7> col:7 invalid a0 'int'
    4 errors generated.
    

    Despite the errors, Clang still produces an AST. (It does this, among other reasons, to allow it to be used within IDEs where the code may have syntax errors.) But its error recovery is not able to recognize that the declaration contains three variables, and ends up discarding the second and third.

    The Python script then traverses this "best guess" AST, and finds that it only has one VarDecl on line 7.

    How do I stop if there are syntax errors?

    You probably do not intend to continue processing if there are syntax errors. Assuming so, you can check the diagnostics field of translation_unit like this:

    ... as in the original code...
    
    index = clang.cindex.Index.create()
    translation_unit = index.parse("simple.c")
    
    # INSERT this check:
    if translation_unit.diagnostics:
        for d in translation_unit.diagnostics:
            print(d, file=sys.stderr)
        sys.exit(2)
    
    traverse(translation_unit.cursor)
    

    With that, the script will halt after printing the errors:

    $ python3 visit.py
    simple.c:2:1: error: unknown type name 'bool'
    simple.c:3:1: error: unknown type name 'bool'
    simple.c:4:1: error: unknown type name 'bool'
    simple.c:7:1: error: unknown type name 'bool'