pythonc++clangabstract-syntax-treelibclang

How to parse multi-dimensional arrays with python libclang bindings


I'm having a difficult time with what, in my mind, should be a fairly simple task: Using the python bindings to libclang I want to get the dimensions for a multi-dimensional array field of a POD C++ structure. When I traverse the AST, I can drill down the cursor containing the array field declaration, and I can even see that it has two child nodes containing, what I assume, are the sizes of each of its dimensions... but I've had no luck accessing the size as a value. Below is a minimal example of my attempt at accessing the size:

test.hpp

#pragma once

struct my_struct {
    int mdarr[10][20];
};

test.py

import clang.cindex as cl

def process(c):
    if c.kind in [cl.CursorKind.STRUCT_DECL, cl.CursorKind.CLASS_DECL]:
        print("Found struct: ", c.spelling)
        for field in c.type.get_fields():
           print("Found field: ", field.spelling) 
           # Returns size of first dimension but not the second dimension
           print("Array size: ", field.type.get_array_size())
           for child in field.get_children():
                # Prints an empty string
                print("Found child: ", child.spelling)
                # How do I extract the value from the `INTEGER_LITERAL`?
                print("Child cursor kind: ", child.kind)

        return

    for child in c.get_children():
        process(child)

idx = cl.Index.create()
tu = idx.parse("test.hpp")
process(tu.cursor)

Output

Found struct:  my_struct
Found field:  mdarr
Array size:  10
Found child:
Child cursor kind:  CursorKind.INTEGER_LITERAL
Found child:
Child cursor kind:  CursorKind.INTEGER_LITERAL

Is there an easy way to extract each dimension's size using libclang?


Solution

  • What is a multidimensional array?

    C and C++ do not have multidimensional arrays per se. Instead, an array type can have another array type as its element type. In the declaration:

    int mdarr[10][20];
    

    this is parsed "inside out" (like all C/C++ declarators) as:

        mdarr
        ^^^^^               mdarr is ...
        mdarr[10]
             ^^^^           ... an array of 10 elements, each element being ...
        mdarr[10][20]
                 ^^^^       ... an array of 20 elements, each element being ...
    int mdarr[10][20]
    ^^^                     ... an integer.
    

    Getting the dimensions in libclang

    Clang's representation follows this structure, representing the type as an array(10) of an array(20) of int.

    The type of a Python libclang Cursor is obtained with Cursor.type.

    From a Type one may invoke:

    Adjusting the code to print all of the dimensions

    The following is a modified (added between BEGIN ADDED and END ADDED) version of the code in the question that prints all of the array dimensions:

    import clang.cindex as cl
    
    def process(c):
        if c.kind in [cl.CursorKind.STRUCT_DECL, cl.CursorKind.CLASS_DECL]:
            print("Found struct: ", c.spelling)
            for field in c.type.get_fields():
               print("Found field: ", field.spelling) 
               # Returns size of first dimension but not the second dimension
               print("Array size: ", field.type.get_array_size())
    
               # BEGIN ADDED
               t = field.type
               while t.kind == cl.TypeKind.CONSTANTARRAY:
                   print("  ADDED: Array size:", t.get_array_size())
                   t = t.get_array_element_type()
                   print("  ADDED: Element type:", t.spelling)
               # END ADDED
    
               for child in field.get_children():
                    # Prints an empty string
                    print("Found child: ", child.spelling)
                    # How do I extract the value from the `INTEGER_LITERAL`?
                    print("Child cursor kind: ", child.kind)
    
            return
    
        for child in c.get_children():
            process(child)
    
    idx = cl.Index.create()
    tu = idx.parse("test.hpp")
    process(tu.cursor)
    

    On the original input, this script prints:

    Found struct:  my_struct
    Found field:  mdarr
    Array size:  10
      ADDED: Array size: 10
      ADDED: Element type: int[20]
      ADDED: Array size: 20
      ADDED: Element type: int
    Found child:  
    Child cursor kind:  CursorKind.INTEGER_LITERAL
    Found child:  
    Child cursor kind:  CursorKind.INTEGER_LITERAL
    

    Getting the value of an INTEGER_LITERAL

    Relatedly, you ask (in a comment in the question code) how to get the value of an INTEGER_LITERAL node. To do that, retrieve the tokens that make up the literal, get the first one, and get its "spelling".

    See the Q+A How to retrieve function call argument values using libclang for more details about that.