pythoncregexclangpycparser

How can I parse a C format string in Python?


I have this code in my C file:

printf("Worker name is %s and id is %d", worker.name, worker.id);

I want, with Python, to be able to parse the format string and locate the "%s" and "%d".

So I want to have a function:

>>> my_function("Worker name is %s and id is %d")
[Out1]: ((15, "%s"), (28, "%d))

I've tried to achieve this using libclang's Python bindings, and with pycparser, but I didn't see how can this be done with these tools.

I've also tried using regex to solve this, but this is not simple at all - think about use cases when the printf has "%%s" and stuff like this.

Both gcc and clang obviously do this as part of compiling - have no one exported this logic to Python?


Solution

  • You can certainly find properly formatted candidates with a regex.

    Take a look at the definition of the C Format Specification. (Using Microsofts, but use what you want.)

    It is:

    %[flags] [width] [.precision] [{h | l | ll | w | I | I32 | I64}] type
    

    You also have the special case of %% which becomes % in printf.

    You can translate that pattern into a regex:

    (                                 # start of capture group 1
    %                                 # literal "%"
    (?:                               # first option
    (?:[-+0 #]{0,5})                  # optional flags
    (?:\d+|\*)?                       # width
    (?:\.(?:\d+|\*))?                 # precision
    (?:h|l|ll|w|I|I32|I64)?           # size
    [cCdiouxXeEfgGaAnpsSZ]            # type
    ) |                               # OR
    %%)                               # literal "%%"
    

    Demo

    And then into a Python regex:

    import re
    
    lines='''\
    Worker name is %s and id is %d
    That is %i%%
    %c
    Decimal: %d  Justified: %.6d
    %10c%5hc%5C%5lc
    The temp is %.*f
    %ss%lii
    %*.*s | %.3d | %lC | %s%%%02d'''
    
    cfmt='''\
    (                                  # start of capture group 1
    %                                  # literal "%"
    (?:                                # first option
    (?:[-+0 #]{0,5})                   # optional flags
    (?:\d+|\*)?                        # width
    (?:\.(?:\d+|\*))?                  # precision
    (?:h|l|ll|w|I|I32|I64)?            # size
    [cCdiouxXeEfgGaAnpsSZ]             # type
    ) |                                # OR
    %%)                                # literal "%%"
    '''
    
    for line in lines.splitlines():
        print '"{}"\n\t{}\n'.format(line, 
               tuple((m.start(1), m.group(1)) for m in re.finditer(cfmt, line, flags=re.X))) 
    

    Prints:

    "Worker name is %s and id is %d"
        ((15, '%s'), (28, '%d'))
    
    "That is %i%%"
        ((8, '%i'), (10, '%%'))
    
    "%c"
        ((0, '%c'),)
    
    "Decimal: %d  Justified: %.6d"
        ((9, '%d'), (24, '%.6d'))
    
    "%10c%5hc%5C%5lc"
        ((0, '%10c'), (4, '%5hc'), (8, '%5C'), (11, '%5lc'))
    
    "The temp is %.*f"
        ((12, '%.*f'),)
    
    "%ss%lii"
        ((0, '%s'), (3, '%li'))
    
    "%*.*s | %.3d | %lC | %s%%%02d"
        ((0, '%*.*s'), (8, '%.3d'), (15, '%lC'), (21, '%s'), (23, '%%'), (25, '%02d'))