pythonpyparsingcolumn-types

pyparsing nestedExpr and double closing characters


I am trying to parse nested column type definitions such as

1  string
2  struct<col_1:string,col_2:int>
3  row(col_1 string,array(col_2 string),col_3 boolean)
4  array<struct<col_1:string,col_2:int>,col_3:boolean>
5  array<struct<col_1:string,col2:int>>

Using nestedExpr works as expected for cases 1-4, but throws a parse error on case 5. Adding a space between double closing brackets like "> >" seems work, and might be explained by this quote from the author.

By default, nestedExpr will look for space-delimited words of printables https://sourceforge.net/p/pyparsing/bugs/107/

I'm mostly looking for alternatives to pre and post processing the input string

type_str = type_str.replace(">", "> ")
# parse string here
type_str = type_str.replace("> ", ">")

I've tried using the infix_notation but I haven't been able to figure out how to use it in this situation. I'm probably just using this the wrong way...

Code snippet

array_keyword = pp.Keyword('array')
row_keyword = pp.Keyword('row')
struct_keyword = pp.Keyword('struct')

nest_open = pp.Word('<([')
nest_close = pp.Word('>)]')

col_name = pp.Word(pp.alphanums + '_')
col_type = pp.Forward()
col_type_delimiter = pp.Word(':') | pp.White(' ')
column = col_name('name') + col_type_delimiter + col_type('type')
col_list = pp.delimitedList(pp.Group(column))

struct_type = pp.nestedExpr(
    opener=struct_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
)


row_type = pp.locatedExpr(pp.nestedExpr(
    opener=row_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
))

array_type = pp.nestedExpr(
    opener=array_keyword + nest_open, closer=nest_close, content=col_type, ignoreExpr=None
)

col_type <<= struct_type('children') | array_type('children') | row_type('children') | scalar_type('type')

Solution

  • nestedExpr and infixNotation are not really appropriate for this project. nestedExpr is generally a short-cut expression for stuff you don't really want to go into details parsing, you just want to detect and step over some chunk of text that happens to have some nesting in opening and closing punctuation. infixNotation is intended for parsing expressions with unary and binary operators, usually some kind of arithmetic. You might be able to treat the punctuation in your grammar as operators, but it is a stretch, and definitely doing things the hard way.

    For your project, you will really need to define the different elements, and it will be a recursive grammar (since the array and struct types will themselves be defined in terms of other types, which could also be arrays or structs).

    I took a stab at a BNF, for a subset of your grammar using scalar types int, float, boolean, and string, and compound types array and struct, with just the '<' and '>' nesting punctuation. An array will take a single type argument, to define the type of the elements in the array. A struct will take one or more struct fields, where each field is an identifier:type pair.

    scalar_type ::= 'int' | 'float' | 'string' | 'boolean'
    array_type ::= 'array' '<' type_defn '>'
    struct_type ::= 'struct' '<' struct_element (',' struct_element)... '>'
    struct_element ::= identifier ':' type_defn
    type_defn ::= scalar_type | array_type | struct_type
    

    (If you later want to add a row definition also, think about what the row is supposed to look like, and how its elements would be defined, and then add it to this BNF.)

    You look pretty comfortable with the basics of pyparsing, so I'll just start you off with some intro pieces, and then let you fill in the rest.

    # define punctuation
    LT, GT, COLON = map(pp.Suppress, "<>:")
    ARRAY = pp.Keyword('array')
    STRUCT = pp.Keyword('struct')
    
    # create a Forward that will be used in other type expressions
    type_defn = pp.Forward()
    
    # here is the array type, you can fill in the other types following this model
    # and the definitions in the BNF
    array_type = pp.Group(ARRAY + LT + type_defn + GT)
    
    ...
    
    # then finally define type_defn in terms of the other type expressions
    type_defn <<= scalar_type | array_type | struct_type
    

    Once you have that finished, try it out with some tests:

    type_defn.runTests("""\
        string
        struct<col_1:string,col_2:int>
        array<struct<col_1:string,col2:int>>
        """, fullDump=False)
    

    And you should get something like:

    string
    ['string']
    
    struct<col_1:string,col_2:int>
    ['struct', [['col_1', 'string'], ['col_2', 'int']]]
    
    array<struct<col_1:string,col2:int>>
    ['array', ['struct', [['col_1', 'string'], ['col2', 'int']]]]>
    

    Once you have that, you can play around with extending it to other types, such as your row type, maybe unions, or arrays that take multiple types (if that was your intention in your posted example). Always start by updating the BNF - then the changes you'll need to make in the code will generally follow.