pythonpyparsinglvm

Convert lvm.conf to python dict using pyparsing


I'm trying to convert lvm.conf to python (JSON like) object. LVM (Logical Volume Management) configuration file looks like this:

# Configuration section config.
# How LVM configuration settings are handled.
config {

    # Configuration option config/checks.
    # If enabled, any LVM configuration mismatch is reported.
    # This implies checking that the configuration key is understood by
    # LVM and that the value of the key is the proper type. If disabled,
    # any configuration mismatch is ignored and the default value is used
    # without any warning (a message about the configuration key not being
    # found is issued in verbose mode only).
    checks = 1

    # Configuration option config/abort_on_errors.
    # Abort the LVM process if a configuration mismatch is found.
    abort_on_errors = 0

    # Configuration option config/profile_dir.
    # Directory where LVM looks for configuration profiles.
    profile_dir = "/etc/lvm/profile"
}


local {
}
log {
    verbose=0
    silent=0
    syslog=1
    overwrite=0
    level=0
    indent=1
    command_names=0
    prefix="  "
    activation=0
    debug_classes=["memory","devices","activation","allocation","lvmetad","metadata","cache","locking","lvmpolld","dbus"]
}

I'd like to get Python dict, like this:

{ "section_name"": 
{"value1" : 1,
 "value2" : "some_string",
 "value3" : [list, of, strings]}... and so on.}

The parser function:

def parseLvmConfig2(path="/etc/lvm/lvm.conf"):
    try:
        EQ, LBRACE, RBRACE, LQ, RQ = map(pp.Suppress, "={}[]")
        comment = pp.Suppress("#") + pp.Suppress(pp.restOfLine)
        configSection = pp.Word(pp.alphas + "_") + LBRACE
        sectionKey = pp.Word(pp.alphas + "_")
        sectionValue = pp.Forward()
        entry = pp.Group(sectionKey + EQ + sectionValue)
        real = pp.Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
        integer = pp.Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
        listval = pp.Regex(r'(?:\[)(.*)?(?:\])').setParseAction(lambda x: eval(x[0]))

        pp.dblQuotedString.setParseAction(pp.removeQuotes)

        struct = pp.Group(pp.ZeroOrMore(entry) + RBRACE)
        sectionValue << (pp.dblQuotedString | real | integer | listval)
        parser = pp.ZeroOrMore(configSection + pp.Dict(struct))
        res = parser.parseFile(path)
        print(res)
    except (pp.ParseBaseException, ) as e:
        print("lvm.conf bad format {0}".format(e))

The result is messy and the question is, how to make pyparsing do the job, without additional logic?

UPDATE(SOLVED):

For anyone who wants to understand pyparsing better, please check @PaulMcG explanation below. (Thanks for pyparsing, Paul! )

import pyparsing as pp
def parseLvmConf(conf="/etc/lvm/lvm.conf", res_type="dict"):
    EQ, LBRACE, RBRACE, LQ, RQ = map(pp.Suppress, "={}[]")
    comment = "#" + pp.restOfLine
    integer = pp.nums
    real = pp.Word(pp.nums + "." + pp.nums)
    pp.dblQuotedString.setParseAction(pp.removeQuotes)
    scalar_value = real | integer | pp.dblQuotedString
    list_value = pp.Group(LQ + pp.delimitedList(scalar_value) + RQ)
    key = pp.Word(pp.alphas + "_", pp.alphanums + '_')
    key_value = pp.Group(key + EQ + (scalar_value | list_value))
    struct = pp.Forward()
    entry = key_value | pp.Group(key + struct)
    struct <<= pp.Dict(LBRACE + pp.ZeroOrMore(entry) + RBRACE)
    parser = pp.Dict(pp.ZeroOrMore(entry))
    parser.ignore(comment)
    try:
        #return lvm.conf as dict
        if res_type == "dict":
            return parser.parseFile(conf).asDict()
        # return lvm.conf as list
        elif res_type == "list":
            return parser.parseFile(conf).asList()
        else:
            #return lvm.conf as ParseResults
            return parser.parseFile(conf)
    except (pp.ParseBaseException,) as e:
        print("lvm.conf bad format {0}".format(e))

Solution

  • Step 1 should always be to at least rough out a BNF for the format you are going to parse. This really helps organize your thoughts, and gets you thinking about the structure and data you are parsing, before starting to write actual code.

    Here is a BNF that I came up with for this config (it looks like a Python string because that makes it easy to paste into your code for future reference - but pyparsing does not work with or require such strings, they are purely a design tool):

    BNF = '''
        key_struct ::= key struct
        struct ::= '{' (key_value | key_struct)... '}'
        key_value ::= key '=' (scalar_value | list_value)
        key ::= word composed of alphas and '_'
        list_value ::= '[' scalar_value [',' scalar_value]... ']'
        scalar_value ::= real | integer | double-quoted-string
        comment ::= '#' rest-of-line
    '''
    

    Notice that the opening and closing {}'s and []'s are at the same level, rather than having an opener in one expression and a closer in another.

    This BNF also will allow for structs nested within structs, which is not strictly required in the sample text you posted, but since your code looked to be supporting that, I included it.

    Translating to pyparsing is pretty straightforward from here, working bottom-up through the BNF:

    EQ, LBRACE, RBRACE, LQ, RQ = map(pp.Suppress, "={}[]")
    comment = "#" + pp.restOfLine
    
    integer = ppc.integer  #pp.Regex(r"[+-]?\d+").setParseAction(lambda x: int(x[0]))
    real = ppc.real  #pp.Regex(r"[+-]?\d+\.\d*").setParseAction(lambda x: float(x[0]))
    pp.dblQuotedString.setParseAction(pp.removeQuotes)
    scalar_value = real | integer | pp.dblQuotedString
    
    # `delimitedList(expr)` is a shortcut for `expr + ZeroOrMore(',' + expr)`
    list_value = pp.Group(LQ + pp.delimitedList(scalar_value) + RQ)
    
    key = pp.Word(pp.alphas + "_", pp.alphanums + '_')
    key_value = pp.Group(key + EQ + (scalar_value | list_value))
    
    struct = pp.Forward()
    entry = key_value | pp.Group(key + struct)
    struct <<= (LBRACE + pp.ZeroOrMore(entry) + RBRACE)
    parser = pp.ZeroOrMore(entry)
    parser.ignore(comment)
    

    Running this code:

    try:
        res = parser.parseString(lvm_source)
        # print(res.dump())
        res.pprint()
        return res
    except (pp.ParseBaseException, ) as e:
        print("lvm.conf bad format {0}".format(e))
    

    Gives this nested list:

    [['config',
      ['checks', 1],
      ['abort_on_errors', 0],
      ['profile_dir', '/etc/lvm/profile']],
     ['local'],
     ['log',
      ['verbose', 0],
      ['silent', 0],
      ['syslog', 1],
      ['overwrite', 0],
      ['level', 0],
      ['indent', 1],
      ['command_names', 0],
      ['prefix', '  '],
      ['activation', 0],
      ['debug_classes',
       ['memory',
        'devices',
        'activation',
        'allocation',
        'lvmetad',
        'metadata',
        'cache',
        'locking',
        'lvmpolld',
        'dbus']]]]
    

    I think the format you would prefer is one where you can access the values as keys in a nested dict or in a hierarchical object. Pyparsing has a class called Dict that will do this at parse time, so that the results names are automatically assigned for nested subgroups. Change these two lines to have their sub-entries automatically dict-ified:

    struct <<= pp.Dict(LBRACE + pp.ZeroOrMore(entry) + RBRACE)
    parser = pp.Dict(pp.ZeroOrMore(entry))
    

    Now if we call dump() instead of pprint(), we'll see the hierarchical naming:

    [['config', ['checks', 1], ['abort_on_errors', 0], ['profile_dir', '/etc/lvm/profile']], ['local'], ['log', ['verbose', 0], ['silent', 0], ['syslog', 1], ['overwrite', 0], ['level', 0], ['indent', 1], ['command_names', 0], ['prefix', '  '], ['activation', 0], ['debug_classes', ['memory', 'devices', 'activation', 'allocation', 'lvmetad', 'metadata', 'cache', 'locking', 'lvmpolld', 'dbus']]]]
    - config: [['checks', 1], ['abort_on_errors', 0], ['profile_dir', '/etc/lvm/profile']]
      - abort_on_errors: 0
      - checks: 1
      - profile_dir: '/etc/lvm/profile'
    - local: ''
    - log: [['verbose', 0], ['silent', 0], ['syslog', 1], ['overwrite', 0], ['level', 0], ['indent', 1], ['command_names', 0], ['prefix', '  '], ['activation', 0], ['debug_classes', ['memory', 'devices', 'activation', 'allocation', 'lvmetad', 'metadata', 'cache', 'locking', 'lvmpolld', 'dbus']]]
      - activation: 0
      - command_names: 0
      - debug_classes: ['memory', 'devices', 'activation', 'allocation', 'lvmetad', 'metadata', 'cache', 'locking', 'lvmpolld', 'dbus']
      - indent: 1
      - level: 0
      - overwrite: 0
      - prefix: '  '
      - silent: 0
      - syslog: 1
      - verbose: 0
    

    You can then access the fields as res['config']['checks'] or res.log.indent.