pythonregexparsingconfigparsermoses

Parsing a Moses config file


Given a config file as such from the Moses Machine Translation Toolkit:

#########################
### MOSES CONFIG FILE ###
#########################

# input factors
[input-factors]
0

# mapping steps
[mapping]
0 T 0

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/home/gillin/jojomert/phrase-jojo/work.src-ref/training/model/phrase-table.gz input-factor=0 output-factor=0
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home/gillin/jojomert/phrase-jojo/work.src-ref/training/model/reordering-table.wbe-msd-bidirectional-fe.gz
Distortion
KENLM lazyken=0 name=LM0 factor=0 path=/home/gillin/jojomert/ru.kenlm order=5

# dense weights for feature functions
[weight]
UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5

I need to read the parameters from the [weights] section:

UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5

I have been doing it as such:

def read_params_from_moses_ini(mosesinifile):
    parameters_string = ""
    for line in reversed(open(mosesinifile, 'r').readlines()):
        if line.startswith('[weight]'):
            return parameters_string
        else:
            parameters_string+=line.strip() + ' ' 

to get this output:

LM0= 0.5 Distortion0= 0.3 LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3 TranslationModel0= 0.2 0.2 0.2 0.2 PhrasePenalty0= 0.2 WordPenalty0= -1 UnknownWordPenalty0= 1 

Then using parsing the output with

moses_param_pattern = re.compile(r'''([^\s=]+)=\s*((?:[^\s=]+(?:\s|$))*)''')

def parse_parameters(parameters_string):
    return dict((k, list(map(float, v.split())))
                   for k, v in moses_param_pattern.findall(parameters_string))


 mosesinifile = 'mertfiles/moses.ini'

 print (parse_parameters(read_params_from_moses_ini(mosesinifile)))

to get:

{'UnknownWordPenalty0': [1.0], 'PhrasePenalty0': [0.2], 'WordPenalty0': [-1.0], 'Distortion0': [0.3], 'LexicalReordering0': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3], 'TranslationModel0': [0.2, 0.2, 0.2, 0.2], 'LM0': [0.5]}

The current solution involve some crazy reversal line reading from the config file and then pretty complicated regex reading to get the parameters.

Is there a simpler or less hacky/verbose way to read the file and achieve the desired parameter dictionary output?

Is it possible to change the configparser such that it reads the moses config file? It's pretty hard because it has some erroneous section that are actually parameters, e.g. [distortion-limit] and there's no key to the value 6. In a validated configparse-able file, it would have been distortion-limit = 6.


Note: The native python configparser is unable to handle a moses.ini config file. Answers from How to read and write INI file with Python3? will not work.


Solution

  • Here is another short regex-based solution that returns a dictionary of the values similar to your output:

    import re
    from collections import defaultdict
    
    dct = {}
    
    str="MOSES_INI_FILE_CONTENTS"
    
    #get [weight] section
    match_weight = re.search(r"\[weight][^\n]*(?:\n(?!$|\n)[^\n]*)*", str) # Regex is identical to "(?s)\[weight].*?(?:$|\n\n)"
    if match_weight:
        weight = match_weight.group() # get the [weight] text
        dct = dict([(x[0], [float(x) for x in x[1].split(" ")]) for x in  re.findall(r"(\w+)\s*=\s*(.*)\s*", weight)])
    
    print dct
    

    See IDEONE demo

    The resulting dictionary contents:

    {'UnknownWordPenalty0': [1.0], 'LexicalReordering0': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3], 'LM0': [0.5], 'PhrasePenalty0': [0.2], 'TranslationModel0': [0.2, 0.2, 0.2, 0.2], 'Distortion0': [0.3], 'WordPenalty0': [-1.0]}
    

    The logic: