pythonpython-3.xxml-parsingminidom

XML parsing in Python: how to get the string indexes of child nodes with regard to the flattened string


I'm a newcomer on XML parsing in Python and I need to obtain some data regarding the inner text of some phrase nodes and their children (preferably using Minidom, but it is not essential).

Example:

<phrase id="x.y">This example
    <foo id="x.y.z">
        <bar type="likelihood" ref="x.y.z">might</bar> 
    be useful</foo>.
</phrase>

What I want to get is the following data:

In the xml example, <bar> inner text (might) starts at index 14 and ends at index 18, whereas <foo> contents (be useful) start at index 19 and end at index 28. The execution of this example should return something like that (the order of the children is of no importance):

('This example might be useful.', [('bar', 14, 18), ('foo', 19, 28)])


Solution

  • This was an interesting project! Somewhat convoluted and not sure how far it will go with other situations, but try something like this:

    from lxml import etree
    phrase = """[your xml above]"""
    doc = etree.fromstring(phrase)
    
    #this requires a couple of help functions to clean up spaces, find indexes, etc.:
    
    def space_rem(str):
        while '  ' in str:
            str = str.replace('  ', ' ')
        return str
    
    def build(str):
        str_path = doc.xpath(f'//{str}/text()')
        str = ''
        for s in str_path:
            str+=(s.strip())
        space_rem(str)
        str_ind = ttxt.find(str)
        return str_ind,str_ind+len(str)
    
    foo_lst = ['foo']
    bar_lst = ['bar']
    ttxt = ''
    
    for t in doc.xpath('//*/text()'):
        ttxt+=t.replace('\n','')
    ttxt = space_rem(ttxt)
    
    foo_lst.extend(build('foo'))
    bar_lst.extend(build('bar'))
    
    ttxt,foo_lst,bar_lst
    

    Output:

    ('This example might be useful.', ['foo', 19, 28], ['bar', 13, 18])