XML parsing in Python: how to get the string indexes of child nodes with regard to the flattened string

I'm a newcomer on XML parsing in Python and I need to obtain some data regarding the inner text of some phrase nodes and their children (preferably using Minidom, but it is not essential).

Example:

<phrase id="x.y">This example
    <foo id="x.y.z">
        <bar type="likelihood" ref="x.y.z">might</bar> 
    be useful</foo>.
</phrase>

What I want to get is the following data:

The whole text in a string combining the parent node and their children (just like recursive method getText, which is featured in the Minidom documentation, does)
A list of triplets containing children data:
- tag name
- start index considering the whole string
- end index considering the whole string

In the xml example, <bar> inner text (might) starts at index 14 and ends at index 18, whereas <foo> contents (be useful) start at index 19 and end at index 28. The execution of this example should return something like that (the order of the children is of no importance):

('This example might be useful.', [('bar', 14, 18), ('foo', 19, 28)])

Solution

This was an interesting project! Somewhat convoluted and not sure how far it will go with other situations, but try something like this:

from lxml import etree
phrase = """[your xml above]"""
doc = etree.fromstring(phrase)

#this requires a couple of help functions to clean up spaces, find indexes, etc.:

def space_rem(str):
    while '  ' in str:
        str = str.replace('  ', ' ')
    return str

def build(str):
    str_path = doc.xpath(f'//{str}/text()')
    str = ''
    for s in str_path:
        str+=(s.strip())
    space_rem(str)
    str_ind = ttxt.find(str)
    return str_ind,str_ind+len(str)

foo_lst = ['foo']
bar_lst = ['bar']
ttxt = ''

for t in doc.xpath('//*/text()'):
    ttxt+=t.replace('\n','')
ttxt = space_rem(ttxt)

foo_lst.extend(build('foo'))
bar_lst.extend(build('bar'))

ttxt,foo_lst,bar_lst

Output:

('This example might be useful.', ['foo', 19, 28], ['bar', 13, 18])