I'm a newcomer on XML parsing in Python and I need to obtain some data regarding the inner text of some phrase nodes and their children (preferably using Minidom, but it is not essential).
Example:
<phrase id="x.y">This example
<foo id="x.y.z">
<bar type="likelihood" ref="x.y.z">might</bar>
be useful</foo>.
</phrase>
What I want to get is the following data:
getText
, which is featured in the Minidom documentation, does)In the xml example, <bar>
inner text (might) starts at index 14 and ends at index 18, whereas <foo>
contents (be useful) start at index 19 and end at index 28. The execution of this example should return something like that (the order of the children is of no importance):
('This example might be useful.', [('bar', 14, 18), ('foo', 19, 28)])
This was an interesting project! Somewhat convoluted and not sure how far it will go with other situations, but try something like this:
from lxml import etree
phrase = """[your xml above]"""
doc = etree.fromstring(phrase)
#this requires a couple of help functions to clean up spaces, find indexes, etc.:
def space_rem(str):
while ' ' in str:
str = str.replace(' ', ' ')
return str
def build(str):
str_path = doc.xpath(f'//{str}/text()')
str = ''
for s in str_path:
str+=(s.strip())
space_rem(str)
str_ind = ttxt.find(str)
return str_ind,str_ind+len(str)
foo_lst = ['foo']
bar_lst = ['bar']
ttxt = ''
for t in doc.xpath('//*/text()'):
ttxt+=t.replace('\n','')
ttxt = space_rem(ttxt)
foo_lst.extend(build('foo'))
bar_lst.extend(build('bar'))
ttxt,foo_lst,bar_lst
Output:
('This example might be useful.', ['foo', 19, 28], ['bar', 13, 18])