I'm not able to parse this xliff fragment:
<source>text1 <g id="1">text2</g> text3 <x id="2"/><x id="3"/>text4</source>
I would like to have a iterative method which runs on the source tag and fill something like
parsed_source[0]='text1'
parsed_source[1]='<g id="1">text2</g>'
parsed_source[2]='text3'
parsed_source[3]='<x id="2"/>'
parsed_source[4]='<x id="3"/>'
parsed_source[5]='text4'
So that I can iterate again on the xml fragment [1], [3] and [4] if needed...
Using lxml
for example:
from lxml import etree
tree = etree.iterparse('aFile.xlf')
for action, elem in tree:
print("%s: %s %s" % (action, elem.tag, elem.text))
I get something similar to:
end: source text1
end: g text2
end: x None
end: x None
And I'm not able to parse text3
and text4
...How can I do that? Thanks
You need to take the tail
property (the text following the element) into account. Read about it here: https://lxml.de/tutorial.html#elements-contain-text.
The following snippet (a slight modification of your code) demonstrates it:
from lxml import etree
tree = etree.iterparse('aFile.xlf')
for action, elem in tree:
print("%s: %s %s %s" % (action, elem.tag, elem.text, elem.tail))
Output:
end: g text2 text3
end: x None None
end: x None text4
end: source text1 None