We are converting our ancient FrameMaker docs to XML. My job is to convert this:
<?FM MARKER [Index] foo, bar ?>`
to this:
<indexterm>
<primary>foo, bar</primary>
</indexterm>
I'm not worried about that part (yet); what is stumping me is that the ProcessingInstruction
s are all over the documents and could potentially be under any element, so I need to be able to search the entire tree, find them, and then process them. I cannot figure out how to iterate over an entire XML tree using minidom
. Am I missing some secret method/iterator? This is what I've looked at thus far:
Elementtree
has the excellent Element.iter()
method, which is a depth-first search, but it doesn't process ProcessingInstruction
s.
ProcessingInstruction
s don't have tag names, so I cannot search for them using minidom
's getElementsByTagName
.
xml.sax
's ContentHandler.processingInstruction
looks like it's only used to create ProcessingInstruction
s.
Short of creating my own depth-first search algorithm, is there a way to generate a list of ProcessingInstruction
s in an XML file, or identify their parents?
Use the XPath API of the lxml
module as such:
from lxml import etree
foo = StringIO('<foo><bar></bar></foo>')
tree = etree.parse(foo)
result = tree.xpath('//processing-instruction()')
The node test processing-instruction() is true for any processing instruction. The processing-instruction() test may have an argument that is Literal; in this case, it is true for any processing instruction that has a name equal to the value of the Literal.
References