I'm using Python to look inside these files: Each zip contains a single xml file with the same base name as the zip file. Each xml file is a concatenation of thousands of individual xml file which I have separated out into individual files. Some of these XML files have a tag that looks like this I'm having trouble finding those in the parse tree. I use the following code so far:
import os
import xml.etree.cElementTree as cET
fname = 'extracted_xmls/ipg140107/1163_G_08622343.xml'
parsed = cET.parse(fname)
root = parsed.getroot()
if root.tag == "us-patent-grant":
bibref = root.find('us-bibliographic-data-grant')
pubref = bibref.find('publication-reference')
prefix = "G"
elif root.tag == "sequence-cwu":
pubref = root.find('publication-reference')
prefix = "S"
else:
print fname, "...uncoded tag"
for g in root.iter():
if g.tag == 'description':
print g.tag
for ga in g.iter():
print ga.tag
for g in root.findall('?GOVINT'):
print g
But the doesn't show up. I think these special tags with question marks in front are called "processing instructions," but I can't figure out how to extract them. Any comments, pointers, and especially code snippets for traversing those things would be appreciated.
The documentation for elementTree says that the parse command ignores any comments or processing instructions. So the question now is - is there a parser that does not do this?
The answer is this: Tags with a question mark in front of them are not really tags. They are "processing instructions." According to the documentation for ElementTree, processing instructions are ignored during parsing.