I am trying to parse an XML file using Python. Due to the size of the XML, I want to use a Pull Parser. I found this one.
My code starts with
doc = pulldom.parse("myfile.xml")
for event, node in doc:
# code here...
I am using
if (node.localName == "b"):
to get the XML tag name, and it works fine.
What I can't find how to do is get the text from between the tags. Using node.nodeValue
returns None
.
I can use node.toxml()
to get the full XML for the node, but I only want the text between the tags. Is there a way to do this other than using a regex replace to take the tags out of node.toxml()
?
You have two nodes with local name "b" for every tag with text - a START_ELEMENT
and an END_ELEMENT
. Normally you should receive something like this:
START_ELEMENT
CHARACTERS
END_ELEMENT
So you are looking for the characters after a matching start-element. You may want to try something like this:
from xml.dom.pulldom import CHARACTERS, START_ELEMENT, parse
doc = parse("myfile.xml")
text_expected = False
for event, node in doc:
print event, node
if text_expected:
text_expected = False
if event != CHARACTERS:
# strange .. there should be some
continue
print node.data
else:
text_expected = (event == START_ELEMENT) and (node.localName == "b")
With this myfile.xml
<a>
<b>c1</b>
<b>c2</b>
</a>
I get the output
c1
c2
Note that you might need to strip()
each string and you must ignore every other CHARACTERS
-event. Every linebreak and whitespace between two elements generate a CHARACTERS
-event.