Is there a way using lxml iterparse to skip an element without checking the tag? Take this xml for example:
<root>
<sample>
<tag1>text1</tag1>
<tag2>text2</tag2>
<tag3>text3</tag3>
<tag4>text4</tag4>
</sample>
<sample>
<tag1>text1</tag1>
<tag2>text2</tag2>
<tag3>text3</tag3>
<tag4>text4</tag4>
</sample>
</root>
If I care about tag1
and tag4
, checking tag2
and tag3
will eat up some time. If the file isn't big, it doesn't really matter but if I have a million <sample>
nodes, I could reduce search time some if I don't have to check tag2
nd tag3
. They're always there and I never need them.
using iterparse in lxml
import lxml
xmlfile = 'myfile.xml'
context = etree.iterparse(xmlfile, events('end',), tag='sample')
for event, elem in context:
for child in elem:
if child.tag == 'tag1'
my_list.append(child.text)
#HERE I'd like to advance the loop twice without checking tag2 and tag3 at all
#something like:
#next(child)
#next(child)
elif child.tag == 'tag4'
my_list.append(child.text)
If you use the tag
arg in iterchildren like you do in iterparse, you can "skip" elements other than tag1
and tag4
.
Example...
from lxml import etree
xmlfile = "myfile.xml"
my_list = []
for event, elem in etree.iterparse(xmlfile, tag="sample"):
for child in elem.iterchildren(tag=["tag1", "tag4"]):
if child.tag == "tag1":
my_list.append(child.text)
elif child.tag == "tag4":
my_list.append(child.text)
print(my_list)
Printed output...
['text1', 'text4', 'text1', 'text4']