There is probably a simple solution, but I just can not find it, so I would be glad if someone could help me out. I use python3.7 I am trying to parse "date", but only of application-reference
<bibliographic-data>
<publication-reference>
<document-id document-id-type="docdb">
<country>EP</country>
<doc-number>1001100</doc-number>
<kind>A1</kind>
<date>20000517</date>
</document-id>
<document-id document-id-type="epodoc">
<doc-number>EP1000000</doc-number>
<date>20000517</date>
</document-id>
<application-reference doc-id="17397285">
<document-id document-id-type="docdb">
<country>EP</country>
<doc-number>99203729</doc-number>
<kind>A</kind>
</document-id>
<document-id document-id-type="epodoc">
<doc-number>EP199903729</doc-number>
<date>19991108</date>
</document-id>
<document-id document-id-type="original">
<doc-number>993729</doc-number>
</document-id>
</application-reference>
It is possible that many other dates appear infront of application-reference, so I can not simply print i.e. the 4th date.
I tried simple xmldom or xml.etree queries, but none worked
As I am not sure how exactly to access it, I tried
root = ElementTree.fromstring(js).getroot()
for appl in root.findall("application-reference"):
ElementTree.dump(appl)
and than I am stuck
The result should be 19991108.
Try using lxml and xpath:
bibli = """[your xml above; make sure it's properly formatted!]"""
from lxml import etree
root = lxml.etree.fromstring(bibli)
print(root.xpath('//application-reference//date/text()'))
Output:
['19991108']