python-3.xxmlxml-parsingxmldom

Parsing same-named tags in XML using python


There is probably a simple solution, but I just can not find it, so I would be glad if someone could help me out. I use python3.7 I am trying to parse "date", but only of application-reference

                <bibliographic-data>
                <publication-reference>
                    <document-id document-id-type="docdb">
                        <country>EP</country>
                        <doc-number>1001100</doc-number>
                        <kind>A1</kind>
                        <date>20000517</date>
                    </document-id>
                    <document-id document-id-type="epodoc">
                        <doc-number>EP1000000</doc-number>
                        <date>20000517</date>
                    </document-id>

      <application-reference doc-id="17397285">
                    <document-id document-id-type="docdb">
                        <country>EP</country>
                        <doc-number>99203729</doc-number>
                        <kind>A</kind>
                    </document-id>
                    <document-id document-id-type="epodoc">
                        <doc-number>EP199903729</doc-number>
                        <date>19991108</date>
                    </document-id>
                    <document-id document-id-type="original">
                        <doc-number>993729</doc-number>
                    </document-id>
                </application-reference>

It is possible that many other dates appear infront of application-reference, so I can not simply print i.e. the 4th date.

I tried simple xmldom or xml.etree queries, but none worked

As I am not sure how exactly to access it, I tried

root = ElementTree.fromstring(js).getroot()

for appl in root.findall("application-reference"):
    ElementTree.dump(appl)

and than I am stuck

The result should be 19991108.


Solution

  • Try using lxml and xpath:

    bibli = """[your xml above; make sure it's properly formatted!]"""
    
    from lxml import etree
    
    root = lxml.etree.fromstring(bibli)
    print(root.xpath('//application-reference//date/text()'))
    

    Output:

    ['19991108']