pythonxmlelementtree

Find a particular tag within XML file in Python using xml.etree.ElementTree as ET with method root.find() - does not work as doc describe


I have an XML file "bib_full-001664.xml" and want to find element: <issn pub-type="ppub">2544-1558</issn>

My XML file:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2024-09-19T14:16:51.314812Z</responseDate>
<request identifier="oai:bibliotekanauki.pl:1664" metadataPrefix="jats" verb="GetRecord">https://bibliotekanauki.pl/api/oai/articles</request>
<GetRecord>
<record>
<header>
<identifier>oai:bibliotekanauki.pl:1664</identifier>
<datestamp>2022-04-07T18:08:48.997Z</datestamp>
<setSpec>4</setSpec>
</header>
<metadata>
<article xmlns="http://jats.nlm.nih.gov" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://jats.nlm.nih.gov https://jats.nlm.nih.gov/archiving/1.2/xsd/JATS-archivearticle1.xsd" article-type="research-article">
    <front>
    <journal-meta>
<journal-title-group>
<journal-title>Medical Science Pulse</journal-title>
</journal-title-group>
<issn pub-type="ppub">2544-1558</issn>
<issn pub-type="epub">2544-1620</issn>
[...] rest of file cutted - too much

I am using code in Python:

tree = ET.parse("bib_full-001664.xml")
xml_data = tree.getroot()
issn_find = xml_data.find("issn")

and issn_find is None - can anybody help me?

I can extract this info using the code:

front = xml_data.findall('*')[2][0][1][0][0]
journal_meta = front[0]
lista = [el for el in journal_meta]
journal_title = lista[0][0].text
journal_issn = lista[1].text

But this is hard way - I have to look at XML file and count each tag and level .... I did not worked with xml module before - this is my first time. So please, be understanding for me....


Solution

  • If you don't know about XML namespaces, then I can understand your frustration. The XML document is also a little unusual since the important namespace is not declared on the root element.

    Many similar questions have been asked before (for example Empty list returned from ElementTree findall), and I won't go into details about namespaces.

    There are two problems with issn_find = xml_data.find("issn"):

    1. issn is not a direct child of the root element. It is deeper down in the hierarchy. To search all descendants, use .//.
    2. issn is bound a namespace, but this is not taken into account.

    The wanted issn element is bound to the http://jats.nlm.nih.gov namespace (declared on the article element). The following code works (it finds the first issn element):

    issn_find = xml_data.find(".//{http://jats.nlm.nih.gov}issn")
    

    It is also possible to use a namespace wildcard:

    issn_find = xml_data.find(".//{*}issn")