I have an XML file "bib_full-001664.xml" and want to find element:
<issn pub-type="ppub">2544-1558</issn>
My XML file:
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2024-09-19T14:16:51.314812Z</responseDate>
<request identifier="oai:bibliotekanauki.pl:1664" metadataPrefix="jats" verb="GetRecord">https://bibliotekanauki.pl/api/oai/articles</request>
<GetRecord>
<record>
<header>
<identifier>oai:bibliotekanauki.pl:1664</identifier>
<datestamp>2022-04-07T18:08:48.997Z</datestamp>
<setSpec>4</setSpec>
</header>
<metadata>
<article xmlns="http://jats.nlm.nih.gov" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://jats.nlm.nih.gov https://jats.nlm.nih.gov/archiving/1.2/xsd/JATS-archivearticle1.xsd" article-type="research-article">
<front>
<journal-meta>
<journal-title-group>
<journal-title>Medical Science Pulse</journal-title>
</journal-title-group>
<issn pub-type="ppub">2544-1558</issn>
<issn pub-type="epub">2544-1620</issn>
[...] rest of file cutted - too much
I am using code in Python:
tree = ET.parse("bib_full-001664.xml")
xml_data = tree.getroot()
issn_find = xml_data.find("issn")
and issn_find is None - can anybody help me?
I can extract this info using the code:
front = xml_data.findall('*')[2][0][1][0][0]
journal_meta = front[0]
lista = [el for el in journal_meta]
journal_title = lista[0][0].text
journal_issn = lista[1].text
But this is hard way - I have to look at XML file and count each tag and level .... I did not worked with xml module before - this is my first time. So please, be understanding for me....
If you don't know about XML namespaces, then I can understand your frustration. The XML document is also a little unusual since the important namespace is not declared on the root element.
Many similar questions have been asked before (for example Empty list returned from ElementTree findall), and I won't go into details about namespaces.
There are two problems with issn_find = xml_data.find("issn")
:
issn
is not a direct child of the root element. It is deeper down in the hierarchy. To search all descendants, use .//
.issn
is bound a namespace, but this is not taken into account.The wanted issn
element is bound to the http://jats.nlm.nih.gov
namespace (declared on the article
element). The following code works (it finds the first issn
element):
issn_find = xml_data.find(".//{http://jats.nlm.nih.gov}issn")
It is also possible to use a namespace wildcard:
issn_find = xml_data.find(".//{*}issn")